## Web Scraper Script
#### Web scraping the New Headlines - for performing Sentiment Analysis - for the following 10 companies: Meta, Apple, Amazon, Netflix, Google, Microsoft, Reliance Inc., Infosys Ltd., HDFC Bank and Tesla


#### **Step-1: Installing and importing the necessary libraries**

In [1]:
!pip install requests beautifulsoup4 html5lib vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [2]:
import time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime, date, timedelta

#### **Step - 2: Scraping Real-time News headlines, related to each stock, for Sentiment Analysis**

#### a) Scraping headlines for the stocks from Finviz

In [3]:
finviz_tickers = {"Meta": "META",
                  "Apple": "AAPL",
                  "Amazon": "AMZN",
                  "Netflix": "NFLX",
                  "Google": "GOOG",
                  "Microsoft": "MSFT",
                  "Tesla": "TSLA",
                  "Reliance" : "RS",
                  "Infosys": "INFY",
                  "HDFC": "HDB"}

headers = {'user-agent':'Mozilla/5.0 \
            (Windows NT 10.0; Win64; x64) \
            AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/84.0.4147.105 Safari/537.36'}

def scrape_finviz_headlines():
  all_headlines = []
  current_date = None
  for name, ticker in finviz_tickers.items():
    print(f"Scraping News Headlines for {name}")
    url = f"https://finviz.com/quote.ashx?t={ticker}"
    r = requests.get(url, headers = headers)
    soup = BeautifulSoup(r.content, "html5lib")
    news_table = soup.find("table", class_="fullview-news-outer")

    if news_table:
      for row in news_table.find_all("tr"):
        try:
          timestamp = row.td.text.strip()
          headline = row.a.text.strip()

          if "-" in timestamp:
            current_date = timestamp.split()[0]
            time_part = timestamp.split()[1]
          else:
            time_part = timestamp

          full_datetime = f"{current_date} {time_part}"
          all_headlines.append({"Company": name, "Timestamp": full_datetime, "Headlines": headline})

        except:
            continue
    time.sleep(5)

  return all_headlines

us_news_df = pd.DataFrame(scrape_finviz_headlines())
us_news_df.tail()

Scraping News Headlines for Meta
Scraping News Headlines for Apple
Scraping News Headlines for Amazon
Scraping News Headlines for Netflix
Scraping News Headlines for Google
Scraping News Headlines for Microsoft
Scraping News Headlines for Tesla
Scraping News Headlines for Reliance
Scraping News Headlines for Infosys
Scraping News Headlines for HDFC


Unnamed: 0,Company,Timestamp,Headlines
995,HDFC,Sep-22-23 03:53AM,HDFC Bank's share price declines; brokerages a...
996,HDFC,Sep-20-23 01:42AM,HDFC Bank shares drop after Nomura downgrades ...
997,HDFC,Sep-20-23 01:27AM,Indian stock indices slump amid global market ...
998,HDFC,Sep-19-23 11:52PM,HDFC Bank shares face sharp decline following ...
999,HDFC,Sep-19-23 11:43PM,RBI extends tenure of HDFC Bank's CEO Sashidha...


In [4]:
us_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Timestamp  1000 non-null   object
 2   Headlines  1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


In [5]:
us_news_df["Timestamp"] = pd.to_datetime(us_news_df["Timestamp"], format = "%b-%d-%y %I:%M%p", errors = "coerce")
us_news_df["Timestamp"].head()

Unnamed: 0,Timestamp
0,NaT
1,NaT
2,NaT
3,2025-07-14 22:51:00
4,2025-07-14 20:00:00


In [6]:
us_news_df.dropna(inplace=True)

In [7]:
us_news_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 993 entries, 3 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Company    993 non-null    object        
 1   Timestamp  993 non-null    datetime64[ns]
 2   Headlines  993 non-null    object        
dtypes: datetime64[ns](1), object(2)
memory usage: 31.0+ KB


In [8]:
us_news_df["Date"] = us_news_df["Timestamp"].dt.date
max_date = us_news_df.groupby(["Company"])["Date"].max()
max_date

Unnamed: 0_level_0,Date
Company,Unnamed: 1_level_1
Amazon,2025-07-14
Apple,2025-07-14
Google,2025-07-14
HDFC,2025-07-07
Infosys,2025-07-07
Meta,2025-07-14
Microsoft,2025-07-14
Netflix,2025-07-14
Reliance,2025-07-11
Tesla,2025-07-14


#### b) Scraping News Headlines for the Indian Stocks from Economic Times website
##### To get latest news (posted within hours in IST)

In [9]:
et_parameters = {"HDFC": "hdfc",
           "Infosys": "infosys",
           "Reliance": "reliance"}

def scrape_et_headlines():
  all_headlines = []
  for name, param in et_parameters.items():
    print(f"Scraping News Headlines for {name}")
    url = f"https://economictimes.indiatimes.com/topic/{param}"
    r = requests.get(url, headers = headers)
    soup = BeautifulSoup(r.content, "html5lib")
    news = soup.find("div", class_="news_sec")

    if not news:
      continue

    for item in news.find_all("li"):
      try:
        p_tag = item.find("p", class_ = "stry")
        time_tag = item.find("time", class_ ="date-format")

        if time_tag and p_tag:
          headline = p_tag.find("a").text.strip()
          timestamp = time_tag.get("data-time")

          if headline and timestamp:
            all_headlines.append({"Company": name, "Timestamp": timestamp, "Headlines": headline})

      except Exception as e:
        print(f"Error parsing item for {name}: {e}")
        continue
  # time.sleep(5)
  return all_headlines


In [10]:
in_news_df = pd.DataFrame(scrape_et_headlines())
in_news_df.head()

Scraping News Headlines for HDFC
Scraping News Headlines for Infosys
Scraping News Headlines for Reliance


Unnamed: 0,Company,Timestamp,Headlines
0,HDFC,"Jul 15, 2025, 12:47 PM",Waaree Renewable Technologies shares jump 15% ...
1,HDFC,"Jul 15, 2025, 09:28 AM","Rallis India shares zoom 7%, hit 52-week high ..."
2,HDFC,"Jul 15, 2025, 08:09 AM",Tata Technologies shares jump 3% after Q1 prof...
3,HDFC,"Jul 03, 2023, 12:43 PM IST","Disclosure Under Regulations 30, 42 And 60 Of ..."
4,Infosys,"Jul 14, 2025, 11:00 AM",Mazagon Dock and CONCOR among stocks bought an...


In [11]:
in_news_df.iloc[0, 2]

'Waaree Renewable Technologies shares jump 15% ahead of Q1 results'

In [12]:
in_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    12 non-null     object
 1   Timestamp  12 non-null     object
 2   Headlines  12 non-null     object
dtypes: object(3)
memory usage: 420.0+ bytes


In [13]:
in_news_df["Timestamp"].unique()[:10]

array(['Jul 15, 2025, 12:47 PM', 'Jul 15, 2025, 09:28 AM',
       'Jul 15, 2025, 08:09 AM', 'Jul 03, 2023, 12:43 PM IST',
       'Jul 14, 2025, 11:00 AM', 'Jul 14, 2025, 10:40 AM',
       'Jul 13, 2025, 09:37 AM', 'Jul 15, 2025, 10:58 AM IST',
       'Jul 11, 2025, 01:33 PM', 'Jul 11, 2025, 10:12 AM'], dtype=object)

In [14]:
in_news_df["Timestamp"] = in_news_df["Timestamp"].str.replace(" IST", "", regex = False)
in_news_df["Timestamp"] = pd.to_datetime(in_news_df["Timestamp"], format = "%b %d, %Y, %I:%M %p", errors = "coerce")
in_news_df["Timestamp"].head()

Unnamed: 0,Timestamp
0,2025-07-15 12:47:00
1,2025-07-15 09:28:00
2,2025-07-15 08:09:00
3,2023-07-03 12:43:00
4,2025-07-14 11:00:00


In [15]:
in_news_df["Date"] = in_news_df["Timestamp"].dt.date
in_news_df.head()

Unnamed: 0,Company,Timestamp,Headlines,Date
0,HDFC,2025-07-15 12:47:00,Waaree Renewable Technologies shares jump 15% ...,2025-07-15
1,HDFC,2025-07-15 09:28:00,"Rallis India shares zoom 7%, hit 52-week high ...",2025-07-15
2,HDFC,2025-07-15 08:09:00,Tata Technologies shares jump 3% after Q1 prof...,2025-07-15
3,HDFC,2023-07-03 12:43:00,"Disclosure Under Regulations 30, 42 And 60 Of ...",2023-07-03
4,Infosys,2025-07-14 11:00:00,Mazagon Dock and CONCOR among stocks bought an...,2025-07-14


In [16]:
in_news_df["Date"] = in_news_df["Timestamp"].dt.date
in_news_df = in_news_df[["Timestamp", "Date", "Company", "Headlines"]]
in_news_df.head()

Unnamed: 0,Timestamp,Date,Company,Headlines
0,2025-07-15 12:47:00,2025-07-15,HDFC,Waaree Renewable Technologies shares jump 15% ...
1,2025-07-15 09:28:00,2025-07-15,HDFC,"Rallis India shares zoom 7%, hit 52-week high ..."
2,2025-07-15 08:09:00,2025-07-15,HDFC,Tata Technologies shares jump 3% after Q1 prof...
3,2023-07-03 12:43:00,2023-07-03,HDFC,"Disclosure Under Regulations 30, 42 And 60 Of ..."
4,2025-07-14 11:00:00,2025-07-14,Infosys,Mazagon Dock and CONCOR among stocks bought an...


In [17]:
print(in_news_df.shape)
print(us_news_df.shape)

(12, 4)
(993, 4)


In [18]:
news_df = pd.concat([us_news_df, in_news_df], ignore_index= True)
print(news_df.shape)
news_df.sample(10)

(1005, 4)


Unnamed: 0,Company,Timestamp,Headlines,Date
341,Netflix,2025-07-10 10:28:00,"Dow climbs, Nasdaq slides as Big Tech struggles",2025-07-10
360,Netflix,2025-07-07 12:02:00,Netflix (NFLX) Downgraded as Analyst Sees 'Too...,2025-07-07
594,Microsoft,2025-07-10 16:14:00,Amazon's Azure Still Has More Room To Run,2025-07-10
757,Reliance,2024-02-16 12:53:00,Reliance's (RS) Earnings and Revenues Surpass ...,2024-02-16
748,Reliance,2024-04-15 15:40:00,Canadian Investment Regulatory Organization Tr...,2024-04-15
827,Infosys,2025-04-30 06:22:00,Infosys Collaborates with Yorkshire Building S...,2025-04-30
738,Reliance,2024-07-25 06:50:00,"Reliance, Inc. Reports Second Quarter 2024 Fin...",2024-07-25
410,Google,2025-07-13 12:10:00,2 Artificial Intelligence (AI) Stocks That Cou...,2025-07-13
10,Meta,2025-07-14 16:03:00,Mark Zuckerberg's Privacy Showdown in Delaware,2025-07-14
230,Amazon,2025-07-14 09:00:00,Pega Signs Five-Year Strategic Collaboration A...,2025-07-14


In [19]:
news_df.duplicated().sum()

np.int64(0)

In [20]:
news_df = news_df[["Timestamp", "Date", "Company", "Headlines"]]
news_df.head(3)

Unnamed: 0,Timestamp,Date,Company,Headlines
0,2025-07-14 22:51:00,2025-07-14,Meta,Meta's Zuckerberg bets hundreds of billions on...
1,2025-07-14 20:00:00,2025-07-14,Meta,How 1 Rocket Engineer's Vegetarian Engine Coul...
2,2025-07-14 18:44:00,2025-07-14,Meta,Meta built its AI reputation on openness that...


#### **Step - 3: Performing Sentiment Analysis of the News headlines**

In [21]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
news_df["Sentiment_Score"] = news_df["Headlines"].apply(lambda x: analyzer.polarity_scores(x)["compound"])
news_df.head()

Unnamed: 0,Timestamp,Date,Company,Headlines,Sentiment_Score
0,2025-07-14 22:51:00,2025-07-14,Meta,Meta's Zuckerberg bets hundreds of billions on...,0.0
1,2025-07-14 20:00:00,2025-07-14,Meta,How 1 Rocket Engineer's Vegetarian Engine Coul...,0.0
2,2025-07-14 18:44:00,2025-07-14,Meta,Meta built its AI reputation on openness that...,0.34
3,2025-07-14 18:28:00,2025-07-14,Meta,Markets Close in the Green; Nasdaq at Fresh Highs,0.3182
4,2025-07-14 18:04:00,2025-07-14,Meta,Nuclear's Moment: Securing US AI Supremacy,0.3612


In [22]:
daily_sentiment = news_df.groupby(["Date", "Company"])["Sentiment_Score"].mean().reset_index()
daily_sentiment.head(10)

Unnamed: 0,Date,Company,Sentiment_Score
0,2023-07-03,HDFC,0.0
1,2023-09-11,Reliance,0.0
2,2023-09-19,HDFC,0.212
3,2023-09-20,HDFC,-0.19785
4,2023-09-21,Reliance,0.31845
5,2023-09-22,HDFC,0.296
6,2023-09-25,HDFC,0.6743
7,2023-09-27,Reliance,0.6249
8,2023-10-03,Reliance,0.0
9,2023-10-12,Reliance,0.0


#### **Step - 4: Saving and Exporting the Sentiment data as CSV file**

In [23]:
daily_sentiment.to_csv("Daily_Sentiment_Data_for_Stocks.csv", index = False)
news_df.to_csv("News_Headlines_Data.csv", index = False)

In [24]:
from google.colab import files

files.download("Daily_Sentiment_Data_for_Stocks.csv")
files.download("News_Headlines_Data.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---

_This notebook was created and authored by Ramya Vijayalayan for educational and portfolio use only._  
© 2025 Ramya | [github.com/ramyavijayalayan-portfolio](https://github.com/ramyavijayalayan10)