# Web scraping of the 500 Company Stock Details Step by Step

We are going to read the stock related data of 500 companies from the moneycontrol website using web scraping.

BeautifulSoap is used to parse HTML content.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

In [2]:
#This url has details and hyperlinks of all 500 companies

r=requests.get('https://www.moneycontrol.com/india/stockpricequote/')

In [3]:
#We want to view the data in text format
data=r.text

In [4]:
#we can check some part of this html data
print(data[:1000])

 <html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <title>Stock Quotes|Company Stock Price quotes|NSE/ BSE Listed Company Stocks|Indian Stock Market</title>
 <meta name="description" content="Get all Indian company stock quotes listed in the share market. NSE/ BSE Listed companies stock price quotes list, top company stock list on Moneycontrol.">
 <meta name="keywords" content="Stock Market Quotes, Company Stock Quotes, Share Market Price & Chart Quotes">
<link rel="canonical" href="https://www.moneycontrol.com/india/stockpricequote/" />
<meta property="og:url" content="https://www.moneycontrol.com/india/stockpricequote/" />
<meta property="og:title" content="Stock Quotes|Company Stock Price quotes|NSE/ BSE Listed Company Stocks|Indian Stock Market" />
<meta property="og:description" content="Get all Indian company stock quotes listed in the share market. NSE/ BSE Listed companies stock price quotes list, top company stock list on Moneycont

# Web Scraping



Use soap to parse HTML so that we can focus on content on web page we actually intend to target.
Here we follow few steps:


1. On the moneycontrol website right click and select inspect to view source code of webpage.
   Then click on the arrow on top so that we can examine the arrow over each of our 500 company links.
    
2. Try first company (say 3M India) , as we place arrow over it we get its html code highlighted which tells us that its class is bl_12.

3. so filter the page content below as  tag "a"  and class bl_12.


In [5]:
soup=BeautifulSoup(data,'html.parser')

mydivs = soup.findAll("a", {"class": "bl_12"})

Lets collect all these hyperlinks of all companies

In [6]:
values=[]
for link in mydivs:
    value=link.get('href')
    values.append(value)


values= [ link.get('href') for link in mydivs]
values[0:5]


['javascript:;',
 'https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42',
 'https://www.moneycontrol.com/india/stockpricequote/financegeneral/aavasfinanciers/AF17',
 'https://www.moneycontrol.com/india/stockpricequote/financeinvestments/adityabirlacapital/ABC9',
 'https://www.moneycontrol.com/india/stockpricequote/pharmaceuticals/abbottindia/AI51']

Let's store our results to a Dataframe

In [7]:
#create a temp Df
pd.set_option('max_colwidth', 800)
stock_data = pd.DataFrame({'LINK': values})

print(stock_data.shape)
stock_data.head(5)

(499, 1)


Unnamed: 0,LINK
0,javascript:;
1,https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42
2,https://www.moneycontrol.com/india/stockpricequote/financegeneral/aavasfinanciers/AF17
3,https://www.moneycontrol.com/india/stockpricequote/financeinvestments/adityabirlacapital/ABC9
4,https://www.moneycontrol.com/india/stockpricequote/pharmaceuticals/abbottindia/AI51


# Time to find some Data Patterns!

We intend to collect details about company name, code and sector.

Where to get these details on the website?
Maybe, the URL!

Let's observe a company URL to get some insights 

https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42
    
The sector is diversified ,code is MI42,name is 3mindia



# Some Outliers

Not all links we retrieved are correct.
To check correct URL format perform split of the LINK and extract the codes, sector and name.
The outliers are printed so that they can be dropped.

In [8]:

codes=[]
sector=[]
name=[]

for index,row in stock_data.iterrows():
    temp=  row['LINK'].split('/')
    if len(temp)==8:
        codes.append(temp[-1])
        sector.append(temp[-3])
        name.append(temp[-2])
    else:
        print(index)
        

0
497
498


In [9]:
#### so lets drop indexes at 0 ,497 and 498
stock_data=stock_data.drop([0, 497,498])

stock_data.shape


(496, 1)

In [10]:
stock_data['sector']=sector
stock_data['name']=codes
stock_data['code']=name
stock_data.head()

Unnamed: 0,LINK,sector,name,code
1,https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42,diversified,MI42,3mindia
2,https://www.moneycontrol.com/india/stockpricequote/financegeneral/aavasfinanciers/AF17,financegeneral,AF17,aavasfinanciers
3,https://www.moneycontrol.com/india/stockpricequote/financeinvestments/adityabirlacapital/ABC9,financeinvestments,ABC9,adityabirlacapital
4,https://www.moneycontrol.com/india/stockpricequote/pharmaceuticals/abbottindia/AI51,pharmaceuticals,AI51,abbottindia
5,https://www.moneycontrol.com/india/stockpricequote/cementmajor/acc/ACC06,cementmajor,ACC06,acc


Cheers! Let's save the Data.

In [11]:
#get this result of 500 urls and codes into a csv file
stock_data.to_csv('Initial500CompanyData.csv',index=False)

# Let's dive deeper

Now that we have the URLs to each of the 500 companies, let's analyse each Company to identify its Standalone properties.

For example take a particular link, https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42 

Scroll down to get the Standalone properties.
    
Identify tag keys-class as value_txtfl and values-class as value_txtfr


In [12]:
r=requests.get('https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42')
d=r.text
soup=BeautifulSoup(d,'html.parser')

mydivs = soup.findAll("div", {"class": "value_txtfl"})
myvalues=soup.findAll("div",{"class":"value_txtfr"})


Let's create our final dataframe to consists of a company's link and its standalone parameters as columns.


In [13]:
# Create the column names of our Dataframe
col_names=[]
col_names.append('Link')
for div in mydivs:
    if div.text not in col_names:
        col_names.append(div.text)
   
    
print(col_names) #to verify the columns got are correct 


['Link', 'Market Cap (Rs Cr.)', 'P/E', 'Book Value (Rs)', 'Dividend (%)', 'Market Lot', 'Industry P/E', 'EPS (TTM)', 'P/C', 'Price/Book', 'Dividend Yield.(%)', 'Face Value (RS)', 'Deliverables (%)']


In [14]:
#create a Final 
scraped_data=pd.DataFrame(columns=col_names)
scraped_data.head()

Unnamed: 0,Link,Market Cap (Rs Cr.),P/E,Book Value (Rs),Dividend (%),Market Lot,Industry P/E,EPS (TTM),P/C,Price/Book,Dividend Yield.(%),Face Value (RS),Deliverables (%)


# All set to gather the Data

In [17]:
#Now we have to get the values for above columns from each respective webpage of 500 companies.
#Lets first read the df from out previous 'Initial500CompanyData.csv' file to operate over each link there.
stock_data=pd.read_csv('Initial500CompanyData.csv')
stock_data.head(5)

Unnamed: 0,LINK,sector,name,code
0,https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42,diversified,MI42,3mindia
1,https://www.moneycontrol.com/india/stockpricequote/financegeneral/aavasfinanciers/AF17,financegeneral,AF17,aavasfinanciers
2,https://www.moneycontrol.com/india/stockpricequote/financeinvestments/adityabirlacapital/ABC9,financeinvestments,ABC9,adityabirlacapital
3,https://www.moneycontrol.com/india/stockpricequote/pharmaceuticals/abbottindia/AI51,pharmaceuticals,AI51,abbottindia
4,https://www.moneycontrol.com/india/stockpricequote/cementmajor/acc/ACC06,cementmajor,ACC06,acc


For each link we request to its webpage and get the values using class-tag and store it in our final_dataframe.
Here sleep is necessary because we just cant request so many times back to back else server will refuse connection.


In [18]:
j=0
for link in stock_data['LINK']:
        
    try:
        entries=[]
        req=requests.get(link)
        data=req.text
        soup=BeautifulSoup(data,'html.parser')
        myvalues=soup.findAll("div",{"class":"value_txtfr"})

        entries = [value.text for value in myvalues if value.text not in entries]
        entries.insert(0,link)
        
        #we have to select link and the first 12 values for our 12 Standalone properties
        scraped_data.loc[j]=entries[0:13]

        j=j+1
    except:
        time.sleep(10)

In [19]:
#Import this result to csv file
 
scraped_data.to_csv('500FilesDataExtractionFinal.csv',index=False)


Cheers! The results are

In [20]:
# Results
scraped_data.iloc[0:5,0:9]

Unnamed: 0,Link,Market Cap (Rs Cr.),P/E,Book Value (Rs),Dividend (%),Market Lot,Industry P/E,EPS (TTM),P/C
0,https://www.moneycontrol.com/india/stockpricequote/diversified/3mindia/MI42,19843.42,65.73,1650.7,0.0,1,30.56,268.93,57.74
1,https://www.moneycontrol.com/india/stockpricequote/financegeneral/aavasfinanciers/AF17,7519.14,31.6,234.56,0.0,1,17.0,31.09,30.39
2,https://www.moneycontrol.com/india/stockpricequote/financeinvestments/adityabirlacapital/ABC9,10668.92,-,34.11,0.0,1,23.53,-,-
3,https://www.moneycontrol.com/india/stockpricequote/cementmajor/acc/ACC06,22412.41,16.73,613.52,140.0,1,29.76,71.38,11.54
4,https://www.moneycontrol.com/india/stockpricequote/trading/adanienterprises/AE13,15496.32,22.21,34.2,100.0,1,21.55,6.35,18.93
