# Indian States Web Scraping

In this notebook, we will **scrape** the webpage available at https://www.mapsofindia.com/states/ and produce the table in a **Pandas Dataframe** for furthur cleaning and analysis using the **Beautiful Soup**.


#### Pandas : Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

#### Beautiful Soup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.



Web Scraping comes under Data Collection in Machine Learning and Data Science Pipeline and is the first tep when building Machnine Learning Models.

First, we will use the **pip (PyPi Package Manager for Python )** to install the **Beautiful Soup**.

In [1]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


# Imports

Now we will import all the required things for scraping web pages

**UrlLib** : UrlLib is a Python package that offers a collection of several Python Modules such as
1. ***Request*** (for opening and reading URLs)
2. ***Error*** (containing the exceptions raised by urllib.request)
3. ***Parse*** (for parsing URLs)
4. ***RobotParser*** (for parsing robots.txt files)

In this project, we will use the **Request** Module in order to open the URL.



In [2]:
from bs4 import BeautifulSoup
import urllib.request
import csv

Now we will define a few things which are needed for Web Scraping.

The **urllib.request** module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

Here we will use the **.urlopen()** function which takes in a url as an argument, opens the url, which can be either a **String** or a **Request** object. For **HTTP and HTTPS URLs**, this function returns a **http.client.HTTPResponse** object slightly modified.

Next we run the **BeautifulSoup()** method along with a **'HTML Parser'** and the **HTTPResponse** that we got from **urllib**.
This returns a **BeautifulSoup Object** which was creates when the parser parsed the **HTTPResponse** into a **HTML Page.**

In [3]:
url = 'https://www.mapsofindia.com/states/'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

In [4]:
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>List of Indian States, Union Territories and Capitals In India Map</title>
<meta content="Get list of Indian states and union territories with detailed map. Detailed information about each state and union territories is also provided here." name="description"/>
<meta content="List of Indian States and union territories, India states and union territories, India states list. list of union territories" name="keywords"/>
<link href="https://m.mapsofindia.com/states/" media="only screen and (max-width:736px)" rel="alternate"/>
<link href="https://m.mapsofindia.com/states-ampcontent/" rel="amphtml"/>
<script src="https://www.mapsofindia.com/js_2009/style.js"></script>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<script src="https://www.mapsofindia.com/widgets/electionsutility/js/responsive-style.js"></script>
<link href="../style_2009/style-new.css

Next we use the **find()** method to find the table in the **soup** object. The **find()** method can be called with or without the **attrs** argument. The **attrs** argument stands for arguments and can be used to find a html tag with a particular set of attributes. For example, here we find the **table** tag with and attribute **class** set to **tableizer-table**.

In [5]:
table = soup.find('table', attrs={'class':'tableizer-table'})

This returns the part of html that has the matched tags and attributes and stores them in the **table** variable.

In [6]:
print(table)

<table class="tableizer-table">
<tr class="tableizer-firstrow"><th>State</th><th>Capital</th><th>Area</th><th>Population</th><th>Official Languages</th><th>LargestCities </th><th>Districts<br/>/Admin<br/> divisions</th><th>Population Density</th><th>Literacy Rate%</th><th>Urban Pop.<br/>%</th><th>Sex Ratio</th><th>Estd <br/>Year</th></tr>
<tr><td><a href="https://www.mapsofindia.com/andhra-pradesh/" title="Andhra Pradesh">Andhra Pradesh</a></td><td>Hyderabad (De jure - 2 June 2024) Amaravati (proposed)</td><td>160,205 km2	</td><td>4,93,78,776</td><td>Telugu</td><td>Visakhapatnam,<br/>Vijayawada,<br/>Guntur</td><td>13</td><td> </td><td>67.66*</td><td>33.49*</td><td>992*</td><td>1956</td></tr>
<tr><td><a href="https://www.mapsofindia.com/arunachal-pradesh/">Arunachal Pradesh</a></td><td>Itanagar</td><td>83,743 km2 </td><td>1,382,611</td><td>English</td><td>Itanagar</td><td>17</td><td>17 /km2</td><td>66.95</td><td>22.67</td><td>920</td><td>1987</td></tr>
<tr><td><a href="https://www.mapso

Next, we will again use the find method to furthur find the **< tr >** tag in the table and store in the the variable **headings**.

Now we need a list of column names in the table for for **Dataset** hence we use the **find_all()** method which finds all the tags with the given attribute and stores them in a list.

Here we use the variable **headingList** to store that list

In [7]:
headings = table.find('tr', attrs={'class':'tableizer-firstrow'})
headingList = headings.find_all('th')
print(headingList)

[<th>State</th>, <th>Capital</th>, <th>Area</th>, <th>Population</th>, <th>Official Languages</th>, <th>LargestCities </th>, <th>Districts<br/>/Admin<br/> divisions</th>, <th>Population Density</th>, <th>Literacy Rate%</th>, <th>Urban Pop.<br/>%</th>, <th>Sex Ratio</th>, <th>Estd <br/>Year</th>]


Next we will get the column names from the list by using the **.getText()** method which returns the contents of a tag and store the list of those column names in the titles variable.

In [8]:
title=[]
for t in headingList:
    title.append(t.getText())
print(title)

['State', 'Capital', 'Area', 'Population', 'Official Languages', 'LargestCities ', 'Districts/Admin divisions', 'Population Density', 'Literacy Rate%', 'Urban Pop.%', 'Sex Ratio', 'Estd Year']


This list we be used to provide the column names to the **Pandas** Dataframe that we create now.

In [9]:
import pandas as pd
dataset = pd.DataFrame()

Next, we use the same **find_all()** method to find all the **< tr >** tags which indicate each individual row in the table and store it in the variable rows. 

We iterate through this rows variable and use the same **getText()** method to get the content and store add it to the **dataset**.

In [11]:
rows = table.find_all('tr')
i = 0
for row in rows:
    if i == 0:
        i += 1
    else:       
        info = row.find_all('td')
        if(len(info)!=12):
            print("Outlier")
        else:
            list = []
            for j in range(12):
                #print(info[j].getText())
                list.append(info[j].getText()) 
                #plist = pd.Series(list)
                #print(plist)
            #print(list)
            plist = pd.Series(list)
            #print(plist)
            dataset = dataset.append(plist, ignore_index=True)

Outlier


In [13]:
dataset.columns=title

In [14]:
dataset.head()

Unnamed: 0,State,Capital,Area,Population,Official Languages,LargestCities,Districts/Admin divisions,Population Density,Literacy Rate%,Urban Pop.%,Sex Ratio,Estd Year
0,Andhra Pradesh,Hyderabad (De jure - 2 June 2024) Amaravati (p...,"160,205 km2\t",49378776,Telugu,"Visakhapatnam,Vijayawada,Guntur",13,,67.66*,33.49*,992*,1956
1,Arunachal Pradesh,Itanagar,"83,743 km2",1382611,English,Itanagar,17,17 /km2,66.95,22.67,920,1987
2,Assam,Dispur,"78,438 km2",31169272,Assamese,"Guwahati, Silchar, Dibrugarh, Nagaon",33,397 /km2,73.18,14.08,954,1975
3,Bihar,Patna,"94,163 km2",103804637,Hindi,"Patna, Gaya, Bhagalpur, Muzaffarpur, Biharsharif",38,"1,102 /km2",63.82,11.30,916,1935
4,Chhattisgarh,Raipur,"135,191 km2",25540196,Chhattisgarhi,"Raipur, Bhilai Nagar, Korba, Bilaspur, Durg",27,189 /km2,71.04,23.24,991,2000


In [13]:
len(dataset)

36