## <font color=green>_The Delhi Perspective_</font>

# 2. DATA

 We will be using data as per:

1. Web scraping data from wikipedia of districts of delhi.
2. Web scraping data from other websites of delhi having population in it.
3. Using Geocoder package to get the latitudes and longitudes.
4. Using FourSquare location to fetch the details of venues nearby.

### 2.1 Data Collection

#### Import libraries:

In [31]:
# library for BeautifulSoup
from bs4 import BeautifulSoup

# library to handle data in a vectorized manner
import numpy as np

# library for data analsysis
import seaborn as sns # for visualization
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# library to handle JSON files
import json
print('numpy, pandas, ..., imported...')

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
print('Nominatim imported...')

# library to handle requests
import requests
print('requests imported...')

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
print('json_normalize imported...')

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplotlib imported...')

# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Kmeans imported...')

# import the Geocoder
import geocoder

# import time
import time

import folium # map rendering library
print('folium imported...')
print('...Done')


numpy, pandas, ..., imported...
Nominatim imported...
requests imported...
json_normalize imported...
matplotlib imported...
Kmeans imported...
folium imported...
...Done


In [2]:
# obtaining the link from website
link=("https://www.census2011.co.in/census/state/districtlist/delhi.html")
page = requests.get(link)
page

<Response [200]>

In [3]:
# cleans the html file
soup = BeautifulSoup(page.content, 'html.parser')
## This extracts the table from within the page.
my_table=soup.find("table")
my_table

<table>
<thead>
<tr>
<th>#</th>
<th class="alignleft">District</th>
<th>Sub-Districts</th>
<th>Population</th>
<th>Increase</th>
<th>Sex Ratio</th>
<th>Literacy</th>
<th>Density</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td class="alignleft"><a href="/census/district/168-north-west-delhi.html">North West Delhi</a></td>
<td><a href="/data/district/168-north-west-delhi-delhi.html">List</a></td>
<td>3,656,539</td>
<td>27.81 %</td>
<td>865</td>
<td>84.45 %</td>
<td>8254</td>
</tr>
<tr>
<td>2</td>
<td class="alignleft"><a href="/census/district/176-south-delhi.html">South Delhi</a></td>
<td><a href="/data/district/176-south-delhi-delhi.html">List</a></td>
<td>2,731,929</td>
<td>20.51 %</td>
<td>862</td>
<td>86.57 %</td>
<td>11060</td>
</tr>
<tr>
<td>3</td>
<td class="alignleft"><a href="/census/district/174-west-delhi.html">West Delhi</a></td>
<td><a href="/data/district/174-west-delhi-delhi.html">List</a></td>
<td>2,543,243</td>
<td>19.46 %</td>
<td>875</td>
<td>86.98 %</td>
<td>19563</t

In [4]:
# Extracts all "tr" (table rows) within the table above
rows=my_table.find_all("tr")

In [5]:
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')for i in rows[0].find_all("th")]
columns

['#',
 'District',
 'Sub-Districts',
 'Population',
 'Increase',
 'Sex Ratio',
 'Literacy',
 'Density']

In [6]:
# Extracts every row with corresponding columns and creating a dataframe
l = []
for tr in rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
del l[0]
df=pd.DataFrame(l, columns= columns,index=None)
df

Unnamed: 0,#,District,Sub-Districts,Population,Increase,Sex Ratio,Literacy,Density
0,1,North West Delhi,List,3656539.0,27.81 %,865.0,84.45 %,8254.0
1,2,South Delhi,List,2731929.0,20.51 %,862.0,86.57 %,11060.0
2,3,West Delhi,List,2543243.0,19.46 %,875.0,86.98 %,19563.0
3,4,South West Delhi,List,2292958.0,30.65 %,840.0,88.28 %,5446.0
4,5,North East Delhi,List,2241624.0,26.78 %,886.0,83.09 %,36155.0
5,6,East Delhi,List,1709346.0,16.79 %,884.0,89.31 %,27132.0
6,7,North Delhi,List,887978.0,13.62 %,869.0,86.85 %,14557.0
7,\n\n\n (adsbygoogle = window.adsbygoogle |...,,,,,,,
8,8,Central Delhi,List,582320.0,-9.91 %,892.0,85.14 %,27730.0
9,9,New Delhi,List,142004.0,-20.72 %,822.0,88.34 %,4057.0


In [7]:
# The Sub-Districts  and # column doesnot contains any information , we will drop it.
df.drop(["#","Sub-Districts"],axis=1,inplace=True)
df

Unnamed: 0,District,Population,Increase,Sex Ratio,Literacy,Density
0,North West Delhi,3656539.0,27.81 %,865.0,84.45 %,8254.0
1,South Delhi,2731929.0,20.51 %,862.0,86.57 %,11060.0
2,West Delhi,2543243.0,19.46 %,875.0,86.98 %,19563.0
3,South West Delhi,2292958.0,30.65 %,840.0,88.28 %,5446.0
4,North East Delhi,2241624.0,26.78 %,886.0,83.09 %,36155.0
5,East Delhi,1709346.0,16.79 %,884.0,89.31 %,27132.0
6,North Delhi,887978.0,13.62 %,869.0,86.85 %,14557.0
7,,,,,,
8,Central Delhi,582320.0,-9.91 %,892.0,85.14 %,27730.0
9,New Delhi,142004.0,-20.72 %,822.0,88.34 %,4057.0


In [8]:
# Remove the row containing none
df = df.dropna(how='any',axis=0) 
df

Unnamed: 0,District,Population,Increase,Sex Ratio,Literacy,Density
0,North West Delhi,3656539,27.81 %,865,84.45 %,8254
1,South Delhi,2731929,20.51 %,862,86.57 %,11060
2,West Delhi,2543243,19.46 %,875,86.98 %,19563
3,South West Delhi,2292958,30.65 %,840,88.28 %,5446
4,North East Delhi,2241624,26.78 %,886,83.09 %,36155
5,East Delhi,1709346,16.79 %,884,89.31 %,27132
6,North Delhi,887978,13.62 %,869,86.85 %,14557
8,Central Delhi,582320,-9.91 %,892,85.14 %,27730
9,New Delhi,142004,-20.72 %,822,88.34 %,4057


### Lets get the Headquarters or subdistricts of Delhi using data by web scraping 

In [9]:
link=("https://en.wikipedia.org/wiki/List_of_districts_of_Delhi")
wikipedia_page = requests.get(link)
# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# This extracts the table where class is "wikitable"
table = soup.find('table', {'class':'wikitable'})
table

<table border="0" cellpadding="1" cellspacing="1" class="wikitable" style="border:1px solid black; background-color: Solid White">
<tbody><tr>
<th style="background-color:#99CCFF">Sl.No.
</th>
<th style="background-color:#99CCFF">District
</th>
<th style="background-color:#99CCFF">Headquarters
</th>
<th colspan="3" style="background-color:#99CCFF">Sub divisions (Tehsils)
</th></tr>
<tr>
<td>1
</td>
<td><a href="/wiki/New_Delhi" title="New Delhi">New Delhi</a></td>
<td><a href="/wiki/Connaught_Place,_New_Delhi" title="Connaught Place, New Delhi">Connaught Place</a>
</td>
<td><a href="/wiki/Chanakyapuri" title="Chanakyapuri">Chanakyapuri</a>
</td>
<td><a href="/wiki/Delhi_Cantonment" title="Delhi Cantonment">Delhi Cantonment</a>
</td>
<td><a class="mw-redirect" href="/wiki/Vasant_Vihar" title="Vasant Vihar">Vasant Vihar</a>
</td></tr>
<tr>
<td>2
</td>
<td><a href="/wiki/North_Delhi" title="North Delhi">North Delhi</a></td>
<td><a href="/wiki/Alipur,_Delhi" title="Alipur, Delhi">Alipur</a

In [10]:
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns1 = [i.text.replace('\n', '')for i in rows[0].find_all("th")]
columns1

['Sl.No.', 'District', 'Headquarters', 'Sub divisions (Tehsils)']

In [11]:
# Extracts every row with corresponding columns and creating a dataframe
l = []
for tr in rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
del l[0]
Serial=[item[0] for item in l] 
Serial
District=[item[1] for item in l]
District
Headquarter=[item[2] for item in l]
Headquarter
Sub_divisions=[item[3:]for item in l]
Sub_divisions
df1=pd.DataFrame()
df1['Serial']=Serial
df1['District']=District
df1['Headquarter']=Headquarter
df1["Sub_divisions"]= Sub_divisions
df1

Unnamed: 0,Serial,District,Headquarter,Sub_divisions
0,1\n,New Delhi,Connaught Place\n,"[Chanakyapuri\n, Delhi Cantonment\n, Vasant Vi..."
1,2\n,North Delhi,Alipur\n,"[Model Town[3]\n, Narela\n, Alipur\n]"
2,3\n,North West Delhi,Kanjhawala\n,"[Rohini\n, Kanjhawala\n, Saraswati Vihar\n]"
3,4\n,West Delhi,Rajouri Garden\n,"[Patel Nagar\n, Punjabi Bagh\n, Rajouri Garden\n]"
4,5\n,South West Delhi,Dwarka\n,"[Dwarka\n, Najafgarh\n, Kapashera\n]"
5,6\n,South Delhi,Saket\n,"[Saket\n, Hauz Khas\n, Mehrauli\n]"
6,7\n,South East Delhi,Defence Colony\n,"[Defence Colony\n, Kalkaji\n, Sarita Vihar\n]"
7,8\n,Central Delhi,Daryaganj\n,"[Kotwali\n, Civil Lines\n, Karol Bagh\n]"
8,9\n,North East Delhi,Nand Nagri\n,"[Seelampur\n, Yamuna Vihar\n, Karawal Nagar\n]"
9,10\n,Shahdara,Shahdara\n,"[Shahdara\n, Seemapuri\n, Vivek Vihar\n]"


 The "Serial" and "Sub_divisions" is not required at this time. so we will drop these.

In [12]:
df1.drop(["Sub_divisions","Serial"],axis=1,inplace=True)
df1

Unnamed: 0,District,Headquarter
0,New Delhi,Connaught Place\n
1,North Delhi,Alipur\n
2,North West Delhi,Kanjhawala\n
3,West Delhi,Rajouri Garden\n
4,South West Delhi,Dwarka\n
5,South Delhi,Saket\n
6,South East Delhi,Defence Colony\n
7,Central Delhi,Daryaganj\n
8,North East Delhi,Nand Nagri\n
9,Shahdara,Shahdara\n


Looking at the data, the Headquarter column contains "\n".These references were extracted from wiki page. so these are removed.

In [13]:
df1['Headquarter'] = df1['Headquarter'].map(lambda x: x.rstrip('\n'))
df1

Unnamed: 0,District,Headquarter
0,New Delhi,Connaught Place
1,North Delhi,Alipur
2,North West Delhi,Kanjhawala
3,West Delhi,Rajouri Garden
4,South West Delhi,Dwarka
5,South Delhi,Saket
6,South East Delhi,Defence Colony
7,Central Delhi,Daryaganj
8,North East Delhi,Nand Nagri
9,Shahdara,Shahdara


In [14]:
#Correcting the index and matching the dataframes.
df2=df1.drop(df1.index[[9,6]])
df2=df2.reset_index(drop=True)
df2

Unnamed: 0,District,Headquarter
0,New Delhi,Connaught Place
1,North Delhi,Alipur
2,North West Delhi,Kanjhawala
3,West Delhi,Rajouri Garden
4,South West Delhi,Dwarka
5,South Delhi,Saket
6,Central Delhi,Daryaganj
7,North East Delhi,Nand Nagri
8,East Delhi,Preet Vihar


In [15]:
df = df.reset_index(drop=True)
df

Unnamed: 0,District,Population,Increase,Sex Ratio,Literacy,Density
0,North West Delhi,3656539,27.81 %,865,84.45 %,8254
1,South Delhi,2731929,20.51 %,862,86.57 %,11060
2,West Delhi,2543243,19.46 %,875,86.98 %,19563
3,South West Delhi,2292958,30.65 %,840,88.28 %,5446
4,North East Delhi,2241624,26.78 %,886,83.09 %,36155
5,East Delhi,1709346,16.79 %,884,89.31 %,27132
6,North Delhi,887978,13.62 %,869,86.85 %,14557
7,Central Delhi,582320,-9.91 %,892,85.14 %,27730
8,New Delhi,142004,-20.72 %,822,88.34 %,4057


Lets merge both dataframes "df","df2" for carrying out the further analysis.

In [16]:
result=pd.merge(df2,df,on="District")
result

Unnamed: 0,District,Headquarter,Population,Increase,Sex Ratio,Literacy,Density
0,New Delhi,Connaught Place,142004,-20.72 %,822,88.34 %,4057
1,North Delhi,Alipur,887978,13.62 %,869,86.85 %,14557
2,North West Delhi,Kanjhawala,3656539,27.81 %,865,84.45 %,8254
3,West Delhi,Rajouri Garden,2543243,19.46 %,875,86.98 %,19563
4,South Delhi,Saket,2731929,20.51 %,862,86.57 %,11060
5,Central Delhi,Daryaganj,582320,-9.91 %,892,85.14 %,27730
6,North East Delhi,Nand Nagri,2241624,26.78 %,886,83.09 %,36155
7,East Delhi,Preet Vihar,1709346,16.79 %,884,89.31 %,27132


In [33]:
result.to_csv("DelhiPerspectiveData.csv")

We have created the dataframe containing the districts and headquarters of Delhi with various other information like population, the increase in population from last census(The data obtained here is of Delhi census 2011), sex ratio, Literacy and Density.

- The initial preparation of creating a dataframe of delhi has been done.
- The next step is to analyse each feature in the dataframe through data visualization.
- The further steps includes using Geocoder package to obtain latitudes and longitudes.
- Using Foursquare locations to fetch venue details and other.