## Scraping the Information of Indian Politicians in different Cities

## Project Outline :

- Scrape "https://en.wikipedia.org/wiki/Category:Indian_politicians_by_city_or_town" 
- We'll be given a sub-category list based on alphabetical order and for each sub-category we'll be having further sub-category based on the cities
- For Each City we'll further have list of all Politicians there and Also URLS where their information is given
- For Each Politician we have to have to get the info : 
    - Born Date 
    - Death Date
    - Image URL
    - If any information is missing then use 0 in it's place
- Finally the information has to stored in Excel Format


## Getting the Web Page First

In [2]:
import requests

In [3]:
main_URL="https://en.wikipedia.org/wiki/Category:Indian_politicians_by_city_or_town"

In [4]:
response=requests.get(main_URL)

In [5]:
# Response is successful
response.status_code

200

In [6]:
page_content= response.text

In [7]:
# Page Content is in HTML Format
page_content[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Category:Indian politicians by city or town - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b0c45d07-9b09-4945-a292-7e0da8b436c8","wgCSPNonce":false,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":14,"wgPageName":"Category:Indian_politicians_by_city_or_town","wgTitle":"Indian politicians by city or town","wgCurRevisionId":811748416,"wgRevisionId":811748416,"wgArticleId":47585689,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Container categories","Indian people by occupation and cit

In [21]:
file=open('Indian Politicians by cities WebPage.html','w')
file.write(str(page_content.encode("utf-8")))
file.close()

## Using Beautiful Soup To Parse and extract the information

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_content, 'html.parser')

In [10]:
a_tags=doc.select('div.CategoryTreeItem a')

In [11]:
len(a_tags)

50

In [12]:
# list of some anchor tags found
a_tags=a_tags[3:]

In [13]:
# Urls of some sub-categories found
for i in range(10):
    print(i+1,a_tags[i]['href'])

1 /wiki/Category:Politicians_from_Ahmedabad
2 /wiki/Category:Politicians_from_Alappuzha
3 /wiki/Category:Politicians_from_Amritsar
4 /wiki/Category:Politicians_from_Bangalore
5 /wiki/Category:Politicians_from_Bhopal
6 /wiki/Category:Politicians_from_Bhubaneswar
7 /wiki/Category:Chandigarh_politicians
8 /wiki/Category:Politicians_from_Chennai
9 /wiki/Category:Politicians_from_Coimbatore
10 /wiki/Category:Politicians_from_Dehradun


In [14]:
subCategory_URLS=[]
for i in range(len(a_tags)):
    subCategory_URLS.append('https://en.wikipedia.org'+(a_tags[i]['href']))

In [15]:
# Final URLS will Look like :-
subCategory_URLS[0]

'https://en.wikipedia.org/wiki/Category:Politicians_from_Ahmedabad'

In [16]:
subCategory_titles=[]
for i in range(len(a_tags)):
    subCategory_titles.append(a_tags[i].text)

In [17]:
subCategory_titles[0]

'Politicians from Ahmedabad'

In [18]:
dict={'Titles':subCategory_titles,'URLS':subCategory_URLS}

In [19]:
import pandas as pd

In [20]:
df=pd.DataFrame(dict)

In [21]:
df.head()

Unnamed: 0,Titles,URLS
0,Politicians from Ahmedabad,https://en.wikipedia.org/wiki/Category:Politic...
1,Politicians from Alappuzha,https://en.wikipedia.org/wiki/Category:Politic...
2,Politicians from Amritsar,https://en.wikipedia.org/wiki/Category:Politic...
3,Politicians from Bangalore,https://en.wikipedia.org/wiki/Category:Politic...
4,Politicians from Bhopal,https://en.wikipedia.org/wiki/Category:Politic...


In [23]:
df.to_excel('Sub-Categories Urls.xlsx')

## Excel File consisting of Urls of every Sub-Category Is created.

## Moving on to Extract further Urls present whithin these Sub-Categories

In [658]:
sub_url=subCategory_URLS[0]
sub_url

'https://en.wikipedia.org/wiki/Category:Politicians_from_Ahmedabad'

In [659]:
res=requests.get(sub_url)

In [660]:
res.status_code

200

In [661]:
sub_doc=BeautifulSoup(res.text,'html.parser')

In [662]:
sub_a_tags=[]
for groups in sub_doc.find_all('div',{'class':'mw-category mw-category-columns'}):
    sub_a_tags=groups.find_all('a')

In [663]:
len(sub_a_tags)

28

In [669]:
# list of all Politician in a specific city
sub_a_tags[:5]

[<a href="/wiki/Violet_Alva" title="Violet Alva">Violet Alva</a>,
 <a href="/wiki/Narhari_Amin" title="Narhari Amin">Narhari Amin</a>,
 <a href="/wiki/Navin_Chandra_Barot" title="Navin Chandra Barot">Navin Chandra Barot</a>,
 <a href="/wiki/Ashok_Bhatt" title="Ashok Bhatt">Ashok Bhatt</a>,
 <a href="/wiki/Brahmkumar_Bhatt" title="Brahmkumar Bhatt">Brahmkumar Bhatt</a>]

In [665]:
politicians_URL=[]
for a in sub_a_tags:
    politicians_URL.append("https://en.wikipedia.org"+a['href'])

In [666]:
politicians_URL

['https://en.wikipedia.org/wiki/Violet_Alva',
 'https://en.wikipedia.org/wiki/Narhari_Amin',
 'https://en.wikipedia.org/wiki/Navin_Chandra_Barot',
 'https://en.wikipedia.org/wiki/Ashok_Bhatt',
 'https://en.wikipedia.org/wiki/Brahmkumar_Bhatt',
 'https://en.wikipedia.org/wiki/I._I._Chundrigar',
 'https://en.wikipedia.org/wiki/Kantilal_Ghia',
 'https://en.wikipedia.org/wiki/Harihar_Khambholja',
 'https://en.wikipedia.org/wiki/Purushottam_Mavalankar',
 'https://en.wikipedia.org/wiki/Sushila_Ganesh_Mavalankar',
 'https://en.wikipedia.org/wiki/Chhabildas_Mehta',
 'https://en.wikipedia.org/wiki/Dilip_Parikh',
 'https://en.wikipedia.org/wiki/Babubhai_Patel_(politician)',
 'https://en.wikipedia.org/wiki/Bhupendrabhai_Patel',
 'https://en.wikipedia.org/wiki/Kamlesh_Patel_(politician)',
 'https://en.wikipedia.org/wiki/Siddharth_Patel',
 'https://en.wikipedia.org/wiki/Suresh_Patel',
 'https://en.wikipedia.org/wiki/Vijay_Patel_(politician)',
 'https://en.wikipedia.org/wiki/Harin_Pathak',
 'https:/

In [667]:
politician_url=politician_URL[23]
politician_url

'https://en.wikipedia.org/wiki/Gabhaji_Thakor'

## Now Extracting Info of Every Politician

In [24]:
def get_politician_info(politician_url):
    
    res=requests.get(politician_url)
    #print("Url Status Code :",res.status_code)
    
    politician_doc=BeautifulSoup(res.text,'html.parser')
    
    # Name Of the Politician
    name_start_index=len(politician_url)-politician_url[::-1].find('/')
    name=politician_url[name_start_index:]
    
    # Image URL
    base_url='https://en.wikipedia.org'
    image_Url=politician_doc.select('td.infobox-image a')

    if len(image_Url)==0:
        image_Url=0
    else:
        image_Url=base_url+image_Url[0]['href']
    
    labelsClass=politician_doc.find_all('th',{'class':'infobox-label'})
    
    
    # BirthDate
    
    birthdate=politician_doc.select('span.bday')
    
    if len(birthdate)==0:
        birthdate=None
    else:
        birthdate=birthdate[0].text
    
    if birthdate==None:
        birthStatus=politician_doc.find('th',{'class':'infobox-label'},text='Born')
    else:
        birthStatus=None
        
    birth_index=None
    if birthStatus!=None:
        birth_index=labelsClass.index(birthStatus)
        dataClass=politician_doc.find_all('td',{'class':'infobox-data'})
    
        birthdate=0
        if birth_index!=None:
            birthdate=dataClass[birth_index].text
    
    if birthdate==None:
        birthdate=0
       
    
    # To Find death Date
    deathStatus=politician_doc.find('th',{'class':'infobox-label'},text='Died')
    
    death_index=None
    if deathStatus!=None:
        death_index=labelsClass.index(deathStatus)
    
    dataClass=politician_doc.find_all('td',{'class':'infobox-data'})
    
    deathdate=0
    if death_index!=None:
        try:
            first_bracket_index=dataClass[death_index].text.index('(')
        except ValueError:
           first_bracket_index=-1
    
        if first_bracket_index!=-1:
            second_bracket_index=dataClass[death_index].text[first_bracket_index:].index(')')
        else:
            second_bracket_index=-1
    
        if second_bracket_index!=-1:
            deathdate=dataClass[death_index].text[first_bracket_index+1:first_bracket_index+second_bracket_index]
        else:
            deathdate=dataClass[death_index].text
        
    return {"Name" : name,"Birth Date":birthdate,"Death Date":deathdate,"Image URL":image_Url}

In [567]:
get_politician_info(politician_url)

{'Name': 'Gabhaji_Thakor',
 'Birth Date': '15 June 1933Amdavad (Gujarat)',
 'Death Date': '2 January 2017',
 'Image URL': 0}

In [568]:
politician_info_dict={
    "Name":[],
    "Birth Date":[],
    "Death Date":[],
    "Image URL":[]
}

for i in range(len(politicians_URL)):
    info=get_politician_info(politicians_URL[i])
    politician_info_dict['Name'].append(info['Name'])
    politician_info_dict['Birth Date'].append(info['Birth Date'])
    politician_info_dict['Death Date'].append(info['Death Date'])
    politician_info_dict['Image URL'].append(info['Image URL'])

In [580]:
politician_info_df=pd.DataFrame(politician_info_dict)
politician_info_df

Unnamed: 0,Name,Birth Date,Death Date,Image URL
0,Violet_Alva,1908-04-24,1969-11-20,https://en.wikipedia.org/wiki/File:Joachim_and...
1,Narhari_Amin,1955-06-05,0,https://en.wikipedia.org/wiki/File:Narhari_Ami...
2,Navin_Chandra_Barot,0,0,0
3,Ashok_Bhatt,1939-01-28,2010-09-29,0
4,Brahmkumar_Bhatt,1921-10-08,2009-01-06,0


## Finally we can convert the data obtained to excel Files for Each City 

In [38]:
def generate_excel_Files(subCategoryTitles,subCategoryURLS):
    
    for i in range(len(subCategoryURLS)):
        print(i,') Working on File :',subCategory_titles[i])
        sub_url=subCategory_URLS[i]
        res=requests.get(sub_url)
        sub_doc=BeautifulSoup(res.text,'html.parser')
        sub_a_tags=[]
        for groups in sub_doc.find_all('div',{'class':'mw-category mw-category-columns'}):
            sub_a_tags=groups.find_all('a')
        politicians_URL=[]
        
        for a in sub_a_tags:
            politicians_URL.append("https://en.wikipedia.org"+a['href'])
        
        politician_info_dict={
        "Name":[],
        "Birth Date":[],
        "Death Date":[],
        "Image URL":[]
        }
        
        for j in range(len(politicians_URL)):
            politician_url=politicians_URL[j]
            info=get_politician_info(politician_url)
            politician_info_dict['Name'].append(info['Name'])
            politician_info_dict['Birth Date'].append(info['Birth Date'])
            politician_info_dict['Death Date'].append(info['Death Date'])
            politician_info_dict['Image URL'].append(info['Image URL'])
        
        politician_info_df=pd.DataFrame(politician_info_dict)
            
        fileName=subCategoryTitles[i]+".xlsx"
        politician_info_df.to_excel(fileName)

In [39]:
generate_excel_Files(subCategory_titles,subCategory_URLS)

0 ) Working on File : Politicians from Ahmedabad
1 ) Working on File : Politicians from Alappuzha
2 ) Working on File : Politicians from Amritsar
3 ) Working on File : Politicians from Bangalore
4 ) Working on File : Politicians from Bhopal
5 ) Working on File : Politicians from Bhubaneswar
6 ) Working on File : Chandigarh politicians
7 ) Working on File : Politicians from Chennai
8 ) Working on File : Politicians from Coimbatore
9 ) Working on File : Politicians from Dehradun
10 ) Working on File : Delhi politicians
11 ) Working on File : Politicians from Faridabad
12 ) Working on File : Politicians from Guwahati
13 ) Working on File : Politicians from Hyderabad, India
14 ) Working on File : Politicians from Indore
15 ) Working on File : Politicians from Jaipur
16 ) Working on File : Politicians from Jammu
17 ) Working on File : Politicians from Kannur
18 ) Working on File : Politicians from Kochi
19 ) Working on File : Politicians from Kolkata
20 ) Working on File : Politicians from 