## Assignment

Making sure to satisfy the requirements of the this scrapper.

In [1]:
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install seaborn
# !pip install urllib
# !pip install bs4

The first step is to import the libraries needed. We use beautiful soup to parse the HTML and scrape the website to extract all the URLs.

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlopen
from bs4 import BeautifulSoup
%matplotlib inline

We set the url to the website i.e wikipedia which contains all the links and information of cities of United States and access the site with our requests library.



In [70]:
#baseURL defined to handle later url calls for 2nd level scrapping
baseURL="https://en.wikipedia.org"

In [71]:
#main url for scrapping the initial table
url = "/wiki/List_of_United_States_cities_by_population"
html = urlopen(baseURL+url)

Inspect the site's HTML structure and parse it in a beautiful way.

In [72]:
#Parse the website into a beautiful soup object
soup = BeautifulSoup(html, 'lxml')

To find the table with all the top cities in the site and locate all the <tr\> tags (inspected the html to determine the name to ease the load on the runtime)

In [73]:
#fetches the table with the class name as wikitable which is the table the cities
rows = soup.find_all("table",class_="wikitable")

In [74]:
#fetches the first and only result out of the result set and fetches all the tr tags in the resultant table
trs = rows[1].find_all("tr")

Here we find all the links in each city, define the cells in the table and scrape the data.  We get all the table rows in list form and then convert into dataframe and for that we iterate through all the table rows.

In [75]:
city_list = []
for tr in trs:
    try:
        tds = tr.find_all("td")
        temp = {}
        temp["rank"] = tds[0].text.replace("\n","")
        temp["city"] = tds[1].text.replace("\n","")
        #code to find link to city
        temp["link"] = tds[1].find_all("a",href=True)[0]['href']
        temp["state"] = tds[2].text.replace("\n","").replace("\xa0","")
        temp["estimate"] = tds[3].text.replace("\n","")
        temp["census"] = tds[4].text.replace("\n","")
        temp["change"] = tds[5].text.replace("\n","")
        temp["land_area_mi"] = tds[6].text.replace("\n","").replace("\xa0","")
        temp["land_area_km2"] = tds[7].text.replace("\n","").replace("\xa0","")
        temp["population_density_mi"] = tds[8].text.replace("\n","").replace("\xa0","")
        temp["population_density_km2"] = tds[9].text.replace("\n","")
        temp["location"] = tds[10].text.replace("\ufeff","").replace("\n","")
        city_list.append(temp)
    except:
        continue

Generated a list of cities and its information.

In [76]:
city_list[0]

{'rank': '1',
 'city': 'New York City[d]',
 'link': '/wiki/New_York_City',
 'state': 'New York',
 'estimate': '8,398,748',
 'census': '8,175,133',
 'change': '+2.74%',
 'land_area_mi': '301.5sqmi',
 'land_area_km2': '780.9km2',
 'population_density_mi': '28,317/sqmi',
 'population_density_km2': '10,933/km2',
 'location': '40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / 40.6635; -73.9387 (1 New York City)'}

In [77]:
#converted list into dataframe
df= pd.DataFrame(city_list)

In [78]:
df.head()

Unnamed: 0,census,change,city,estimate,land_area_km2,land_area_mi,link,location,population_density_km2,population_density_mi,rank,state
0,8175133,+2.74%,New York City[d],8398748,780.9km2,301.5sqmi,/wiki/New_York_City,40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / ...,"10,933/km2","28,317/sqmi",1,New York
1,3792621,+5.22%,Los Angeles,3990456,"1,213.9km2",468.7sqmi,/wiki/Los_Angeles,34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W ...,"3,276/km2","8,484/sqmi",2,California
2,2695598,+0.39%,Chicago,2705994,588.7km2,227.3sqmi,/wiki/Chicago,41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / ...,"4,600/km2","11,900/sqmi",3,Illinois
3,2100263,+10.72%,Houston[3],2325502,"1,651.1km2",637.5sqmi,/wiki/Houston,29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / ...,"1,395/km2","3,613/sqmi",4,Texas
4,1445632,+14.85%,Phoenix,1660272,"1,340.6km2",517.6sqmi,"/wiki/Phoenix,_Arizona",33°34′20″N 112°05′24″W / 33.5722°N 112.0901°W ...,"1,200/km2","3,120/sqmi",5,Arizona


Crawls the links in the given dataframe and stores the page to make scrapping easier as the web page is not repeatedly opened(Space vs time).

Another function to seperate the signature wiki infobox from the given webpage and store it seperately

In [91]:
#crawls the page and return the soup object of the particular page
def crawl_page(x):
    url = baseURL+x
    html = urlopen(url)
    return BeautifulSoup(html, 'lxml')

#locate the info box and extract the data
def get_info_box(x):
    return x.find_all("table",{"class":"infobox geography vcard"})


Storing all the crawled information in a dataframe and generating csv file.

In [92]:
df["page_info"] = df["link"].apply(crawl_page)

You can save the current dataframe as a csv to prevent some overhead. Below is the code but it is commented.

In [93]:
# df.to_csv("page_crawl.csv", sep='\t', encoding='utf-8')
# df = pd.read_csv("page_crawl.csv",sep="\t")

In [94]:
df.head()

Unnamed: 0,census,change,city,estimate,land_area_km2,land_area_mi,link,location,population_density_km2,population_density_mi,rank,state,page_info
0,8175133,+2.74%,New York City[d],8398748,780.9km2,301.5sqmi,/wiki/New_York_City,40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / ...,"10,933/km2","28,317/sqmi",1,New York,"<!DOCTYPE html> <html class=""client-nojs"" dir=..."
1,3792621,+5.22%,Los Angeles,3990456,"1,213.9km2",468.7sqmi,/wiki/Los_Angeles,34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W ...,"3,276/km2","8,484/sqmi",2,California,"<!DOCTYPE html> <html class=""client-nojs"" dir=..."
2,2695598,+0.39%,Chicago,2705994,588.7km2,227.3sqmi,/wiki/Chicago,41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / ...,"4,600/km2","11,900/sqmi",3,Illinois,"<!DOCTYPE html> <html class=""client-nojs"" dir=..."
3,2100263,+10.72%,Houston[3],2325502,"1,651.1km2",637.5sqmi,/wiki/Houston,29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / ...,"1,395/km2","3,613/sqmi",4,Texas,"<!DOCTYPE html> <html class=""client-nojs"" dir=..."
4,1445632,+14.85%,Phoenix,1660272,"1,340.6km2",517.6sqmi,"/wiki/Phoenix,_Arizona",33°34′20″N 112°05′24″W / 33.5722°N 112.0901°W ...,"1,200/km2","3,120/sqmi",5,Arizona,"<!DOCTYPE html> <html class=""client-nojs"" dir=..."


In [96]:
df["infobox"] = df["page_info"].apply(get_info_box)

In [97]:
df.head()

Unnamed: 0,census,change,city,estimate,land_area_km2,land_area_mi,link,location,population_density_km2,population_density_mi,rank,state,page_info,infobox
0,8175133,+2.74%,New York City[d],8398748,780.9km2,301.5sqmi,/wiki/New_York_City,40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / ...,"10,933/km2","28,317/sqmi",1,New York,"<!DOCTYPE html> <html class=""client-nojs"" dir=...","[<table class=""infobox geography vcard"" style=..."
1,3792621,+5.22%,Los Angeles,3990456,"1,213.9km2",468.7sqmi,/wiki/Los_Angeles,34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W ...,"3,276/km2","8,484/sqmi",2,California,"<!DOCTYPE html> <html class=""client-nojs"" dir=...","[<table class=""infobox geography vcard"" style=..."
2,2695598,+0.39%,Chicago,2705994,588.7km2,227.3sqmi,/wiki/Chicago,41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / ...,"4,600/km2","11,900/sqmi",3,Illinois,"<!DOCTYPE html> <html class=""client-nojs"" dir=...","[<table class=""infobox geography vcard"" style=..."
3,2100263,+10.72%,Houston[3],2325502,"1,651.1km2",637.5sqmi,/wiki/Houston,29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / ...,"1,395/km2","3,613/sqmi",4,Texas,"<!DOCTYPE html> <html class=""client-nojs"" dir=...","[<table class=""infobox geography vcard"" style=..."
4,1445632,+14.85%,Phoenix,1660272,"1,340.6km2",517.6sqmi,"/wiki/Phoenix,_Arizona",33°34′20″N 112°05′24″W / 33.5722°N 112.0901°W ...,"1,200/km2","3,120/sqmi",5,Arizona,"<!DOCTYPE html> <html class=""client-nojs"" dir=...","[<table class=""infobox geography vcard"" style=..."


1. lambda function(separate_field) - Takes fields from the given list of field_list and fetches the field according to the field list from the info box and stores it in the dataframe
2. lambda function(separate_text_block) -  seperates the text blocks from the webpage according to the field list and stores them in the dataframe according to the field list.

In [98]:
def seperate_field(x,field_list):
    x= x[0]
    tempVariable = x.find_all('tr')
    for i in tempVariable:
        for j in i.find_all("th",limit=10000):
            for k in field_list:
                if k in j.text.lower():
                    return i.find("td").text.replace("\n"," ")

                
def seperate_text_block(x,field_list):
    tempVariable = x.find_all(["h2",'h3',"p"])
    flag = False
    final_string = ""
    for i in tempVariable:
        if (i.name == 'h2' or i.name == 'h3') and field_list in i.text:
            flag = True
            continue
        elif (i.name == 'h2' or i.name == 'h3') and field_list in i.text:
            flag = False
        if flag == True:
            final_string+= i.text + " "
    return final_string
         

Storing all the separated fields in a dataframe

In [101]:
df["county"] = df["infobox"].apply(seperate_field,field_list=["county","counties"])

In [102]:
df["settled"] = df["infobox"].apply(seperate_field,field_list=["settled"])

In [103]:
df["website"] = df["infobox"].apply(seperate_field,field_list=["website"])

In [104]:
df["zip_code"] = df["infobox"].apply(seperate_field,field_list=["zip codes","zip code","zip code(s)"])

In [105]:
df["major_airport"] = df["infobox"].apply(seperate_field,field_list=["major airport(s)","major airport","major airports","primary airport"])

In [106]:
df["demonym"] = df["infobox"].apply(seperate_field,field_list=["demonym(s)"])

In [107]:
df["government_type"] = df["infobox"].apply(seperate_field,field_list=["type"])

In [108]:
df["mayor"] = df["infobox"].apply(seperate_field,field_list=["mayor"])

In [109]:
df["time_zone"] = df["infobox"].apply(seperate_field,field_list=["time zone"])

In [110]:
df["summer_time_zone"] = df["infobox"].apply(seperate_field,field_list=["summer"])

In [111]:
df["area_code"] = df["infobox"].apply(seperate_field,field_list = ["area code"])

In [112]:
df["government_body"] = df["infobox"].apply(seperate_field,field_list= ["body"])

In [113]:
df["history"] = df["page_info"].apply(seperate_text_block,field_list= "History")

In [114]:
df["geography"] = df["page_info"].apply(seperate_text_block,field_list= "Geography")

In [115]:
df["demographics"] = df["page_info"].apply(seperate_text_block,field_list= "Demographics")

In [116]:
df["economy"] = df["page_info"].apply(seperate_text_block,field_list= "Economy")

In [117]:
df["transportation"] = df["page_info"].apply(seperate_text_block,field_list= "Transportation")

In [118]:
df["education"] = df["page_info"].apply(seperate_text_block,field_list = "Education")

In [119]:
df["sports"] = df["page_info"].apply(seperate_text_block,field_list="Sports")

In [120]:
len(df.columns)

33

In [121]:
df.columns

Index(['census', 'change', 'city', 'estimate', 'land_area_km2', 'land_area_mi',
       'link', 'location', 'population_density_km2', 'population_density_mi',
       'rank', 'state', 'page_info', 'infobox', 'county', 'settled', 'website',
       'zip_code', 'major_airport', 'demonym', 'government_type', 'mayor',
       'time_zone', 'summer_time_zone', 'area_code', 'government_body',
       'history', 'geography', 'demographics', 'economy', 'transportation',
       'education', 'sports'],
      dtype='object')

In [122]:
df.head()

Unnamed: 0,census,change,city,estimate,land_area_km2,land_area_mi,link,location,population_density_km2,population_density_mi,...,summer_time_zone,area_code,government_body,history,geography,demographics,economy,transportation,education,sports
0,8175133,+2.74%,New York City[d],8398748,780.9km2,301.5sqmi,/wiki/New_York_City,40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / ...,"10,933/km2","28,317/sqmi",...,UTC−04:00 (EDT),"212/646/332, 718/347/929, 917",New York City Council,"Etymology In 1664, the city was named in honor...",New York City is situated in the Northeastern ...,New York City is the most populous city in the...,City economic overview New York City is a glob...,New York City's comprehensive transportation s...,Primary and secondary education The New York C...,New York City is home to the headquarters of t...
1,3792621,+5.22%,Los Angeles,3990456,"1,213.9km2",468.7sqmi,/wiki/Los_Angeles,34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W ...,"3,276/km2","8,484/sqmi",...,UTC−07:00 (PDT),"213/323, 310/424, 747/818",Los Angeles City Council,Pre-colonial history The Los Angeles coastal a...,Topography The city of Los Angeles covers a to...,The 2010 United States Census[97] reported Los...,The economy of Los Angeles is driven by intern...,Freeways The city and the rest of the Los Ange...,Colleges and universities There are three publ...,The city of Los Angeles and its metropolitan a...
2,2695598,+0.39%,Chicago,2705994,588.7km2,227.3sqmi,/wiki/Chicago,41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / ...,"4,600/km2","11,900/sqmi",...,UTC−05:00 (Central),312/872 and 773/872,Chicago City Council,"Beginnings In the mid-18th century, the area w...",Topography Chicago is located in northeastern ...,"During its first hundred years, Chicago was on...",Chicago has the third-largest gross metropolit...,Chicago is a major transportation hub in the U...,Schools and libraries Chicago Public Schools (...,"Sporting News named Chicago the ""Best Sports C..."
3,2100263,+10.72%,Houston[3],2325502,"1,651.1km2",637.5sqmi,/wiki/Houston,29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / ...,"1,395/km2","3,613/sqmi",...,UTC−5 (CDT),"713, 281, 832, 346",Houston City Council,The Allen brothers—Augustus Chapman and John K...,Houston is located 165 miles (266 km) east of ...,The 2010 United States Census reported that Ho...,Houston is recognized worldwide for its energy...,Houston is considered an automobile-dependent ...,Nineteen school districts exist within the cit...,Houston has sports teams for every major profe...
4,1445632,+14.85%,Phoenix,1660272,"1,340.6km2",517.6sqmi,"/wiki/Phoenix,_Arizona",33°34′20″N 112°05′24″W / 33.5722°N 112.0901°W ...,"1,200/km2","3,120/sqmi",...,,East: 480 Central: 602 West: 623,Phoenix City Council,Early history[edit] The Hohokam people occupie...,"Phoenix is in the southwestern United States, ...",Phoenix is the sixth most populous city in the...,The early economy of Phoenix was focused prima...,Phoenix is served by Phoenix Sky Harbor Intern...,Public education in the Phoenix area is provid...,Major league[edit] Phoenix is home to several ...


In [123]:
df.to_csv("big.csv",index = False)

In [124]:
pd.read_csv("big.csv")

Unnamed: 0,census,change,city,estimate,land_area_km2,land_area_mi,link,location,population_density_km2,population_density_mi,...,summer_time_zone,area_code,government_body,history,geography,demographics,economy,transportation,education,sports
0,8175133,+2.74%,New York City[d],8398748,780.9km2,301.5sqmi,/wiki/New_York_City,40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / ...,"10,933/km2","28,317/sqmi",...,UTC−04:00 (EDT),"212/646/332, 718/347/929, 917",New York City Council,"Etymology In 1664, the city was named in honor...",New York City is situated in the Northeastern ...,New York City is the most populous city in the...,City economic overview New York City is a glob...,New York City's comprehensive transportation s...,Primary and secondary education The New York C...,New York City is home to the headquarters of t...
1,3792621,+5.22%,Los Angeles,3990456,"1,213.9km2",468.7sqmi,/wiki/Los_Angeles,34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W ...,"3,276/km2","8,484/sqmi",...,UTC−07:00 (PDT),"213/323, 310/424, 747/818",Los Angeles City Council,Pre-colonial history The Los Angeles coastal a...,Topography The city of Los Angeles covers a to...,The 2010 United States Census[97] reported Los...,The economy of Los Angeles is driven by intern...,Freeways The city and the rest of the Los Ange...,Colleges and universities There are three publ...,The city of Los Angeles and its metropolitan a...
2,2695598,+0.39%,Chicago,2705994,588.7km2,227.3sqmi,/wiki/Chicago,41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / ...,"4,600/km2","11,900/sqmi",...,UTC−05:00 (Central),312/872 and 773/872,Chicago City Council,"Beginnings In the mid-18th century, the area w...",Topography Chicago is located in northeastern ...,"During its first hundred years, Chicago was on...",Chicago has the third-largest gross metropolit...,Chicago is a major transportation hub in the U...,Schools and libraries Chicago Public Schools (...,"Sporting News named Chicago the ""Best Sports C..."
3,2100263,+10.72%,Houston[3],2325502,"1,651.1km2",637.5sqmi,/wiki/Houston,29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / ...,"1,395/km2","3,613/sqmi",...,UTC−5 (CDT),"713, 281, 832, 346",Houston City Council,The Allen brothers—Augustus Chapman and John K...,Houston is located 165 miles (266 km) east of ...,The 2010 United States Census reported that Ho...,Houston is recognized worldwide for its energy...,Houston is considered an automobile-dependent ...,Nineteen school districts exist within the cit...,Houston has sports teams for every major profe...
4,1445632,+14.85%,Phoenix,1660272,"1,340.6km2",517.6sqmi,"/wiki/Phoenix,_Arizona",33°34′20″N 112°05′24″W / 33.5722°N 112.0901°W ...,"1,200/km2","3,120/sqmi",...,,East: 480 Central: 602 West: 623,Phoenix City Council,Early history[edit] The Hohokam people occupie...,"Phoenix is in the southwestern United States, ...",Phoenix is the sixth most populous city in the...,The early economy of Phoenix was focused prima...,Phoenix is served by Phoenix Sky Harbor Intern...,Public education in the Phoenix area is provid...,Major league[edit] Phoenix is home to several ...
5,1526006,+3.81%,Philadelphia[e],1584138,347.6km2,134.2sqmi,/wiki/Philadelphia,40°00′34″N 75°08′00″W / 40.0094°N 75.1333°W / ...,"4,511/km2","11,683/sqmi",...,UTC-4 (EDT),"215, 267, 445",Philadelphia City Council,"Before Europeans arrived, the Philadelphia are...",Topography The geographic center of Philadelph...,According to the 2018 United States Census Bur...,Philadelphia is the center of economic activit...,Philadelphia is served by the Southeastern Pen...,Primary and secondary education Education in P...,Philadelphia's first professional sports team ...
6,1327407,+15.43%,San Antonio,1532233,"1,194.0km2",461.0sqmi,/wiki/San_Antonio,29°28′21″N 98°31′30″W / 29.4724°N 98.5251°W / ...,"1,250/km2","3,238/sqmi",...,UTC−5 (CDT),"210 (majority), 830 (portions), 726",San Antonio City Council,"At the time of European encounter, Payaya Indi...",San Antonio is approximately 75 miles (121 km)...,"According to the 2010 U.S. Census, 1,327,407 p...",San Antonio has a diversified economy with a g...,Air[edit] The San Antonio International Airpor...,"San Antonio hosts over 100,000 students in its...",Professional sports[edit] The city's only top-...
7,1307402,+9.07%,San Diego,1425976,842.3km2,325.2sqmi,/wiki/San_Diego,32°48′55″N 117°08′06″W / 32.8153°N 117.1350°W ...,"1,670/km2","4,325/sqmi",...,UTC−7 (PDT),"619, 858",San Diego City Council,Pre-colonial period The original inhabitants o...,According to SDSU professor emeritus Monte Mar...,"The city had a population of 1,307,402 accordi...",The largest sectors of San Diego's economy are...,With the automobile being the primary means of...,Primary and secondary schools Public schools i...,Major League teams Minor League teams College ...
8,1197816,+12.29%,Dallas,1345047,882.9km2,340.9sqmi,/wiki/Dallas,32°47′36″N 96°45′59″W / 32.7933°N 96.7665°W / ...,"1,493/km2","3,866/sqmi",...,UTC−5 (Central),"214, 469, 972, 682, 817[4][5]",Dallas City Council,Preceded by thousands of years of varying cult...,Dallas is situated in the Southern United Stat...,Dallas is the ninth most-populous city in the ...,"In its beginnings, Dallas relied on farming, n...",Like many other major cities in the United Sta...,"There are 337 public schools, 89 private schoo...",The Dallas—Fort Worth metropolitan area is hom...
9,945942,+8.90%,San Jose,1030119,459.7km2,177.5sqmi,"/wiki/San_Jose,_California",37°17′48″N 121°49′08″W / 37.2967°N 121.8189°W ...,"2,231/km2","5,777/sqmi",...,UTC−7 (Pacific Daylight Time),408/669,San Jose City Council,Pre-Columbian period[edit] The Santa Clara Val...,San Jose is located at 37°20′07″N 121°53′31″W﻿...,"In 2014, the U.S. Census Bureau released its n...",The cost of living in San Jose and the surroun...,Like other American cities built mostly after ...,Higher education[edit] San Jose is home to sev...,San Jose is home to the San Jose Sharks of the...
