Python for Data Analytics | Module 5
<br>Professor James Ng

# Web Scraping

In [7]:
import numpy as np
import pandas as pd

Web scraping is a pretty diverse topic. It requires an understanding of how websites are created and managed, and knowing some fundamentals of HTML. Web scraping also necessarily involves lots of string manipulation. In this section we will go through the basics of scraping HTML tables from a static webpage.

### `pd.read_html()`

Using the pandas package, you can read and parse HTML tables from webpages directly. 

The following example extracts the NBA 2005 draft data set from the [Sports Reference](https://www.basketball-reference.com/draft/NBA_2005.html) website.

In [2]:
nba_data_list = pd.read_html("https://www.basketball-reference.com/draft/NBA_2005.html") 

In [3]:
print(type(nba_data_list), '\n', len(nba_data_list))

<class 'list'> 
 1


You will notice that `read_html()` returns a list. There can be multiple tables in a given webpage. The `read_html()` method returns list of tables. In this webpage there is only one table. So you can access the table with the 0th indexed element. 

In [4]:
nba_df = nba_data_list[0]
nba_df

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Round 1,Unnamed: 4_level_0,Totals,Shooting,Per Game,Advanced,Unnamed: 9_level_0,...,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0,Unnamed: 17_level_0,Unnamed: 18_level_0,Unnamed: 19_level_0,Unnamed: 20_level_0,Unnamed: 21_level_0
Unnamed: 0_level_1,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP,PTS,TRB,AST,WS,WS/48,BPM,VORP
0,1,1,MIL,Andrew Bogut,Utah,14,706,19862,6808,6112,...,.120,.557,28.1,9.6,8.7,2.2,50.6,.122,2.3,21.6
1,2,2,ATL,Marvin Williams,UNC,15,1037,29501,10803,5428,...,.363,.808,28.4,10.4,5.2,1.3,63.8,.104,-0.2,13.3
2,3,3,UTA,Deron Williams,Illinois,12,845,28865,13804,2619,...,.357,.822,34.2,16.3,3.1,8.1,77.3,.129,1.3,23.8
3,4,4,NOH,Chris Paul,Wake Forest,15,970,33963,17899,4345,...,.370,.869,35.0,18.5,4.5,9.6,173.6,.245,7.2,79.0
4,5,5,CHA,Raymond Felton,UNC,14,971,28829,10853,2867,...,.329,.790,29.7,11.2,3.0,5.2,40.2,.067,-0.6,9.9
5,6,6,POR,Martell Webster,,10,580,13914,5056,1811,...,.382,.791,24.0,8.7,3.1,1.0,24.7,.085,-0.7,4.6
6,7,7,TOR,Charlie Villanueva,UConn,11,656,13578,6808,3019,...,.341,.772,20.7,10.4,4.6,0.8,22.6,.080,-1.8,0.6
7,8,8,NYK,Channing Frye,Arizona,13,890,19772,7786,4002,...,.388,.822,22.2,8.7,4.5,1.0,38.9,.094,-0.2,8.9
8,9,9,GSW,Ike Diogu,Arizona State,6,225,2795,1348,689,...,.500,.786,12.4,6.0,3.1,0.3,6.5,.112,-4.8,-2.0
9,10,10,LAL,Andrew Bynum,,8,418,10690,4822,3221,...,.111,.690,25.6,11.5,7.7,1.2,37.4,.168,1.3,9.0


#### Clean the data

Data scraped from webpages most certainly need to be cleaned. Here are some examples.

You might have observed the column names are all tuples. You can change the column names to something cleaner by modifying the dataframe's columns attribute. 

In [5]:
nba_df.columns = ['Rk', 'Pk', 'Tm','Player','College', 'Yrs','G', 'MP', 'PTS','TRB','AST','FG%', 
                    '3P%', 'FT%', 'MP', 'PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP']

nba_df.head()

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
0,1,1,MIL,Andrew Bogut,Utah,14,706,19862,6808,6112,...,0.12,0.557,28.1,9.6,8.7,2.2,50.6,0.122,2.3,21.6
1,2,2,ATL,Marvin Williams,UNC,15,1037,29501,10803,5428,...,0.363,0.808,28.4,10.4,5.2,1.3,63.8,0.104,-0.2,13.3
2,3,3,UTA,Deron Williams,Illinois,12,845,28865,13804,2619,...,0.357,0.822,34.2,16.3,3.1,8.1,77.3,0.129,1.3,23.8
3,4,4,NOH,Chris Paul,Wake Forest,15,970,33963,17899,4345,...,0.37,0.869,35.0,18.5,4.5,9.6,173.6,0.245,7.2,79.0
4,5,5,CHA,Raymond Felton,UNC,14,971,28829,10853,2867,...,0.329,0.79,29.7,11.2,3.0,5.2,40.2,0.067,-0.6,9.9


You will notice that the data is **messy**. For example, if you look at the rows from 28:34, you will see that index 30, 31 have garbage data. Where did this come from? Looking at the source [webpage](https://www.basketball-reference.com/draft/NBA_2005.html), the HTML table has a break.

In [6]:
nba_df.loc[28:34]

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
28,29,29,MIA,Wayne Simien,Kansas,2,51,507,169,99,...,,.854,9.9,3.3,1.9,0.2,0.9,.083,-4.8,-0.4
29,30,30,NYK,David Lee,Florida,12,829,24293,11232,7320,...,.034,.772,29.3,13.5,8.8,2.2,76.0,.150,1.6,22.3
30,,,,Round 2,,Totals,Shooting,Per Game,Advanced,,...,,,,,,,,,,
31,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP,PTS,TRB,AST,WS,WS/48,BPM,VORP
32,31,31,ATL,Salim Stoudamire,Arizona,3,157,2672,1260,214,...,.366,.882,17.0,8.0,1.4,1.0,2.2,.040,-5.1,-2.1
33,32,32,LAC,Daniel Ewing,Duke,2,127,1683,431,158,...,.295,.780,13.3,3.4,1.2,1.4,0.8,.024,-3.7,-0.7
34,33,33,NOH,Brandon Bass,LSU,12,758,16410,6575,3448,...,.207,.832,21.6,8.7,4.5,0.8,42.8,.125,-1.1,3.7


In [7]:
# Drop those two rows with those indices and you are saying inplace=True, to make sure you are not creating a copy. 
nba_df.drop([30,31], axis=0, inplace= True)
nba_df.loc[28:34]

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST,WS,WS/48,BPM,VORP
28,29,29,MIA,Wayne Simien,Kansas,2,51,507,169,99,...,,0.854,9.9,3.3,1.9,0.2,0.9,0.083,-4.8,-0.4
29,30,30,NYK,David Lee,Florida,12,829,24293,11232,7320,...,0.034,0.772,29.3,13.5,8.8,2.2,76.0,0.15,1.6,22.3
32,31,31,ATL,Salim Stoudamire,Arizona,3,157,2672,1260,214,...,0.366,0.882,17.0,8.0,1.4,1.0,2.2,0.04,-5.1,-2.1
33,32,32,LAC,Daniel Ewing,Duke,2,127,1683,431,158,...,0.295,0.78,13.3,3.4,1.2,1.4,0.8,0.024,-3.7,-0.7
34,33,33,NOH,Brandon Bass,LSU,12,758,16410,6575,3448,...,0.207,0.832,21.6,8.7,4.5,0.8,42.8,0.125,-1.1,3.7


### Activity

* Use `pd.read_html()` to download the information on all the states from this [wikipedia](https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals) page. 
    * Do the column names appear appropriately? Make sure you set the column names appropriately. 

In [8]:
state_info = pd.read_html("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
state_df = state_info[0]
state_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,State,Abr.,State-hood,Capital,Capital since,Area (mi²),Population (2010),Notes,,,
1,Municipal (Within city proper boundaries),Metropolitan (Both within the capital city pro...,Rank in state,Rank in US,,,,,,,
2,Alabama,AL,1819,Montgomery,1846,155.4,205764,374536,2.0,102.0,Birmingham is the state's largest city.
3,Alaska,AK,1959,Juneau,1906,2716.7,31275,710231,3.0,,Largest capital by municipal land area.
4,Arizona,AZ,1912,Phoenix,1889,474.9,1445632,4192887,1.0,6.0,Phoenix is the most populous capital city in t...


In [9]:
state_df.columns = ['State', 'State Abr', 'State-hood','Capital','Capital since', 'Area (mi^2)', 
                    'Population (2010) Municpal', 'Population (2010) Metropolitan', 'Population (2010) Rank in State',
                   'Population (2010) Rank in US', 'Notes']
state_df

Unnamed: 0,State,State Abr,State-hood,Capital,Capital since,Area (mi^2),Population (2010) Municpal,Population (2010) Metropolitan,Population (2010) Rank in State,Population (2010) Rank in US,Notes
0,State,Abr.,State-hood,Capital,Capital since,Area (mi²),Population (2010),Notes,,,
1,Municipal (Within city proper boundaries),Metropolitan (Both within the capital city pro...,Rank in state,Rank in US,,,,,,,
2,Alabama,AL,1819,Montgomery,1846,155.4,205764,374536,2.0,102.0,Birmingham is the state's largest city.
3,Alaska,AK,1959,Juneau,1906,2716.7,31275,710231,3.0,,Largest capital by municipal land area.
4,Arizona,AZ,1912,Phoenix,1889,474.9,1445632,4192887,1.0,6.0,Phoenix is the most populous capital city in t...
5,Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757,1.0,117.0,
6,California,CA,1850,Sacramento,1854,97.2,466488,2149127,6.0,35.0,
7,Colorado,CO,1876,Denver,1867,153.4,600158,2543482,1.0,26.0,Denver was called Denver City until 1882.
8,Connecticut,CT,1788,Hartford,1875,17.3,124775,1212381,3.0,199.0,
9,Delaware,DE,1787,Dover,1777,22.4,36047,162310,2.0,,Longest-serving capital in terms of statehood.


### Activity

* Download the top 250 movies from [IMDB](http://www.imdb.com/chart/top?ref_=nv_wl_img_3) list  
* <b>Which movie released in 2014 has the highest IMDb rating?</b>

In [10]:
movies = pd.read_html("https://www.imdb.com/chart/top?ref_=nv_wl_img_3")

In [11]:
movies.head()

AttributeError: 'list' object has no attribute 'head'

Don't forget read_html always stores extracted tables as elements in a list.

In [12]:
# On further inspection the list contains only one element
len(movies)

1

In [13]:
# The list element is a dataframe
movies[0].head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,


In [14]:
# Store the dataframe in the movies list as a separate dataframe.
# The copy() method here is needed to explicitly tell Python that 
# movie_df is meant to be a new, separate copy of movies[0].

movie_df = movies[0].copy()

In [15]:
# Create a new column for year. Year info is embedded in 'Rank & Title' but we can pull them out using basic string
# manipulation.

movie_df['year'] = movie_df['Rank & Title'].str[-5:-1]
movie_df.head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4,year
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,,1994
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,,1972
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,,1974
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,,2008
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,,1957


In [16]:
# Extract movies from 2014. Also sort by IMDb rating to ensure highest comes first.
movie_df[ movie_df['year'] == '2014' ].sort_values('IMDb Rating', ascending=False)

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4,year
29,,30. Interstellar (2014),8.5,12345678910 NOT YET RELEASED Seen,,2014
45,,46. Whiplash (2014),8.5,12345678910 NOT YET RELEASED Seen,,2014
183,,184. Relatos salvajes (2014),8.1,12345678910 NOT YET RELEASED Seen,,2014
186,,187. Gone Girl (2014),8.1,12345678910 NOT YET RELEASED Seen,,2014
193,,194. The Grand Budapest Hotel (2014),8.1,12345678910 NOT YET RELEASED Seen,,2014
235,,236. PK (2014),8.0,12345678910 NOT YET RELEASED Seen,,2014
245,,246. Guardians of the Galaxy (2014),8.0,12345678910 NOT YET RELEASED Seen,,2014


In [17]:
# And we're done! INTERSTELLAR and WHIPLASH had the highest rating among movies released in 2014.

### More data cleaning/string manipulation

In [18]:
movie_df.head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4,year
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,,1994
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,,1972
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,,1974
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,,2008
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,,1957


In [19]:
# 'Unnamed' columns are useless, let's drop them

# boolean array indicating whether columns have the specified string

movie_df.columns.str.contains('Unnamed:')

array([ True, False, False, False,  True, False])

In [20]:
# drop columns that do not contain the specified string

movie_df = movie_df.loc[: , ~movie_df.columns.str.contains('Unnamed:') ]
movie_df.head()

Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,year
0,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,1994
1,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,1972
2,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,1974
3,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,2008
4,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,1957


In [21]:
# Rank is embedded in 'Rank & Title'. Let's pull out rank and store it in its own column.

# To do so we will split the 'Rank & Title' strings with the period '.' as separator. Let's see how this works.

movie_df['Rank & Title'].str.split('.')

# this returns a Series of split strings

0                [1,   The Shawshank Redemption  (1994)]
1                           [2,   The Godfather  (1972)]
2                  [3,   The Godfather: Part II  (1974)]
3                         [4,   The Dark Knight  (2008)]
4                            [5,   12 Angry Men  (1957)]
5                        [6,   Schindler's List  (1993)]
6      [7,   The Lord of the Rings: The Return of the...
7                            [8,   Pulp Fiction  (1994)]
8          [9,   The Good, the Bad and the Ugly  (1966)]
9                             [10,   Fight Club  (1999)]
10     [11,   The Lord of the Rings: The Fellowship o...
11                          [12,   Forrest Gump  (1994)]
12                             [13,   Inception  (2010)]
13     [14,   Star Wars: Episode V - The Empire Strik...
14     [15,   The Lord of the Rings: The Two Towers  ...
15                            [16,   The Matrix  (1999)]
16                            [17,   Goodfellas  (1990)]
17       [18,   One Flew Over t

In [22]:
for row in movie_df['Rank & Title'].str.split('.'):
    print(row[0])

# These are what we want to store in a new 'rank' column 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250


In [23]:
# Store rank iteratively row by row

for idx, row in enumerate(movie_df['Rank & Title'].str.split('.')):
    movie_df.loc[idx, 'rank'] = row[0]

In [24]:
movie_df.sample(10)

Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,year,rank
96,97. A Clockwork Orange (1971),8.3,12345678910 NOT YET RELEASED Seen,1971,97
1,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,1972,2
181,182. Tokyo Story (1953),8.1,12345678910 NOT YET RELEASED Seen,1953,182
126,127. Unforgiven (1992),8.2,12345678910 NOT YET RELEASED Seen,1992,127
153,154. V for Vendetta (2005),8.1,12345678910 NOT YET RELEASED Seen,2005,154
182,183. The Big Lebowski (1998),8.1,12345678910 NOT YET RELEASED Seen,1998,183
218,219. The 400 Blows (1959),8.0,12345678910 NOT YET RELEASED Seen,1959,219
170,171. Jurassic Park (1993),8.1,12345678910 NOT YET RELEASED Seen,1993,171
54,55. Memento (2000),8.4,12345678910 NOT YET RELEASED Seen,2000,55
80,81. Taare Zameen Par (2007),8.3,12345678910 NOT YET RELEASED Seen,2007,81


In [25]:
# This is perfectly fine, but can be done more compactly using Python's list comprehension

#### List comprehension

In [26]:
# Create a list of ranks from the original Series of split strings
[ elem[0] for elem in movie_df['Rank & Title'].str.split('.') ]

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '68',
 '69',
 '70',
 '71',
 '72',
 '73',
 '74',
 '75',
 '76',
 '77',
 '78',
 '79',
 '80',
 '81',
 '82',
 '83',
 '84',
 '85',
 '86',
 '87',
 '88',
 '89',
 '90',
 '91',
 '92',
 '93',
 '94',
 '95',
 '96',
 '97',
 '98',
 '99',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '138',
 '13

In [27]:
# Can assign to DataFrame column directly
movie_df['rank2'] = [ elem[0] for elem in movie_df['Rank & Title'].str.split('.') ]

In [28]:
movie_df.sample(5)

Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,year,rank,rank2
227,228. Rocky (1976),8.0,12345678910 NOT YET RELEASED Seen,1976,228,228
40,41. Psycho (1960),8.5,12345678910 NOT YET RELEASED Seen,1960,41,41
18,19. Joker (2019),8.6,12345678910 NOT YET RELEASED Seen,2019,19,19
235,236. PK (2014),8.0,12345678910 NOT YET RELEASED Seen,2014,236,236
102,103. Singin' in the Rain (1952),8.2,12345678910 NOT YET RELEASED Seen,1952,103,103


In [29]:
# This is cool! How about extracting titles into its own column?

movie_df['title']=[ elem[1].split('(')[0].strip() for elem in movie_df['Rank & Title'].str.split('.') ]
movie_df.sample(5)

Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,year,rank,rank2,title
203,204. Stand by Me (1986),8.1,12345678910 NOT YET RELEASED Seen,1986,204,204,Stand by Me
230,231. It Happened One Night (1934),8.0,12345678910 NOT YET RELEASED Seen,1934,231,231,It Happened One Night
95,96. North by Northwest (1959),8.3,12345678910 NOT YET RELEASED Seen,1959,96,96,North by Northwest
112,113. To Kill a Mockingbird (1962),8.2,12345678910 NOT YET RELEASED Seen,1962,113,113,To Kill a Mockingbird
171,172. Andhadhun (2018),8.1,12345678910 NOT YET RELEASED Seen,2018,172,172,Andhadhun


**Pandas read_html is pretty useful and convenient, but won't work on webpages that do not have HTML tables.**

Example:

In [30]:
pd.read_html('http://www.nd.edu')

ValueError: No tables found

# Most commonly used packages for web scraping 

* urllib
* requests
* BeautifulSoup
* re
* selenium (not covered here)

This will require some familiarity with the fundamentals of HTML, the language used to display webpages on the browser. 

In [36]:
import urllib
import requests
from bs4 import BeautifulSoup

In [37]:
req = requests.get("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
page = req.text

page_soup = BeautifulSoup(page, 'html.parser')


In [38]:
page_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XeUbtgpAIDEAAHb63n4AAAEK","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_U.S._state_capitals","wgTitle":"List of U.S. state capitals","wgCurRevisionId":6726415,"wgRevisionId":6726415,"wgArticleId":18635,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["State cap

The contents of a webpage are messy and may not be obvious for the first time. However, if you want to scrape any website, you will have to be patient and look through its source code to extract what you want. **Use your browser's Inspect Element feature for this.**

In [39]:
# You can extract html elements based on tags, classes, attributes and more.

# Simple example of extracting HTML title
page_soup.title

<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>

In [40]:
page_soup.title.text

'List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia'

### Searching the html source code of a webpage

You can programmatically search through the source code of a webpage to find stuff that's in a webpage. Do it using BeautifulSoup's **`find_all()`** method. You can search based on HTML tags (preferred) or even just plain old text.
Let's search for all `table` tags.

In [41]:
states_table = page_soup.find_all("table")
states_table

[<table class="wikitable sortable">
 <caption>State capitals of the United States
 </caption>
 <tbody><tr>
 <th rowspan="2">State</th>
 <th rowspan="2">Abr.</th>
 <th rowspan="2">State-hood</th>
 <th rowspan="2">Capital</th>
 <th rowspan="2">Capital since</th>
 <th rowspan="2">Area (mi²)</th>
 <th colspan="4">Population (2010)</th>
 <th rowspan="2">Notes
 </th></tr>
 <tr>
 <th><a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">Municipal</a> (Within city proper boundaries)
 </th>
 <th><a class="new" href="/w/index.php?title=List_of_metropolitan_statistical_areas&amp;action=edit&amp;redlink=1" title="List of metropolitan statistical areas (not yet started)">Metropolitan</a> (Both within the capital city proper and the surrounding area of the city proper)
 </th>
 <th>Rank in state
 </th>
 <th>Rank in US
 </th></tr>
 <tr>
 <td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
 <td>AL</td>
 <td align="center">1819</td>
 <td><

In [42]:
# This webpage happens to have only 1 table. 
len(states_table)

1

In [43]:
# Multiple tables will simply be returned in a list. Example:

ncaab_soup = BeautifulSoup(requests.get(
    "https://en.wikipedia.org/wiki/NCAA_Division_I_Men%27s_Basketball_Tournament").text)
ncaab_tables = ncaab_soup.find_all("table")
len(ncaab_tables)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


26

#### Extract references in the Wikipedia page
Going back to US State Capitals, let's extract the URLs of all the references at the bottom of [this Wikipedia entry](https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals). We previously stored the html in a BeautifulSoup object that we named `page_soup`.

In [44]:
refs = page_soup.find_all('ol', {'class': 'references'})

In [45]:
refs

[<ol class="references">
 <li id="cite_note-1"><span class="mw-cite-backlink"><a href="#cite_ref-1">↑</a></span> <span class="reference-text"><a class="external free" href="http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml" rel="nofollow">http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml</a></span>
 </li>
 </ol>, <ol class="references">
 <li id="cite_note-2"><span class="mw-cite-backlink"><a href="#cite_ref-2">↑</a></span> <span class="reference-text"><cite class="citation web"><a class="external text" href="http://spscp.weebly.com/" rel="nofollow">"The Super PAC That Supports Certain People (SPSCP)"</a>. <i>The Super PAC That Supports Certain People (SPSCP)</i><span class="reference-accessdate">. Retrieved <span class="nowrap">2019-09-19</span></span>.</cite><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=unknown&amp;rft.jtitle=The+Super+PAC+That+Supports+Certain+People+%28SPSCP%29&amp;rft.

In [46]:
# Find all li tags where id contains "cite_note". You want to pick up both "cite_note-1" and "cite_note-2" so
# we use the re package that makes it easy to detect patterns in strings.

import re
citenotes = page_soup.find_all('li', {'id': re.compile('cite_note-')})
citenotes

[<li id="cite_note-1"><span class="mw-cite-backlink"><a href="#cite_ref-1">↑</a></span> <span class="reference-text"><a class="external free" href="http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml" rel="nofollow">http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml</a></span>
 </li>,
 <li id="cite_note-2"><span class="mw-cite-backlink"><a href="#cite_ref-2">↑</a></span> <span class="reference-text"><cite class="citation web"><a class="external text" href="http://spscp.weebly.com/" rel="nofollow">"The Super PAC That Supports Certain People (SPSCP)"</a>. <i>The Super PAC That Supports Certain People (SPSCP)</i><span class="reference-accessdate">. Retrieved <span class="nowrap">2019-09-19</span></span>.</cite><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=unknown&amp;rft.jtitle=The+Super+PAC+That+Supports+Certain+People+%28SPSCP%29&amp;rft.atitle=The+Super+PAC+That+Supports+Certain+People+%28SPS

In [47]:
for elem in citenotes:
    print(elem.find('a', {'class': 'external'}))

<a class="external free" href="http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml" rel="nofollow">http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml</a>
<a class="external text" href="http://spscp.weebly.com/" rel="nofollow">"The Super PAC That Supports Certain People (SPSCP)"</a>


In [48]:
# Almost there
for elem in citenotes:
    print(elem.find('a', {'class': 'external'}).get('href'))

http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml
http://spscp.weebly.com/


In [49]:
# You could populate a list, Series or DataFrame with the extracted items iteratively.

# But as we saw previously, Python's "list comprehension" is very useful for situations like this.

ref_links = [ elem.find('a', {'class': 'external'}).get('href') for elem in citenotes ]
ref_links

['http://www.vtliving.com/towns/montpelier/visit_montpelier.shtml',
 'http://spscp.weebly.com/']

## Grab and embed our favorite song

Let's do something more fun: embed everybody's favorite song on this page.

In [50]:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import random
from IPython.display import IFrame    

In [51]:
textToSearch = "Notre Dame Our Mother Notre Dame Folk Choir"

query = urllib.parse.quote(textToSearch)

query

'Notre%20Dame%20Our%20Mother%20Notre%20Dame%20Folk%20Choir'

In [52]:
url = "https://www.youtube.com/results?search_query=" + query

response = urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'lxml')

In [53]:
# In the following we are just picking up the first link that shows up. 
soup.find_all(attrs={'class':'yt-uix-tile-link'})[0]

<a aria-describedby="description-id-616784" class="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link " data-sessionlink="itct=CF4Q3DAYACITCO-2iu3qnuYCFeJBTAgdpSgGHjIGc2VhcmNoUitOb3RyZSBEYW1lIE91ciBNb3RoZXIgTm90cmUgRGFtZSBGb2xrIENob2ly" dir="ltr" href="/watch?v=ENQMjEskM7s" rel="spf-prefetch" title="Notre Dame, Our Mother | Notre Dame Folk Choir">Notre Dame, Our Mother | Notre Dame Folk Choir</a>

In [54]:
# Now we are picking up the randomly-generated string that uniquely identifies every every youtube video
soup.find_all(attrs={'class':'yt-uix-tile-link'})[0]['href']

'/watch?v=ENQMjEskM7s'

In [55]:
# Extract the ID without the '/watch?v=' in the string. Start from the 9th index to the rest of it. 
soup.find_all(attrs={'class':'yt-uix-tile-link'})[0]['href'][9:]

'ENQMjEskM7s'

In [56]:
songLink = 'https://www.youtube.com/embed/'+ soup.find_all(attrs={'class':'yt-uix-tile-link'})[0]['href'][9:]

# We are embedding the video we searched into the following webpage in an HTML iframe

IFrame(songLink, width="560", height="315")

# Web Scraping through Application Programming Interface (API)

What's an API? In general an API is "a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service." (lexico.com.)

In the context of the Web, an API for a website simply returns data in response to a request made by a client (i.e. us).

Say we are FedEx couriers. We want to pick up some packages from someone's house for overnight delivery. Instead of trying to break into the house and searching for the packages, imagine the home owner gave us specific instructions on where the packages are and how to access them (perhaps a back door with a keypad). That's an API!

If a website you wish to harvest data from is kind enough to provide an API, use it and play by its rules! (And thank it profusely.)

**If an API is available for a website, that is always preferred rather than scraping.**

In this section, we will use simple APIs provided by NASA, [here](http://open-notify.org/), to retrieve data about the International Space Station (ISS). 

Some of the content presented here is based on [dataquest](https://www.dataquest.io/blog/python-api-tutorial/). 

## Internation Space Station APIs

In [19]:
import pandas as pd
import requests
response = requests.get("http://api.open-notify.org/iss-now.json")

print(response.status_code)

200


There are various status codes that you will get when you send any web request. [This](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) page describes them in detail.

#### Current ISS Position

In [20]:
response1 = requests.get("http://api.open-notify.org/iss-now.json")

# typically content is returned as a JSON. pandas read_json is great for parsing JSON into dataframes automatically.
pd.read_json(response1.content)

Unnamed: 0,iss_position,message,timestamp
latitude,3.9359,success,2019-12-05 16:17:19
longitude,-84.2903,success,2019-12-05 16:17:19


#### Current Number of People In Space

In [59]:
response2 = requests.get("http://api.open-notify.org/astros.json")
pd.read_json(response2.content)

Unnamed: 0,people,number,message
0,"{'name': 'Christina Koch', 'craft': 'ISS'}",6,success
1,"{'name': 'Alexander Skvortsov', 'craft': 'ISS'}",6,success
2,"{'name': 'Luca Parmitano', 'craft': 'ISS'}",6,success
3,"{'name': 'Andrew Morgan', 'craft': 'ISS'}",6,success
4,"{'name': 'Oleg Skripochka', 'craft': 'ISS'}",6,success
5,"{'name': 'Jessica Meir', 'craft': 'ISS'}",6,success


## Google Maps API


### Google API key

Unlike NASA, Google Maps is a commercial enterprise; therefore they want to know who is using their APIs so they can control access. Therefore everyone who uses Google's APIs needs an API key tied to your Google account. Ideally you would need to create this key from your Google API dashboard by logging in with your Google account. I have provided this key to for a dummy account. It comes with its own restrictions, as you'll see below.

**NOTE**: The below key `AIzaSyBwoKBVreiCa9HmRNmvIwZgeP5bn3tk-h0` will be disabled after the class. You can create your own key using the link [here](https://support.google.com/googleapi/answer/6158862?hl=en). 

It is more convenient to access Google Maps API using `googlemaps` package, but we first need to install it.

In [3]:
!pip install googlemaps --user

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
import googlemaps
from datetime import datetime

In [8]:
gmaps = googlemaps.Client(key='AIzaSyBwoKBVreiCa9HmRNmvIwZgeP5bn3tk-h0')

#### Using Google Maps API, find expected driving time between SB and Chicago

In [9]:
cities_zip = pd.read_csv("https://www3.nd.edu/~jng2/uscities_zip.csv", index_col = ['city', 'state_id'])

In [10]:
cities_zip.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,city_ascii,state_name,county_fips,county_name,county_fips_all,county_name_all,lat,lng,population,density,source,military,incorporated,timezone,ranking,zips,id
city,state_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
New York,NY,New York,New York,36061,New York,36061,New York,40.6943,-73.9249,19354922,11083,polygon,False,True,America/New_York,1,11229 11226 11225 11224 11222 11221 11220 1138...,1840059961
Los Angeles,CA,Los Angeles,California,6037,Los Angeles,6037,Los Angeles,34.1139,-118.4068,12815475,3295,polygon,False,True,America/Los_Angeles,1,90291 90293 90292 91316 91311 90037 90031 9000...,1840107920


In [11]:
starttime = datetime.strptime('Dec 20 2019  8:00AM', '%b %d %Y %I:%M%p')
starttime

datetime.datetime(2019, 12, 20, 8, 0)

In [12]:
results = gmaps.distance_matrix((cities_zip.loc['South Bend', 'IN']['lat'], cities_zip.loc['South Bend', 'IN']['lng']), 
                      (cities_zip.loc['Chicago', 'IL']['lat'], cities_zip.loc['Chicago', 'IL']['lng']), 
                      departure_time= starttime)

In [13]:
results

{'destination_addresses': ['3201 S Western Ave, Chicago, IL 60608, USA'],
 'origin_addresses': ['123 Spruce St, South Bend, IN 46616, USA'],
 'rows': [{'elements': [{'distance': {'text': '145 km', 'value': 145084},
     'duration': {'text': '1 hour 42 mins', 'value': 6095},
     'duration_in_traffic': {'text': '1 hour 44 mins', 'value': 6261},
     'status': 'OK'}]}],
 'status': 'OK'}

#### Using Google Maps API, find the current location of ISS.

In [21]:
iss_df = pd.read_json(response1.content)
iss_df.head()

Unnamed: 0,iss_position,message,timestamp
latitude,3.9359,success,2019-12-05 16:17:19
longitude,-84.2903,success,2019-12-05 16:17:19


In [22]:
lat = iss_df.loc['latitude', 'iss_position']
long = iss_df.loc['longitude', 'iss_position']
(lat, long)

(3.9359, -84.2903)

In [28]:
gmaps.reverse_geocode( (lat, long) )

ApiError: REQUEST_DENIED (This API project is not authorized to use this API.)

In [25]:
# Oops, Google doesn't give everything away for free
# Let's use a free package instead
!pip install reverse_geocoder --user

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [26]:
import reverse_geocoder

In [27]:
reverse_geocoder.search( (lat, long) )

Loading formatted geocoded file...


[OrderedDict([('lat', '8.03333'),
              ('lon', '-82.86667'),
              ('name', 'Punta de Burica'),
              ('admin1', 'Chiriqui'),
              ('admin2', ''),
              ('cc', 'PA')])]