# Web Scraping Using urllib and BeautifulSoup

### Problem 1: Build a web crawler to download New York MTA's weekly turnstile data
#### Source: http://web.mta.info/developers/ (Right click and "Inspect" hyperlinks)

In [1]:
import urllib.request                   # Library to fetch webpages
from bs4 import BeautifulSoup           # Library to parse data from webpages
from time import sleep
from random import uniform

In [2]:
# Fetch webpage, parse HTML code from fetched page, and save to BeautifulSoup object
url = 'http://web.mta.info/developers/turnstile.html'
webpage = urllib.request.urlopen(url)
soup = BeautifulSoup(webpage, 'lxml')
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<title>mta.info | Turnstile Data</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!--<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7">-->
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="/siteimages/favicon.ico" rel="shortcut icon"/>
<link href="/css/base.css" rel="stylesheet" type="text/css"/>
<link href="/css/grid.css" rel="stylesheet" type="text/css"/>
<link href="/css/topbar.css" rel="stylesheet" type="text/css"/>
<link href="/css/formalize.css" rel="stylesheet" type="text/css"/>
<!-- <link rel="stylesheet" type="text/css" href="/css/jquery.datepick.css"> -->
<!-- jQuery include should be at the top -->
<script language="javascript" src="/js/jquery-1.4.4.min.js" type="text/javascript"></script>
<!-- Global site tag (gtag.js) - Google Analytics 

In [3]:
soup.get_text()                         # Text before parsing

'\n\nmta.info | Turnstile Data\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n\r\n  gtag(\'config\', \'UA-139746469-1\');\r\n\n\n\n\n@import url(/mta/mtahq_custom_clean.css);\n@import url(/mta/news/newsroom_custom.css);\n#contentbox h2,h3 {\npadding-bottom:8px;\npadding-top:12px;\n}\n.indented {\npadding-left:15px;\n}\n\n\n\n\n\n\n\nSkip to main content\n\n\n\n\n\n\n\n\nAccessibility\nText-only\nCustomer Self-Service\nEmployment\nFAQs/Contact Us\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCoronavirus updates:  MTA Service During the Coronavirus Pandemic, Read more\n\n\n\n\nHome\n\nMTA Home\nNYC Subways and Buses\nLong Island Rail Road\nMetro-North Railroad\nBridges and Tunnels\nMTA Capital Program\n\n\nSchedules\nFares & Tolls\nMaps\nPlanned Service Changes\nMTA Info\nDoing Business With Us\nTransparency\n\nMain Page\nBoard Materials\nBudget Info\nCapital Program Info\nCapital Program D

In [4]:
# To make the raw HTML data pretty, we can parse our content with a parser 
# like lxml, HTML.parser, HTML5lib, XML parser. lxml is flexible and popular.

print(soup.prettify())                  # Text after parsing

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
 <head>
  <title>
   mta.info | Turnstile Data
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!--<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7">-->
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="/siteimages/favicon.ico" rel="shortcut icon"/>
  <link href="/css/base.css" rel="stylesheet" type="text/css"/>
  <link href="/css/grid.css" rel="stylesheet" type="text/css"/>
  <link href="/css/topbar.css" rel="stylesheet" type="text/css"/>
  <link href="/css/formalize.css" rel="stylesheet" type="text/css"/>
  <!-- <link rel="stylesheet" type="text/css" href="/css/jquery.datepick.css"> -->
  <!-- jQuery include should be at the top -->
  <script language="javascript" src="/js/jquery-1.4.4.min.js" type="text/javascript">
  </script>
  <!-- Global

In [5]:
soup.title                              # Returns title tag

<title>mta.info | Turnstile Data</title>

In [6]:
soup.title.string                       # Returns content between <title> and </title>up

'mta.info | Turnstile Data'

In [7]:
soup.a                                  # Returns first a tag

<a href="#main-content">Skip to main content</a>

In [8]:
soup.find_all('a')                      # Returns all a tags (for URL links)

[<a href="#main-content">Skip to main content</a>,
 <a href="http://www.mta.info"><img alt="To MTA.info homepage" src="/template/images/mta_info.gif"/></a>,
 <a href="/accessibility">Accessibility</a>,
 <a href="http://assistive.usablenet.com/tt/http://www.mta.info">Text-only</a>,
 <a href="/selfserve">Customer Self-Service</a>,
 <a href="/mta/employment/">Employment</a>,
 <a href="/faqs.htm">FAQs/Contact Us</a>,
 <a href="https://new.mta.info/coronavirus" style="text-decoration: none;color: yellow !important;" target="_blank">Read more</a>,
 <a href="http://www.mta.info" style="padding-left:18px;">Home</a>,
 <a href="http://www.mta.info">MTA Home</a>,
 <a href="http://www.mta.info/nyct">NYC Subways and Buses</a>,
 <a href="http://www.mta.info/lirr">Long Island Rail Road</a>,
 <a href="http://www.mta.info/mnr">Metro-North Railroad</a>,
 <a href="http://www.mta.info/bandt">Bridges and Tunnels</a>,
 <a href="http://web.mta.info/capital">MTA Capital Program</a>,
 <a href="http://www.mta.i

In [9]:
len(soup.find_all('a'))

597

In [10]:
for link in soup.find_all('a'):
    print(link.get('href'))

#main-content
http://www.mta.info
/accessibility
http://assistive.usablenet.com/tt/http://www.mta.info
/selfserve
/mta/employment/
/faqs.htm
https://new.mta.info/coronavirus
http://www.mta.info
http://www.mta.info
http://www.mta.info/nyct
http://www.mta.info/lirr
http://www.mta.info/mnr
http://www.mta.info/bandt
http://web.mta.info/capital
http://www.mta.info/schedules
http://web.mta.info/fares
http://web.mta.info/maps
http://web.mta.info/service
http://web.mta.info/about
http://web.mta.info/business
http://web.mta.info/accountability/
http://web.mta.info/accountability
http://web.mta.info/mta/boardmaterials.html
http://web.mta.info/mta/budget/
http://web.mta.info/capital
http://web.mta.info/capitaldashboard/CPDHome.html
http://web.mta.info/mta/investor/
http://web.mta.info/mta/leadership/
http://web.mta.info/persdashboard/performance14.html
http://www.mta.info/mta-news
http://web.mta.info/mta/news/hearings
http://web.mta.info/mta/news/hearings/index-reinvention.html
None
resources/nyc

In [11]:
soup.find_all('a')[36].get('href')      # Data starts from 37th a tag

'resources/nyct/turnstile/Remote-Booth-Station.xls'

In [12]:
soup.findAll('a')[37].get('href')       # findAll() and find_all() are simlar 

'data/nyct/turnstile/turnstile_210123.txt'

In [13]:
# Download data from a tags and save to local files

for i in range(37, 40):
    a_tag = soup.find_all('a')[i]
    link = a_tag['href']
    download_url = 'http://web.mta.info/developers/' + link
    urllib.request.urlretrieve(download_url, 'C:/Users/abhatt/Desktop/'+link[link.find('/turnstile_')+1:]) 
    sleep(uniform(1.0, 3.0)) 
    
# request.urlretrieve needs two parameters: download url and local filename
# Local files are saved as 'turnstile_180922.txt' on the Desktop directory
# Delay successive requests (1-3 sec) to avoid crashing the server and being 
# blocked as a DOS attacker/bot

### Problem 2: Write a web scraper to extract selected data from a wiki page into a CSV file using requests library instead of urllib and html.parser instead of lxml
#### Source: https://en.wikipedia.org/wiki/Fields_Medal

In [14]:
import requests
url = 'https://en.wikipedia.org/wiki/Fields_Medal'
response = requests.get(url)
response.status_code                    # Status 200 indicates success

200

In [15]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Fields Medal - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YApDwgpAMMEAAWRYfP8AAABD","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Fields_Medal","wgTitle":"Fields Medal","wgCurRevisionId":991319782,"wgRevisionId":991319782,"wgArticleId":10859,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Spanish-language sources (es)","Webarchive template wayback links","Articles with short description","Short description matches Wikidata","Wikipedia semi-protected pages","Use 

In [16]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Fields Medal - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YApDwgpAMMEAAWRYfP8AAABD","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Fields_Medal","wgTitle":"Fields Medal","wgCurRevisionId":991319782,"wgRevisionId":991319782,"wgArticleId":10859,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Spanish-language sources (es)","Webarchive template wayback links","Articles with short description","Short description matches Wikidata","Wikipedia semi-protected 

In [17]:
soup.title

<title>Fields Medal - Wikipedia</title>

In [18]:
# HTTP get() request is unsecured; for sensitive data involving login, post() method is preferable
# e.g., r = requests.post('https://facebook.com/post', data = {'key':'value'})

# Locate the right table on the Wiki page to source data

len(soup.find_all('table'))

6

In [19]:
len(soup.find_all('table', {'class':'wikitable sortable'}))

1

In [20]:
my_table = soup.find('table', {'class':'wikitable sortable'})
all_rows = my_table.find_all('tr')
print(all_rows[:5])

[<tr>
<th>Year
</th>
<th><a href="/wiki/International_Congress_of_Mathematicians#List_of_Congresses" title="International Congress of Mathematicians">ICM</a> location
</th>
<th>Medalists<sup class="reference" id="cite_ref-IMU_21-0"><a href="#cite_note-IMU-21">[21]</a></sup>
</th>
<th>Affiliation<br/>(when awarded)
</th>
<th>Affiliation<br/>(current/last)
</th>
<th>Reasons
</th></tr>, <tr>
<td rowspan="2">1936
</td>
<td rowspan="2"><a href="/wiki/Oslo" title="Oslo">Oslo</a>, Norway
</td>
<td><a href="/wiki/Lars_Ahlfors" title="Lars Ahlfors">Lars Ahlfors</a>
</td>
<td><a href="/wiki/University_of_Helsinki" title="University of Helsinki">University of Helsinki</a>, Finland
</td>
<td><a href="/wiki/Harvard_University" title="Harvard University">Harvard University</a>, US<sup class="reference" id="cite_ref-22"><a href="#cite_note-22">[22]</a></sup><sup class="reference" id="cite_ref-23"><a href="#cite_note-23">[23]</a></sup>
</td>
<td>"Awarded medal for research on covering surfaces related

In [21]:
len(all_rows)

# Note: Some rows have 6 columns, some have 4 columns
# There are also URLs, \n, etc. in the cells that need cleaning 

61

In [22]:
# Read data from HTML table into arrays

data = []
for row in all_rows:
    cells = row.find_all('td')
    if len(cells) == 6:
        a = cells[0].find(text=True)
        b = cells[1].find(text=True)
        c = cells[2].find(text=True)
        d = cells[3].find(text=True) 
        e = cells[4].find(text=True)
        f = cells[5].find(text=True)
        data.append([a, b, c, d, e, f])
    elif len(cells) == 4:
        c = cells[0].find(text=True)
        d = cells[1].find(text=True)
        e = cells[2].find(text=True)
        f = cells[3].find(text=True) 
        data.append([a, b, c, d, e, f])

In [23]:
# Clean newline characters in data cells

for row in range(len(data)):
    for col in range(6):
        data[row][col] = data[row][col].replace('\n', '')
        
data

[['1936',
  'Oslo',
  'Lars Ahlfors',
  'University of Helsinki',
  'Harvard University',
  '"Awarded medal for research on covering surfaces related to '],
 ['1936',
  'Oslo',
  'Jesse Douglas',
  'Massachusetts Institute of Technology',
  'City College of New York',
  '"Did important work on the '],
 ['1950',
  'Cambridge',
  'Laurent Schwartz',
  'University of Nancy',
  'University of Paris VII',
  '"Developed the '],
 ['1950',
  'Cambridge',
  'Atle Selberg',
  'Institute for Advanced Study',
  'Institute for Advanced Study',
  '"Developed generalizations of the '],
 ['1954',
  'Amsterdam',
  'Kunihiko Kodaira',
  'Princeton University',
  'University of Tokyo',
  '"Achieved major results in the theory of harmonic integrals and numerous applications to Kählerian and more specifically to '],
 ['1954',
  'Amsterdam',
  'Jean-Pierre Serre',
  'University of Nancy',
  'Collège de France',
  '"Achieved major results on the '],
 ['1958',
  'Edinburgh',
  'Klaus Roth',
  'University Coll

In [24]:
# Transfer data from arrays to a Pandas dataframe and write to a CSV file

import pandas as pd
df = pd.DataFrame(data)
df.columns = ['Year', 'Location', 'Name', 'Affiliation', 'Last_Affiliation', 'Reason']
df[0:10]

Unnamed: 0,Year,Location,Name,Affiliation,Last_Affiliation,Reason
0,1936,Oslo,Lars Ahlfors,University of Helsinki,Harvard University,"""Awarded medal for research on covering surfac..."
1,1936,Oslo,Jesse Douglas,Massachusetts Institute of Technology,City College of New York,"""Did important work on the"
2,1950,Cambridge,Laurent Schwartz,University of Nancy,University of Paris VII,"""Developed the"
3,1950,Cambridge,Atle Selberg,Institute for Advanced Study,Institute for Advanced Study,"""Developed generalizations of the"
4,1954,Amsterdam,Kunihiko Kodaira,Princeton University,University of Tokyo,"""Achieved major results in the theory of harmo..."
5,1954,Amsterdam,Jean-Pierre Serre,University of Nancy,Collège de France,"""Achieved major results on the"
6,1958,Edinburgh,Klaus Roth,University College London,Imperial College London,"""Solved in 1955 the famous"
7,1958,Edinburgh,René Thom,University of Strasbourg,Institut des Hautes Études Scientifiques,"""In 1954 invented and developed the theory of"
8,1962,Stockholm,Lars Hörmander,University of Stockholm,Lund University,"""Worked in"
9,1962,Stockholm,John Milnor,Princeton University,Stony Brook University,"""Proved that a 7-dimensional sphere can have s..."


In [25]:
df.to_csv('C:/Users/abhatt/Desktop/Text_Analytics/python/FieldsMedal.csv')

### Problem 3: Download product and price data from a live e-commerce website
#### We need selenium to extract dynamic data rendered through Javascript. We also need chromedriver or a simlar automated driver to run the script
#### Download chromedriver.exe from https://sites.google.com/a/chromium.org/chromedriver/downloads (chromedriver version must match with Google Chrome version)

In [26]:
# pip install selenium
from selenium import webdriver
import time

In [27]:
url = 'https://www.webscraper.io/test-sites/e-commerce/ajax/computers/laptops'
driver = webdriver.Chrome('c:/Users/abhatt/Desktop/Text_Analytics/python/chromedriver')
driver.get(url)

In [28]:
time.sleep(1)                              # Give Javascript time to render
soup = BeautifulSoup(driver.page_source)

In [29]:
for caption in soup.find_all(class_='caption'):
    product_name = caption.find(class_='title').text
    price = caption.find(class_='pull-right price').text
    print(product_name, price)

Asus VivoBook X4 $295.99
Prestigio SmartB $299.00
Prestigio SmartB $299.00
Aspire E1-510 $306.99
Lenovo V110-15IA $321.94
Lenovo V110-15IA $356.49


### Pickling/serializing Python objects

In [30]:
# Note: Pickles are binary objects, with no file extension
    
import pickle
filename = 'c:/Users/abhatt/Desktop/Text_Analytics/python/data/FieldsMedal_pkl'
outfile = open(filename,'wb')
pickle.dump(df, outfile) 
outfile.close()

In [31]:
with open('c:/Users/abhatt/Desktop/Text_Analytics/python/data/FieldsMedal_pkl', 'rb') as infile:
    df_new = pickle.load(infile)            # Short-hand notation
df_new

Unnamed: 0,Year,Location,Name,Affiliation,Last_Affiliation,Reason
0,1936,Oslo,Lars Ahlfors,University of Helsinki,Harvard University,"""Awarded medal for research on covering surfac..."
1,1936,Oslo,Jesse Douglas,Massachusetts Institute of Technology,City College of New York,"""Did important work on the"
2,1950,Cambridge,Laurent Schwartz,University of Nancy,University of Paris VII,"""Developed the"
3,1950,Cambridge,Atle Selberg,Institute for Advanced Study,Institute for Advanced Study,"""Developed generalizations of the"
4,1954,Amsterdam,Kunihiko Kodaira,Princeton University,University of Tokyo,"""Achieved major results in the theory of harmo..."
5,1954,Amsterdam,Jean-Pierre Serre,University of Nancy,Collège de France,"""Achieved major results on the"
6,1958,Edinburgh,Klaus Roth,University College London,Imperial College London,"""Solved in 1955 the famous"
7,1958,Edinburgh,René Thom,University of Strasbourg,Institut des Hautes Études Scientifiques,"""In 1954 invented and developed the theory of"
8,1962,Stockholm,Lars Hörmander,University of Stockholm,Lund University,"""Worked in"
9,1962,Stockholm,John Milnor,Princeton University,Stony Brook University,"""Proved that a 7-dimensional sphere can have s..."


In [34]:
# Multiprocessing pickled objects

import multiprocessing as mp
p = mp.Pool(2)                    # Pool specifies number of parallel processes

from math import cos
p.map(cos, range(10))

[1.0,
 0.5403023058681398,
 -0.4161468365471424,
 -0.9899924966004454,
 -0.6536436208636119,
 0.28366218546322625,
 0.960170286650366,
 0.7539022543433046,
 -0.14550003380861354,
 -0.9111302618846769]

In [None]:
# Objects that can’t be pickled, such as lambda functions, can’t be multiprocessed

p.map(lambda x: 2**x, range(10))

In [None]:
# However, you can multiprocess them using the dill package for serialization
# and a fork of the multiprocessing package called pathos.multiprocessing

# pip install pathos
import pathos.multiprocessing as mp
p = mp.Pool(2)

!pip install dill
import dill
p.map(lambda x: 2**x, range(10))
dill.dump(lambda x: x**2, open('dillfile','wb'))