----
# Web Scraping with Python 
----
## Session 1 - Hands On Part


#### Dr. Zeyad Elkelani 

----
## Introduction:
----
How to get data from the Internet?
</br> 

Mainly there are two ways to extract data from a given website:
1. **API (Application Programming Interface)**
    - this method allows for data retrieval using a standardized protocols designed by website developers. 
2. **Web Scraping/Harvesting**
    - this method  allows to access the HTML backend of a webpage and extract relevant information/data.


**In this particular workshop, we are going to focus on the second method of web data extraction or "Web Scraping". In future sessions, we will talk about "Web Spiders/Crawlers" which is related to Web Scraping.**


----
## Learning Objectives for this sequence: 
----
- **Session One:** 
  - Understanding Robots.txt and HTTP requests. 
  - Understanding basic components of a webpage and HTML. 
  - Get familiar with Pandas Module. 
  - Parsing html string into Pandas.
  - Parse URL class into Pandas.
  - Parse Tables from Wikipedia into Pandas. 
  - Parse non-Wikipedia Tables into Pandas. 
  - Parse Wiki InfoBoxes.
  - Write html parsed tables into flat csv.
  
</br>

- **Future Sessions:**
  - Advanced understanding of HTML parsing using tagging and CSS selection.
  - Parsing multiple URLs at the same time. 
  - Loop through multiple webpages and extract data (web crawling).
  - Building Web Spiders (Automated Scraping ).
  - Use Advanced Open Source Web Spiders modules (SCrapy). 
  - Use SCrapy Shell.
  
</br> 

**By the end of these workshops, you should be able to build your own scraper/ crawler and schedule your job queries.** 
    


----
## Agenda for Session One
----
    - Setting up working environment
    - What is Robots Exclusion Standard?
    - How does Web Scraping work? (High-Level) 
    - What is inside a Web Page?
    - Parsing HTML tables 
    - Parsing Wiki Tables with Pandas
    - Parsing Wiki Infoboxes
    - Parsing non-Wikipedia Tables
    - Write html tables to csv document 

----
## Let's Setup our working environment: 
----
### In this workshop, we will use Python 3.7 to execute commands. 
  - We strongly recommend uisng **Google Colab** for this session.
  - Use the Ipython file sent by QCL prior to the start of the workshop.
  - **At the end of this session, please send us your work on Ipython file qcl@cmc.edu as part of the QCL Digital Badges Program**

</br>

### Installing Packages in COLAB
One of the good things about Google Colab, that it is pre-packaged with several scientific computing Python modules like Numpy, Pandas, and Tensorflow. If you still want to install a package that is not available, you can do so by running a shell command as `!pip install [Package_Name]`


### Jupyter Applications 

- **Install the following packages:**
```
io
pandas
requests
IPython
```
- **Installing Packages from Shell for Project Jupyter:** 
    - `conda install [Package_name]`
    - `pip install [package_name]`

- In case, you want to upgrade/update **pip** itself:
    - Check pip version in your command prompt `pip --version`
    - Then update your pip `pip install --upgrade pip`

- In case, you want to upgrade/update a package:
    - `sudo pip install [package_name] --upgrade`
    - `sudo` will ask to enter your root password to confirm the action.

- In case you do not know the root password (not adminstrator), you should just use virtual environment and this case, use:
    - `pip install [package_name] --upgrade`
    
- To install packages while in Jupyter Kernel, we recommend the following inside .ipynb file: **Shell-like Install**
    - `import sys` 
    - `!conda install --yes --prefix {sys.prefix} <package_name>`
    
    OR
    
    - `import sys`
    - `!{sys.executable} -m pip install <package_name>`


----
## Let's Get Started:
----
- Import packages into the current Jupyter Kernel:



In [0]:
import io
import pandas
import IPython

## What is Robots Exclusion Standard? 

In [0]:
# Check cmc.edu 

----
## How does Web Scraping work? (High-Level Overview)
----
1. The process of starts by senidng a reuqets to the URL of the domain of interest, **IF HTTP REQUEST IS ALLOWED**, server responds by returning HTML content of the webpage. Such task is done in this workshop using a third-party Python Library.  

2. After rendering the HTML content, next step is to parse the nested content of HTML source code. Thus, we use third-party Python libraries that provide a way to create a netsed/tree structure of the HTML data.

3. There are many HTML parser libraries availabe, `html5lib` is the most used and adavanced one.

4. In order to go through the parse tree created, we will use other third-party Python library to render the contect in a given data type. 

----
## What is inside a Web Page? 
----
When you visit a web page, your web browser sends a request to a web server. This request is known as `GET` request. Then the server will send back files that tell our browser how to render a given webpage for us. Generally the rendered files are composed of the following types:
- `HTML`: contain the main content of the page.
- `CSS`: add styling to make the page look nicer.
- `JS`: Javascript files add interactivity to web pages.
- `Images`: image formats, such as JPG and PNG allow web pages to show pictures.

After the browser receives all files, it renders the page in the format maintained by server and displays it to us. 

----
## What is HTML?
----
**HyperText Markup Language (HTML)** is a language that web pages are created in. It is not a programming language, it is rather a markup kanguage that instructs a browser on how to layout a content. It does the same task a word processor like Microsoft Word does in terms of making a bold text, create paragraphs, inserting images...etc.

Let's look at the HTML element below:

In [0]:
basic_html = """
<html>
<head>
basic html
</head>
</html>
"""

In [0]:
# use a function to display html elements stored above as a string: 
# We use IPython module here: 
from IPython.display import display_html 

display_html(basic_html, raw = True)

# Raw = TRUE, raw html is used inside this function 

In [0]:
simple_html = """
<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
</p>
<p>
Here's a second paragraph of text!
</p>
</body>
</html>
"""

In [0]:
# What is different here? 
display_html(simple_html, raw = True)

In [0]:
html_with_link = """
<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
<a href="https://www.cmc.edu/">Claremont McKenna College</a>
</p>
<p>
Here's a second paragraph of text!
<a href="https://www.cmc.edu/qcl">Murty Sunak Quantitative and Computing Lab</a> </p>
</body></html>
"""

In [0]:
# What did we add here? 
display_html(html_with_link, raw = True)

## Parsing HTML Tables: 

- In this section we will use Pandas Module
    - Let's look at the `read_html` and `pandas.read_html` functions we will use

- `pandas.read_html` it is a function accepts: A URL, a file-like object, or a raw string containing HTML.

- **Let's start by passing a raw html string:**

In [0]:
html_sample_string = """
<table>
  <thead>
    <tr>
      <th>Programming Language</th>
      <th>Creator</th> 
      <th>Year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C</td>
      <td>Dennis Ritchie</td> 
      <td>1972</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>Guido Van Rossum</td> 
      <td>1989</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>Yukihiro Matsumoto</td> 
      <td>1995</td>
    </tr>
  </tbody>
</table>
"""

In [0]:
# What is </thead>?
# What is <tr> and <td>
display_html(html_sample_string, raw = True)


Programming Language,Creator,Year
C,Dennis Ritchie,1972
Python,Guido Van Rossum,1989
Ruby,Yukihiro Matsumoto,1995


Let's use `read_html` and use alias for a package

In [0]:
import pandas as pd
l = pd.read_html(html_sample_string)
#l
# What is df?
type(l)
len(l)

1

In [0]:
l
df1 = l[0] 
df1

Unnamed: 0,Programming Language,Creator,Year
0,C,Dennis Ritchie,1972
1,Python,Guido Van Rossum,1989
2,Ruby,Yukihiro Matsumoto,1995


In [0]:
# Check this URL for more information: https://python.swaroopch.com/data_structures.html

In [0]:
# Which language came after 1990?
df1[df1.Year > 1990]

Unnamed: 0,Programming Language,Creator,Year
2,Ruby,Yukihiro Matsumoto,1995


- What if our function `pd.read_html` cannot locate the header of the HTML table or, it is missing? 

In [0]:
html_string_other = """
<table>
  <tr>
    <td>Programming Language</td>
    <td>Creator</td> 
    <td>Year</td>
  </tr>
  <tr>
    <td>C</td>
    <td>Dennis Ritchie</td> 
    <td>1972</td>
  </tr>
  <tr>
    <td>Python</td>
    <td>Guido Van Rossum</td> 
    <td>1989</td>
  </tr>
  <tr>
    <td>Ruby</td>
    <td>Yukihiro Matsumoto</td> 
    <td>1995</td>
  </tr>
</table>
"""

In [0]:
# What is different in this table? What is missing? 

In [0]:
pd.read_html(html_string_other)

[                      0                   1     2
 0  Programming Language             Creator  Year
 1                     C      Dennis Ritchie  1972
 2                Python    Guido Van Rossum  1989
 3                  Ruby  Yukihiro Matsumoto  1995]

In [0]:
_pd.read_html(html_string_other)
what_is_this = pd.read_html(html_string_other, header = 0)
type(what_is_this)

df_only = pd.read_html(html_string_other, header = 0)[0]
type(df_only)
df_only

Unnamed: 0,Programming Language,Creator,Year
0,C,Dennis Ritchie,1972
1,Python,Guido Van Rossum,1989
2,Ruby,Yukihiro Matsumoto,1995


Let's Read a table from an existing webpage:

In [0]:
# http://www.contextures.com/xlSampleData01.html

In [0]:
# Let's return a list and df from this website? 
df = pd.read_html('http://www.contextures.com/xlSampleData01.html', header = 0)[0]
df

Unnamed: 0,OrderDate,Region,Rep,Item,Units,UnitCost,Total
0,1/6/2018,East,Jones,Pencil,95,1.99,189.05
1,1/23/2018,Central,Kivell,Binder,50,19.99,999.5
2,2/9/2018,Central,Jardine,Pencil,36,4.99,179.64
3,2/26/2018,Central,Gill,Pen,27,19.99,539.73
4,3/15/2018,West,Sorvino,Pencil,56,2.99,167.44
5,4/1/2018,East,Jones,Binder,60,4.99,299.4
6,4/18/2018,Central,Andrews,Pencil,75,1.99,149.25
7,5/5/2018,Central,Jardine,Pencil,90,4.99,449.1
8,5/22/2018,West,Thompson,Pencil,32,1.99,63.68
9,6/8/2018,East,Jones,Binder,60,8.99,539.4


In [0]:
df.head(11)
df.describe()
df.dtypes

OrderDate     object
Region        object
Rep           object
Item          object
Units          int64
UnitCost     float64
Total        float64
dtype: object

In [0]:
# Let's take a look at the first 10 rows? 

In [0]:
# Let's do some stats? 

In [0]:
#df.describe()

In [0]:
# Pandas Data Types
#df.dtypes

In [0]:
# What is the average cost per region?

In [0]:
df.groupby('Region').mean()


Unnamed: 0_level_0,Units,UnitCost,Total
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Central,49.958333,18.01875,464.127917
East,53.153846,9.143846,461.699231
West,38.5,53.658333,414.453333


In [0]:
# you are going to learn more about dfs in particular Pandas Numpy Scipy from other workshops such as ML, coding, and Python 

In [0]:
# Break 5 mins

##  Parsing Wiki Tables with Pandas
- `read_html documentation`
- We will introduce another function that will help us in reading Wiki Tables: 

- In this case, we need to call specific wikipedia webpage to parser function bulit on top of a thrid-party python module, such as `read_html` or `pandas.read_html` functions from Pandas. 
- When passing the URL through parser, we identify exact class to be parsed.
- We know in case of Wikipedia web pages, table class is  called "**wikitable**"

In [0]:
# How to locate Wikitable Class in a Wikipedia Webpage?

In [0]:
#https://en.wikipedia.org/wiki/List_of_African_countries_by_population

In [0]:
# Let's import our function
from pandas.io.html import read_html 
# Define Webpage:
page = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
# Pass arguments: 
wikitable = read_html(page, attrs={"class" : "wikitable"})

In [0]:
len(wikitable)
print ("I got this number {num} of tables from Wikipedia Page".format(num=len(wikitable)))

I got this number 1 of tables from Wikipedia Page


In [0]:
type(wikitable)
wikitable 
df_wikitable = wikitable[0]
#df_wikitable

Unnamed: 0,Country(or dependent territory),Officialfigure(whereavailable),Date oflast figure,Source
0,Nigeria,193392517,2016,Official estimate
1,Ethiopia,99391000,2015,Official estimate
2,Egypt,96983083,"March 31, 2019",Official population clock
3,Democratic Republic of the Congo,86026000,"July 1, 2015",Official estimate
4,South Africa,54956900,"July 1, 2015",Official estimate
5,Tanzania,51046000,2015,Official estimate
6,Kenya,45533000,2015,Official estimate
7,Sudan,40235000,2015,Official estimate
8,Algeria,40100000,"January 1, 2016",Official estimate
9,Uganda,34856813,"August 28, 2014",Preliminary 2014 census result


In [0]:
# How many tables did we get? 


In [0]:
#wikitable_df = wikitable[0]
#wikitable_df

In [0]:
# Let's use pd.read_html: 

In [0]:
#https://en.wikipedia.org/wiki/Timeline_of_programming_languages

In [0]:
list_wiki = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages')
len(list_wiki)

13

In [0]:
# Let's look at the length of our dfs list? 

In [0]:
#type()

In [0]:
list_wiki[12]

Unnamed: 0,vteProgramming languages,vteProgramming languages.1
0,Comparison Timeline History,Comparison Timeline History
1,APL Assembly BASIC C C++ C# COBOL Elixir Fortr...,APL Assembly BASIC C C++ C# COBOL Elixir Fortr...
2,Category Lists Alphabetical Categorical Gener...,Category Lists Alphabetical Categorical Gener...


In [0]:
#type(list[11])

In [0]:
# Let's merge them all together by row:
merged_long_df = pd.concat(list_wiki[4:12])
#merged_long_df

Unnamed: 0,Year,Name,"Chief developer, company",Predecessor(s)
0,1804,Jacquard Loom,Joseph Marie Jacquard,none (unique language)
1,1943–45,Plankalkül (concept),Konrad Zuse,none (unique language)
2,1943–46,ENIAC coding system,"John von Neumann, John Mauchly, J. Presper Eck...",none (unique language)
3,1946,ENIAC Short Code,"Richard Clippinger, John von Neumann after Ala...",ENIAC coding system
4,1946,Von Neumann and Goldstine graphing system (Not...,John von Neumann and Herman Goldstine,ENIAC coding system
5,1947,ARC Assembly,Kathleen Booth[1][2],ENIAC coding system
6,1948,CPC Coding scheme,Howard H. Aiken,Analytical Engine order code
7,1948,Curry notation system,Haskell Curry,ENIAC coding system
8,1948,Plankalkül (concept published),Konrad Zuse,none (unique language)
9,1949,Short Code,John Mauchly and William F. Schmitt,ENIAC Short Code


In [0]:
# Remove the last row? 
merged_long_df = merged_long_df[merged_long_df.Year != 'Year']
merged_long_df

Unnamed: 0,Year,Name,"Chief developer, company",Predecessor(s)
0,1804,Jacquard Loom,Joseph Marie Jacquard,none (unique language)
1,1943–45,Plankalkül (concept),Konrad Zuse,none (unique language)
2,1943–46,ENIAC coding system,"John von Neumann, John Mauchly, J. Presper Eck...",none (unique language)
3,1946,ENIAC Short Code,"Richard Clippinger, John von Neumann after Ala...",ENIAC coding system
4,1946,Von Neumann and Goldstine graphing system (Not...,John von Neumann and Herman Goldstine,ENIAC coding system
5,1947,ARC Assembly,Kathleen Booth[1][2],ENIAC coding system
6,1948,CPC Coding scheme,Howard H. Aiken,Analytical Engine order code
7,1948,Curry notation system,Haskell Curry,ENIAC coding system
8,1948,Plankalkül (concept published),Konrad Zuse,none (unique language)
9,1949,Short Code,John Mauchly and William F. Schmitt,ENIAC Short Code


In [0]:
# In what year was Julia created?

In [0]:
merged_long_df[merged_long_df.Name == "Julia"].Year

12    2012
Name: Year, dtype: object

In [0]:
# Can we read tables in different languages?
# How about encoding issues? 

- Another way to get HTML elements:
    - Send Request to Wikipedia Server and get HTML elements
    - Read HTML document with Pandas. 

In [0]:
'https://en.wikipedia.org/wiki/Python_(programming_language)'

In [0]:
# here we import requests package: 
import requests 
request = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
#request.text
#type(request.text)
#type(request)

In [0]:
type(request)

requests.models.Response

In [None]:
request.text

392286

In [0]:
list_new = pd.read_html(request.text)
len(list_new)

8

In [0]:
# Try this table?
# https://en.wikipedia.org/wiki/List_of_footballers_with_the_most_official_appearances

In [0]:
# Which position stay as a footballer the most? 

In [0]:
# Task 10 - 15 mins

### Parse Wiki Infoboxes
- What if you are interested in Wiki Infobox not a table
    - You can identify wikinfobox on a specific wiki page and pass it through `read_html` function

In [0]:
# page = 'https://en.wikipedia.org/wiki/Claremont,_California'



In [0]:
# len(infobox)

In [0]:
# infobox[0]

In [0]:
# infobox[0][9:26]

## Parsing non-Wikipedia Tables

In [0]:
# Let's get some financial data, say Dow Jones big 30 stock market prices: 
# https://money.cnn.com/data/dow30/

In [0]:
page = 'https://money.cnn.com/data/dow30/'
dowjones_list = read_html(page, attrs={"class" : "wsod_dataTable wsod_dataTableBig"})

len(dowjones_list)

1

In [0]:
dow30 = dowjones_list[0]

file_name = './dow30.csv'
dow30.to_csv(file_name, sep=',', encoding='utf-8')

In [0]:
# Let's try this website: 
from pandas.io.html import read_html
page = 'https://www.wunderground.com/weather/us/ca/claremont/91711'
# call table class:  
table = read_html(page, index_col=0, attrs={"class":"mat-table"})

## Write html tables to csv/excel

In [0]:
# Find your csv? View it 