# Intro to web scraping with BeautifulSoup

### 1. Intro to HTML

Before to try to scrap a web page we have to undertand what HTML is, how it looks like and how information is showed in a web page. 

Take a look a this HTML code and execute it:

In [None]:
%%HTML
<html>  
    <head>
        <title>PyLadiesBCN HTML mini Tutorial</title>
    </head>
    <body>
        <h1>PyLadiesBCN</h1>
         <img src="https://www.pyladies.com/assets/images/pylady_geek.png" width="200" height="150"</img>
        
        <h3> Welcome to Data, APIS and web scraping Workshop!!</h3>
        
        <p> Visit our GitHub
        <a href="https://github.com/pyladies-bcn">repository</a>
        </p>
        
        <p> Or our twitter: 
        <a href="https://twitter.com/pyladiesbcn">here</a>
        </p>
    <body>
</html>

<div class="alert alert-block alert-info">
Jupyter Notebook understand that code inside the follow cell is HTML by adding **%%HTML** magic command on top. If you want to learn more about cell magic and cells  you can visit this <a href="https://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb">tutorial</a>. 
</div>

In [None]:
%lsmagic

Try now to visit this [web page](https://www.icc-cricket.com/rankings/womens/team-rankings/odi) about cricket ranking. Take a look to HTML code that generate this web page by mouse right-click and selecting **Inspect** or **Inspect this object** depending on Web browser you are using. 

### 2. Obtaining web page information

In [None]:
import requests

page = requests.get("https://www.icc-cricket.com/rankings/womens/team-rankings/odi")

In [None]:
page

<div class="alert alert-block alert-info">
`<Response [200]>` indicates that has not been any problem with webpage and obtaining itsinformation. To know more about this library and possible excepctions please visit this <a href="http://docs.python-requests.org/en/master/api/#exceptions">guide</a>
</div>

Request "simulate" you are asking to navigator for a particular web page. Request obtain all information contained in our desired webpage (all HTML webpage code). In this case all this information is stored in **page.content** variable:

In [None]:
page.content

### 3. Searching and filtering data using BeautifulSoup

But all this information in plain text is hard to manage or even find information of your interest. For this reason we use specialized functions like **BeautifulSoup**.  
This library not only stores web pages information also undestand HTML tags and how information is stored in them. Take a lot what this library can do for us: 

In [None]:
from bs4 import BeautifulSoup

content = BeautifulSoup(page.content,"html.parser")

In [None]:
content

We can now print out the HTML content of the page, formatted nicely, using the **prettify** method on **content** object:

In [None]:
print(content.prettify())

From this point we can find desired information in HTML by HTML tag type. Take a look to following examples:

In [None]:
content.find_all('h3')

In [None]:
content.find_all('a')

In [None]:
content.find_all('img')

But we are going to focus on getting classification table information. For this reason we have to *Inspect* desired webpage element and find its class type and name. In this case, classification is preceeded      by **`<div class="wrapper wrapper--sticky">`** tag.

In [None]:
data = content.find_all('div', class_='wrapper wrapper--sticky')

data

In [None]:
len(data)

<div class="alert alert-block alert-info">
**len(data) = 1** indicates that only one element acomplish **find_all** filter. If we know in advance, we can also use **find** estatement.
</div>

In [None]:
data = content.find('div', class_='wrapper wrapper--sticky')

Using prettify we can have more idea about table structure: 

In [None]:
print(data.prettify())

And we can also find elements into this part of webpage HTML code. Look at some examples:

In [None]:
data.find('h3', class_='widget__title')

<div class="alert alert-block alert-info">
If we just want text inside desired element we use **.get_text()**
</div>

In [None]:
title = data.find('h3', class_='widget__title').get_text()
title

In [None]:
last_update = data.find('div', class_='rankings-table__last-updated').get_text()
last_update

Finally, we can only are interested to get table content:

In [None]:
table = data.find('table')

table

We are going to represent this information in HTML format:

In [None]:
%%HTML
<table class="table">
    <thead>
     <tr class="table-head">
      <th class="table-head__cell">
       Pos
      </th>
      <th class="table-head__cell rankings-table__team u-text-left">
       Team
      </th>
      <th class="table-head__cell">
       Weighted Matches
      </th>
      <th class="table-head__cell">
       Points
      </th>
      <th class="table-head__cell u-text-right rating">
       Rating
      </th>
     </tr>
    </thead>
    <tbody>
     <tr class="table-body rankings-table__hero" data-team-id="15">
      <td class="table-body__cell table-body__cell--position">
       1
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo AUS u-show-phablet u-hide-mobile">
       </span>
       <span class="flag-30 table-body_logo AUS u-hide-phablet">
       </span>
       Australia
      </td>
      <td class="table-body__cell">
       23
      </td>
      <td class="table-body__cell">
       3,275
      </td>
      <td class="table-body__cell u-text-right rating">
       142
      </td>
     </tr>
     <tr class="table-body" data-team-id="14">
      <td class="table-body__cell table-body__cell--position">
       2
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo IND u-hide-mobile">
       </span>
       India
      </td>
      <td class="table-body__cell">
       31
      </td>
      <td class="table-body__cell">
       3,788
      </td>
      <td class="table-body__cell u-text-right rating">
       122
      </td>
     </tr>
     <tr class="table-body" data-team-id="11">
      <td class="table-body__cell table-body__cell--position">
       3
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo ENG u-hide-mobile">
       </span>
       England
      </td>
      <td class="table-body__cell">
       25
      </td>
      <td class="table-body__cell">
       3,033
      </td>
      <td class="table-body__cell u-text-right rating">
       121
      </td>
     </tr>
     <tr class="table-body" data-team-id="16">
      <td class="table-body__cell table-body__cell--position">
       4
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo NZ u-hide-mobile">
       </span>
       New Zealand
      </td>
      <td class="table-body__cell">
       31
      </td>
      <td class="table-body__cell">
       3,529
      </td>
      <td class="table-body__cell u-text-right rating">
       114
      </td>
     </tr>
     <tr class="table-body" data-team-id="19">
      <td class="table-body__cell table-body__cell--position">
       5
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo SA u-hide-mobile">
       </span>
       South Africa
      </td>
      <td class="table-body__cell">
       39
      </td>
      <td class="table-body__cell">
       3,864
      </td>
      <td class="table-body__cell u-text-right rating">
       99
      </td>
     </tr>
     <tr class="table-body" data-team-id="21">
      <td class="table-body__cell table-body__cell--position">
       6
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo WI u-hide-mobile">
       </span>
       West Indies
      </td>
      <td class="table-body__cell">
       22
      </td>
      <td class="table-body__cell">
       1,921
      </td>
      <td class="table-body__cell u-text-right rating">
       87
      </td>
     </tr>
     <tr class="table-body" data-team-id="20">
      <td class="table-body__cell table-body__cell--position">
       7
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo PAK u-hide-mobile">
       </span>
       Pakistan
      </td>
      <td class="table-body__cell">
       26
      </td>
      <td class="table-body__cell">
       1,978
      </td>
      <td class="table-body__cell u-text-right rating">
       76
      </td>
     </tr>
     <tr class="table-body" data-team-id="13">
      <td class="table-body__cell table-body__cell--position">
       8
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo SL u-hide-mobile">
       </span>
       Sri Lanka
      </td>
      <td class="table-body__cell">
       26
      </td>
      <td class="table-body__cell">
       1,478
      </td>
      <td class="table-body__cell u-text-right rating">
       57
      </td>
     </tr>
     <tr class="table-body" data-team-id="22">
      <td class="table-body__cell table-body__cell--position">
       9
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo BAN u-hide-mobile">
       </span>
       Bangladesh
      </td>
      <td class="table-body__cell">
       13
      </td>
      <td class="table-body__cell">
       632
      </td>
      <td class="table-body__cell u-text-right rating">
       49
      </td>
     </tr>
     <tr class="table-body" data-team-id="12">
      <td class="table-body__cell table-body__cell--position">
       10
      </td>
      <td class="table-body__cell rankings-table__team u-text-left">
       <span class="flag-15 table-body_logo IRE u-hide-mobile">
       </span>
       Ireland
      </td>
      <td class="table-body__cell">
       10
      </td>
      <td class="table-body__cell">
       211
      </td>
      <td class="table-body__cell u-text-right rating">
       21
      </td>
     </tr>
    </tbody>
   </table>

In [None]:
data.find('thead')

In [None]:
header = [item.get_text() for item in data.find('thead').find_all('th')]

In [None]:
header

In [None]:
header =[]

for item in data.find('thead').find_all('th'):
    header.append(item.get_text())
    
header

In [None]:
table_content =[]

for tablerow in data.find('tbody').find_all('tr'):

    table_row =[]
    
    for tableitem in tablerow.find_all('td'):
        table_row.append(tableitem.get_text())
        
    table_content.append(table_row)
    
table_content

### 4. Using Pandas to make information usable

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(table_content,columns=header)

In [None]:
df

In [None]:
df["Points"] = df["Points"].str.replace(",","").astype(int)
df

In [None]:
df["Points"].mean()

In [None]:
df["Points"].min()

In [None]:
df.to_csv(title,index=False, sep=',')

### 5. Pandas SUPERPOWERS in structured data

In [None]:
df2 = pd.read_html("https://www.icc-cricket.com/rankings/womens/team-rankings/odi",attrs = {'class': 'table'})
df2

In [None]:
df2[0]

In [None]:
df2 = df2[0]

Pandas automatically recognise HTML tables and assign them properly properties to directly work with them.

In [None]:
df2

In [None]:
df2['Team']

In [None]:
len(df2['Team'][0])

In [None]:
df2['Weighted Matches']

In [None]:
df2['Points']

In [None]:
df2['Rating']

In [None]:
df2['Points'].mean()

In [None]:
df2['Points'].min()

In [None]:
df2['Points'].max()

In [None]:
df2.to_csv(title+ '_pandas',index=False, sep=',')