<a href="https://colab.research.google.com/github/rinr2602/DA_pandas_series/blob/main/Reading_html.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install lxml
import pandas as pd



### Parsing raw HTML strings

Another useful pandas method is read_html(). This method will read HTML tables from a given URL, a file-like object, or a raw string containing HTML, and return a list of DataFrame objects.

Let's try to read the following html_string into a DataFrame.

In [2]:
html_string = """
<table>
    <thead>
      <tr>
        <th>Order date</th>
        <th>Region</th>
        <th>Item</th>
        <th>Units</th>
        <th>Unit cost</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>1/6/2018</td>
        <td>East</td>
        <td>Pencil</td>
        <td>95</td>
        <td>1.99</td>
      </tr>
      <tr>
        <td>1/23/2018</td>
        <td>Central</td>
        <td>Binder</td>
        <td>50</td>
        <td>19.99</td>
      </tr>
      <tr>
        <td>2/9/2018</td>
         <td>Central</td>
        <td>Pencil</td>
        <td>36</td>
        <td>4.99</td>
      </tr>
      <tr>
        <td>3/15/2018</td>
        <td>West</td>
        <td>Pen</td>
        <td>27</td>
        <td>19.99</td>
      </tr>
    </tbody>
</table>
"""

In [5]:
dfs = pd.read_html(html_string)
len(dfs)

1

In [7]:
df = dfs[0]
df

Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99
3,3/15/2018,West,Pen,27,19.99


In [8]:
df.shape

(4, 5)

In [9]:
df.loc[df['Region'] == 'Central']

Unnamed: 0,Order date,Region,Item,Units,Unit cost
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99


In [10]:
df.loc[df['Units']>35]

Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99


### Defining header

Pandas will automatically find the header to use thanks to the tag.

But in many cases we'll find wrong or incomplete tables that make the read_html method parse the tables in a wrong way without the proper headers.

To fix them we can use the header parameter.

In [12]:
html_string = """
<table>
  <tr>
    <td>Order date</td>
    <td>Region</td>
    <td>Item</td>
    <td>Units</td>
    <td>Unit cost</td>
  </tr>
  <tr>
    <td>1/6/2018</td>
    <td>East</td>
    <td>Pencil</td>
    <td>95</td>
    <td>1.99</td>
  </tr>
  <tr>
    <td>1/23/2018</td>
    <td>Central</td>
    <td>Binder</td>
    <td>50</td>
    <td>19.99</td>
  </tr>
  <tr>
    <td>2/9/2018</td>
    <td>Central</td>
    <td>Pencil</td>
    <td>36</td>
    <td>4.99</td>
  </tr>
  <tr>
    <td>3/15/2018</td>
    <td>West</td>
    <td>Pen</td>
    <td>27</td>
    <td>19.99</td>
  </tr>
</table>
"""

In [13]:
pd.read_html(html_string)[0]

Unnamed: 0,0,1,2,3,4
0,Order date,Region,Item,Units,Unit cost
1,1/6/2018,East,Pencil,95,1.99
2,1/23/2018,Central,Binder,50,19.99
3,2/9/2018,Central,Pencil,36,4.99
4,3/15/2018,West,Pen,27,19.99


In [15]:
pd.read_html(html_string,
             header=0)[0]

Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99
3,3/15/2018,West,Pen,27,19.99


### Parsing HTML tables from the web

Now that we know how read_html works, go one step beyond and try to parse HTML tables directly from an URL.

To do that we'll call the read_html method with an URL as paramter.

In [16]:
html_url = "https://www.basketball-reference.com/leagues/NBA_2019_per_game.html"

In [17]:
nba_tables = pd.read_html(html_url)

In [18]:
len(nba_tables)

1

In [21]:
nba = nba_tables[0]
nba.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,...,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,2,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,...,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,3,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,...,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,4,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,...,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,5,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,...,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


### Complex example

We can also use the requests module to get HTML code from an URL to parse it into DataFrame objects.

If we look at the given URL we can see multiple tables about The Simpsons TV show.

We want to keep the table with information about each season.


In [22]:
import requests

In [23]:
html_url = "https://en.wikipedia.org/wiki/The_Simpsons"

In [28]:
r = requests.get(html_url)
wiki_tables = pd.read_html(r.text,
                           header=0)
len(wiki_tables)

47

In [33]:
simpsons = wiki_tables[1]
simpsons.head()

Unnamed: 0,Cast members,Cast members.1,Cast members.2,Cast members.3,Cast members.4,Cast members.5,Cast members.6,Cast members.7,Cast members.8
0,,,,,,,,,
1,Dan Castellaneta,Julie Kavner,Nancy Cartwright,Yeardley Smith,Hank Azaria,Harry Shearer,,,
2,"Homer Simpson, Abe Simpson, Krusty the Clown, ...","Marge Simpson, Patty and Selma Bouvier, additi...","Bart Simpson, Maggie Simpson, various characters",Lisa Simpson,"Moe Szyslak, Chief Wiggum, Apu Nahasapeemapeti...","Ned Flanders, Mr. Burns, Dr. Hibbert (1990–202...",,,


In [31]:
simpsons.drop([0,1], inplace=True)