If you like, upvote please!

# Original Table (in HTML) vs. Our DataFrame (made in PANDAS)


![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Introduction

On our journey as data scientists, we will not always find well-organized data saved in an xlsx or csv file. It will often be necessary to perform some data extraction techniques that are contained in other sources. One of these techniques that is well known is called "web scraping".
Web scraping allows us to obtain data that is on an html page, that is, on a website.

In this notebook, I will learn how to get a website dataframe and manipulate it using the powerful library PANDAS.

**LET'S CODE!!**

# Step-by-step:
1. Import libraries
2. Get the url
3. Know the HTML of this url
4. Find the table
5. HTML table -----> Pandas DataFrame
6. Complete code

## 1. Import libraries

For this tutorial, we will use **bs4, requests and pandas**.

### BeautifulSoup

![Bs4](https://funthon.files.wordpress.com/2017/05/bs.png)

BeautifulSoup is a library used for make web scraping in websites using a easy and short syntax. With it we can get phrases, words, numbers, images, and **TABLES**

### Requests

![Requests](https://www.nicepng.com/png/detail/70-702215_building-with-python-requests-python-requests-logo.png)

It is a library used to requisition websites, and download a web page

### Pandas

![Pandas](https://www.dlf.pt/dfpng/middlepng/442-4429904_pandas-python-logo-png-pandas-python-logo-transparent.png)

Pandas is a powerful library used to create, manipulate, plot and explore data (in panel or in tables). It is one of the most used libraries by data scientists (Python users), as it allows us to do numerous tasks necessary for a project


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## 2.Get the URL 

![image.png](attachment:image.png)

In [2]:

url = 'https://www.basketball-reference.com/leagues/NBA_2020_totals.html'
req = requests.get(url)
print(req) #To verify output

<Response [200]>


If output = "Response [200]". It work!!
If output = "Response [404]. Page not found!!

## 3. Parse HTML page

In [3]:
soup = BeautifulSoup(req.content, 'html.parser')
#print(soup) #To verify output (VERY LONG)

The output is whole HTML page

## 4. Find the table 

In [4]:
tabela = soup.find(name='table')
#print(tabela) #To verify output (VERY LONG)

The output is the table in HTML tags

## 5. from HTML table to PANDAS DataFrame

In [5]:
df = pd.read_html(str(tabela))[0].set_index('Rk') #We use "[0]" because the output would be a list with a single element, and we need that element, not the list
df.head()

Unnamed: 0_level_0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Steven Adams,C,26,OKC,63,63,1680,283,478,0.592,...,0.582,207,376,583,146,51,67,94,122,684
2,Bam Adebayo,PF,22,MIA,72,72,2417,440,790,0.557,...,0.691,176,559,735,368,82,93,204,182,1146
3,LaMarcus Aldridge,C,34,SAS,53,53,1754,391,793,0.493,...,0.827,103,289,392,129,36,87,74,128,1001
4,Kyle Alexander,C,23,MIA,2,0,13,1,2,0.5,...,,2,1,3,0,0,0,1,1,2
5,Nickeil Alexander-Walker,SG,21,NOP,47,1,591,98,266,0.368,...,0.676,9,75,84,89,17,8,54,57,267


# Complete code

In [6]:
#Step one: import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

#Step two: get the URL
url = 'https://www.basketball-reference.com/leagues/NBA_2020_totals.html'
req = requests.get(url)

#Step three: parse HTML page
soup = BeautifulSoup(req.content, 'html.parser')

#Step four: find the table
tabela = soup.find(name='table')

#Step five: from HTML Table to PANDAS DataFrame
df = pd.read_html(str(tabela))[0].set_index('Rk')
df.head()

Unnamed: 0_level_0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Steven Adams,C,26,OKC,63,63,1680,283,478,0.592,...,0.582,207,376,583,146,51,67,94,122,684
2,Bam Adebayo,PF,22,MIA,72,72,2417,440,790,0.557,...,0.691,176,559,735,368,82,93,204,182,1146
3,LaMarcus Aldridge,C,34,SAS,53,53,1754,391,793,0.493,...,0.827,103,289,392,129,36,87,74,128,1001
4,Kyle Alexander,C,23,MIA,2,0,13,1,2,0.5,...,,2,1,3,0,0,0,1,1,2
5,Nickeil Alexander-Walker,SG,21,NOP,47,1,591,98,266,0.368,...,0.676,9,75,84,89,17,8,54,57,267


# Thanks for your upvote