---
<img alt="Colaboratory logo" width="15%" src="https://i.postimg.cc/8CdBQXmP/JPLavor.png">

#### **Data Science**
*by [jplavorr](https://linktr.ee/jplavorr)*  

---

#Web Scraping para Esportes

Quando comecei a procurar DataSets relacionados as ligas esportivas e as estatisticas de seus jogadores, acabei me deparando com um problema. Por mais que eu encontrasse tais DataSets no Kaggle, eles não continham as informações completas ou não apresentavam colunas nos quais eu gostaria de analisar. Foi aí que percebi que às vezes, nem todos os dados estão disponíveis para nós de forma prática. Para continuar um determinado projeto de análise de dados, devemos fazer um pouco mais para obter os dados apropriados e atualizados de que precisamos.

Logo, isso nos tras no tópico desse artigo, **Web Scraping**, que será usado para criar DataSets que iremos usar futuramente na série de artigos sobre Data Science aplicada nos esportes. Esse artigo servirá como base sobre como iremos extrair as informações estatísticas dos jogos e temporadas para realizar as analises que irão ocorrer. 

Para reunir as informações de todas as estatísticas de variados esportes, iremos usar o site [Sports Reference](https://www.sports-reference.com/). Este site é essencialmente uma enciclopédia para todas as coisas sobre estatísticas de esportes. Aí veio a minha próxima pergunta: Por que não pegar os dados diretamente da Referência do Basquete? Depois de mais pesquisas, descobri uma ótima biblioteca Python que resolveu esta parte do meu projeto: BeautifulSoup. Esta biblioteca é um raspador da web que nos permite pesquisar o HTML de uma página da web e extrair as informações de que precisamos. A partir daí, armazenaremos os dados que coletamos em um DataFrame usando pandas.


In [28]:
#Bibliotecas 
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [29]:
# Temporada iremos Analisar
year = 2018

In [30]:
# URL da pagina que iremos fazer o scarping
url_nba = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)

In [31]:
page = requests.get(url_nba)
page

<Response [200]>

In [None]:
page.content

In [32]:
html_nba = urlopen(url_nba)
soup_nba = BeautifulSoup(html_nba)

#Players

In [33]:
# avoid the first header row
rows = soup_nba.findAll('tr')[1:]
players_stats_2018 = [[td.getText() for td in rows[i].findAll('td')] for i in range(len(rows))]

In [34]:
headers_2018 = [th.getText() for th in soup_nba.findAll('tr', limit=2)[0].findAll('th')]

In [35]:
headers_2018

['Rk',
 'Player',
 'Pos',
 'Age',
 'Tm',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'eFG%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS']

In [36]:
headers_2018_final = headers_2018[1:]

In [37]:
stats_2018 = pd.DataFrame(players_stats_2018, columns = headers_2018_final)

In [38]:
stats_2018.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Álex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4


In [39]:
stats_2018.shape

(690, 29)

#Players (outro)

In [40]:
table_Basket = soup_nba.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="per_game_stats") 

In [42]:
columns_Basket = table_Basket.findAll(lambda tag: tag.name=='tr',limit=1)

In [43]:
headers_Basket = [th.getText() for th in columns_Basket[0].findAll('th')]
headers_Basket_final = headers_Basket[1:]

In [44]:
#Criando uma lista com todas as estatisticas presentes
rows_Basket = table_Basket.tbody.findAll('tr')[0:]
player_stats_Basket = [[td.getText() for td in rows[i].findAll('td')] for i in range(len(rows))]

In [45]:
nba_stats_Basket_2018 = pd.DataFrame(player_stats_Basket, columns = headers_Basket_final)

In [46]:
nba_stats_Basket_2018.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Álex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4


In [47]:
nba_stats_Basket_2018.shape

(690, 29)

# Time

In [48]:
# URL da pagina que iremos fazer o scarping
url_time = "https://www.basketball-reference.com/leagues/NBA_{}.html".format(year)

In [49]:
page = requests.get(url_time)
page

<Response [200]>

In [50]:
html_time = urlopen(url_time)
soup_time = BeautifulSoup(html_time)

In [51]:
table_time = soup_time.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="per_game-team") 

In [52]:
columns_time = table_time.findAll(lambda tag: tag.name=='tr',limit=2)

In [53]:
columns_time

[<tr> <th aria-label="Rk" class="ranker poptip sort_default_asc show_partial_when_sorting center" data-stat="ranker" data-tip="Rank" scope="col">Rk</th> <th aria-label="team" class=" poptip center" data-stat="team" scope="col">Team</th> <th aria-label="Games" class=" poptip center" data-stat="g" data-tip="Games" scope="col">G</th> <th aria-label="Minutes Played" class=" poptip center" data-stat="mp" data-tip="Minutes Played" scope="col">MP</th> <th aria-label="Field Goals" class=" poptip center" data-stat="fg" data-tip="Field Goals" scope="col">FG</th> <th aria-label="FGA" class=" poptip center" data-stat="fga" data-tip="Field Goal Attempts" scope="col">FGA</th> <th aria-label="Field Goal Percentage" class=" poptip center" data-stat="fg_pct" data-tip="Field Goal Percentage" scope="col">FG%</th> <th aria-label="3-Point Field Goals" class=" poptip center" data-stat="fg3" data-tip="3-Point Field Goals" scope="col">3P</th> <th aria-label="3-Point Field Goal Attempts" class=" poptip center"

In [54]:
headers_time = [th.getText() for th in columns_time[0].findAll('th')]
headers_time_final = headers_time[1:]

In [55]:
headers_time_final

['Team',
 'G',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS']

In [56]:
#Criando uma lista com todas as estatisticas presentes
rows_time = table_time.tbody.findAll('tr')[0:]
time_stats_Basket = [[td.getText() for td in rows_time[i].findAll('td')] for i in range(len(rows_time))]

In [57]:
nba_stats_2018 = pd.DataFrame(time_stats_Basket, columns = headers_time_final)

In [58]:
nba_stats_2018

Unnamed: 0,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,2P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Golden State Warriors*,82,240.6,42.8,85.1,0.503,11.3,28.9,0.391,31.5,...,0.815,8.4,35.1,43.5,29.3,8.0,7.5,15.5,19.6,113.5
1,Houston Rockets*,82,240.9,38.7,84.2,0.46,15.3,42.3,0.362,23.4,...,0.781,9.0,34.5,43.5,21.5,8.5,4.8,13.8,19.5,112.4
2,New Orleans Pelicans*,82,243.4,42.7,88.3,0.483,10.2,28.2,0.362,32.5,...,0.772,8.7,35.7,44.3,26.8,8.0,5.9,14.9,19.1,111.7
3,Toronto Raptors*,82,241.8,41.3,87.4,0.472,11.8,33.0,0.358,29.5,...,0.794,9.8,34.2,44.0,24.3,7.6,6.1,13.4,21.7,111.7
4,Cleveland Cavaliers*,82,240.6,40.4,84.8,0.476,12.0,32.1,0.372,28.4,...,0.779,8.5,33.7,42.1,23.4,7.1,3.8,13.7,18.6,110.9
5,Denver Nuggets,82,242.4,40.7,86.6,0.47,11.5,30.9,0.371,29.2,...,0.767,11.0,33.5,44.5,25.1,7.6,4.9,15.0,18.7,110.0
6,Philadelphia 76ers*,82,241.2,40.8,86.6,0.472,11.0,29.8,0.369,29.9,...,0.752,10.9,36.5,47.4,27.1,8.3,5.1,16.5,22.1,109.8
7,Minnesota Timberwolves*,82,241.5,41.0,86.1,0.476,8.0,22.5,0.357,33.0,...,0.804,10.3,31.6,42.0,22.7,8.4,4.2,12.5,18.2,109.5
8,Los Angeles Clippers,82,240.3,40.3,85.4,0.471,9.5,26.8,0.354,30.8,...,0.743,10.1,33.7,43.9,22.3,7.7,4.5,14.7,20.0,109.0
9,Charlotte Hornets,82,241.2,39.0,86.7,0.45,10.0,27.2,0.369,28.9,...,0.747,10.1,35.4,45.5,21.6,6.8,4.5,12.7,17.2,108.2


#Processo ETL

In [59]:
from datetime import date

start = date(2018, 1, 1) 
end = date(2020, 1, 1)

year_range = [year for year in range(start.year, end.year +1)]

In [60]:
year_range

[2018, 2019, 2020]

In [61]:
names = ['Season_{}'.format(i) for i in year_range]

In [62]:
names

['Season_2018', 'Season_2019', 'Season_2020']

In [63]:
for i in range(len(year_range)):
    url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year_range[i])
    # this is the HTML from the given URL
    html = urlopen(url)
    soup = BeautifulSoup(html)
    headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
    headers = headers[1:]
    # avoid the first header row
    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[j].findAll('td')]
                for j in range(len(rows))]
    locals()[names[i]] = pd.DataFrame(player_stats, columns = headers)


In [64]:
Season_2019.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,...,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,...,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,...,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,...,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,...,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


In [66]:
Season_2020.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Steven Adams,C,26,OKC,63,63,26.7,4.5,7.6,0.592,...,0.582,3.3,6.0,9.3,2.3,0.8,1.1,1.5,1.9,10.9
1,Bam Adebayo,PF,22,MIA,72,72,33.6,6.1,11.0,0.557,...,0.691,2.4,7.8,10.2,5.1,1.1,1.3,2.8,2.5,15.9
2,LaMarcus Aldridge,C,34,SAS,53,53,33.1,7.4,15.0,0.493,...,0.827,1.9,5.5,7.4,2.4,0.7,1.6,1.4,2.4,18.9
3,Kyle Alexander,C,23,MIA,2,0,6.5,0.5,1.0,0.5,...,,1.0,0.5,1.5,0.0,0.0,0.0,0.5,0.5,1.0
4,Nickeil Alexander-Walker,SG,21,NOP,47,1,12.6,2.1,5.7,0.368,...,0.676,0.2,1.6,1.8,1.9,0.4,0.2,1.1,1.2,5.7
