## Scraping the List of International Players to Play in the NBA

The list of all international players to appear in the NBA can be found on Wikipedia, so we will start by scraping the Wikipedia table that the data lives in.

Link to Wikipedia table: https://en.wikipedia.org/wiki/List_of_NBA_players_born_outside_the_United_States

#### Importing Packages

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

#### Scraping Wikipedia

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_NBA_players_born_outside_the_United_States'
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find_all('table', {'class': "wikitable"})[6]

#### Creating the Players Dataframe

In [3]:
players = pd.read_html(str(table))
players = pd.DataFrame(players[0])
players = players.iloc[:,0:7]

# Dropping the position column, as I will pull that from Basketball Reference
players = players.drop(players.columns[3], axis=1)
players.head()

Unnamed: 0,Nationality[A],Birthplace[B],Player,Career[C],Yrs,Notes
0,Angola,—,Bruno Fernando*,2019–present,3,—
1,Antigua and Barbuda,United States,Julius Hodge,2005–2007,2,"Born in the United States, became a naturalize..."
2,Argentina,—,Leandro Bolmaro*,2021–present,1,Also holds Italian citizenship.[13]
3,Argentina,—,Nicolás Brussino,2016–2017,2,Also holds Italian citizenship.
4,Argentina,—,Facundo Campazzo*,2020–present,2,—


In [4]:
# Cleaning the column names
players.columns = players.columns.str.replace(r'\[.*\]','')
players.tail(10)

  players.columns = players.columns.str.replace(r'\[.*\]','')


Unnamed: 0,Nationality,Birthplace,Player,Career,Yrs,Notes
627,U.S. Virgin Islands,—,Charles Claxton,1995,1,—
628,U.S. Virgin Islands,United States,Nic Claxton*,2019–present,3,Born on the United States mainland by U.S. Vir...
629,U.S. Virgin Islands,United States,David Vanterpool,2001,1,"Born on the United States mainland, represents..."
630,Uruguay,—,Esteban Batista,2005–2007,2,—
631,Venezuela,Trinidad and Tobago,Carl Herrera,1991–1999,8,"Born in Trinidad and Tobago, represented Venez..."
632,Venezuela,United States,Askia Jones,1994,1,"Born in the United States, represented Venezue..."
633,Venezuela,United States,Harold Keeling,1986,1,"Born in the United States, represented Venezue..."
634,Venezuela,United States,Donta Smith,2004–2006,2,"Born on the United States, represented Venezuela."
635,Venezuela,—,Óscar Torres,2001–2002,2,—
636,Venezuela,—,Greivis Vásquez,2010–2016,7,—


In [5]:
# Save to csv
players.to_csv(r'Documents\International Players.csv', index = False)

### Scraping Player Stats from Basketball Reference

We will pull advanced stats and per 100 possession stats for our analysis. Advanced stats pages are available dating back to 1950, however not all advanced stat categories are populated dating back to 1950. Per 100 possession stats are available dating back to 1974. Advanced stats will help us answer which players contributed the most to winning games. Per 100 possession stats will enable us to compare players while adjusting for the pace of the game and minutes played per game.  

- Example Advanced Stats page: https://www.basketball-reference.com/leagues/NBA_2020_advanced.html
- Example Per 100 Possessions page: https://www.basketball-reference.com/leagues/NBA_2020_per_poss.html

#### Scraping Advanced Stats

In [6]:
# We will pull advanced stats from 1950 through the current 2022 season 

years = list(range(1950,2023))

advanced_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_advanced.html"

for year in years:
    url = advanced_stats_url.format(year)
    data = requests.get(url)
    
    with open("Basketball Reference/{}_Advanced.html".format(year), "w+", encoding="utf-8") as f:
        f.write(data.text)

#### Creating the Advanced Stats Dataframe

In [7]:
# Saving the html pages to a dataframe
dfs = []
for year in years:
    with open("Basketball Reference/{}_Advanced.html".format(year), encoding = "utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    advanced_table = soup.find_all(id="advanced_stats")
    advanced_df = pd.read_html(str(advanced_table))[0]
    advanced_df["Year"] = year
    dfs.append(advanced_df)

In [8]:
advanced_stats = pd.concat(dfs)

In [9]:
# Save to csv
advanced_stats.to_csv(r'Documents\Advanced Stats.csv', index = False)

#### Scraping Per 100 Possession Stats

In [10]:
# We will pull per 100 possession stats from 1974 through the current 2022 season 

years = list(range(1974,2023))

per_poss_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_poss.html"

for year in years:
    url = per_poss_url.format(year)
    data = requests.get(url)
    
    with open("Basketball Reference/{}_Per_Poss.html".format(year), "w+", encoding="utf-8") as f:
        f.write(data.text)

#### Creating the Per 100 Possessions Stats Dataframe

In [11]:
# Saving the html pages to a dataframe
dfs = []
for year in years:
    with open("Basketball Reference/{}_Per_Poss.html".format(year), encoding = "utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    per_poss_table = soup.find_all(id="per_poss_stats")
    per_poss_df = pd.read_html(str(per_poss_table))[0]
    per_poss_df["Year"] = year
    dfs.append(per_poss_df)

In [12]:
per_poss = pd.concat(dfs)

In [13]:
# Save to csv
per_poss.to_csv(r'Documents\Per 100 Poss Stats.csv', index = False)