##### STA 141B Data & Web Technologies for Data Analysis

# Lecture 12 - 02/17/26, Scraping

### Announcements
- Third homework will be uploaded today/tomorrow.
- Midterm grades will be uploaded soon.
- Midterm solutions will be discussed in the discussion sections on February 18.

### Today's topics
 - Web Scraping: 
     - Tornado Watch
     - WhereTheISS

### Ressources
 - [WhereTheISS](wheretheiss.at)
 - [Tornado Watch](https://www.tornadohq.com/)

## `pd.read_html'

Use `pd.read_html` whenever you want to read a table from a webpage. It is by far the most convenient way to do so.

Always provide your User-Agent and slow down your requests!

In [1]:
# Example
import pandas as pd

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area', storage_options=headers)

In [2]:
tables[1] # et voila!

Unnamed: 0_level_0,City,ST,Land area,Land area,Water area,Water area,Total area,Total area,Population (2020)
Unnamed: 0_level_1,City,ST,(mi2),(km2),(mi2),(km2),(mi2),(km2),Population (2020)
0,Sitka,AK,2870.2,7434,1904.3,4932.0,4774.5,12366,8458
1,Juneau,AK,2702.9,7000,555.1,1438.0,3258.0,8438,32255
2,Wrangell,AK,2556.1,6620,915.0,2370.0,3471.1,8990,2127
3,Anchorage,AK,1706.8,4421,237.7,616.0,1944.5,5036,291247
4,Tribune[a]*,KS,778.2,2016,0.0,0.0,778.2,2016,1182
...,...,...,...,...,...,...,...,...,...
145,Toledo,OH,80.5,208,3.3,8.5,83.8,217,270871
146,Jonesboro,AR,80.2,208,0.6,1.6,80.7,209,78576
147,El Reno,OK,79.6,206,0.6,1.6,80.2,208,16989
148,Ellsworth,ME,79.3,205,14.6,38.0,93.9,243,8399


## Scrape manually (XPath/BS4)

In [3]:
import requests
import lxml.html as lx
import requests_cache
import time
from re import sub
session = requests_cache.CachedSession('../output/lecture12')

We want to create this seasonality chart from [here](https://foodwise.org/eat-seasonally/seasonality-charts/seasonality-chart-vegetables/).

For this, we need to take two steps:
- Get all vegetables listed on the page
- For each product, get the months where the product is 'in Season'

### FIRST ATTEMPT

#### How to get the product in the first place? 

Visit https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable
and use Inspect.

In [4]:
def get_produce(page):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = requests.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    products = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
    return products

In [5]:
get_produce(2)

['Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi']

There are four pages in total.

In [6]:
produce = [item for pages in [get_produce(i) for i in range(1,5)] for item in pages]
produce

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed',
 'Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi',
 'Komatsuna',
 'Lambsquarters',
 'Leeks',
 'Lettuce',
 'Mushrooms',
 'Mustard greens',
 'Nettles',
 'Okra',
 'Onions',
 'Orach',
 'Parsnips',
 'Pea shoots',
 'Peas',
 'Peppers, chile',
 'Peppers, sweet',
 'Potatoes',
 'Purslane',
 'Radishes',
 'Romanesco',
 'Rutabagas',
 'Salsify',
 'Scallions',
 'Shallots',
 'Shelling beans',
 'Spinach',
 'Sprouts',
 'Squash, summer',
 'Squash, winter',
 'Sunchokes',
 'Sweet potatoes',
 'Taro root',
 'Tatsoi',
 'To

#### How to get the months?

Visit https://foodwise.org/foods/corn/

In [7]:
def get_months(product): 
    time.sleep(0.5)
    url = "https://foodwise.org/foods/" + product + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: # N
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() # remove (In Season) or any non-alphanumeric content
    except:
        month = []
    
    return month

In [8]:
month = get_months('corn')
month 

['June', 'July', 'August', 'September', 'October']

#### Iterate over produce items

In [9]:
seasonality_info = [get_months(p) for p in produce]

HTTPError: 404 Client Error: Not Found for url: https://foodwise.org/foods/Peppers,%20chile/

Go back to [Page 3](https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable&_paged=3). If we click on the link for Peppers, chile, it opens a different page:
https://foodwise.org/foods/peppers-chile/

__SOLUTION:__ Get the links, rather than the names!

### SECOND ATTEMPT

#### Get Links

In [10]:
def get_products(page):
    time.sleep(0.5)
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = session.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    return(html.xpath('//a[@class="card-image-title__outer-link"]/@href'))

In [11]:
all_veggies = [el for p in range(1,5) for el in get_products(p)]

In [12]:
all_veggies

['https://foodwise.org/foods/artichokes/',
 'https://foodwise.org/foods/arugula/',
 'https://foodwise.org/foods/asparagus/',
 'https://foodwise.org/foods/beets/',
 'https://foodwise.org/foods/bitter-melon/',
 'https://foodwise.org/foods/bok-choy/',
 'https://foodwise.org/foods/broccoli/',
 'https://foodwise.org/foods/broccoli-rabe/',
 'https://foodwise.org/foods/brussels-sprouts/',
 'https://foodwise.org/foods/burdock/',
 'https://foodwise.org/foods/cabbage/',
 'https://foodwise.org/foods/cactus-pads/',
 'https://foodwise.org/foods/cardoons/',
 'https://foodwise.org/foods/carrots/',
 'https://foodwise.org/foods/cauliflower/',
 'https://foodwise.org/foods/celeriac/',
 'https://foodwise.org/foods/celery/',
 'https://foodwise.org/foods/celtuce/',
 'https://foodwise.org/foods/chard/',
 'https://foodwise.org/foods/chickweed/',
 'https://foodwise.org/foods/chicory/',
 'https://foodwise.org/foods/collard-greens/',
 'https://foodwise.org/foods/corn/',
 'https://foodwise.org/foods/cress/',
 'ht

#### Get months

In [17]:
def get_months(produce_link): 
    time.sleep(0.5)
    response = session.get(produce_link, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        return [None, []] 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        except:
            return [None, []] 
        else:
            month = sub(r'(In Season)|\W', ' ', string).split() 
            name = html.xpath("//h1/text()")[0]
            return [name, month]

In [14]:
get_months(all_veggies[0])

['Artichokes',
 ['March',
  'April',
  'May',
  'June',
  'September',
  'October',
  'November',
  'December']]

#### Combine everything

In [18]:
year = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 
        'October', 'November', 'December']

In [19]:
month = get_months('https://foodwise.org/foods/artichokes/')

In [20]:
month

['Artichokes',
 ['March',
  'April',
  'May',
  'June',
  'September',
  'October',
  'November',
  'December']]

In [21]:
[item in month for item in year]

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

In [25]:
def assemble_row(produce_link): 
    name, months = get_months(produce_link)
    months = [item in months for item in year]
    months.insert(0, name)
    return months

In [26]:
assemble_row('https://foodwise.org/foods/artichokes/')

['Artichokes',
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True]

In [27]:
data = [assemble_row(i) for i in all_veggies] 
data

[['Artichokes',
  False,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Arugula',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Asparagus',
  False,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  False,
  False,
  False,
  False],
 ['Beets',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Bitter melon',
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  False],
 ['Bok choy',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli rabe',
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Brussels sprouts',
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  Fal

In [28]:
df = pd.DataFrame(data)
df.shape

(76, 13)

In [29]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Artichokes,False,False,True,True,True,True,False,False,True,True,True,True
1,Arugula,True,True,True,True,True,True,True,True,True,True,True,True
2,Asparagus,False,True,True,True,True,True,False,False,False,False,False,False
3,Beets,True,True,True,True,True,True,True,True,True,True,True,True
4,Bitter melon,False,False,False,False,False,True,True,True,True,True,True,False


In [30]:
columnames = year.copy()
columnames.insert(0, 'Produce')
df.columns = columnames

In [31]:
df

Unnamed: 0,Produce,January,February,March,April,May,June,July,August,September,October,November,December
0,Artichokes,False,False,True,True,True,True,False,False,True,True,True,True
1,Arugula,True,True,True,True,True,True,True,True,True,True,True,True
2,Asparagus,False,True,True,True,True,True,False,False,False,False,False,False
3,Beets,True,True,True,True,True,True,True,True,True,True,True,True
4,Bitter melon,False,False,False,False,False,True,True,True,True,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,Tatsoi,True,True,True,True,False,False,False,False,False,True,True,True
72,Tomatillos,False,False,False,False,False,True,True,True,True,True,True,False
73,Tomatoes,False,False,False,False,False,True,True,True,True,True,False,False
74,Turnips,True,True,True,True,True,True,True,True,True,True,True,True


## Pro-Tips for undocumented APIs

In many cases, websites are using APIs in the background, but prefer them to be used by human users.

Let's go to the [NBA Stats Website](https://www.nba.com/stats/players/traditional?PerMode=Totals&sort=PTS&dir=-1).

Step-by-step Guide:
- Click on the link
- Open the Inspector (right-cick + Inspect or shortcut)
- Reload the page
- Be confused. Too many entries!
- Click on the pause-Symbol to stop logging new network flow
- Filter for NBA
- Sort return values and go to json
- Find the following link:
https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2025-26&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=

Great. We are done, right? At least, this is what I/Nicolai told you in the lectures.

Unfortunately, if you click on the link or send a request, you don't get the json data you want to see.

In [32]:
import requests
url = "	https://stats.nba.com/stats/leaguedashplayerbiostats"

In [34]:
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
}
response = requests.get(url, headers=headers, timeout=10, params = {
    "Season":"2025-26",
    "SeasonType":"Regular Season"
})
response.raise_for_status()
response.json()

ReadTimeout: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=10)

The reason is that NBA.com want to provide these data on their website, but prefer not to share their API.

However, you can still access the API if you add headers to the request.
For this, you have to

- Right-click on the entry of the request in the inspector
- Choose Copy Value, then Copy as Curl
- Open the website [curlconverter.com](https://www.curlconverter.com)
- Paste the values to the main field
- Copy the result
- Paste them to Jupyter notebook

In [35]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.9',
    # 'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Referer': 'https://www.nba.com/',
    'Origin': 'https://www.nba.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
}

params = {
    'College': '',
    'Conference': '',
    'Country': '',
    'DateFrom': '',
    'DateTo': '',
    'Division': '',
    'DraftPick': '',
    'DraftYear': '',
    'GameScope': '',
    'GameSegment': '',
    'Height': '',
    'ISTRound': '',
    'LastNGames': '0',
    'LeagueID': '00',
    'Location': '',
    'MeasureType': 'Base',
    'Month': '0',
    'OpponentTeamID': '0',
    'Outcome': '',
    'PORound': '0',
    'PaceAdjust': 'N',
    'PerMode': 'Totals',
    'Period': '0',
    'PlayerExperience': '',
    'PlayerPosition': '',
    'PlusMinus': 'N',
    'Rank': 'N',
    'Season': '2025-26',
    'SeasonSegment': '',
    'SeasonType': 'Regular Season',
    'ShotClockRange': '',
    'StarterBench': '',
    'TeamID': '0',
    'VsConference': '',
    'VsDivision': '',
    'Weight': '',
}

response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats', params=params, headers=headers)

In [36]:
response.raise_for_status()

In [37]:
raw_data = response.json()

In [38]:
type(raw_data)

dict

In [39]:
raw_data.keys()

dict_keys(['resource', 'parameters', 'resultSets'])

In [40]:
tmp = raw_data['resultSets']

In [41]:
type(tmp)

list

In [42]:
len(tmp)

1

In [43]:
tmp[0]

{'name': 'LeagueDashPlayerStats',
 'headers': ['PLAYER_ID',
  'PLAYER_NAME',
  'NICKNAME',
  'TEAM_ID',
  'TEAM_ABBREVIATION',
  'AGE',
  'GP',
  'W',
  'L',
  'W_PCT',
  'MIN',
  'FGM',
  'FGA',
  'FG_PCT',
  'FG3M',
  'FG3A',
  'FG3_PCT',
  'FTM',
  'FTA',
  'FT_PCT',
  'OREB',
  'DREB',
  'REB',
  'AST',
  'TOV',
  'STL',
  'BLK',
  'BLKA',
  'PF',
  'PFD',
  'PTS',
  'PLUS_MINUS',
  'NBA_FANTASY_PTS',
  'DD2',
  'TD3',
  'WNBA_FANTASY_PTS',
  'GP_RANK',
  'W_RANK',
  'L_RANK',
  'W_PCT_RANK',
  'MIN_RANK',
  'FGM_RANK',
  'FGA_RANK',
  'FG_PCT_RANK',
  'FG3M_RANK',
  'FG3A_RANK',
  'FG3_PCT_RANK',
  'FTM_RANK',
  'FTA_RANK',
  'FT_PCT_RANK',
  'OREB_RANK',
  'DREB_RANK',
  'REB_RANK',
  'AST_RANK',
  'TOV_RANK',
  'STL_RANK',
  'BLK_RANK',
  'BLKA_RANK',
  'PF_RANK',
  'PFD_RANK',
  'PTS_RANK',
  'PLUS_MINUS_RANK',
  'NBA_FANTASY_PTS_RANK',
  'DD2_RANK',
  'TD3_RANK',
  'WNBA_FANTASY_PTS_RANK',
  'TEAM_COUNT'],
 'rowSet': [[1630639,
   'A.J. Lawson',
   'A.J.',
   1610612761,
   'T

In [44]:
tbl = pd.DataFrame(tmp[0]['rowSet'])

In [45]:
clm_names = tmp[0]['headers']

In [46]:
tbl.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,57,58,59,60,61,62,63,64,65,66
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,25.0,13,6,7,0.462,...,87,106,420,426,343,435,222,28,432,1
1,1631260,AJ Green,AJ,1610612749,MIL,26.0,49,22,27,0.449,...,252,463,233,142,179,181,222,28,157,1
2,1642358,AJ Johnson,AJ,1610612742,DAL,21.0,28,5,23,0.179,...,288,106,409,403,287,411,222,28,410,2
3,203932,Aaron Gordon,Aaron,1610612743,DEN,30.0,23,17,6,0.739,...,216,183,152,193,41,243,100,28,236,1
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,29.0,35,22,13,0.629,...,193,224,301,304,148,353,222,28,339,1


In [47]:
tbl.columns = clm_names

In [48]:
tbl.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,25.0,13,6,7,0.462,...,87,106,420,426,343,435,222,28,432,1
1,1631260,AJ Green,AJ,1610612749,MIL,26.0,49,22,27,0.449,...,252,463,233,142,179,181,222,28,157,1
2,1642358,AJ Johnson,AJ,1610612742,DAL,21.0,28,5,23,0.179,...,288,106,409,403,287,411,222,28,410,2
3,203932,Aaron Gordon,Aaron,1610612743,DEN,30.0,23,17,6,0.739,...,216,183,152,193,41,243,100,28,236,1
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,29.0,35,22,13,0.629,...,193,224,301,304,148,353,222,28,339,1


In [49]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.9',
    # 'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Referer': 'https://www.nba.com/',
    'Origin': 'https://www.nba.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4',
}

params = {
    'College': '',
    'Conference': '',
    'Country': '',
    'DateFrom': '',
    'DateTo': '',
    'Division': '',
    'DraftPick': '',
    'DraftYear': '',
    'GameScope': '',
    'GameSegment': '',
    'Height': '',
    'ISTRound': '',
    'LastNGames': '0',
    'LeagueID': '00',
    'Location': '',
    'MeasureType': 'Base',
    'Month': '0',
    'OpponentTeamID': '0',
    'Outcome': '',
    'PORound': '0',
    'PaceAdjust': 'N',
    'PerMode': 'Totals',
    'Period': '0',
    'PlayerExperience': '',
    'PlayerPosition': '',
    'PlusMinus': 'N',
    'Rank': 'N',
    'Season': '2022-23',
    'SeasonSegment': '',
    'SeasonType': 'Regular Season',
    'ShotClockRange': '',
    'StarterBench': '',
    'TeamID': '0',
    'VsConference': '',
    'VsDivision': '',
    'Weight': '',
}

response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats', params=params, headers=headers)

In [50]:
import requests

def get_season(year):
    
    params = {
        'College': '',
        'Conference': '',
        'Country': '',
        'DateFrom': '',
        'DateTo': '',
        'Division': '',
        'DraftPick': '',
        'DraftYear': '',
        'GameScope': '',
        'GameSegment': '',
        'Height': '',
        'ISTRound': '',
        'LastNGames': '0',
        'LeagueID': '00',
        'Location': '',
        'MeasureType': 'Base',
        'Month': '0',
        'OpponentTeamID': '0',
        'Outcome': '',
        'PORound': '0',
        'PaceAdjust': 'N',
        'PerMode': 'Totals',
        'Period': '0',
        'PlayerExperience': '',
        'PlayerPosition': '',
        'PlusMinus': 'N',
        'Rank': 'N',
        'Season': year,
        'SeasonSegment': '',
        'SeasonType': 'Regular Season',
        'ShotClockRange': '',
        'StarterBench': '',
        'TeamID': '0',
        'VsConference': '',
        'VsDivision': '',
        'Weight': '',
    }
    time.sleep(1) # don't forget to slow down the process!
    response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats', params=params, headers=headers)
    response.raise_for_status()
    result = response.json()['resultSets'][0]
    tbl = pd.DataFrame(result['rowSet'])
    tbl.columns = result['headers']

    return tbl

In [51]:
get_season('2025-26')

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,25.0,13,6,7,0.462,...,87,106,420,426,343,435,222,28,432,1
1,1631260,AJ Green,AJ,1610612749,MIL,26.0,49,22,27,0.449,...,252,463,233,142,179,181,222,28,157,1
2,1642358,AJ Johnson,AJ,1610612742,DAL,21.0,28,5,23,0.179,...,288,106,409,403,287,411,222,28,410,2
3,203932,Aaron Gordon,Aaron,1610612743,DEN,30.0,23,17,6,0.739,...,216,183,152,193,41,243,100,28,236,1
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,29.0,35,22,13,0.629,...,193,224,301,304,148,353,222,28,339,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
527,1641744,Zach Edey,Zach,1610612763,MEM,23.0,11,7,4,0.636,...,174,179,283,348,85,335,73,28,343,1
528,203897,Zach LaVine,Zach,1610612758,SAC,30.0,39,9,30,0.231,...,415,343,122,75,526,146,222,28,126,1
529,1630192,Zeke Nnaji,Zeke,1610612743,DEN,25.0,39,27,12,0.692,...,318,239,275,334,446,326,122,28,330,1
530,1630533,Ziaire Williams,Ziaire,1610612751,BKN,24.0,38,11,27,0.289,...,216,332,202,209,495,245,222,28,233,1


In [52]:
get_season('2024-25')

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,24.0,26,14,12,0.538,...,241,200,327,337,299,375,159,44,364,1
1,1631260,AJ Green,AJ,1610612749,MIL,25.0,73,44,29,0.603,...,99,498,290,214,62,247,281,44,225,1
2,1642358,AJ Johnson,AJ,1610612764,WAS,20.0,29,8,21,0.276,...,312,217,369,347,495,379,281,44,372,2
3,203932,Aaron Gordon,Aaron,1610612743,DEN,29.0,51,33,18,0.647,...,451,301,107,139,41,183,134,44,181,1
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,28.0,62,39,23,0.629,...,224,259,292,292,115,322,281,44,315,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,203897,Zach LaVine,Zach,1610612758,SAC,30.0,74,32,42,0.432,...,501,418,46,12,414,40,93,44,23,2
565,1630192,Zeke Nnaji,Zeke,1610612743,DEN,24.0,57,36,21,0.632,...,254,252,327,365,323,343,281,44,357,1
566,1630533,Ziaire Williams,Ziaire,1610612751,BKN,23.0,63,22,41,0.349,...,458,476,142,178,551,189,159,44,184,1
567,1629627,Zion Williamson,Zion,1610612740,NOP,24.0,30,10,20,0.333,...,536,301,80,141,369,180,77,15,190,1


In [53]:
df = pd.DataFrame([])
for y in range(20,26):
    year = str(2000+y) + '-' + str(y+1)
    df = pd.concat([df, get_season(year)], ignore_index=True)

df.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
0,203932,Aaron Gordon,Aaron,1610612743,DEN,25.0,50,29,21,0.58,...,409,314,95,156,124,150,115,17,152,2
1,1628988,Aaron Holiday,Aaron,1610612754,IND,24.0,66,30,36,0.455,...,432,325,201,203,218,244,178,29,232,1
2,1630174,Aaron Nesmith,Aaron,1610612738,BOS,21.0,46,22,24,0.478,...,237,301,354,343,258,347,245,29,344,1
3,1627846,Abdel Nader,Abdel,1610612756,PHX,27.0,24,16,8,0.667,...,149,152,328,367,162,380,245,29,377,1
4,1629690,Adam Mokoka,Adam,1610612741,CHI,22.0,14,3,11,0.214,...,39,34,519,500,262,510,245,29,509,1


In [54]:
df.shape

(3357, 67)

In [55]:
df.sort_values('PLAYER_NAME')

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
1684,1630639,A.J. Lawson,A.J.,1610612742,DAL,23.0,42,27,15,0.643,...,211,148,415,385,192,400,257,38,405,1
2256,1630639,A.J. Lawson,A.J.,1610612761,TOR,24.0,26,14,12,0.538,...,241,200,327,337,299,375,159,44,364,1
2825,1630639,A.J. Lawson,A.J.,1610612761,TOR,25.0,13,6,7,0.462,...,87,106,420,426,343,435,222,28,432,1
1145,1630639,A.J. Lawson,A.J.,1610612742,DAL,22.0,15,5,10,0.333,...,96,74,457,437,369,466,253,39,461,2
1685,1631260,AJ Green,AJ,1610612749,MIL,24.0,56,35,21,0.625,...,121,223,375,313,145,369,257,38,343,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2255,1629627,Zion Williamson,Zion,1610612740,NOP,23.0,70,42,28,0.600,...,571,495,13,25,89,31,54,38,35,1
539,1629627,Zion Williamson,Zion,1610612740,NOP,20.0,61,29,32,0.475,...,540,457,6,10,107,16,33,29,15,1
3356,1629627,Zion Williamson,Zion,1610612740,NOP,25.0,40,12,28,0.300,...,519,397,23,49,477,75,73,28,78,1
1144,1629597,Zylan Cheatham,Zylan,1610612762,UTA,26.0,1,0,1,0.000,...,1,1,564,582,332,596,268,40,596,1


In [56]:
df.sort_values('NBA_FANTASY_PTS_RANK')

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
2679,203999,Nikola Jokić,Nikola,1610612743,DEN,30.0,70,46,24,0.657,...,540,505,4,3,2,1,2,1,1,1
2118,203999,Nikola Jokić,Nikola,1610612743,DEN,29.0,79,55,24,0.696,...,541,542,6,5,1,1,2,2,2,1
1544,203999,Nikola Jokić,Nikola,1610612743,DEN,28.0,69,48,21,0.696,...,476,470,11,17,1,1,2,1,4,1
989,203999,Nikola Jokić,Nikola,1610612743,DEN,27.0,74,46,28,0.622,...,586,577,3,5,7,1,1,1,1,1
3330,1630178,Tyrese Maxey,Tyrese,1610612755,PHI,25.0,52,29,23,0.558,...,510,475,24,2,94,1,62,28,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
661,1630525,David Johnson,David,1610612761,TOR,21.0,2,2,0,1.000,...,1,1,564,582,272,596,268,40,596,1
601,101139,CJ Miles,CJ,1610612738,BOS,35.0,1,1,0,1.000,...,1,23,564,582,287,596,268,40,596,1
672,1630610,DeJon Jarreau,DeJon,1610612754,IND,24.0,1,0,1,0.000,...,1,1,564,582,252,596,268,40,596,1
546,1630278,Ade Murkey,Ade,1610612758,SAC,24.0,1,0,1,0.000,...,1,1,564,582,267,596,268,40,596,1


In [84]:
df.sort_values('NBA_FANTASY_PTS_RANK').head(10)

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT
399,203999,Nikola Jokić,Nikola,1610612743,DEN,26.0,72,47,25,0.653,...,517,529,4,3,10,1,1,2,1,1
989,203999,Nikola Jokić,Nikola,1610612743,DEN,27.0,74,46,28,0.622,...,586,577,3,5,7,1,1,1,1,1
3328,1630178,Tyrese Maxey,Tyrese,1610612755,PHI,25.0,52,29,23,0.558,...,508,474,24,2,94,1,62,27,1,1
2118,203999,Nikola Jokić,Nikola,1610612743,DEN,29.0,79,55,24,0.696,...,541,542,6,5,1,1,2,2,2,1
1544,203999,Nikola Jokić,Nikola,1610612743,DEN,28.0,69,48,21,0.696,...,476,470,11,17,1,1,2,1,4,1
2679,203999,Nikola Jokić,Nikola,1610612743,DEN,30.0,70,46,24,0.657,...,540,505,4,3,2,1,2,1,1,1
2746,1628983,Shai Gilgeous-Alexander,Shai,1610612760,OKC,26.0,76,63,13,0.829,...,561,511,2,1,1,2,93,44,2,1
452,201566,Russell Westbrook,Russell,1610612764,WAS,32.0,65,30,35,0.462,...,535,528,17,19,433,2,2,1,3,1
1485,1629029,Luka Dončić,Luka,1610612742,DAL,24.0,66,33,33,0.5,...,459,457,4,3,83,2,11,3,2,1
2056,1629029,Luka Dončić,Luka,1610612742,DAL,25.0,70,46,24,0.657,...,497,475,4,1,32,2,6,3,1,1


Find a Wikipedia entry about [Nikola Jokic](https://en.wikipedia.org/wiki/Nikola_Joki%C4%87)

## Summary 

- Scraping does not necessarily return the desired, make use of error handling 
- Make use of the advantages of devtools to see how the website is structured