# Barttorvik Full Season

Get the Barttorvik ratings from previous full seasons (regular season plus postseason) for features about previous team strength. Assumes all data has been copied from website (https://barttorvik.com/#) to Excel, as specified in the steps below. 

Steps before this file:

1. Copy from the website, excluding the initial row with D1 averages
2. Make sure the REC column in excel is text only before pasting (otherwise it tries to convert the records into a date)
3. Paste into excel with "Match Destination Formatting"
4. Save as csv into the folder

In [1]:
import pandas as pd

pd.set_option('display.max_columns', 100)

df = pd.concat(
    [
        pd.read_csv(f'../data/unprocessed/barttorvik_full_season/barttorvik_full_season_{season}.csv')
        .assign(Season=season)
        for season in range(2008, 2024)
    ],
    ignore_index=True
)

df['RK'] = pd.to_numeric(df['RK'], errors='coerce')

df = df.loc[df['RK'].notna(), ['Season', 'TEAM', 'BARTHAG']].reset_index(drop=True)

df

Unnamed: 0,Season,TEAM,BARTHAG
0,2008,Kansas,0.9825
1,2008,Memphis,0.9715
2,2008,UCLA,0.9664
3,2008,North Carolina,0.9659
4,2008,Wisconsin,0.96
...,...,...,...
5593,2023,Delaware St.,0.0913
5594,2023,IUPUI,0.0826
5595,2023,Green Bay,0.0669
5596,2023,Hartford,0.0418


Some teams (like Ivy League in 2021) are missing, so add them in with NAs

In [2]:
df_full_teams = pd.DataFrame(
    [(season, team) for team in df['TEAM'].unique() for season in df['Season'].unique()],
    columns=['Season', 'TEAM']
)

df_full_teams

Unnamed: 0,Season,TEAM
0,2008,Kansas
1,2009,Kansas
2,2010,Kansas
3,2011,Kansas
4,2012,Kansas
...,...,...
5867,2019,Lindenwood
5868,2020,Lindenwood
5869,2021,Lindenwood
5870,2022,Lindenwood


In [3]:
df = (
    pd.merge(
        df, 
        df_full_teams,
        how='right',
        on=['Season', 'TEAM']
    )
    .sort_values(
        ['Season', 'TEAM'], 
        ignore_index=True
    )
)

df

Unnamed: 0,Season,TEAM,BARTHAG
0,2008,Abilene Christian,
1,2008,Air Force,0.5644
2,2008,Akron,0.7541
3,2008,Alabama,0.76
4,2008,Alabama A&M,0.1214
...,...,...,...
5867,2023,Wright St.,0.4449
5868,2023,Wyoming,0.536
5869,2023,Xavier,0.8891
5870,2023,Yale,0.7534


In [4]:
df['Past 4 Years BARTHAG'] = (
    df
    .groupby(['TEAM'])
    ['BARTHAG']
    .rolling(window=4, min_periods=2)  # at least 2 years of data to calculate
    .mean()
    .reset_index()
    .set_index('level_1')
)['BARTHAG']

df.rename(
    columns={
    'BARTHAG': 'Past Year BARTHAG'
    }, 
    inplace=True
)

df['Season'] += 1  # shift by a year so BARTHAGs are from past instead of the current rating

df = df.loc[df['Season'] >= 2012, :].reset_index(drop=True)

df

Unnamed: 0,Season,TEAM,Past Year BARTHAG,Past 4 Years BARTHAG
0,2012,Abilene Christian,,
1,2012,Air Force,0.5782,0.463200
2,2012,Akron,0.6049,0.678300
3,2012,Alabama,0.8419,0.782050
4,2012,Alabama A&M,0.1283,0.102125
...,...,...,...,...
4766,2024,Wright St.,0.4449,0.569925
4767,2024,Wyoming,0.536,0.540900
4768,2024,Xavier,0.8891,0.835725
4769,2024,Yale,0.7534,0.684333


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4771 entries, 0 to 4770
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Season                4771 non-null   int64  
 1   TEAM                  4771 non-null   object 
 2   Past Year BARTHAG     4566 non-null   object 
 3   Past 4 Years BARTHAG  4566 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 149.2+ KB


In [6]:
df.to_csv(f'../data/preprocessed/barttorvik_full_season/barttorvik_full_season.csv', index=False)

'Done'

'Done'