# Retrieving College Basketball Data

In this notebook, I retrieve the scores and kenpom data for all teams from 2002 to 2017. The data is saved in csv files so that it can be accessed later by other notebooks.

In [15]:
# Import packages
import sys
sys.path.append('./')

import datetime
import pandas as pd
import collegebasketball as cbb
cbb.__version__

'2023'

In [16]:
START_YEAR = 2012
END_YEAR = 2024

## Getting the Scores Data

The scores are from https://www.sports-reference.com/cbb/. Below shows the code I used to download all the scores in college basketball from 2002 to 2017.

For each season, I create a csv file with all of the game scores for that season. Each record contains the team names, score and the tournament the game was played in (if applicable).

In [17]:
# The location where the files will be saved
path = './Data/Scores/'

In [18]:
# We will be creating a csv file for each regular season and tournament from 2002 to 2023 (you might want to ignore 2020)
for year in range(2024, 2025):

    # Set up the starting and   ending dates of the regular season and march madness
    start = datetime.date(year - 1, 11, 1)
    end = datetime.date(year, 4, 10)
    
    # Set up the path for this years scores
    path_regular = path + str(year) + '_season.csv'

    # Create and save the csv files for the regular season and march madness data for the year
    cbb.load_scores_dataframe(start, end, csv_file_path=path_regular, debug=True)
    print('{} Done'.format(year))

2023-11-01: 0
2023-11-02: 0
2023-11-03: 0
2023-11-04: 0
2023-11-05: 0
2023-11-06: 184
2023-11-07: 29
2023-11-08: 30
2023-11-09: 48
2023-11-10: 81
2023-11-11: 58
2023-11-12: 35
2023-11-13: 39
2023-11-14: 80
2023-11-15: 38
2023-11-16: 39
2023-11-17: 84
2023-11-18: 53
2023-11-19: 63
2023-11-20: 60
2023-11-21: 52
2023-11-22: 69
2023-11-23: 16
2023-11-24: 65
2023-11-25: 71
2023-11-26: 53
2023-11-27: 22
2023-11-28: 40
2023-11-29: 79
2023-11-30: 28
2023-12-01: 34
2023-12-02: 95
2023-12-03: 40
2023-12-04: 10
2023-12-05: 60
2023-12-06: 69
2023-12-07: 10
2023-12-08: 10
2023-12-09: 118
2023-12-10: 39
2023-12-11: 15
2023-12-12: 33
2023-12-13: 27
2023-12-14: 14
2023-12-15: 13
2023-12-16: 94
2023-12-17: 37
2023-12-18: 36
2023-12-19: 51
2023-12-20: 43
2023-12-21: 81
2023-12-22: 58
2023-12-23: 18
2023-12-24: 4
2023-12-25: 0
2023-12-26: 0
2023-12-27: 2
2023-12-28: 22
2023-12-29: 58
2023-12-30: 100
2023-12-31: 28
2024-01-01: 5
2024-01-02: 36
2024-01-03: 52
2024-01-04: 56
2024-01-05: 11
2024-01-06: 146
2

In [19]:
# Load a dataset to take an initial look
file_path = path + '2024_season.csv'
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament
0,North Carolina Central,Kansas,56.0,99.0,
1,Dartmouth,Duke,54.0,92.0,
2,Samford,Purdue,45.0,98.0,
3,James Madison,Michigan State,79.0,76.0,
4,Northern Illinois,Marquette,70.0,92.0,


In [20]:
# Let's take a look at all the games involving Tennessee during the 2003 Tournament
data = cbb.filter_tournament(data)
data[(data['Home'] == 'Sam Houston St.') | (data['Away'] == 'Sam Houston St.')]

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament


## Getting Basic Team Stats

The teams stats data is also from https://www.sports-reference.com/cbb/. This data contains basic basketball statistics for each team at the end of each season. These stats will later be used to train the model and evaluate teams.

In [21]:
# The location where the files will be saved
path = './Data/SportsReference/'

# We will be creating a csv file of data for each season
for year in range(2024, 2025):
    
    # Set the path for the current year data
    stats_path = path + str(year) + '_stats.csv'
    
    # Save the basic stats data into a csv file
    cbb.load_stats_dataframe(year=year, csv_file_path=stats_path, debug=True)

    print('{} Done'.format(year))

361
2024 Done


In [22]:
# Load some data to take a look
stats_path = path + '2024_stats.csv'
data = pd.read_csv(stats_path)

data.head()

Unnamed: 0,School,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Abilene Christian,33,-3.47,-1.22,2393,2414,1335,854,1858,0.46,...,542,746,0.727,322,1108,421,258,70,418,653
1,Air Force,31,-4.22,1.98,2051,2243,1250,774,1631,0.475,...,324,475,0.682,225,872,451,202,122,372,541
2,Akron,34,3.0,-2.53,2517,2239,1365,827,1939,0.427,...,463,636,0.728,354,1252,447,192,99,389,563
3,Alabama,32,21.06,11.38,2904,2594,1290,884,2008,0.44,...,572,730,0.784,407,1267,515,232,133,383,636
4,Alabama A&M,33,-14.69,-7.59,2267,2501,1325,801,1877,0.427,...,622,865,0.719,373,1170,344,251,127,534,695


## Getting the Kenpom Data

The kenpom data is from https://kenpom.com. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [23]:
# The location where the files will be saved
path = './Data/Kenpom/'

# We will be creating a csv file of kenpom data for each season
for year in range(2024, 2025):
    
    # Set the path for the current year data
    kp_path = path + str(year) + '_kenpom.csv'
    
    # Save the kenpom data into a csv file
    cbb.load_kenpom_dataframe(year=year, csv_file_path=kp_path)

In [24]:
# Load some data to take a look
kp_path = path + '2024_kenpom.csv'
data = pd.read_csv(kp_path)

data.head()

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
0,1,Connecticut,1.0,BE,31,3,31.67,126.4,1,94.7,...,0.047,72,10.04,42,111.4,47,101.4,37,-3.23,284
1,2,Houston,1.0,B12,30,4,31.45,118.7,18,87.3,...,0.053,59,11.6,15,111.8,38,100.2,7,-0.78,227
2,3,Purdue,1.0,B10,29,4,29.07,125.0,4,95.9,...,0.045,76,13.81,4,114.1,5,100.3,8,10.36,13
3,4,Auburn,4.0,SEC,27,7,28.87,120.6,10,91.7,...,-0.067,324,9.69,49,111.9,34,102.2,71,1.65,148
4,5,Arizona,2.0,P12,25,8,26.77,121.2,8,94.4,...,-0.043,287,11.05,24,112.0,31,101.0,23,10.6,11


In [25]:
# Let's take a look at Tennessee's kenpom numbers for 2024
data[data['Team'] == 'Sam Houston St.']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
140,141,Sam Houston St.,,CUSA,21,12,2.36,106.0,181,103.7,...,0.066,42,0.12,168,104.4,264,104.3,108,3.12,111


## Getting the T-Rank Data

The T-Rank data is from http://www.barttorvik.com/#. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [26]:
# The location where the files will be saved
path = './Data/TRank/'

# We will be creating a csv file of data for each season
for year in range(2024, 2025):
    
    # Set the path for the current year data
    TRank_path = path + str(year) + '_TRank.csv'
    
    # Save the T-Rank data into a csv file
    cbb.load_TRank_dataframe(year=year, csv_file_path=TRank_path)
    print('{} Done'.format(year))

2024 Done


In [27]:
# Load some data to take a look
TRank_path = path + '2024_TRank.csv'
data = pd.read_csv(TRank_path)

data.head()

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,3P%D,3P%D Rank,3PR,3PR Rank,3PRD,3PRD Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
0,1,Houston,B12,34,30,4,119.0,16,85.6,1,...,30.0,13,36.6,201,40.9,294,63.4,349,10.4,3
1,2,Connecticut,BE,34,31,3,126.8,1,93.9,13,...,31.9,61,40.9,87,33.2,47,64.6,325,10.9,2
2,3,Purdue,B10,33,29,4,126.1,2,94.8,17,...,31.4,42,35.0,244,37.2,184,67.6,167,10.9,1
3,4,Iowa St.,B12,34,27,7,113.4,51,86.6,2,...,31.5,49,32.0,301,45.0,353,67.6,169,6.7,4
4,5,Auburn,SEC,34,27,7,120.7,10,92.1,5,...,29.8,9,37.5,176,33.3,50,69.9,59,5.4,9


In [28]:
# Let's take a look at Tennessee's kenpom numbers for 2024
data[data['Team'] == 'Sam Houston St.']

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,3P%D,3P%D Rank,3PR,3PR Rank,3PRD,3PRD Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
128,129,Sam Houston St.,CUSA,31,21,12,105.5,171,101.9,115,...,32.5,92,35.0,244,39.3,251,67.7,161,-5.7,123
