**Christian Data Exploration**

This notebook outlines the actions taken to transform the data so that it is ready for model use. Specifically:
- It will convert binary columns like win_loss, OT, Homegame, etc. into 1's and 0's where 1 indicates the symantical presense of "something." 
- Convert numeric string values from NaN to 0.
- Convert conference into a series of binary variables.
- Attempt to identify a solution to the AP top 25 flow-in and out perdicament.
- Shift the dataframe to identify the rank changes.       

**Steps and Workflow**
1. Data setup
2. Data Exploration
    - Explore AP Rank Data
    - Get all Unique Conference Values
3. Data Manipulation
    - Setting up AP Rank Shift
    - Setup all Binary Variables
4. Data Validation Check

**Data Concepts**
- The dataset contains two records for each game: (1) the winner and (2) the looser. I think this is fine--but rankings are zero-sum. 
- AP Ranks only display for the top 25 teams and the AP rank either represents before or after. Eitherway a shift is needed to determine the before and after on each game and create the differential. 

**Outstanding Questons / Data Issues**
- What does "Bye" mean in the opponent field? It seems like perhaps they did not have a game at during this week and perhaps could be dropped? 
- There still seems to be negative values in the week columns -- may want to just drop 2020 from the dataset since it may cause errors. 
- I may need to drop the first week of the season or remove it's shift values since it's pulling in the previous season's information.


**1. Data Setup**

In [None]:
import pandas as pd
import seaborn as sn

# Set Standard dataframe settings
pd.set_option('display.max_columns', None)
df_clean = pd.read_csv(r'C:\Users\chris\OneDrive - Indiana University\Graduate School\MIS\INFO-H 501\Projects\Group-8-Project\03 - Cleaned Data Space\Cleaned_Dataset.csv')


#df_clean.drop(columns =['Unnamed: 0','Team_id'], inplace=True)
df_clean.head(20)

Unnamed: 0,team,date,opponent,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,week,season,CONF,SOR,FPI,AP/CFP,SOS,GC,AVGWP,AP_rank,state,code
0,Air Force Falcons,"Sat, Nov 16",Oregon State,W,False,,110.0,97.0,68.0,0,28,28,TRUE,12,2024,Mountain West,108,106,--,117,115,100,,,
1,Air Force Falcons,"Sat, Aug 31",Merrimack,W,False,,71.0,63.0,21.0,6,21,15,TRUE,1,2024,Mountain West,108,106,--,117,115,100,,,
2,Air Force Falcons,"Sat, Sep 7",San José State,L,False,,54.0,50.0,36.0,17,7,-10,TRUE,2,2024,Mountain West,108,106,--,117,115,100,,,
3,Air Force Falcons,"Sat, Sep 14",Baylor,L,False,,18.0,71.0,18.0,31,3,-28,FALSE,3,2024,Mountain West,108,106,--,117,115,100,,,
4,Air Force Falcons,"Sat, Sep 21",Bye,Bye,False,,0.0,0.0,0.0,0,0,0,Bye,4,2024,Mountain West,108,106,--,117,115,100,,,
5,Air Force Falcons,"Sat, Sep 28",Wyoming,L,False,,115.0,54.0,106.0,31,19,-12,FALSE,5,2024,Mountain West,108,106,--,117,115,100,,,
6,Air Force Falcons,"Sat, Oct 5",Navy,L,False,,115.0,29.0,45.0,34,7,-27,TRUE,6,2024,Mountain West,108,106,--,117,115,100,,,
7,Air Force Falcons,"Sat, Oct 12",New Mexico,L,False,,79.0,103.0,82.0,52,37,-15,FALSE,7,2024,Mountain West,108,106,--,117,115,100,,,
8,Air Force Falcons,"Sat, Oct 19",Colorado State,L,False,,175.0,60.0,51.0,21,13,-8,TRUE,8,2024,Mountain West,108,106,--,117,115,100,,,
9,Air Force Falcons,"Sat, Oct 26",Bye,Bye,False,,0.0,0.0,0.0,0,0,0,Bye,9,2024,Mountain West,108,106,--,117,115,100,,,


**2. Data Exploration - Explore AP Rank Data** 

In [47]:
# Select all rows where AP_rank is not null and then display the first 14 rows sorted by team and week
mask = df_clean['AP_rank'].notna()
df_ap_rank = df_clean[mask]
df_ap_rank.iloc[0:14].sort_values(by=['team','week']).reset_index()

Unnamed: 0,index,team,date,opponent,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,week,season,CONF,SOR,FPI,AP/CFP,SOS,GC,AVGWP,AP_rank,state,code
0,27,Alabama Crimson Tide,"Sat, Aug 31",Western Kentucky,W,False,,200.0,102.0,139.0,0,63,63,TRUE,1,2024,SEC,17,4,11,20,7,12,5.0,Alabama,333.0
1,28,Alabama Crimson Tide,"Sat, Sep 7",South Florida,W,False,,194.0,140.0,68.0,16,42,26,TRUE,2,2024,SEC,17,4,11,20,7,12,4.0,Alabama,333.0
2,29,Alabama Crimson Tide,"Sat, Sep 14",Wisconsin,W,False,,196.0,75.0,78.0,10,42,32,FALSE,3,2024,SEC,17,4,11,20,7,12,4.0,Alabama,333.0
3,30,Alabama Crimson Tide,"Sat, Sep 21",Bye,Bye,False,,0.0,0.0,0.0,0,0,0,Bye,4,2024,SEC,17,4,11,20,7,12,4.0,Alabama,333.0
4,31,Alabama Crimson Tide,"Sat, Sep 28",Georgia,W,False,,374.0,117.0,177.0,34,41,7,TRUE,5,2024,SEC,17,4,11,20,7,12,4.0,Alabama,333.0
5,32,Alabama Crimson Tide,"Sat, Oct 5",Vanderbilt,L,False,,310.0,45.0,82.0,40,35,-5,FALSE,6,2024,SEC,17,4,11,20,7,12,2.0,Alabama,333.0
6,34,Alabama Crimson Tide,"Sat, Oct 12",South Carolina,W,False,,209.0,42.0,89.0,25,27,2,TRUE,7,2024,SEC,17,4,11,20,7,12,7.0,Alabama,333.0
7,35,Alabama Crimson Tide,"Sat, Oct 19",Tennessee,L,False,,239.0,42.0,73.0,24,17,-7,FALSE,8,2024,SEC,17,4,11,20,7,12,7.0,Alabama,333.0
8,36,Alabama Crimson Tide,"Sat, Oct 26",Missouri,W,False,,215.0,79.0,82.0,0,34,34,TRUE,9,2024,SEC,17,4,11,20,7,12,15.0,Alabama,333.0
9,33,Alabama Crimson Tide,"Sat, Nov 02",Bye,Bye,False,,0.0,0.0,0.0,0,0,0,Bye,10,2024,SEC,17,4,11,20,7,12,14.0,Alabama,333.0


**2. Data Exploration - Get all Unique Conference Values**

These values will be used to create a set of binary coolumns for each conference.

In [86]:
# Groups by conference and counts unique teams in each conference.
df_clean.groupby('CONF')['team'].nunique().reset_index(name='unique_teams')

# Create the conference list for binary column creation
conference_list = []

for conf in df_clean['CONF']:
    if conf not in conference_list:
        conference_list.append(conf)

print(conference_list)

['Mountain West', 'MAC', 'SEC', 'Sun Belt', 'Big 12', 'American', 'ACC', 'CUSA', 'Big Ten', 'FBS Indep.', 'Pac-12']


**3. Data Manipulation - Setup AP Rank Shift and AP Rank Differential**

In [None]:
# Define main dataframe, sorted by team and week for manipulation
df_base = df_clean.sort_values(by=['team','season','week']).reset_index()

df_base["Previous_AP_Rank"] = df_base.groupby('team')['AP_rank'].shift(1)
df_base['AP_Rank_Differential'] = df_base['AP_rank'] - df_base['Previous_AP_Rank']

mask = ((df_base['team'] == 'Alabama Crimson Tide') & (df_base['season'] != 2020))
df_base[mask].head()

# Check on the Alabama Crimson Tide for the first 14 weeks to determien if the previous AP rank shift is working correctly


Unnamed: 0,index,team,date,opponent,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,week,season,CONF,SOR,FPI,AP/CFP,SOS,GC,AVGWP,AP_rank,state,code,Previous_AP_Rank
136,5531,Alabama Crimson Tide,"Sat, Sep 4",Miami,W,False,,344.0,60.0,126.0,13,44,31,True,1,2021,SEC,2,2,1,1,2,3,1.0,Alabama,333.0,2.0
137,5544,Alabama Crimson Tide,"Sat, Sep 11",Mercer,W,False,,227.0,70.0,85.0,14,48,34,True,2,2021,SEC,2,2,1,1,2,3,1.0,Alabama,333.0,1.0
138,5532,Alabama Crimson Tide,"Sat, Sep 18",Florida,W,False,,240.0,78.0,61.0,29,31,2,False,3,2021,SEC,2,2,1,1,2,3,1.0,Alabama,333.0,1.0
139,5533,Alabama Crimson Tide,"Sat, Sep 25",Southern Miss,W,False,,313.0,110.0,105.0,14,63,49,True,4,2021,SEC,2,2,1,1,2,3,1.0,Alabama,333.0,1.0
140,5534,Alabama Crimson Tide,"Sat, Oct 2",Ole Miss,W,False,,241.0,171.0,65.0,21,42,21,True,5,2021,SEC,2,2,1,1,2,3,1.0,Alabama,333.0,1.0


**3. Data Manipulation - Setup all Binary Variables**

In [None]:
# Note -- may need to remove "Bye" / non-game weeks prior to engaging in this manipulation

# Create binary columns for each conference in the conference list
for conf in conference_list:
    df_base[conf] = (df_base['CONF'] == conf).astype(int)

# Creating the binary columns
df_base['win_loss'] = (df_base['win_loss'] == 'W').astype(int)
df_base['OT'] = (df_base['OT'] == True).astype(int)
df_base['home_game'] = (df_base['home_game'] == True).astype(int)

df_base.head()

Unnamed: 0,index,team,date,opponent,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,week,season,CONF,SOR,FPI,AP/CFP,SOS,GC,AVGWP,AP_rank,state,code,Previous_AP_Rank,Mountain West,MAC,SEC,Sun Belt,Big 12,American,ACC,CUSA,Big Ten,FBS Indep.,Pac-12
0,7267,Air Force Falcons,"Sat, Oct 3",Navy,1,False,,41.0,118.0,29.0,7,40,33,TRUE,-2,2020,Mountain West,83,74,--,115,66,40,,,,,1,0,0,0,0,0,0,0,0,0,0
1,7268,Air Force Falcons,"Sat, Oct 10",Bye,0,False,,0.0,0.0,0.0,0,0,0,Bye,-1,2020,Mountain West,83,74,--,115,66,40,,,,,1,0,0,0,0,0,0,0,0,0,0
2,7266,Air Force Falcons,"Sat, Oct 17",Bye,0,False,,0.0,0.0,0.0,0,0,0,Bye,0,2020,Mountain West,83,74,--,115,66,40,,,,,1,0,0,0,0,0,0,0,0,0,0
3,7270,Air Force Falcons,"Sat, Oct 24",San José State,0,False,,92.0,60.0,56.0,17,6,-11,FALSE,1,2020,Mountain West,83,74,--,115,66,40,,,,,1,0,0,0,0,0,0,0,0,0,0
4,7271,Air Force Falcons,"Sat, Oct 31",Boise State,0,False,,38.0,112.0,38.0,49,30,-19,TRUE,2,2020,Mountain West,83,74,--,115,66,40,,,,,1,0,0,0,0,0,0,0,0,0,0
