**Christian Data Exploration**

This notebook outlines the actions taken to transform the data so that it is ready for model use. Specifically:
- It will convert binary columns like win_loss, OT, Homegame, etc. into 1's and 0's where 1 indicates the symantical presense of "something." 
- Convert numeric string values from NaN to 0.
- Convert conference into a series of binary variables.
- Attempt to identify a solution to the AP top 25 flow-in and out perdicament.
- Shift the dataframe to identify the rank changes.       

**Steps and Workflow**
1. Data setup
2. Data Exploration
    - Explore AP Rank Data
    - Get all Unique Conference Values
3. Data Manipulation
    - Setting up AP Rank Shift
    - Setup all Binary Variables
4. Data Validation Check

**Data Concepts**
- The dataset contains two records for each game: (1) the winner and (2) the looser. I think this is fine--but rankings are zero-sum. 
- AP Ranks only display for the top 25 teams and the AP rank either represents before or after. Eitherway a shift is needed to determine the before and after on each game and create the differential. 

**Outstanding Questons / Data Issues**
- What does "Bye" mean in the opponent field? It seems like perhaps they did not have a game at during this week and perhaps could be dropped? 
- There still seems to be negative values in the week columns -- may want to just drop 2020 from the dataset since it may cause errors. 
- I may need to drop the first week of the season or remove it's shift values since it's pulling in the previous season's information.


**1. Data Setup**

In [None]:
import pandas as pd
import seaborn as sn

# Set Standard dataframe settings
pd.set_option('display.max_columns', None)
df_clean = pd.read_csv(r'C:\Users\martchfr\OneDrive - Indiana University\Graduate School\MIS\INFO-H 501\Projects\Group-8-Project\03 - Cleaned Data Space\mergedTrainingData.csv')

df_clean.drop(columns =['Unnamed: 0'], inplace=True)
df_clean.sort_values(by=['season','Team','week']).reset_index().head(20)

Unnamed: 0,index,week,season,Team,opponent,code,date,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,CONF,SOR,FPI,SOS,GC,AVGWP,AP_rank,opponent_rank,rank_change
0,2238,2,2021,Air Force,Navy,2005,"Sat, Sep 11",W,False,,49.0,97.0,27.0,3,23,20,False,Mountain West,33,66,110,24,6,,,
1,2600,3,2021,Air Force,Utah State,2005,"Sat, Sep 18",L,False,,182.0,102.0,,49,45,-4,True,Mountain West,33,66,110,24,6,,,0.0
2,3007,4,2021,Air Force,Florida Atlantic,2005,"Sat, Sep 25",W,False,,70.0,164.0,46.0,7,31,24,True,Mountain West,33,66,110,24,6,,,0.0
3,3454,5,2021,Air Force,New Mexico,2005,"Sat, Oct 2",W,False,,33.0,142.0,33.0,10,38,28,False,Mountain West,33,66,110,24,6,,,0.0
4,3912,6,2021,Air Force,Wyoming,2005,"Sat, Oct 9",W,False,,110.0,140.0,77.0,14,24,10,True,Mountain West,33,66,110,24,6,,,0.0
5,4340,7,2021,Air Force,Boise State,2005,"Sat, Oct 16",W,False,,59.0,138.0,59.0,17,24,7,False,Mountain West,33,66,110,24,6,,,0.0
6,4763,8,2021,Air Force,San Diego State,2005,"Sat, Oct 23",L,False,,58.0,50.0,31.0,20,14,-6,True,Mountain West,33,66,110,24,6,,21.0,0.0
7,0,10,2021,Air Force,Army,2005,"Sat, Nov 6",L,True,,226.0,68.0,106.0,21,14,-7,True,Mountain West,33,66,110,24,6,,,0.0
8,464,11,2021,Air Force,Colorado State,2005,"Sat, Nov 13",W,False,,121.0,151.0,92.0,21,35,14,False,Mountain West,33,66,110,24,6,,,0.0
9,929,12,2021,Air Force,Nevada,2005,"Fri, Nov 19",W,True,3.0,23.0,208.0,23.0,39,41,2,False,Mountain West,33,66,110,24,6,,,0.0


**2. Data Exploration - Explore AP Rank Data** 

In [33]:
# Select all rows where AP_rank is not null and then display the first 14 rows sorted by team and week
mask = df_clean['AP_rank'].notna()
df_ap_rank = df_clean[mask]
df_ap_rank.iloc[0:14].sort_values(by=['Team','week']).reset_index()

Unnamed: 0,index,week,season,Team,opponent,code,date,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,CONF,SOR,FPI,SOS,GC,AVGWP,AP_rank,opponent_rank,rank_change
0,2,10,2021,Alabama,LSU,333,"Sat, Nov 6",W,False,,302.0,18.0,160.0,14,20,6,True,SEC,2,2,1,2,3,3.0,,-1.0
1,7,10,2021,Auburn,Texas A&M,2,"Sat, Nov 6",L,False,,153.0,69.0,50.0,20,3,-17,False,SEC,48,20,4,29,58,12.0,13.0,-9.0
2,9,10,2021,Baylor,TCU,239,"Sat, Nov 6",L,False,,214.0,125.0,121.0,30,28,-2,False,Big 12,7,15,25,6,7,14.0,,-4.0
3,16,10,2021,Cincinnati,Tulsa,2132,"Sat, Nov 6",W,False,,274.0,43.0,113.0,20,28,8,True,American,6,10,54,5,2,2.0,,0.0
4,17,10,2021,Coastal Carolina,Georgia State,324,"Sat, Nov 13",L,False,,233.0,128.0,101.0,42,40,-2,True,Sun Belt,34,45,130,47,5,21.0,,-3.0
5,27,10,2021,Fresno State,San Diego State,278,"Sat, Oct 30",W,False,,306.0,186.0,107.0,20,30,10,False,Mountain West,30,53,95,37,19,25.0,,-1.0
6,28,10,2021,Georgia,Missouri,61,"Sat, Nov 6",W,False,,255.0,41.0,76.0,6,43,37,True,SEC,1,1,3,1,1,1.0,,0.0
7,33,10,2021,Houston,South Florida,248,"Sat, Nov 6",W,False,,385.0,130.0,164.0,42,54,12,False,American,17,37,78,27,10,20.0,,-6.0
8,36,10,2021,Iowa,Northwestern,2294,"Sat, Nov 6",W,False,,172.0,141.0,68.0,12,17,5,False,Big Ten,16,32,23,33,54,19.0,,9.0
9,41,10,2021,Kentucky,Tennessee,96,"Sat, Nov 6",L,False,,372.0,109.0,166.0,45,42,-3,True,SEC,19,28,32,15,15,18.0,,6.0


**2. Data Exploration - Get all Unique Conference Values**

These values will be used to create a set of binary coolumns for each conference.

In [13]:
# Groups by conference and counts unique teams in each conference.
df_clean.groupby('CONF')['Team'].nunique().reset_index(name='unique_teams')

# Create the conference list for binary column creation
conference_list = []

for conf in df_clean['CONF']:
    if conf not in conference_list:
        conference_list.append(conf)

print(conference_list)

['Mountain West', 'MAC', 'SEC', 'Pac-12', 'FBS Indep.', 'Big 12', 'ACC', 'CUSA', 'American', 'Sun Belt', 'Big Ten']


**3. Data Manipulation - Setup AP Rank Shift and AP Rank Differential**

In [23]:
# Define main dataframe, sorted by team and week for manipulation
df_base = df_clean.sort_values(by=['Team','season','week']).reset_index()

df_base["Previous_AP_Rank"] = df_base.groupby('Team')['AP_rank'].shift(1)
df_base['AP_Rank_Differential'] = df_base['AP_rank'] - df_base['Previous_AP_Rank']

mask = df_base['Team'] == 'Alabama'

df_base[mask].head(14)

# Check on the Alabama Crimson Tide for the first 14 weeks to determien if the previous AP rank shift is working correctly


Unnamed: 0.1,index,Unnamed: 0,week,season,Team,opponent,code,date,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,CONF,SOR,FPI,SOS,GC,AVGWP,AP_rank,opponent_rank,rank_change,Previous_AP_Rank,AP_Rank_Differential
85,2601,2602,3,2021,Alabama,Florida,333,"Sat, Sep 18",W,False,,240.0,78.0,61.0,29,31,2,False,SEC,2,2,1,2,3,1.0,9.0,0.0,,
86,3009,3010,4,2021,Alabama,Southern Miss,333,"Sat, Sep 25",W,False,,313.0,110.0,105.0,14,63,49,True,SEC,2,2,1,2,3,1.0,,0.0,1.0,0.0
87,3456,3457,5,2021,Alabama,Ole Miss,333,"Sat, Oct 2",W,False,,241.0,171.0,65.0,21,42,21,True,SEC,2,2,1,2,3,1.0,12.0,0.0,1.0,0.0
88,3914,3915,6,2021,Alabama,Texas A&M,333,"Sat, Oct 9",L,False,,369.0,147.0,146.0,41,38,-3,False,SEC,2,2,1,2,3,1.0,,0.0,1.0,0.0
89,4342,4343,7,2021,Alabama,Mississippi State,333,"Sat, Oct 16",W,False,,348.0,78.0,117.0,9,49,40,False,SEC,2,2,1,2,3,5.0,,4.0,1.0,4.0
90,4765,4766,8,2021,Alabama,Tennessee,333,"Sat, Oct 23",W,False,,371.0,107.0,123.0,24,52,28,True,SEC,2,2,1,2,3,4.0,,-1.0,5.0,-1.0
91,2,3,10,2021,Alabama,LSU,333,"Sat, Nov 6",W,False,,302.0,18.0,160.0,14,20,6,True,SEC,2,2,1,2,3,3.0,,-1.0,4.0,-1.0
92,466,467,11,2021,Alabama,New Mexico State,333,"Sat, Nov 13",W,False,,270.0,99.0,158.0,3,59,56,True,SEC,2,2,1,2,3,3.0,,0.0,3.0,0.0
93,931,932,12,2021,Alabama,Arkansas,333,"Sat, Nov 20",W,False,,559.0,122.0,190.0,35,42,7,True,SEC,2,2,1,2,3,2.0,21.0,-1.0,3.0,-1.0
94,1407,1408,13,2021,Alabama,Auburn,333,"Sat, Nov 27",W,True,4.0,317.0,71.0,150.0,22,24,2,False,SEC,2,2,1,2,3,3.0,,1.0,2.0,1.0


**3. Data Manipulation - Setup all Binary Variables**

In [106]:
# Note -- may need to remove "Bye" / non-game weeks prior to engaging in this manipulation

# Create binary columns for each conference in the conference list
for conf in conference_list:
    df_base[conf] = (df_base['CONF'] == conf).astype(int)

# Creating the binary columns
df_base['win_loss'] = (df_base['win_loss'] == 'W').astype(int)
df_base['OT'] = (df_base['OT'] == True).astype(int)
df_base['home_game'] = (df_base['home_game'] == True).astype(int)

df_base.head()

Unnamed: 0,index,team,date,opponent,win_loss,OT,OT_num,pass,rush,rec,points_allowed,points_scored,point_differential,home_game,week,season,CONF,SOR,FPI,AP/CFP,SOS,GC,AVGWP,AP_rank,state,code,Previous_AP_Rank,AP_Rank_Differential,Mountain West,MAC,SEC,Sun Belt,Big 12,American,ACC,CUSA,Big Ten,FBS Indep.,Pac-12
0,7267,Air Force Falcons,"Sat, Oct 3",Navy,1,0,,41.0,118.0,29.0,7,40,33,0,-2,2020,Mountain West,83,74,--,115,66,40,,,,,,1,0,0,0,0,0,0,0,0,0,0
1,7268,Air Force Falcons,"Sat, Oct 10",Bye,0,0,,0.0,0.0,0.0,0,0,0,0,-1,2020,Mountain West,83,74,--,115,66,40,,,,,,1,0,0,0,0,0,0,0,0,0,0
2,7266,Air Force Falcons,"Sat, Oct 17",Bye,0,0,,0.0,0.0,0.0,0,0,0,0,0,2020,Mountain West,83,74,--,115,66,40,,,,,,1,0,0,0,0,0,0,0,0,0,0
3,7270,Air Force Falcons,"Sat, Oct 24",San José State,0,0,,92.0,60.0,56.0,17,6,-11,0,1,2020,Mountain West,83,74,--,115,66,40,,,,,,1,0,0,0,0,0,0,0,0,0,0
4,7271,Air Force Falcons,"Sat, Oct 31",Boise State,0,0,,38.0,112.0,38.0,49,30,-19,0,2,2020,Mountain West,83,74,--,115,66,40,,,,,,1,0,0,0,0,0,0,0,0,0,0
