### Men's NCAA Tournament Lab

Welcome!  This lab is designed to introduce you all to building features and scoring models on game data from the NCAA tournament.  

When you're done, you should be able to work through the basics of using predictive models in these types of situations.

**Step 1:** Import files for the seeds, ncaa tournament games, and regular season games.  Also import the exported csv you made from class for the initial one variable model you fit.

In [1]:
# your answer here
import pandas as pd
import numpy as np
seeds = pd.read_csv('../data/NCAA/MNCAATourneySeeds.csv')
results = pd.read_csv('../data/NCAA/MNCAATourneyCompactResults.csv')
seasons = pd.read_csv('../data/NCAA/MRegularSeasonCompactResults.csv')
game_data = pd.read_csv('../data/NCAA/game_data.csv')

**Step 2:** Create a Training & Test Set, With the Test Set Comprising of All Games 2015 & After.  Use the exported csv from class for this, since it's already prepped.

In [2]:
# your answer here
query = game_data.Season < 2015
train = game_data[query].copy()
test  = game_data[~query].copy()

**Step 3:** Find an initial validation score with the 1 seed model, and a RandomForest Classifier, right out of the box.

 - Run KFold, using 10 splits
 - Just use the seed difference for X
 - FYI: The score being returned is prediction accuracy

In [3]:
# your answer here
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X = train[['SeedDiff']]
y = train['Result']

rf = RandomForestClassifier()

scores = cross_val_score(estimator=rf, X=X, y=y, cv=10)

What is your initial validation score?

In [4]:
# take the average of all scores -- about 0.7
np.mean(scores)

0.7066865208877283

**Step 4:** Create new data that captures the won-loss record of each team

We're going to break this down into smaller steps to make it easier to digest

**a).** Use `groupby()` to group teams based on `Season` and `WTeamID` in the dataset for regular season games.  Apply the `count()` aggregator to one of the columns to determine how many games each team won.

In [5]:
# your answer here
seasons.groupby(['Season', 'WTeamID'])['WScore'].count()

Season  WTeamID
1985    1102        5
        1103        9
        1104       21
        1106       10
        1108       19
        1109        1
        1110        7
        1111       10
        1112       18
        1113       11
        1114       17
        1116       21
        1117       11
        1119       15
        1120       18
        1121        7
        1122        6
        1123       13
        1124        7
        1126        6
        1129       12
        1130       16
        1131       14
        1132       11
        1133       16
        1134       11
        1135        8
        1137       15
        1139       16
        1140       15
                   ..
2019    1435        9
        1436       26
        1437       25
        1438       29
        1439       24
        1440        8
        1441        6
        1442        8
        1443       20
        1444        7
        1447       11
        1448       11
        1449       26
        1450    

**b).** Save the grouping from the previous step as it's own variable, but with the following additions:

 - tack on the `reset_index()` method at the end -- note what this does
 - as an argument for the `reset_index()` method, pass in `name=Wins`

In [6]:
# your answer here
wins = seasons.groupby(['Season', 'WTeamID'])['WScore'].count().reset_index(name='Wins')

**c).** Repeat steps `a` and `b`, but this time group in `LTeamID` and make the new column called `Losses` instead of `Wins`.

In [7]:
# your answer here
losses = seasons.groupby(['Season', 'LTeamID'])['LScore'].count().reset_index(name='Losses')

At this point -- look at the two variables you created, and just make sure you can make sense out of what they're telling you.  You should have two separate dataframes that tell you how many wins & losses each team in each season had from 1985 until tolday.

**Step 5:** Merge your data back into your original data set

This can be a little tedious and time consuming, but you have to be careful in order to make sure you get it right.

**Part 1:** Building Features for Team 1

**a).** How many games did team 1 win?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'WTeamID'`
 - **new column name:** `'T1Wins'`

In [8]:
game_data = game_data.merge(wins, left_on=['Season', 'T1TeamID'], right_on=['Season', 'WTeamID'], how='left')
game_data.columns.values[-1] = 'T1Wins'

**b).** How many games did team 1 lose?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'LTeamID'`
 - **new column name:** `'T1Losses'`

In [9]:
# your answer here
#game_data.merge(losses, left_on=['Season', 'T1TeamID'], right_on=['Season', 'LTeamID'], how='left')
game_data = game_data.merge(losses, left_on=['Season', 'T1TeamID'], right_on=['Season', 'LTeamID'], how='left')
game_data.rename({'Losses': 'T1Losses'}, axis=1, inplace=True)

**c).** Some teams have gone undefeated.  If that's the case there will be no entries for them in the loss column.  Fill in these values with 0 now.

In [10]:
# your answer here
game_data['T1Losses'].fillna(0, inplace=True)

**d).** You probably have some unnecessary columns right now.  Remove unnecessary columns created from the merges if they exist.  These are most likely going to be the `WTeamID` and `LTeamID` columns.

In [11]:
# your answer here
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

**e).** Now create a new column called `T1WinPCT` that's the winning percentage of team 1.

In [12]:
# your answer here
game_data['T1WinPct'] = game_data['T1Wins'] / (game_data['T1Wins'] + game_data['T1Losses'])

**Part II:**  Build the same features for Team II

Your turn:  Try and recreate the exact same features you just created for the first team, but for the second.

**Hint:**  In your original dataset, swap out `T1TeamID` for `T2TeamID` for the merges.

In [13]:
# number of wins for team 2
game_data = game_data.merge(wins, left_on=['Season', 'T2TeamID'], right_on=['Season', 'WTeamID'], how='left')
game_data.columns.values[-1] = 'T2Wins'

In [14]:
# number of losses for team 2
game_data = game_data.merge(losses, left_on=['Season', 'T2TeamID'], right_on=['Season', 'LTeamID'], how='left')
game_data.rename({'Losses': 'T2Losses'}, axis=1, inplace=True)

In [17]:
# fill in the empty values
game_data['T2Losses'].fillna(0, inplace=True)

In [19]:
# and drop unnecessary columns
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

In [22]:
# and Team 2 Win percentage
game_data['T2WinPct'] = game_data['T2Wins'] / (game_data['T2Wins'] + game_data['T2Losses'])

**Step 6:** Recreate your training and test sets from the original data source, using the same criteria as before

In [24]:
# your answer here
train = game_data[query].copy()
test  = game_data[~query].copy()

**Step 7:** Recreate `X` and `y`, except this time include the new features that you added -- Wins and losses for each team, as well as their winning percentage

In [27]:
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,T1Seed,T2Seed,SeedDiff,Result,T1Wins,T1Losses,T1WinPct,T2Wins,T2Losses,T2WinPct
0,1985,1116,63,1234,54,9,8,1,1,21,12.0,0.636364,20,10.0,0.666667
1,1985,1120,59,1345,58,11,6,5,1,18,11.0,0.62069,17,8.0,0.68
2,1985,1207,68,1250,43,1,16,-15,1,25,2.0,0.925926,11,18.0,0.37931
3,1985,1229,58,1425,55,9,8,1,1,20,7.0,0.740741,19,9.0,0.678571
4,1985,1242,49,1325,38,3,14,-11,1,23,7.0,0.766667,20,7.0,0.740741


In [37]:
# your answer
X = train[['SeedDiff', 'T1Wins', 'T2Losses','T1WinPct', 'T2Wins', 'T2Losses', 'T2WinPct']]
y = train['Result']

**Step 8:** Re-check your validation scores with the new data, using the same conditions that we did in the previous step.  See if your validation scores improved at all.

In [38]:
scores = cross_val_score(estimator=rf, X=X, y=y, cv=10)

Did your results improve?

In [39]:
# no, they didn't
np.mean(scores)

0.6717220137075719

**Step 9:** Of the two different versions of our model that we just tested, take the best one, fit your random forest on its training data, and then score it on your test set to see how your final results come out.

In [40]:
# since just the seed difference performed better, we'll stick with that
X = train[['SeedDiff']]
y = train['Result']

# fit the Random Forest
rf.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [41]:
# and score it on the test
X_test = test[['SeedDiff']]
y_test = test['Result']

# and we get these results
rf.score(X_test, y_test)

0.7059701492537314

**Step 10:** How close were your validation and test results?  Ie, how reliable were our validation results?

**Answer:** The validation scores from the one variable model with the seed difference were very similar....both very nearly 0.7.  Which suggests that, at the very least, the results predicted from it are reliable, even if they could potentially be improved upon with more information.

**Bonus:** If time permits, you can try a few different permutations of what we just did to continue to improve your results.  Including:

 - Trying to add more features beyon each team's winning percentage (perhaps average point differential would be more informative)
 - Using a grid search to find the best parameters of your random forest and seeing how that improves your results

In [None]:
# your answer here