## Data
The model I used (which performed fairly well - top 3%) was a very simple random forest model. I think what helped the model overperform some others was the data that I used.

I scraped historical NCAAM data from ESPN. Beginning in 2008 (the first year that BPI rankings are available on ESPN), I pulled many different team statistics for each team that played Division I basketball:

* Team Wins/Losses
* Team Stats
    * Points
    * Rebounds
    * Assists
    * Steals
    * FG%
    * etc.
* Strength of Schedule Ranking
* Strength of Record Ranking
* BPI Ranking
* and a few more

From there, I built a dataset that contained games from each March Madness tournament since 2008. In this dataset, I included all the team stats listed above as well as the team's seed and the outcome of the game.

In [None]:
import pandas as pd
import numpy as np
import csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, log_loss
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/ncaam-2022/MARCH_MADNESS_DATA.csv')
df.head()

In [None]:
df = df.drop(columns=['T1NAME','T2NAME','T1SCORE','T2SCORE','REGION','YEAR','ROUND',
                      'T1ID','T1HW','T1HL','T1AW','T1AL','T1HREC','T1AREC',
                      'T2ID','T2HW','T2HL','T2AW','T2AL','T2HREC','T2AREC',])
df = df.dropna()

In [None]:
x = df.iloc[:,:-1]
y = df.WINNER

In [None]:
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=99)

## Model
I ran a RandomizedSearchCV to get the best hyperparameters for my random forest.

Below are the parameters that the model returned as the optimal solution.

In [None]:
# ### Random Forest
# from sklearn.model_selection import RandomizedSearchCV
# random_grid = {}
# random_grid['n_estimators'] = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# random_grid['max_features'] = ['auto', 'sqrt']
# random_grid['max_depth'] = [int(x) for x in np.linspace(10, 110, num = 11)]
# random_grid['max_depth'].append(None)
# random_grid['min_samples_split'] = [2, 5, 10]
# random_grid['min_samples_leaf'] = [1, 2, 4]
# random_grid['bootstrap'] = [True, False]
# rf = RandomForestClassifier()
# rf_random = RandomizedSearchCV(
#     estimator=rf, 
#     param_distributions=random_grid, 
#     n_iter=500, cv=3, verbose=2,
#     n_jobs=-1)
# rf_random.fit(train_x, train_y)

```{  
    'n_estimators': 1600,  
    'min_samples_split': 5,  
    'min_samples_leaf': 4,  
    'max_features': 'auto',  
    'max_depth': 50,  
    'bootstrap': False  
}```

In [None]:
rf = RandomForestClassifier(
    n_estimators=1500,
    min_samples_leaf=4,
    max_features='auto',
    max_depth=50,
    bootstrap=True)
rf.fit(train_x, train_y)

In [None]:
### Random Forest
rf_pred = rf.predict(test_x)
rf_acc = accuracy_score(test_y, rf_pred)
rf_predprob = rf.predict_proba(test_x)
print(f"{'Random Forest':<20} ==> {round(rf_acc * 100, 2)}")
print(f"{'Log Loss':<20} ==> {round(log_loss(test_y, rf_predprob), 3)}")

In [None]:
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)

In [None]:
feature_names = list(df.columns[:-1])
forest_importances = pd.Series(importances, index=feature_names).sort_values()
fig, ax = plt.subplots(figsize=(10, 6))
forest_importances.plot.bar(ax=ax)
ax.set_title("Variable Importances")
ax.set_ylabel("Impurity Decrease")
fig.tight_layout()

I found this to be intersting. I know that the selection committee uses SOR as a major factor in determining the seeding of the tournament. This resulted in my model (nearly always) picking the higher seed to win the game. 

## Future Improvements
There are three main things I want to tackle next year.
1. See if there is a way to include momentum going into the tournament (something that North Carolina used to get to the championship). This could be having a L10 result prior to the end of the season.
2. Include conference data for the teams. It's no secret that some conferences perform better than others in the tournament.
3. Train data on ALL games (regular season and tournament). This would necessitate removing the 'seed' variable, but could be interesting to experiment with.