<a href="https://colab.research.google.com/github/todnewman/coe_training/blob/master/NCAA_GamePlay_Kaggle2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NCAA Basketball Gameplay and Win Probability Predictor

This Notebook contains everything needed to predict the NCAA tournament.  The prediction is based upon over 10 years worth of NCAA tournament game results.  Features in the data consist of Kenpom.com features (I have to pay him $20 a year to update my data) and a few extras I have blended in.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math
import matplotlib
import sklearn as sk
import numpy as np
from sklearn import preprocessing
from sklearn import metrics
import csv
from itertools import tee
import sys
from random import random
!pip install mlxtend
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")




## Location of Data

Change the below to wherever the input files live (this .ipynb file and the actual_xx.csv files).  This is also where data outputs will go.

* SampleSubmissionStage2_0.csv:  Here is the output file formatted for Kaggle submission.
* pred_xx_out.csv: This is the predictive dataframe for each round (in case you're interested in looking the data over.
* NCAA_game_pred-xx-lasso.csv:  This is the results of the games played the number represents the win margin for the team in the first column.  The last two columns are the mean value of the win margin and the standard deviation.
* There is also a logfile, which should be very useful.  It will go wherever you tell it to go in the main block.

## Mount YOUR Google Drive

This will bring up a link you need to click to allow this script access to save files in your google drive.

In [2]:
#Here's where you can mount your own Google Drive and copy results from /content into it...

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/'My Drive'/data
%ls



Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
/content/drive/My Drive/data
1_FBB_2017_matrix.png              PlayByPlay_2011.zip
1_rain_confusion_matrix.png        PlayByPlay_2012.zip
2_FBB_2017_matrix.png              PlayByPlay_2013.zip
2_rain_confusion_matrix.png        PlayByPlay_2014.zip
Cities.csv                         PlayByPlay_2015.zip
Conferences.csv                    PlayByPlay_2016.zip
ConferenceTourneyGames.csv         PlayByPlay_2017.zip
corr_mat.png                       PlayByPlay_2018

# Dataset Download Area

Here we grab data from Github.  It gets saved in /content.

In [5]:
!wget -O /content/kenpom_2019.csv 'https://raw.githubusercontent.com/todnewman/data/master/kenpom_2019.csv'
!wget -O /content/historical_tourney_data_2019.csv 'https://raw.githubusercontent.com/todnewman/data/master/historical_tourney_data_2019.csv'
!wget -O /content/SampleSubmissionStage2-temp.csv  'https://raw.githubusercontent.com/todnewman/data/master/SampleSubmissionStage2-temp.csv'
!wget -O /content/Teams.csv 'https://raw.githubusercontent.com/todnewman/data/master/Teams.csv'
!wget -O /content/pytools.py 'https://raw.githubusercontent.com/todnewman/data/master/pytools.py'

%cd /content
%ls


--2019-03-29 02:50:35--  https://raw.githubusercontent.com/todnewman/data/master/kenpom_2019.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1993149 (1.9M) [text/plain]
Saving to: ‘/content/kenpom_2019.csv’


2019-03-29 02:50:35 (23.3 MB/s) - ‘/content/kenpom_2019.csv’ saved [1993149/1993149]

--2019-03-29 02:50:36--  https://raw.githubusercontent.com/todnewman/data/master/historical_tourney_data_2019.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 563356 (550K) [text/plain]
Saving to: ‘/content/historical_tourney_data_2019.csv’


2

# Prepare the dataframe - Feature Engineering

Here we have a few convenience functions that we call to get the Pandas dataframe ready for Machine Learning algorithms.  

prep_dataframe is intended to allow one spot for the developer to identify which features (columns) to bring in to the ML algorithm.

split_data takes a dataframe that has been prepared and splits it into training and test data sets.  This is also the location where I will do unbalanced dataset operations when this is necessary using the inbalanced learning algorithms

split_pred_data is a simple function to enable prediction on a predictive dataset

Mutual Information functions are for future use.

Correlation Matrix is used to identify the features with the most positive and negative correlations with the target.

In [0]:
'''Prep the dataset we wish to apply the model to.  There won't be a Target vector.'''
def prep_dataframe(df, flag, target, drop_seed):
    
    Name = 0
    
    if flag == 'predict':        
        df[target] = 0 
        if drop_seed:
            df = df.drop(['Season.1', 'KeyField.1', 'SEED.1', 'SEED.2', 'TARGET16.1', 'Season.2',
                  'KeyField.2','TARGET16.2'], axis=1) # eliminate seed vector
        else:
            df = df.drop(['Season.1', 'KeyField.1', 'TARGET16.1', 'Season.2',
                  'KeyField.2','TARGET16.2'], axis=1) # eliminate seed vector


        Name1 = df.pop('TeamName.1')
        Name2 = df.pop('TeamName.2')
        return df, Name1, Name2
    else:
        
        if drop_seed:
            df = df.drop(['Unnamed: 0','TeamName.1', 'TeamName.2', 'Season.1', 'KeyField.1', 'SEED.1', 'SEED.2', 'TARGET16.1', 'Score.1','Season.2',
                  'KeyField.2','TARGET16.2','Score.2', 'score_delta'], axis=1) # eliminate seed vector

        else:
            df = df.drop(['Unnamed: 0', 'TeamName.1', 'TeamName.2', 'Season.1', 'KeyField.1', 'TARGET16.1', 'Score.1','Season.2',
                  'KeyField.2','TARGET16.2','Score.2', 'score_delta'], axis=1) # eliminate seed vector

            
        df = df.fillna(0)
        Name = 0
        
        return df, Name

In [0]:
def split_data(df_s, size, target, os_flag):
    ''' Split the data into training and testing sets.  Column headers below are custom for the "Douglas data set".
    NaN's are filled with zero to maintain consistent data shapes.
    '''
    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    from imblearn.combine import SMOTEENN
    from imblearn.combine import SMOTETomek 
    from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler

    y = df_s.pop(target) # Remove the 'target' column that has the labels
    X = df_s.fillna(0)
    y = np.array(y).astype(int)
    X = np.array(X)
    
    # Scale the data so it works better with the ML algorithms... Also run oversampling algorithms
    
    
    if os_flag:
        sm = RandomOverSampler(random_state=41)    
        X, y = sm.fit_sample(X, y)
        
    if minmax_flag:
        X = preprocessing.MinMaxScaler().fit_transform(X)
    else:
        X = preprocessing.MaxAbsScaler().fit_transform(X)
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=size) # Keeping the test set small to evaluate timing

    return X_train, X_test, y_train, y_test, X, y

In [0]:
def split_pred_data(df_s, size, target):
    ''' Split the data into training and testing sets.  Column headers below are custom for the "Douglas data set".
    NaN's are filled with zero to maintain consistent data shapes.
    '''
    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    df_s[target] = 0 # Just so we have a target vector in the DF, even though we won't use it

    y = df_s.pop(target) # Remove the 'target' column that has the labels
    X = df_s.fillna(0)
    y = np.array(y).astype(int)
    X = np.array(X)
    
    # Scale the data so it works better with the ML algorithms
    if minmax_flag:
        X = preprocessing.MinMaxScaler().fit_transform(X)
    else:
        X = preprocessing.MaxAbsScaler().fit_transform(X)
    
    return X, y

## Feature Selection

This module is designed to select the most informative features.  Right now it's using a basic correlation matrix and then selecting the most positive and negative correlated features using a variable called "half_range".  Setting half_range to 4 will take the top 25% of the ordered matrix as well as the bottom 25%.  
For the small data in this NCAA dataset this function is probably less useful than it would be on a larger dataset.  So I created a global called all_feat that will accept all features.

In [0]:
def feature_select(target, df_hist, mi_flag): 
    
    half_range = 3.0  # This will determine how many of the most positive and negative correlated features are used
    if all_feat:
        half_range = 2.0
    # Open a dataframe just to create the correlation matrix                             
    df_new = pd.DataFrame()
    df_copy = pd.DataFrame()
    df_copy['Target'] = df_hist[target]
    
    # Convert Objects to Values
    cols = df_hist.columns
    df_hist[cols] = df_hist[cols].apply(pd.to_numeric, errors='coerce')
    
    df_new = df_hist.corr().sort_values(target, ascending = False) 

    size = len(df_new)
    
    ratio = (1/half_range) # will let us pick the highest and lowest correlated features by this ratio
    
    upper_limit = int(size*ratio)
    lower_limit = int(size*(1-ratio))
    upper_range = df_new[target].index[1:upper_limit].tolist()
    lower_range = df_new[target].index[lower_limit:size].tolist()
    
    '''
    Below allows us to switch between mutual information-based and correlation-based feature selectors
    '''
    if mi_flag:
        total_range = get_mutual_information(df_hist, target, upper_limit)
    else:
        total_range = (upper_range + lower_range)
        
    target_vector = df_copy['Target']
    df_hist = df_hist[total_range] # limit the dataframe used for training to just the most informative features
    df_hist['Target'] = target_vector


    return (df_hist, 'Target', total_range)


def get_mutual_information(df, target, range):
    from sklearn.feature_selection import mutual_info_classif
    # Split data:
    df_copy = pd.DataFrame()
    df_copy = df
    
    y = df_copy.pop(target)
    X = df_copy.fillna(0)
    X = np.array(X)

    df_n = pd.DataFrame()
    df_n['MI'] = mutual_info_classif(X, y)
    df_n['Keys'] = df_copy.keys()
    df_n = df_n.sort_values(['MI'], ascending=False)
    features = np.array(df_n['Keys'][1:range].tolist())
    
    return features

## Process data from the Kaggle input into Predictive DataFrame Format

This pulls data from the Kaggle example submission format and from that it grabs the teams competing in each match.  It then calls grab_kp_data, a function that takes the kaggle inputs, the 2019 statistics for each team, and creates the file to be used for inference by the model in the next round.

In [0]:
def process_data(df_team, df_input, round_t, curr_year, verbose):
    '''
    Pass in results of the prediction and output the file used for prediction.
    
    Returns df_pred, the official predictive dataframe.  We will predict the results of the games using the model
    trained on years of tournament games.
    '''
     # Lists that we'll use    
    team1 = []
    team2 = []
    
    team_dict = {}
    
    for i, id in enumerate(df_team['TeamID'].unique()):
        team_dict[id] = df_team['TeamName'].iloc[i]
    
    # Below, we go through the Kaggle submission file and we separate out the year and the two team
    # ID values.  Then we go into the dictionary to pull out the names from the IDs.
        
    input_val = df_input['ID'].astype(str)
    
    for i, val in enumerate(input_val):
        year, id1, id2 = val.split(sep='_')
        name1 = team_dict[int(id1)]
        name2 = team_dict[int(id2)]
        team1.append(name1)
        team2.append(name2)
            
    # Dataframes that we'll use

    df_prediction = pd.DataFrame()   
       
    # Output Filename
    pred_file = ('pred_%s_out.csv' % round_t)

    #
    # Create the DataFrame with the info we need to be able to 
    df_prediction['WTeamID'] = team1
    df_prediction['LTeamID'] = team2
    df_prediction['Season'] = curr_year
    
    if verbose:
        print("This is the dataframe we end up routing to the play_games function.")
        print(df_prediction)
        
    #
    # Send info on the teams from the Kaggle example file over to the function
    #   that creates the predictive dataframe for each match (team1 vs. team2)
    df_pred = grab_kp_data(df_prediction, curr_year)
    df_pred.to_csv(pred_file)
    return df_pred

def grab_kp_data(df_winlose, curr_year):
    '''
    Here we take the actual basketball data for the teams playing each other and build out the
    predictive dataframe.  I have a strange practice of flipping the teams so that each team is
    in the Team1 spot once.  This seems to make a difference.
    
    df_winlose:  this is the prediction dataframe that captures the teamnames from the
                 Kaggle example submission file ID's.
    '''
    game = pd.DataFrame()
    combined_pred_df = pd.DataFrame()
    
    filename_kp_data = 'kenpom_2019.csv'
    
    ncaa_data = pd.read_csv(filename_kp_data, parse_dates=True) # Open Up Latest Kenpom File
    
    ncaa_data = ncaa_data[ncaa_data['Season']==curr_year]  # We only want the current year's data
        
    for i, row in df_winlose.iterrows():
        key_team1 =  ("%s%s" % (row['Season'], row['WTeamID'])) # Key to use to enter the NCAA KP Data
        key_team2 = ("%s%s" % (row['Season'], row['LTeamID'])) # Key for the second team
                
        team1_filter = ncaa_data['KeyField']==(key_team1) # Filter to grab the first team's data from the KP Data
        team2_filter = ncaa_data['KeyField']==(key_team2) # Filter for the second team
               
        team1=ncaa_data[team1_filter]  # Record from current year for Team 1
        team2=ncaa_data[team2_filter] # Record from current year for Team 2
        team2.index = team1.index.copy()  # So they're not on separate lines            
        team1 = team1.add_suffix(".1") # Done so we don't have columns with the same name
        team2 = team2.add_suffix(".2")
        
        game = pd.concat([team1, team2], axis=1)  # This record will contain both teams' stats
        
        combined_pred_df = combined_pred_df.append(game)
        
    return (combined_pred_df)

# CLASSIFICATION Function

This predicts the probability of a win for the team in the first column (team 1) over the team in the second column (team 2).  Right now it is set up to use the ensemble stacking classifier (through call to find_best_ensemble), but you could substitute any other classifier with a 'model=XX'.

In [0]:
def run_classification(df_hist, df_predict, round_t):
    from sklearn import linear_model
    from sklearn.neural_network import MLPClassifier 
    
    # Global Variables
    os_flag = False  # Oversampling
    drop_seed = False # Should we drop the SEED feature?    
    
    target = 'win'  # Column header that contains our binary target

    best_features = []

    df_test_train = df_hist
    
    #C_values = [0.001, 0.01, 0.05, 0.1, 1, 10]  # For Logistic Regression CV    
    #model = linear_model.LogisticRegressionCV(Cs=C_values, cv=5, penalty='l2', scoring='roc_auc', fit_intercept=False)
    #model = MLPClassifier(solver='lbfgs', early_stopping = True, hidden_layer_sizes=(64,32,8))
    
    '''
    First, process the training data set.
    '''
    
    df_train, Name = prep_dataframe(df_test_train, 'train', target, drop_seed)
    df_train, target, best_features = feature_select(target, df_train, mi_flag)
    
    #
    # Split data: determine the percentage of data that goes into the test set
    X_train, X_test, y_train, y_test, X, y = split_data(df_train, 0.20, target, os_flag)
    
    '''
    Call the ensemble function to return the stacking model that we will use.
    '''
    model = find_best_ensemble(X,y)
    
    #    
    # Fit the training set data.    
    model.fit(X_train, y_train)
    
    expected = y_test # This is the values of the Target that we 'know' from the historical data
    predicted = model.predict(X_test) # These are the Targets we predict (to compare to the expected values later)
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    #pytools.roc(model, X_train, y_train, target, 'model_kaggle.png')
    
    '''
    Now we use the trained model to predict how this seasons games will go.
    '''        
    
    df_pred, Name1, Name2 = prep_dataframe(df_predict, 'predict', target, drop_seed)
    df_pred = df_pred[best_features]  # Align predictive data with the training data features
    X_predict, y_bogus = split_pred_data(df_pred, 0, target)  # Do scaling of data to match training data
    
    prob_in_class = model.predict_proba(X_predict)[:,1] # Grab the probability of team 1 beating team 2
    
    return (Name1, Name2, prob_in_class, model)

## Build the Stacking Classifier

Here we show how to use sklearn classifiers and mlxtend's nice stacking classifier (http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/) to return a stacking classifier model to the run_classifier function.

This is a much simpler way of doing stacking classifiers than the way I've been doing it in the past.

My strategy for picking classifiers for a stacker is generally to select models that are dissimilar from one another, i.e., Gaussian Processes using an RBF kernel is very different from KNN.

In this case, the use of the stacker drops the log-loss score significantly over straight Multi-layer Perceptron or optimized Logistic Regression.

In [0]:
def find_best_ensemble(X,y):
    '''
    Currently Im not doing anything to actually find the optimal set of models for a stacker.  But I intend to do
    this soon, so hence the name of this function.  The objective of this function right now is to return
    indicators of the strength of the stacker defined and then to return the stacker model.
    
    Returns:  model
    '''
    from sklearn import model_selection
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from mlxtend.classifier import StackingCVClassifier
    from sklearn.neural_network import MLPClassifier 
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.gaussian_process.kernels import RBF
    import numpy as np
    
    kernel_GP = 1.0 * RBF(1.0) # Kernel for the Gaussian Process Classifier
    C_values = [0.001, 0.01, 0.05, 0.1, 1, 10] # For the Logistic Regression C value
    
    #
    # Here are a couple of different models that could be substituted for the stacking classifier
    #clf1 = AdaBoostClassifier(n_estimators=285, learning_rate=0.001)
    #clf3 = MLPClassifier(solver='lbfgs', early_stopping = True, hidden_layer_sizes=(16,32,8))
    
    #
    # Classifiers to use in a Stacking Classifier
    clf3 = KNeighborsClassifier(n_neighbors = 3)
    clf2 = RandomForestClassifier(n_estimators=144, criterion='gini', bootstrap=False, min_samples_leaf = 7, 
               min_samples_split = 5, max_features = 10, max_depth = None)
    clf1 = GaussianProcessClassifier(kernel=kernel_GP, random_state = 0)    
    clf4 = LogisticRegressionCV(Cs=C_values, cv=5, penalty='l2', scoring='roc_auc', fit_intercept=False)
    #
    # Meta-Classifier below for classifying across the Above Classifiers
    lr   = LogisticRegressionCV(Cs=C_values, cv=3, scoring = 'roc_auc')
    #
    # Set up the Stacking Classifier
    sclf = StackingCVClassifier(classifiers=[clf3, clf1, clf2, clf4],
                          use_probas=True,
                          meta_classifier=lr)

    print('5-fold cross validation:\n')

    for clf, label in zip([clf3, clf1, clf2, clf4, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Gaussian Process',
                       'Logistic Regression CV',
                       'StackingCVClassifier']):

        scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=5, scoring='roc_auc')
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
              % (scores.mean(), scores.std(), label))
        
    return clf

## Handle the Results of the Played Games

play_games: sets up the data to run the classification function.  Will play any number of games desired and then take the mean and std of the resulting probability of winning.  Set up the number of iterations with n_iter in the main.  Calls run_classification which does the bulk of the machine learning.  In this case, it is set up to build a stacking classifier.

determine_win: takes the mean probability of winning and builds the dataframe for the Kaggle output.


In [0]:
def play_games(pred_df, round_t, df_hist_train, n_iter, verbose ):
    '''
    This is the executive for the gameplaying feature.
    '''
    r2 = []
    
    combined_df = pd.DataFrame()
    first_time = True
          
    for n in range (0, n_iter):
        Team1, Team2, Prediction, modelinfo = run_classification(df_hist_train, pred_df, round_t)
        if first_time:
            combined_df['Team1'] = Team1.values
            combined_df['Team2'] = Team2.values
        first_time = False
        combined_df[n] = np.round(Prediction,3)
    
    combined_df['mean'] = combined_df.mean(axis = 1)
    combined_df['std'] = combined_df.std(axis = 1)
    combined_df.to_csv('NCAA_game_pred-%s-lasso.csv' % round_t)
    
    if verbose:
        print ("Here are the results from regression from the play_games function for %s iterations" % n_iter)
        print (combined_df[['Team1', 'Team2', 'mean', 'std']].sort_values(['mean'], ascending = False))
    return combined_df, modelinfo


def determine_win(df, round_t, verbose):
    id_val = []  # List to hold the integer ID for each team
    team_dict = {}
    
    teamfile = 'Teams.csv'
    outfile = ('SampleSubmissionStage2_%s.csv' % round_t) # Save off output in Kaggle Format
    df_team = pd.read_csv(teamfile) # to create dictionary relating teams and IDs
    
    df_kaggle = pd.DataFrame() # Final Kaggle Output Format
    
    #
    # Build the team dictionary
    for i, name in enumerate(df_team['TeamName'].unique()):
        team_dict[name] = df_team['TeamID'].iloc[i]
    
    # Need to map into the dictionary the right way.
    # Due to multiple spellings.
    
    id1 = df['Team1'].map(team_dict)
    id2 = df['Team2'].map(team_dict)
    
    df['id1'] = id1
    df['id2'] = id2
    for i, row in df.iterrows():
        id_val.append(('2019_%s_%s' % (row['id1'], row['id2'])))
    
    df['prob1'] = df['mean']
    
    #df_kaggle['ID'] = 
    if verbose:
        print("Winning Team  Losing Team  Win Value1  Win Value2")
        print(df[['Team1', 'id1', 'Team2', 'id2','prob1']][0:35])        
    
    df_kaggle['ID'] = id_val
    df_kaggle['Pred'] = df['prob1']
    df_kaggle.to_csv(outfile)
    


# MAIN function



In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, BayesianRidge, LassoCV, RidgeCV, LassoLarsCV, MultiTaskLassoCV, ElasticNetCV
from sklearn.svm import SVR
from sklearn.model_selection import KFold
import pytools
import logging
import warnings
warnings.filterwarnings("ignore")

# Input Files
filename_train = 'historical_tourney_data_2019.csv' # This is the historical dataset from years of playoffs results
input_filename = 'SampleSubmissionStage2-temp.csv'  # Kaggle's example format file
teamfile = 'Teams.csv' # List mapping Team Names and Team Integer ID's

# Logging setup.  This will go in the path chosen at the top of this notebook.
logname = 'ncaa_logfile.txt'

logger = logging.getLogger('NCAA Predictor')
logger.setLevel(logging.DEBUG)
fh = logging.FileHandler(logname)
fh.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info("NCAA tourney output RUN Starts here")

# Global Variables
global mi_flag     # Use mutual information instead of correlation.  
global all_feat    # use all numerical features
global os_flag     # Use random oversampling to address imbalance in the target
global minmax_flag # Use the MinMax Scaler instead of MaxAbs

mi_flag = False
os_flag = False
minmax_flag = True
all_feat = True

curr_year = 2019 # change every year
n_iter = 5
round_t = 0

logger.info("mi_flag %s * os_flag %s * minmax_flag %s * all_feat %s\n" % 
             ( mi_flag, os_flag, minmax_flag,  all_feat))

df_hist_train = pd.read_csv(filename_train) # Historical NCAA Tournament match data
df_input      = pd.read_csv(input_filename) # Dataframe from the Kaggle Example Submission File    
df_team       = pd.read_csv(teamfile)
   
verbose = False
pred_df = process_data(df_team, df_input, round_t, curr_year, verbose)        # Dataframe of games to predict
combined, model = play_games(pred_df, round_t, df_hist_train, n_iter, verbose)  # Run games for this year's teams
    
logger.info("Model Info: %s" % model)

verbose = True
determine_win(combined, round_t, verbose)   # Determine the results and format for Kaggle
    

5-fold cross validation:

Accuracy: 0.64 (+/- 0.03) [KNN]
Accuracy: 0.76 (+/- 0.03) [Random Forest]
Accuracy: 0.75 (+/- 0.02) [Gaussian Process]
Accuracy: 0.77 (+/- 0.02) [Logistic Regression CV]
Accuracy: 0.77 (+/- 0.03) [StackingCVClassifier]
              precision    recall  f1-score   support

           0       0.70      0.75      0.72        72
           1       0.78      0.74      0.76        87

   micro avg       0.74      0.74      0.74       159
   macro avg       0.74      0.74      0.74       159
weighted avg       0.74      0.74      0.74       159

[[54 18]
 [23 64]]
5-fold cross validation:

Accuracy: 0.64 (+/- 0.03) [KNN]
Accuracy: 0.76 (+/- 0.03) [Random Forest]
Accuracy: 0.75 (+/- 0.03) [Gaussian Process]
Accuracy: 0.77 (+/- 0.02) [Logistic Regression CV]
Accuracy: 0.77 (+/- 0.03) [StackingCVClassifier]
              precision    recall  f1-score   support

           0       0.69      0.60      0.64        84
           1       0.61      0.71      0.65        75



In [16]:
%ls /content
%cp SampleSubmissionStage2_0.csv /content/drive/'My Drive'/.

[0m[01;34mdrive[0m/                            [01;34m__pycache__[0m/
historical_tourney_data_2019.csv  pytools.py
kenpom_2019.csv                   [01;34msample_data[0m/
NCAA_game_pred-0-lasso.csv        SampleSubmissionStage2_0.csv
ncaa_logfile.txt                  SampleSubmissionStage2-temp.csv
pred_0_out.csv                    Teams.csv
