# Before I start off

Kudos to @theoviel (https://www.kaggle.com/theoviel/ncaa-starter-the-simpler-the-better) for the reference code notebook.

I am trying to learnhow to solve these problems and these simple implementations really help me practice my python and help me in implementation in case I get lost

# Initial setup

Kaggle setup for listing directories and files available

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing important libraries

Import statements for all libraries used throughout the code

In [None]:
# Importing required libraries

import os
import re
import warnings

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from itertools import product, combinations

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score # Will use for validation of our model by prediction throgh various models


## Set useful options 

Setting display options for pandas dataframe to make visualization easier in notebook

In [None]:
# Setting important options to make visualization easier

pd.set_option('display.max_rows', 400)
pd.set_option('display.max_columns', 160)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.max_columns', None)

warnings.filterwarnings("ignore")


# Importing datasets

Import required datasets for stage 1 competition

We will mainly focus on results and seeds for base set of predictions


In [None]:
base_path = "/kaggle/input/ncaam-march-mania-2021/MDataFiles_Stage1/"

for filename in os.listdir(base_path):
    print(filename)

In [None]:
# Import basic datasets to get started on a base model
# We will skip using the regular tournamen data for the base model
# Teams.csv, MSeasons.csv, MNCAATourneySeeds.csv, MRegularSeasonCompactResults.csv, MNCAATourneyCompactResults.csv, MSampleSubmissionStage1.csv

teams = pd.read_csv(base_path + "MTeams.csv")
seasons = pd.read_csv(base_path + "MSeasons.csv")
seeds = pd.read_csv(base_path + "MNCAATourneySeeds.csv")
results = pd.read_csv(base_path + "MNCAATourneyCompactResults.csv")
regular_results = pd.read_csv(base_path + "MRegularSeasonCompactResults.csv")

sample = pd.read_csv(base_path + "MSampleSubmissionStage1.csv")

list_datasets = [teams, seasons, seeds, results,regular_results, sample]

string_list_datasets = ['teams', 'seasons', 'seeds', 'results', 'regular_results','sample']

## Visualize the datasets

Print top 5 rows to get general idea of datasets

Also, generate summary of datasets for few preliminary observations

In [None]:
for i, dataset in enumerate(list_datasets):
    print(string_list_datasets[i])
    print("\n")
    print(dataset.head())
    print("\n")
    print(dataset.tail())
    print("\n\n\n")

In [None]:
for i, dataset in enumerate(list_datasets):
    print(string_list_datasets[i])
    display(dataset.describe())

### Dataset comments

0. regular_results:
    - has non-NCAA results starting from 1985 until 2019
    - We can use this to create few metrics that can support our model. All features can be siilar to ones made for NCAA as data format is same
    - Wloc and NumOT can be safely removed as it might create significant bias in model and we have no way of predicting if some team will run into an overtime
1. results:
    - has NCAA results starting from 1985 until 2019
    - We can directly use this to evaluate our model since task is to predict 2015 to 2019 matches
    - Wloc and NumOT can be safely removed as it might create significant bias in model and we have no way of predicting if some team will run into an overtime
2. teams:
    - Has Team Ids and their respecitve years for first tournament match played
    - Can be used to calculate age of teams. It might help our model to factor in experience of the team into results

3. seasons:
    - Has start dates and region areas mentioned
    - We can safely ignore this dataset as results already have a standardized date variable

4. Seeds:
    - Has the seed information for the teams per season.
    - Since it directly correlates with the round of play, we may provide it as a factor which can account for player exhaustion

5. sample:
    - Sample submission.
    - We can use this file as base for creating our model dataframe and test models on it

# Preprocessing and creating base lists

We will be using datasets to understand participating teams, regions, etc and creating submission file based on all possible team combinations for predictions

## Base list creation

We need to predict for 2015 to 2019 seasons.
We can create a dataset containing all years, all team combinations on which we can keep adding features

all_teams = teams['TeamID'].unique()
all_seeds = seeds['Seed'].unique()
all_years = results['Season'].unique()

print("# of seeds: n = %d\n" % len(all_seeds))

print("# of teams: n = %d\n" % len(all_teams))

print("years: %d to %d \n" % (min(all_years),max(all_years)))

# Create base file with IDs in required format

base_team_matchups =list(combinations(all_teams, 2))

print("Combinations array: \n")
print(base_team_matchups[1:10])

In [None]:
all_teams = teams['TeamID'].unique()
all_seeds = seeds['Seed'].unique()
all_years = results['Season'].unique()


# EDA and feature creation

Main focus is on using results table to generate basic variables

Following variables will be created:
1. Avg. wins/loss per team per season
2. Avg. Score for win/lose team per team per season
3. Avg. Score gap

## Generate similar variables based on regular season results

We will use dtaa for matches played before actual NCAA tournaments for computing similar features

In [None]:
# We will focus on just 1 main dataset.. i.e. results dataset for first pass at baseline

df = regular_results

Create a score gap variable by Winning score - Losing score

In [None]:
# Create a score gap variable

df['ScoreGap'] = df['WScore'] - df['LScore']

### Function for quick rollups

In [None]:
def rollup_df(base_df,rollup_cols,aggregation):
    temp = base_df.groupby(rollup_cols).agg(aggregation)
    temp = temp.reset_index()
    return temp

Get # of wins, avg score and avg. score gap for eac set of winnng and losing teams

In [None]:
# Get number of wins and losses per year per team

num_wins = rollup_df(df,['Season','WTeamID'],{'DayNum':'count','WScore':'mean','ScoreGap':'mean'})

num_wins = num_wins.rename(columns = {'WTeamID':'TeamID','DayNum':'num_wins','WScore':'WScore_avg','ScoreGap':'WScoreGap_avg'})

num_loss = rollup_df(df,['Season','LTeamID'],{'DayNum':'count','LScore':'mean','ScoreGap':'mean'})

num_loss = num_loss.rename(columns = {'LTeamID':'TeamID','DayNum':'num_loss','LScore':'LScore_avg','ScoreGap':'LScoreGap_avg'})


In [None]:
num_loss

# Create base dataframe with all teams and the created variables

We remove distinction b/w winning and losing teams and join all features.
Since same team can have won and lost match, we can deduplicate to ensure we have 1 entry per team per season


In [None]:
# Create set with all possible combos of season and teams

df_feat_merged = num_wins[['Season','TeamID']].append(num_loss[['Season','TeamID']])

df_feat_merged = df_feat_merged.drop_duplicates().reset_index().drop(['index'],axis = 1)

### Join previously created features for each team

Here, we make a single consistent dataset with appropriate wins/loss values attached to each team

In [None]:
df_feat_1 = pd.merge(df_feat_merged,num_wins, on = ['Season','TeamID'], how = 'left')
df_feat_2 = pd.merge(df_feat_1,num_loss, on = ['Season','TeamID'], how = 'left')

df_feat_2 = df_feat_2.fillna(0)

In [None]:
df_feat_2.head()

## Computing features based on wins and losses

We compute a ratio of # of wins for the team in the specific season

Also, we will average out the score gap by using # of wins and win score gap (vice versa for losing score) and establish a general score gap team establishes in each season

In [None]:
df_feat_final = df_feat_2

df_feat_final['WinRatio'] = df_feat_final['num_wins']/(df_feat_final['num_wins'] + df_feat_final['num_loss'])

df_feat_final['total_win_gap'] = df_feat_final['WScore_avg']* df_feat_final['num_wins']

df_feat_final['total_lose_gap'] = df_feat_final['LScore_avg']* df_feat_final['num_loss']
                                  
df_feat_final['ScoreGapAvg'] = (df_feat_final['total_win_gap'] - df_feat_final['total_lose_gap'])/(df_feat_final['num_wins'] + df_feat_final['num_loss']) 



### Drop columns from dataset

We will drop few columns which cannot be considered as features for final predictions.
i.e. num_wins, num_loss, Score Gaps

In [None]:
df_feat_final = df_feat_final.drop(['num_wins','num_loss','total_win_gap','total_lose_gap','WScoreGap_avg','LScoreGap_avg'],axis = 1)

In [None]:
df_feat_final.head(5)
df_feat_final.tail(5)

In [None]:
df_reg_season_feat = df_feat_final.copy()

## Using NCAA compact results to generate features based on the tournament

In [None]:
df = results

df = df.drop(columns = ['WLoc','NumOT'], axis = 1)

Join match seeds as we will use this dataset as base for our predictions

In [None]:
df = pd.merge(
    df,
    seeds,
    how = 'left',
    left_on=['Season','WTeamID'],
    right_on = ['Season','TeamID']).drop(columns = ['TeamID'],axis = 1).rename(columns = {'Seed':'WSeed'})

In [None]:
df = pd.merge(
    df,
    seeds,
    how = 'left',
    left_on=['Season','LTeamID'],
    right_on = ['Season','TeamID']
    ).drop(columns = ['TeamID'],axis = 1).rename(columns = {'Seed':'LSeed'})

Remove the regions from seeds. We wont need the regions in seeds for our model

In [None]:
def clean_seed(seed):
    return int(re.sub("[^0-9]","",seed))

In [None]:
df['LSeed'] = df['LSeed'].apply(clean_seed)
df['WSeed'] = df['WSeed'].apply(clean_seed)


In [None]:
df.head()

In [None]:
df_reg_season_feat.head()
df_reg_season_feat.columns.values

### Merge the NCAA results with season features we made previously

In [None]:
df = pd.merge(df,
             df_reg_season_feat,
             how = 'left',
             left_on = ['Season','WTeamID'],
             right_on = ['Season','TeamID'],
             ).rename(columns = 
                      {'WScore_avg':'WScore_avg_W', 
                       'LScore_avg':'LScore_avg_W',
                       'WinRatio':'WinRatio_W',
                       'ScoreGapAvg':'ScoreGapAvg_W'}
                     ).drop(['TeamID'],axis = 1)

In [None]:
df = pd.merge(df,
             df_reg_season_feat,
             how = 'left',
             left_on = ['Season','WTeamID'],
             right_on = ['Season','TeamID'],
             ).rename(columns = 
                      {'WScore_avg':'WScore_avg_L', 
                       'LScore_avg':'LScore_avg_L',
                       'WinRatio':'WinRatio_L',
                       'ScoreGapAvg':'ScoreGapAvg_L'}
                     ).drop(['TeamID'],axis = 1)

In [None]:
df.head()

Since our data only has winning teams as base, we need to create a general dataset which has winning and losing teams both in same columns.

We will duplicate the dataframe and swap winning and losing rows for this

In [None]:
# Rename winning team identifiers as A and losing as B
win_rename = {'WTeamID':'TeamID_A',
              'WScore':'Score_A',
              'LTeamID':'TeamID_B',
              'LScore':'Score_B',
              'WSeed':'Seed_A',
              'LSeed':'Seed_B',
              'WScore_avg_W':'WScore_avg_A',
              'LScore_avg_W':'LScore_avg_A',
              'WinRatio_W':'WinRatio_A',
              'ScoreGapAvg_W':'ScoreGapAvg_A',
              'WScore_avg_L':'WScore_avg_B',
              'LScore_avg_L':'LScore_avg_B',
              'WinRatio_L':'WinRatio_B',
              'ScoreGapAvg_L':'ScoreGapAvg_B'}

# Rename losing team identifiers as A and winning as B
lose_rename = {'WTeamID':'TeamID_B',
              'WScore':'Score_B',
              'LTeamID':'TeamID_A',
              'LScore':'Score_A',
              'WSeed':'Seed_B',
              'LSeed':'Seed_A',
              'WScore_avg_W':'WScore_avg_B',
              'LScore_avg_W':'LScore_avg_B',
              'WinRatio_W':'WinRatio_B',
              'ScoreGapAvg_W':'ScoreGapAvg_B',
              'WScore_avg_L':'WScore_avg_A',
              'LScore_avg_L':'LScore_avg_A',
              'WinRatio_L':'WinRatio_A',
              'ScoreGapAvg_L':'ScoreGapAvg_A'}

In [None]:
win_df = df.copy()
win_df = win_df.rename(columns = win_rename)

lose_df = df.copy()
lose_df = lose_df.rename(columns = lose_rename)

final_df = pd.concat([win_df,lose_df], axis = 0).reset_index().drop(columns = ['index'])

final_df['Win_A'] = final_df.apply(lambda x: 1 if x['Score_A'] > x['Score_B'] else 0, axis = 1)

In [None]:
final_df

### COmpute difference b/w team A and team B

We will compute difference feature as it will help us assess if A is worse/better than B

In [None]:
final_df['SeedDiff'] = final_df['Seed_A'] - final_df['Seed_B']

final_df['ScoreGapDiff'] = final_df['ScoreGapAvg_A'] - final_df['ScoreGapAvg_B']

final_df['WinRatioDiff'] = final_df['WinRatio_A'] - final_df['WinRatio_B']

# Build test/submission dataset

In [None]:
sample.head()

df_test = sample.copy()

Split out the team ID and Season year

In [None]:
df_test['Season'] = df_test['ID'].apply(lambda x: int(x.split('_')[0])) 

df_test['TeamID_A'] = df_test['ID'].apply(lambda x: int(x.split('_')[1])) 

df_test['TeamID_B'] = df_test['ID'].apply(lambda x: int(x.split('_')[2])) 

Join match Seeds

In [None]:
df_test = pd.merge(df_test,
                  seeds,
                  how = 'left',
                  left_on = ['Season','TeamID_A'],
                  right_on = ['Season','TeamID']
                  ).rename(columns = {'Seed':'Seed_A'}).drop(['TeamID'], axis = 1) 

In [None]:
df_test = pd.merge(df_test,
                  seeds,
                  how = 'left',
                  left_on = ['Season','TeamID_B'],
                  right_on = ['Season','TeamID']
                  ).rename(columns = {'Seed':'Seed_B'}).drop(['TeamID'], axis = 1) 

Clean seeds

In [None]:
df_test['Seed_A'] = df_test['Seed_A'].apply(clean_seed)
df_test['Seed_B'] = df_test['Seed_B'].apply(clean_seed)

In [None]:
df_test['SeedDiff'] = df_test['Seed_A'] - df_test['Seed_B']

### Join Season stats

In [None]:
df_test = pd.merge(df_test,
             df_reg_season_feat,
             how = 'left',
             left_on = ['Season','TeamID_A'],
             right_on = ['Season','TeamID'],
             ).rename(columns = 
                      {'WScore_avg':'WScore_avg_A', 
                       'LScore_avg':'LScore_avg_A',
                       'WinRatio':'WinRatio_A',
                       'ScoreGapAvg':'ScoreGapAvg_A'}
                     ).drop(['TeamID'],axis = 1)

In [None]:
df_test = pd.merge(df_test,
             df_reg_season_feat,
             how = 'left',
             left_on = ['Season','TeamID_B'],
             right_on = ['Season','TeamID'],
             ).rename(columns = 
                      {'WScore_avg':'WScore_avg_B', 
                       'LScore_avg':'LScore_avg_B',
                       'WinRatio':'WinRatio_B',
                       'ScoreGapAvg':'ScoreGapAvg_B'}
                     ).drop(['TeamID'],axis = 1)

In [None]:
df_test['SeedDiff'] = df_test['Seed_A'] - df_test['Seed_B']

df_test['ScoreGapDiff'] = df_test['ScoreGapAvg_A'] - df_test['ScoreGapAvg_B']

df_test['WinRatioDiff'] = df_test['WinRatio_A'] - df_test['WinRatio_B']

# Validate model via k-fold validations 

We use older seasons to predict model output for next season and check score 

In [None]:
print(final_df.columns.values)

print(final_df.Season.unique())


In [None]:
features = [
    'Seed_A',
    'Seed_B',
    'WinRatio_A',
    'ScoreGapAvg_A',
    'WinRatio_B',
    'ScoreGapAvg_B',
    'SeedDiff',
    'WinRatioDiff',
    'ScoreGapDiff']

Creating copies of required datasets to make it easier to process

In [None]:
final_df
df_test

In [None]:
seasons = final_df.Season.unique()    

f_score = [] # Store the season and the f-score for predictions.
predictions = [] # Store the predictions on test set based on all intermediate models. We will average the predictions

SEED = 11

for season_yr in seasons[10:]:
    
    model = LogisticRegression(C = 10, random_state=SEED)
    
    std_scaler = StandardScaler()
    
    train_X = final_df.loc[final_df['Season'] < season_yr,features]
    val_X = final_df.loc[final_df['Season'] == season_yr,features]
    
    train_y = final_df.loc[final_df['Season'] < season_yr, ['Win_A']]
    val_y = final_df.loc[final_df['Season'] == season_yr, ['Win_A']]
    
    train_X = std_scaler.fit_transform(train_X)
    val_X = std_scaler.transform(val_X)
    test_X = std_scaler.transform(df_test[features])
    
    model.fit(train_X, train_y)
    
    pred_y = model.predict(val_X)
    
    f_score.append([season_yr, f1_score(val_y, pred_y)])
    
    test_y = model.predict_proba(test_X)[:,1]
    
    predictions.append(test_y)
    
predictions = np.vstack(predictions).transpose()

In [None]:
exp_df = pd.DataFrame(predictions)

exp_df.to_csv('./temp.csv')

In [None]:
f_score
#pred_y
#predictions.shape


In [None]:
results = np.mean(predictions,axis = 1)
submission = df_test.loc[:,['ID']]

submission['Pred'] = results

submission.to_csv('submission.csv', index=False)