# NCAA Woman's, Gradient Boosted Trees (XGBoost)

## Work In Progress, Stay Tune...
**Notebook Strategy...**

Hello After studing multiple notebook I want to describe the strategy I will follow.
This are the steps:

* Load the Seed, Season and Tournament Results information.
* Build aggregated features by Team and Season from the Season Results dataset.
* Merge the information Seed data and Aggregated data into the Tournament Results Dataset by Team and Season
* Use the Tournament Results Dataset with the outcomes of the game to construct a target variable (Did they win?)
* The target variable will be the outcome of the game for Team A, Win = 1 Lost = 0

**Notebook Updates**

1. 02/24/2022: First baseline model constructed using an XGBoost classifier.
2. 02/26/2022: Converted the Notebook to NCAA Woman's version.
3. 03/11/2022: Added new aggregated features, added ranking metrics.

**Notebook Used**
* https://www.kaggle.com/theoviel/using-last-year-s-2nd-place
* https://www.kaggle.com/theoviel/ncaa-starter-the-simpler-the-better

Both Notebooks are from **theoviel** so thanks so much to the author.

**Experiment Summary**
...


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import optuna # import hyperparam optimization libraries

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# This part of the code import the Regular Expressions libraries
import re # Regular expression

In [None]:
%%time
# I like to disable my Notebook Warnings, so the execution is cleaner.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook configuration, Mmre useful with massive datasets...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

# Reading the Datasets

In [None]:
%%time
# Read the requiered datasets.
season_data = pd.read_csv('/kaggle/input/womens-march-mania-2022/WDataFiles_Stage2/WRegularSeasonCompactResults.csv')
seeds = pd.read_csv('/kaggle/input/womens-march-mania-2022/WDataFiles_Stage2/WNCAATourneySeeds.csv')
ranking_system = pd.read_csv("../input/ncaa-women-538-team-ratings/538ratingsWomen.csv")

In [None]:
%%time
season_data.info()

In [None]:
%%time
season_data.head()

In [None]:
%%time
season_data.describe()

# Creating Features for the Model.

In [None]:
%%time
def score_gap(df):
    """
    Calculates the difference of scores between the Winner and Losing Teams...
    """
    df['ScoreGap'] = df['WScore'] - df['LScore']
    return df

In [None]:
%%time
season_data = score_gap(season_data)

In [None]:
%%time
season_data.head()

In [None]:
%%time
def create_team_list(df, group_list = ['Season', 'WTeamID'], team_id = 'WTeamID'):
    """
    Creates an empty list of all the teams, Winners + Lossers to merge data back...
    """
    group = df.groupby(group_list).count().reset_index()
    group = group[group_list].rename(columns={team_id: "TeamID"})
    return group

In [None]:
%%time
winners = create_team_list(season_data, group_list = ['Season', 'WTeamID'], team_id = 'WTeamID')
lossers = create_team_list(season_data, group_list = ['Season', 'LTeamID'], team_id = 'LTeamID')

# Create an empty train dataset.
team_agg_features = pd.concat([winners, lossers], axis = 0).drop_duplicates().sort_values(['Season', 'TeamID']).reset_index(drop = True)

In [None]:
%%time
# Display the first few rows of the dataset.
team_agg_features.head()

In [None]:
%%time
# Creating aggregated features...
def winner_aggregated_features(df, group_list = ['Season', 'WTeamID']):
    '''
    Create multiple aggregated features for the Winner Team...
    '''
    tmp = df.groupby(group_list).agg(NumWins       = ('WTeamID', 'count'), 
                                     AvgWinsGap    = ('ScoreGap', 'mean'),
                                     W_TotalPoints = ('WScore', 'sum'),
                                     W_MaxPoints   = ('WScore', 'max'),
                                     W_MinPoints   = ('WScore', 'min'),
                                     W_MadPoints   = ('WScore', 'mad'),
                                    )
    tmp = tmp.reset_index()
    tmp = tmp.rename(columns={"WTeamID": "TeamID"})
    
    return tmp

In [None]:
%%time
# Creating aggregated features...
def losser_aggregated_features(df, group_list = ['Season', 'LTeamID']):
    '''
    Create multiple aggregated features for the Loser Team...
    '''
    tmp = df.groupby(group_list).agg(NumLosses       = ('LTeamID', 'count'), 
                                     AvgLossesGap    = ('ScoreGap', 'mean'),
                                     L_TotalPoints   = ('LScore', 'sum'),
                                     L_MaxPoints     = ('LScore', 'max'),
                                     L_MinPoints     = ('LScore', 'min'),
                                     L_MadPoints   = ('WScore', 'mad'),
                                    )
    tmp = tmp.reset_index()
    tmp = tmp.rename(columns={"LTeamID": "TeamID"})
    return tmp

In [None]:
%%time
winner_team_aggregation = winner_aggregated_features(season_data)
losser_team_aggregation = losser_aggregated_features(season_data)

In [None]:
%%time
def merge_back(df):
    '''
    Merge back two dataframes, Using the teamID
    '''
    df = df.merge(winner_team_aggregation, on = ['Season', 'TeamID'], how = 'left')
    df = df.merge(losser_team_aggregation, on = ['Season', 'TeamID'], how = 'left')
    df.fillna(0, inplace = True) 
    return df

In [None]:
%%time
team_agg_features = merge_back(team_agg_features)

In [None]:
%%time
def calculate_features(df):
    '''
    Calculate some new features based on Aggregated Features...
    '''
    df['WinRatio'] = df['NumWins'] / (df['NumWins'] + df['NumLosses'])
    df['AvgScoreGap'] = ((df['NumWins'] * df['AvgWinsGap'] - df['NumLosses'] * df['AvgLossesGap']) / (df['NumWins'] + df['NumLosses']))
    df['PointsRatio'] = df['W_TotalPoints'] / (df['L_TotalPoints'] + df['W_TotalPoints'])
    df['MadScoreGap'] = df['W_MadPoints'] - df['L_MadPoints'] 
    return df

team_agg_features = calculate_features(team_agg_features)

In [None]:
team_agg_features.head()

In [None]:
%%time
ranking_system = ranking_system.rename(columns = {'538rating': 'Rating'})
ranking_system.drop('TeamName', axis = 1, inplace = True)
ranking_system.head()

In [None]:
%%time
team_agg_features = team_agg_features.merge(ranking_system, on = ['Season', 'TeamID'], how = 'left')

In [None]:
team_agg_features

In [None]:
%%time
team_agg_features = team_agg_features[['Season','TeamID', 'WinRatio', 'AvgScoreGap','PointsRatio', 'MadScoreGap', 'Rating']]

In [None]:
%%time
team_agg_features.head()

# Creating the Training Dataset...

In [None]:
%%time
tournament_data = pd.read_csv('/kaggle/input/womens-march-mania-2022/WDataFiles_Stage2/WNCAATourneyCompactResults.csv')
tournament_data.head()

In [None]:
%%time
tournament_data.describe()

In [None]:
%%time
tournament_data = tournament_data.rename(columns = {'WTeamID' : 'W_TeamID', 'LTeamID' : 'L_TeamID', 'WScore' : 'W_Score', 'LScore' : 'L_Score'})  

In [None]:
%%time
# Drop non importante features from the dataset...
tournament_data.drop(['NumOT', 'WLoc'], axis = 1, inplace = True)

In [None]:
%%time
MIN_SEASON = 2016
# Remove data before 2016, no all the data is available...
tournament_data = tournament_data[tournament_data['Season'] >= MIN_SEASON].reset_index(drop = True)

In [None]:
%%time
tournament_data.head()

In [None]:
%%time
def merge_seed(df, seed_df, left_on = ['Season', 'W_TeamID'], field_name = 'SeedW'):
    df = pd.merge(df,seed_df, how = 'left', left_on = left_on, right_on = ['Season', 'TeamID'])
    df = df.drop('TeamID', axis = 1).rename(columns = {'Seed': field_name})
    return df

In [None]:
%%time
tournament_data = merge_seed(tournament_data, seeds, left_on = ['Season', 'W_TeamID'], field_name = 'W_Seed')
tournament_data = merge_seed(tournament_data, seeds, left_on = ['Season', 'L_TeamID'], field_name = 'L_Seed')

In [None]:
%%time
def seed_number(row):
    return int(re.sub("[^0-9]", "", row))

tournament_data['W_Seed'] = tournament_data['W_Seed'].apply(seed_number)
tournament_data['L_Seed'] = tournament_data['L_Seed'].apply(seed_number)

In [None]:
%%time
def merge_agg_features(df, agg_features):
    for result in ['W', 'L']:
        df = pd.merge(df, agg_features, how = 'left', left_on = ['Season', result +'_'+ 'TeamID'], right_on = ['Season', 'TeamID'])
        avoid = ['Season', 'TeamID']
        new_names = {col: result +'_'+ col for col in agg_features.columns if col not in avoid}
        df = df.rename(columns = new_names)        
        df = df.drop(columns = 'TeamID', axis = 1)
    return df

tournament_data = merge_agg_features(tournament_data, team_agg_features)

In [None]:
%%time
tournament_data.head()

In [None]:
def replace_win_loser(df):
    '''
    Replace the Win, Loser oot of the dataframe for Team A and Team B
    '''
    team_a = df.copy()
    team_b = df.copy()
    
    team_a_dict, team_b_dict = {}, {}
    
    for col in team_a.columns:
        if col.find('W_') == 0:
            new_col_name = str(col).replace('W_', 'A_')
            team_a_dict[col] = new_col_name
        if col.find('L_') == 0:
            new_col_name = col.replace('L_', 'B_')    
            team_a_dict[col] = new_col_name
            
    for col in team_b.columns:
        if col.find('W_') == 0:
            new_col_name = str(col).replace('W_', 'B_')
            team_b_dict[col] = new_col_name
        if col.find('L_') == 0:
            new_col_name = col.replace('L_', 'A_')
            team_b_dict[col] = new_col_name

    team_a = team_a.rename(columns = team_a_dict)
    team_b = team_b.rename(columns = team_b_dict)
    
    merged_df = pd.concat([team_a, team_b], axis = 0, sort = False)
    return merged_df

In [None]:
%%time
tournament_data = replace_win_loser(tournament_data)

In [None]:
%%time
def calculate_differences(df):
    """
    
    """
    df['SeedDiff'] = df['A_Seed'] - df['B_Seed']
    df['WinRatioDiff'] = df['A_WinRatio'] - df['B_WinRatio']
    df['GapAvgDiff'] = df['A_AvgScoreGap'] - df['B_AvgScoreGap']    
    df['PointsRatioDiff'] = df['A_PointsRatio'] - df['A_PointsRatio']
    df['WinGapMadDiff'] = df['A_MadScoreGap'] - df['B_MadScoreGap']
    df['RatingDiff'] = df['A_Rating'] - df['B_Rating']
    
    return df

tournament_data = calculate_differences(tournament_data)

---

# Creating the Target Variables

In [None]:
%%time
# This code cell create two target variables, one for regression and other for classification.

tournament_data['ScoreDiff'] = tournament_data['A_Score'] - tournament_data['B_Score']
tournament_data['A_Win'] = (tournament_data['ScoreDiff'] > 0).astype(int)
tournament_data = tournament_data.drop(columns=['A_Score', 'B_Score'])

In [None]:
%%time
tournament_data.head()

In [None]:
tournament_data.info()

---

# Creating the Test Dataset

In [None]:
%%time
sub_stage_two = pd.read_csv('/kaggle/input/womens-march-mania-2022/WDataFiles_Stage2/WSampleSubmissionStage2.csv')
tst_data = sub_stage_two.copy()

In [None]:
tst_data.shape

In [None]:
%%time
def separate_id(df):
    """
    
    """
    df['Season']  = df['ID'].apply(lambda x: int(x.split('_')[0]))
    df['TeamIdA'] = df['ID'].apply(lambda x: int(x.split('_')[1]))
    df['TeamIdB'] = df['ID'].apply(lambda x: int(x.split('_')[2]))
    return df

tst_data = separate_id(tst_data)

In [None]:
%%time
tst_data = merge_seed(tst_data, seeds, left_on = ['Season', 'TeamIdA'], field_name = 'A_Seed')
tst_data = merge_seed(tst_data, seeds, left_on = ['Season', 'TeamIdB'], field_name = 'B_Seed')

In [None]:
%%time
tst_data['A_Seed'] = tst_data['A_Seed'].apply(seed_number)
tst_data['B_Seed'] = tst_data['B_Seed'].apply(seed_number)

In [None]:
%%time
tst_data = tst_data.rename(columns = {'TeamIdA': 'A_TeamID', 'TeamIdB': 'B_TeamID'})

In [None]:
tst_data

In [None]:
team_agg_features

In [None]:
%%time
def merge_agg_features(df, agg_features):
    for result in ['A', 'B']:
        df = pd.merge(df, agg_features, how = 'left', left_on = ['Season', result +'_'+ 'TeamID'], right_on = ['Season', 'TeamID'])
        avoid = ['Season', 'TeamID']
        new_names = {col: result +'_'+ col for col in agg_features.columns if col not in avoid}
        df = df.rename(columns = new_names)        
        df = df.drop(columns = 'TeamID', axis = 1)
    return df

tst_data = merge_agg_features(tst_data, team_agg_features)

In [None]:
%%time
tst_data = calculate_differences(tst_data)

In [None]:
%%time
tst_data

In [None]:
%%time
tst_data.shape

# Building the Model...
Ok, so up to this point we have been aggregating and merging data from the Seasons Datasets...

* We used the season data to create aggregated features by team and season. and merge this back to the tournament data
* Using the merged data we calculate the outcomes of each of the games, Win or Lost.
* This new calculated variable will be our target.

In [None]:
%%time
from sklearn import tree
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import log_loss
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [None]:
%%time
target_feature = 'A_Win'
avoid = ['ScoreDiff', 'Season', 'DayNum', 'A_Win']
features = [col for col in tournament_data.columns if col not in avoid]

In [None]:
%%time
features

In [None]:
features = ['A_TeamID',
            'B_TeamID',
            'A_Seed',
            'B_Seed',
            'A_WinRatio',
            'A_AvgScoreGap',
            'A_PointsRatio',
            'A_MadScoreGap',
            'A_Rating',
            'B_WinRatio',
            'B_AvgScoreGap',
            'B_PointsRatio',
            'B_MadScoreGap',
            'B_Rating',
            'SeedDiff',
            'WinRatioDiff',
            'GapAvgDiff',
            'PointsRatioDiff',
            'WinGapMadDiff',
            'RatingDiff'
           ]

In [None]:
%%time
season = 2021
X_train = tournament_data[tournament_data['Season'] < season][features].reset_index(drop = True).copy()
X_valid = tournament_data[tournament_data['Season'] == season][features].reset_index(drop = True).copy()

y_train = tournament_data[tournament_data['Season'] < season][target_feature].reset_index(drop = True).copy()
y_valid = tournament_data[tournament_data['Season'] == season][target_feature].reset_index(drop = True).copy()

scaler = MinMaxScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)        
X_valid = scaler.transform(X_valid)

def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 8, 2048)
    max_depth = trial.suggest_int("max_depth", 1, 16)
    learning_rate = trial.suggest_float("learning_rate", 0.01, 0.2)
    subsample = trial.suggest_float("subsample", 0.5, 1)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.5, 1)
    reg_lambda = trial.suggest_float("reg_lambda", 1, 20)
    reg_alpha = trial.suggest_float("reg_alpha", 0, 20)
    gamma = trial.suggest_float("gamma", 0, 20)
    min_child_weight  = trial.suggest_int("min_child_weight", 0, 128)
    
    clf = XGBClassifier(n_estimators  = n_estimators,
                       learning_rate = learning_rate,
                       max_depth = max_depth,
                       subsample = subsample,
                       colsample_bytree = colsample_bytree,
                       reg_lambda = reg_lambda,
                       reg_alpha = reg_alpha,
                       gamma = gamma,
                       min_child_weight = min_child_weight,
                       random_state  = 69,
                       #objective = 'binary:logistic',
                       objective = 'reg:logistic',
                       tree_method = 'gpu_hist',
                      )
    
    clf.fit(X_train, y_train)
    
    valid_pred = clf.predict(X_valid)
    score = log_loss(y_valid, valid_pred)
    
    return score

In [None]:
%%time
study = optuna.create_study(direction = "minimize")
study.optimize(objective, n_trials = 30)

In [None]:
%%time
parameters = study.best_params
parameters

In [None]:
%%time
params = {'n_estimators': 654,
          'max_depth': 11,
          'learning_rate': 0.08580077249980977,
          'subsample': 0.9805151546884083,
          'colsample_bytree': 0.9189021882456367,
          'reg_lambda': 19.96904416121937,
          'reg_alpha': 16.502095478188046,
          'gamma': 19.854799282674364,
          'min_child_weight': 7,
          'random_state': 69,
          'objective': 'reg:logistic',
          'tree_method':'gpu_hist',
         }

params = {'n_estimators': 1388,
          'max_depth': 11,
          'learning_rate': 0.085,
          'subsample': 0.9805151546884083,
          'colsample_bytree': 0.9189021882456367,
          'reg_lambda': 19.96904416121937,
          #'reg_alpha': 16.502095478188046,
          #'gamma': 19.854799282674364,
          'tree_method':'gpu_hist',
          'objective': 'reg:logistic',
          'random_state': 69,}

In [None]:
%%time
# Develop a CV loop to avoid leaking data from future tournaments...
def kfold_model(train_df, tst_df):
    cvs = []
    preds_test = []
    seasons = train_df['Season'].unique()
    
    for season in seasons[1:]:
        print(f'\nValidating on season {season}')
        X_train = train_df[train_df['Season'] <= season][features].reset_index(drop = True).copy()
        X_val = train_df[train_df['Season'] == season][features].reset_index(drop = True).copy()
        
        y_train = train_df[train_df['Season'] <= season][target_feature].reset_index(drop = True).copy()
        y_val = train_df[train_df['Season'] == season][target_feature].reset_index(drop = True).copy()
        
        tst_dataset = tst_df[features].copy()
        
        scaler = MinMaxScaler()
        scaler.fit(X_train)
        
        X_train = scaler.transform(X_train)        
        X_val = scaler.transform(X_val)
        tst_dataset = scaler.transform(tst_dataset)
        
        model = XGBClassifier(**params)
        model.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0, early_stopping_rounds = 64)
        pred = model.predict_proba(X_val)[:, 1]
        
        pred_test = model.predict_proba(tst_dataset)[:, 1]
        preds_test.append(pred_test)
        
        loss = log_loss(y_val, pred)
        cvs.append(loss)
        
        print(f'\t -> Scored {loss:.4f}')
    print(f'\nLocal Cross Validation Score Is: {np.mean(cvs):.3f}')
    return preds_test

In [None]:
%%time
predictions = kfold_model(tournament_data, tst_data)

In [None]:
# Model Records 
# Local Cross Validation Score Is: 0.509 
# Local Cross Validation Score Is: 0.494
# Local Cross Validation Score Is: 0.488
# Local Cross Validation Score Is: 0.466
# Local Cross Validation Score Is: 0.582
# Local Cross Validation Score Is: 0.623
# Local Cross Validation Score Is: 0.468
# Local Cross Validation Score Is: 0.461
# Local Cross Validation Score Is: 0.453

In [None]:
%%time
mean_predictions = np.mean(predictions, 0)

sub = tst_data[['ID', 'Pred']].copy()
sub['Pred'] = mean_predictions
sub.to_csv('submission.csv', index = False)

---