**Introduction**

How important are team seeds in the tournament? There is a good amount of effort that goes into assigning seeds for teams. In this kernel lets explore to see how important the seed feature is for the outcome. We could make this the baseline for other models to see if we can improve our prediction. 

The next kernel in this set is here - [Starter-Advanced Features, Model Tune using FastAI](https://www.kaggle.com/bshyammm/starter-advanced-features-model-tune-using-fastai)

In [None]:
import numpy as np 
import pandas as pd 

in_path = '../input/datafiles/'

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

#suppress warnings
import warnings
warnings.filterwarnings('ignore')


**Import NCAA Tourney Results data**

In [None]:
#import data
NCAATourneyCompactResults = pd.read_csv(in_path + 'NCAATourneyCompactResults.csv')
NCAATourneyCompactResults.head(5)

**Import Seed data**

In [None]:
NCAATourneySeeds = pd.read_csv(in_path + 'NCAATourneySeeds.csv')
#convert seed to int
NCAATourneySeeds.Seed = NCAATourneySeeds.Seed.str.replace('[a-zA-Z]', '')
NCAATourneySeeds.Seed = NCAATourneySeeds.Seed.astype('int64')
NCAATourneySeeds.head(5)

**Join Seed data with the raw data**

In [None]:
#Join winning team's seed
NCAA = pd.merge(NCAATourneyCompactResults, NCAATourneySeeds, how='inner', 
               left_on=['Season', 'WTeamID'], 
               right_on=['Season', 'TeamID'])
NCAA.rename(columns={"Seed": "W_SEED"}, inplace=True)
#Join losing team's seed
NCAA = pd.merge(NCAA, NCAATourneySeeds, how='inner', 
               left_on=['Season', 'LTeamID'], 
               right_on=['Season', 'TeamID'])
NCAA.rename(columns={"Seed": "L_SEED"}, inplace=True)
NCAA.drop(columns=['TeamID_x', 'TeamID_y'], inplace=True)
NCAA.head(5)

**Derivations**

*Please note - a higher seeded team, is actually a lower ranked team and vice versa*

Lets derive - 
1. OUTCOME - this will be default to 1, as the primary team id is the winning team id
2. Seed_Diff - Losing_Team_Seed - Winning_Team_Seed
3. Lower_Seed_Win - Is 1 where a lower seeded (higher ranked) team wins the game
4. Higher_Seed_Win - Is 1 where a higher seeded (lower ranked) team wins the game

In [None]:
NCAA['OUTCOME'] = 1
NCAA['Seed_diff'] = NCAA.L_SEED - NCAA.W_SEED
NCAA['Lower_Seed_Win'] = np.where(NCAA.Seed_diff>0, 1, 0)
NCAA['Higher_Seed_Win'] = np.where(NCAA.Seed_diff<0, 1, 0)
NCAA.tail(5)

**Visualize counts of lower seeded (higher ranked) team wins**

In [None]:
counts = pd.DataFrame(NCAA.Lower_Seed_Win.value_counts()/len(NCAA))
counts = counts.reset_index()
counts.columns = ['Outcome', 'Percent']
counts

In [None]:
data = [
    go.Bar(
        x = counts.Outcome,
        y = counts.Percent,
        #text = (NCAA.Lower_Seed_Win.value_counts()/len(NCAA)), 
        #textposition = 'auto', 
        marker = dict(
          color = ['rgba(50, 171, 96, 0.7)', 'rgba(219, 64, 82, 0.7)']
        ),
        name = 'Seeds'
    )
]
fig = go.Figure(data=data)
iplot(fig, filename='base-bar')

This shows that about 70% of the time lower seeded, higher ranked teams win these games. 

**Visualize the wins by Season**

In [None]:
NCAA_counts = NCAA.groupby(['Season'])['Lower_Seed_Win', 'Higher_Seed_Win'].agg('sum').reset_index()
NCAA_counts.tail(5)

In [None]:
data = [
    go.Bar(
        x = NCAA_counts.Season,
        y = NCAA_counts.Higher_Seed_Win,
        marker = dict(
          color = 'rgba(219, 64, 82, 0.7)'
        ),
        name = 'Higher Seed Win'
    ),
    go.Bar(
        x = NCAA_counts.Season,
        y = NCAA_counts.Lower_Seed_Win,
        marker = dict(
          color = 'rgba(55, 128, 191, 0.7)'
        ),
        name = 'Lower Seed Win'
    )
]
layout = go.Layout(
    barmode='group'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='base-bar')

**Visualize by Seed Diff**

In [None]:
NCAA['Seed_diff_abs'] = abs(NCAA.Seed_diff)
NCAA_counts = NCAA.groupby(['Seed_diff_abs'])['Lower_Seed_Win', 'Higher_Seed_Win'].agg('sum').reset_index()
NCAA_counts

In [None]:
data = [
    go.Bar(
        x = NCAA_counts.Seed_diff_abs,
        y = NCAA_counts.Higher_Seed_Win,
        text = NCAA_counts.Higher_Seed_Win, 
        textposition = 'auto', 
        marker = dict(
          color = 'rgba(219, 64, 82, 0.7)'
        ),
        name = 'Higher Seed Win'
    ),
    go.Bar(
        x = NCAA_counts.Seed_diff_abs,
        y = NCAA_counts.Lower_Seed_Win,
        text = NCAA_counts.Lower_Seed_Win, 
        textposition = 'auto', 
        marker = dict(
          color = 'rgba(55, 128, 191, 0.7)'
        ),
        name = 'Lower Seed Win'
    )
]


fig = go.Figure(data=data)
iplot(fig, filename='base-bar')

**Summary**

As the difference increase, the difference in the number of wins are also higher. 

This analysis should give us a good baseline to beat by generating advanced features and creating data models. 

Please refer to this kernel for a starter feature and model - [Starter-Advanced Features, Model Tune using FastAI](https://www.kaggle.com/bshyammm/starter-advanced-features-model-tune-using-fastai)