**Getting Data**

We will first utilize the data provided by Kaggle. 

Detailed data descriptions can be found here:

https://www.kaggle.com/c/mens-machine-learning-competition-2019/data

Let's first load the necessary packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

# Importing useful packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

Input data files are available in the "../input/" directory. 

The list of files can be found also on the right side panel, under "Draft Environment"

In [None]:
# Input data files are available in the "../input/" directory.
# List the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

Many files are in .zip format, however, Kaggle kernel will read the csv directly from the zip files.

Let's read a couple files into Pandas DataFrame, a popular data strucutre for analytics. Use the built in .head() function to examine the first few records.

In [None]:
# Use function .read_csv to read CSV files into DataFrame

# Read game results since 1985
df_compact = pd.read_csv('../input/datafiles/RegularSeasonCompactResults.csv')
df_compact.head(10)

In [None]:
# Read team info
df_teams = pd.read_csv('../input/datafiles/Teams.csv')
df_teams['TeamName'].head()

In [None]:
# Read play by play data for year 2018
play2018 = pd.read_csv('../input/playbyplay_2018/Events_2018.csv')
play2018.head()

In [None]:
# Read tournament seed data
df_seeds = pd.read_csv('../input/datafiles/NCAATourneySeeds.csv')
df_seeds.head()

Feel free to load additional data and check out what they contain


Here are some useful ways to inspect the data

 .shape  # review the number of rows and columns for the dataframe

.dtypes  # review the data type for each column

.nunique()  # get the unique count of a variable


In [None]:
# Check the data size for the dataframe for teams
df_teams.shape

In [None]:
# Check the data type for each columns in teams data
df_teams.dtypes

In [None]:
# Count of unique values of each columns in the teams data. One can see that there are 366 unique teams
df_teams.nunique()

Sometimes there exist files that share the same naming conventions and it's ideal to load these files into a single table. For the Play by Play data, it's easy to see that in each year's data there are two csv files: Events and Players. And the naming convention is very structured with only the year that updates. We can utilize the following scripts to reach file and concatenate them into a single table.

In [None]:
df_pbp = pd.DataFrame()
for i in range(8):
    df = pd.read_csv('../input/playbyplay_201' + str(i) + '/Events_201' + str(i) + '.csv')
    df_pbp = df_pbp.append(df)
    print("Cumulative data size for year 201" + str(i) + ": " + str(df_pbp.shape))

**Data Wrangling**



Now we want to prepare the data in a way that we can build a naive model with. The seed number or ranking in the tournament region seems to be a good predictor for how likely the team will win. So let's start preparing the data with the seed number in mind.

The first thing we need is to clean the seed column. In the original data, the "Seed" column has a "W" attached to the front of the actual seed number. We can run the following script to remove the "W" character and convert the seed data into integers. 

In [None]:
df_seeds['seed_int'] = df_seeds['Seed'].apply(lambda x: int(x[1:3]))
df_winseeds = df_seeds.loc[:, ['TeamID', 'Season', 'seed_int']].rename(columns={'TeamID':'WTeamID', 'seed_int':'WSeed'})
df_lossseeds = df_seeds.loc[:, ['TeamID', 'Season', 'seed_int']].rename(columns={'TeamID':'LTeamID', 'seed_int':'LSeed'})
df_temp = pd.merge(left=df_compact, right=df_winseeds, how='left', on=['Season', 'WTeamID'])
df_concat = pd.merge(left=df_temp, right=df_lossseeds, on=['Season', 'LTeamID'])

df_concat.head()

Then we want to create a calculated field named "SeedDiff", which is the difference between winning team's seed number and losing team's seed number

In [None]:
df_concat['SeedDiff'] = df_concat.WSeed - df_concat.LSeed
df_concat.head()

Lastly, we need to create our target variable for this naive model. Here we seperate the wins and losses, then assign 1 to wins and 0 to losses.

In [None]:
df_wins = pd.DataFrame()
df_wins['SeedDiff'] = df_concat['SeedDiff']
df_wins['Result'] = 1

df_losses = pd.DataFrame()
df_losses['SeedDiff'] = -df_concat['SeedDiff']
df_losses['Result'] = 0

df_predictions = pd.concat((df_wins, df_losses))
df_predictions.head()

Use shuffle function from sklearn to construct the training data

In [None]:
X_train = df_predictions.SeedDiff.values.reshape(-1,1)
y_train = df_predictions.Result.values
X_train, y_train = shuffle(X_train, y_train)

In [None]:
X_train.shape

In [None]:
y_train.shape