___
This notebook tries to do a simple prediction of which team will win in a random match based on seeding

In [51]:
import pandas as pd
import numpy as np
import scipy
from sklearn import *

## Read data

In [36]:
raw_data = pd.read_csv("input/tour-results-seed.csv")

## Data Transformation
- Differential in seeding
- winning results

In [40]:
winning_team_perspective_df = (
    raw_data
    .pipe(lambda x:x.assign(diff_seed = x.L_seed - x.W_seed))
    .pipe(lambda x:x.assign(yhat = 1))
)

In [41]:
losing_team_perspective_df = (
    raw_data
    .pipe(lambda x:x.assign(diff_seed = x.W_seed - x.L_seed))
    .pipe(lambda x:x.assign(yhat = 0))
)

In [99]:
prediction_df = (
    winning_team_perspective_df.append(losing_team_perspective_df)
)

## Splitting data into train and test sets
- train data: <= 2013 Season
- test data: >= 2014 Season

In [48]:
train_df = prediction_df.query("Season <= 2013")
test_df = prediction_df.query("Season >= 2014")

In [82]:
train_data_x = train_df[['diff_seed']]
train_data_y = train_df['yhat']

test_data_x = test_df[['diff_seed']]
test_data_y = test_df['yhat']

## Initializing Logistics Regression

In [83]:
logreg = linear_model.LogisticRegression()

In [84]:
logreg.fit(train_data_x,train_data_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Getting accuracy of logistics regression
- 0.71 accurate
- include confusion matrix

In [103]:
logreg.score(test_data_x,test_data_y)

0.70895522388059706

In [87]:
logreg.score(test_data_x,test_data_y)

0.70895522388059706

In [106]:
metrics.confusion_matrix(test_df.yhat,test_df.prediction_results)

array([[180,  88],
       [ 68, 200]])

## Joining prediction to actual dataframe

In [None]:
test_results = pd.DataFrame(logreg.predict(test_df[['diff_seed']])).rename(columns={0:"prediction_result"})

In [93]:
test_df['prediction_results'] = test_results.prediction_result.values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [96]:
test_df.tail(20)

Unnamed: 0,Season,WTeamID,W_seed,LTeamID,L_seed,diff_seed,yhat,prediction_results
2097,2017,1276,7,1257,2,5,0,1
2098,2017,1314,1,1116,8,-7,0,0
2099,2017,1332,3,1348,11,-8,0,0
2100,2017,1376,7,1181,2,5,0,1
2101,2017,1417,3,1153,6,-3,0,0
2102,2017,1211,1,1452,4,-3,0,0
2103,2017,1242,1,1345,4,-3,0,0
2104,2017,1332,3,1276,7,-4,0,0
2105,2017,1462,11,1112,2,9,0,1
2106,2017,1196,4,1458,8,-4,0,0


http://blog.yhat.com/posts/roc-curves.html

# Concluding remarks for logistics regression modelling
- current approach isnt going to work for predicting results
    - we are using post results to predict post results
- will need to use regular season to predict out winning probability

## Next Steps
- combining both regular season and post season for determining winner
    - refer to images (link)
- calculate out intermediate variables for prediction
- all our features into prediction model
- feature selection will be utilised later to decide which ones remain
- ensure overfitting doesnt exist
- try out different models