## Project Outline:

  * **Data Extraction: Read CSV data and remove/fill-in missing values**
  * **Feature Extraction: Categorize/Encode features and remove unwanted/ redundant features.**
  * **Pipeline: Pipe the transformation and predictors/classifiers**
  * **Evaluate Classifiers/Predictors: Cross val the pipeline using appropriate scoring metric**
  * **Fine Tuning: Searching over Parameters using GridSearchCV**


In [14]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
import pandas as pd
import numpy as np
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

* ** Sklearn doesn't have a way to choose particular feature to be used for transformation in the pipeline. Thus we require our own Function Transformer to handle column selection.**


In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to sklearn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

### Data Extraction 

* ** Import data using pandas and perform cleaning.**

In [3]:
# This is an example of nba shots data which we would be using to classify/predict a shot was made or not.
data = pd.read_csv("../data/shot_logs.csv")

* **Data Cleaning -- Check for nulls and decide to drop them or fill them.**

In [4]:
data.isnull().any()

GAME_ID                       False
MATCHUP                       False
LOCATION                      False
W                             False
FINAL_MARGIN                  False
SHOT_NUMBER                   False
PERIOD                        False
GAME_CLOCK                    False
SHOT_CLOCK                     True
DRIBBLES                      False
TOUCH_TIME                    False
SHOT_DIST                     False
PTS_TYPE                      False
SHOT_RESULT                   False
CLOSEST_DEFENDER              False
CLOSEST_DEFENDER_PLAYER_ID    False
CLOSE_DEF_DIST                False
FGM                           False
PTS                           False
player_name                   False
player_id                     False
dtype: bool

In [5]:
# We see only shot_clock has null values
# Let's check how many do we have and if we can drop it
print "Number of nulls in SHOT CLOCK feature", data[data.SHOT_CLOCK.isnull()].shape[0]

print "We see that number is small thus we can drop it."
dataForAnalysis = data.dropna().copy()

Number of nulls in SHOT CLOCK feature 5567
We see that number is small thus we can drop it.


## Feature Extraction
* **Bucket the features or drop the unwanted/redundant features.**

In [6]:
print '*'*30
print "Based on the knowledge of data, the below columns are unwanted for classification"
del dataForAnalysis["GAME_ID"]  
del dataForAnalysis['MATCHUP']
del dataForAnalysis['GAME_CLOCK']
del dataForAnalysis["FINAL_MARGIN"]
del dataForAnalysis["PTS"]
del dataForAnalysis["player_name"]
del dataForAnalysis["CLOSEST_DEFENDER"]
del dataForAnalysis["W"] ### Match Result
del dataForAnalysis['SHOT_RESULT']  ### Duplicate information , captured in FGM.
del dataForAnalysis["player_id"]
del dataForAnalysis["CLOSEST_DEFENDER_PLAYER_ID"]

print '*'*30
print "Categorizing distance of the shot..."
dataForAnalysis["SHOT_DIST_CAT"] = pd.cut(dataForAnalysis.SHOT_DIST, 7, labels = range(1,8))
del dataForAnalysis["SHOT_DIST"]

print '*'*30
print "Categorizing the number of dribbles before the shot"
dataForAnalysis["DRIBBLES_CAT"] = pd.cut(dataForAnalysis.DRIBBLES, 4, labels = range(1,5))
del dataForAnalysis["DRIBBLES"]

print '*'*30
print "Categorizing the defender distance before the shot"
dataForAnalysis["CLOSE_DEF_DIST_CAT"] = pd.cut(dataForAnalysis.CLOSE_DEF_DIST, 11, labels = range(1,12))
del dataForAnalysis["CLOSE_DEF_DIST"]
print '*'*30

print "Encoding should be part of the pipeline but as Pipeline doesn't Label Encoding , we are doing it in the feature extraction"
le = LabelEncoder()
dataForAnalysis["IS_HOME"] = dataForAnalysis[["LOCATION"]].apply(le.fit_transform)
del dataForAnalysis["LOCATION"]

******************************
Based on the knowledge of data, the below columns are unwanted for classification
******************************
Categorizing distance of the shot...
******************************
Categorizing the number of driibbles before the shot
******************************
Categorizing the defender distance before the shot
******************************
Encoding should be part of the pipeline but as Pipeline doesn't Label Encoding , we are doing it in the feature extraction


## Pipeline
* **Funtion Transformers, Encoders and Classifiers.**

In [7]:
def cat_shot_clock(times, y= None):
    """
    Custom Function Transformer for the time shot was made to convert into 3 categories
    """
    rt = []
    for time_a in times:
        time = time_a[0]
        if time > 0 and time < 9:
            rt.append(0)
        elif time >=9 and time < 17:
            rt.append(1)
        else:
            rt.append(2)
    return pd.DataFrame(rt)

In [8]:
def touch_time_cat(touch_times, y=None):
    """
    Custom Function Transformer for the touch-time before the shot was made to convert into 3 categories
    """
    rt = []
    for touch_time_a in touch_times:
        touch_time = touch_time_a[0]
        if touch_time <=2:
            rt.append(0)
        elif touch_time > 2 and touch_time <=6:
            rt.append(1)
        else:
            rt.append(2)
    return pd.DataFrame(rt)

In [9]:
print '*'*30
print "Pipeling of all the transformation on the data before we hand it over to the classifer"
binaryEncoder = ce.BinaryEncoder(cols = ['player_id', 'CLOSEST_DEFENDER_PLAYER_ID', 'SHOT_NUMBER'])
pipeReadyForPrediction = FeatureUnion([
    ('cat_shot_clock' , Pipeline([
                       ('selector' , ItemSelector(key = ['SHOT_CLOCK'])),
                        ('encoder'  , FunctionTransformer(cat_shot_clock))
                        #('encoder'  , OneHotEncoder())
                        ]))
    ,    
    ('touch_time_cat' , Pipeline([
                       ('selector' , ItemSelector(key = ['TOUCH_TIME'])),
                        ('encoder'  , FunctionTransformer(touch_time_cat))
                       ]))
        
    , 
    ('shot_distance' , Pipeline([
                       ('selector' , ItemSelector(key = ['SHOT_NUMBER'])),
                        ('encoder'  , OneHotEncoder(sparse= False, handle_unknown='ignore'))
                       ]))
    ])

******************************
Pipeling of all the transformation on the data before we hand it over to the classifer


In [10]:
print '*'*30
print "Plug in Logistic Classifer in the Pipeline"
logPipe = Pipeline([("tranformation", pipeReadyForPrediction), ("LogisticRegression", LogisticRegression())])

print '*'*30
print "Plug in Random Forest Classifer in the Pipeline"
randomForestPipe = Pipeline([("tranformation", pipeReadyForPrediction), ("RandomForest", RandomForestClassifier(n_estimators=7))])

******************************
Plug in Logistic Classifer in the Pipeline
******************************
Plug in Random Forest Classifer in the Pipeline


## Evaluate Classifiers
* **Cross Value Score for each of the pipeline.**

In [11]:
print "Separate the features and the prediction"

print "Extracting Features..."
features_X = dataForAnalysis.ix[:, dataForAnalysis.columns != 'FGM']  
print features_X.head()
print "Prediction..."
predict_Y = dataForAnalysis.FGM
print predict_Y.head()

Separate the features and the prediction
Extracting Features...
   SHOT_NUMBER  PERIOD  SHOT_CLOCK  TOUCH_TIME  PTS_TYPE SHOT_DIST_CAT  \
0            1       1        10.8         1.9         2             2   
1            2       1         3.4         0.8         3             5   
3            4       2        10.3         1.9         2             3   
4            5       2        10.9         2.7         2             1   
5            6       2         9.1         4.4         2             3   

  DRIBBLES_CAT CLOSE_DEF_DIST_CAT  IS_HOME  
0            1                  1        0  
1            1                  2        0  
3            1                  1        0  
4            1                  1        0  
5            1                  1        0  
Prediction...
0    1
1    0
3    0
4    0
5    0
Name: FGM, dtype: int64


In [12]:
folds = 7
scoring_metric = 'roc_auc'
for pipe in [logPipe, randomForestPipe]:
    mean_score = np.mean(cross_val_score(pipe, features_X, predict_Y, cv = folds, scoring = scoring_metric))
    print "Average {0} score across {1} folds of {2} is {3} ".format(scoring_metric, folds, pipe.steps[-1][0], mean_score)
    print '*'*30



Average roc_auc score across 7 folds of LogisticRegression is 0.552466983055 
******************************
Average roc_auc score across 7 folds of RandomForest is 0.549155766368 
******************************


## Grid Search
* **Fine tuning C (Logistic Regression), Estimators (Random Forest).**

In [15]:
print "Grid Search is good for fine tuning and finding the best estimator but it's computationally expensive"

Grid Search is good for fine tuning and finding the best estimator but it's computationally expensive


In [18]:
C = [0.001, 0.1, 10]
estimators_range = [11,31,88]
cv_folds = 37
log_param_grid = dict(LogisticRegression__C = C)
rf_param_grid = dict(RandomForest__n_estimators = estimators_range)
for pipe, params in [(logPipe, log_param_grid), (randomForestPipe, rf_param_grid)]:
    grid = GridSearchCV(pipe, params, cv = cv_folds, scoring = scoring_metric, n_jobs=-1)
    grid.fit(features_X, predict_Y)
    print "Best cross-validated auc: ",grid.best_score_
    print "Best parameter found: ",grid.best_params_
    print "Fitted_model: ",grid.best_estimator_.steps[1][1]

Best cross-validated accuracy:  0.552942755629
Best parameter found:  {'LogisticRegression__C': 0.001}
Fitted_model:  LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Best cross-validated accuracy:  0.549745372766
Best parameter found:  {'RandomForest__n_estimators': 88}
Fitted_model:  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=88, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
