# Using Machine Learning to Predict NBA Games Winners

This jupyter notebook is an auxiliar material to my capstone project report in the Udacity's Machine Learning Engineer Nanodegree. The PDF file can be found in my GitHub repository:

* https://github.com/vilacham/capstone_report

---
## Importing data

As a first step, I will import the dataset and create a copy of it to work on:

In [1]:
# Import pandas
import pandas as pd

# Import dataset and create a copy of it
try:
    original_data = pd.read_excel('capstone_database.xlsx')
    data = original_data
    print('Dataset was successfully imported and has {} samples with {} features each.'.format(*data.shape))
except:
    print('Dataset could not be loaded. Is it missing?')

Dataset was successfully imported and has 36154 samples with 96 features each.


---
## Importing functions and modules
In order to make the reading of this jupyter notebook easier, I opted for writing functions and modules with more extense line codes in separated Python files: `functions.py`, `best_streak_classifier` and `majority_vote_classifier`.

The first file, `functions.py`, can be found in https://github.com/vilacham/capstone_report/blob/master/functions.py and contains the following functions:
* `preprocess`, which I will use to preprocess data (rename its columns, drop NaNs, deal with categorical data, create label column, drop unnecessary columns and convert all features to numerical data);
* `get_frequent_outliers`, which I will use to identify and drop samples that are outliers for more than one feature;
* `standardize`, which I will use to normalize features in my dataset;
* `divide_data`, which I will use to divide my dataset in three (last game data, last two games data and last five games data);
* `get_n_principal_components`, which I will use to find the *n* principal components to reduce my dataset (I aim to use those principal components that explain at least 60% of the variance);
* `plot_pca_graph`, which I will use to ploat a graph with the explained variance ratios of the principal components and the cumulative sum of these;
* `reduce`, which will reduce my dataset.

The second file, `best_streak_classifier.py`, can be found in https://github.com/vilacham/capstone_report/blob/master/best_streak_classifier.py and is a class that contains code to predict a winner based only on the streak features of the home team and the visitor team: the team with the highest streak wins is predicted as the winner, and in the case of tie, the home team is predicted as the winner.

The third file, `majority_vote_classifier.py`, can be found in https://github.com/vilacham/capstone_report/blob/master/majority_vote_classifier.py and is a class that contains code to predict a winner based on the majority vote of the following classifiers:
* Logistic Regression;
* Decision Tree;
* K Neighbors;
* Multi-Layer Perceptron;
* Support Vector Machine;
* Gaussian NB;
* Random Forest.

In [2]:
import functions as f
from best_streak_classifier import BestStreakClassifier
from majority_vote_classifier import MajorityVoteClassifier

---
## Data preprocessing

Now that I have a copy of the dataset, my next steps are: 
* rename its columns;
* remove NaNs;
* deal with categorical data;
* create the label column; and
* drop unnecessary columns.

In [3]:
data = f.preprocess(data)

In [4]:
############################## REVIEW CODE
bsc = BestStreakClassifier(data['H STK'], data['A STK'])
bsc.score(data['WINNER'])

0.58229942100909848

In [4]:
outliers = f.get_frequent_outliers(data)
good_data = data.drop(data.index[outliers]).reset_index(drop=True)
print('{} outliers for more than one feature.'.format(len(outliers)))
print('Original data had {} samples.'.format(data.shape[0]))
print('Good data has {} samples.'.format(good_data.shape[0]))

3214 outliers for more than one feature.
Original data had 16926 samples.
Good data has 13712 samples.


In [5]:
# import train_test_split
from sklearn.model_selection import train_test_split

# Split dataset
X, y = good_data.iloc[:, :-1], good_data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [6]:
X_train, X_test = f.standardize(X_train, X_test)

In [7]:
# Divide training set and get last game data
X_train_last_game = X_train[list(X_train.columns[:30]) + list(X_train.columns[86:])]
X_test_last_game = X_test[list(X_test.columns[:30]) + list(X_test.columns[86:])]

# Divide training set and get last two games data
X_train_last_two_games = X_train[list(X_train.columns[:2]) + list(X_train.columns[30:58]) + list(X_train.columns[86:])]
X_test_last_two_games = X_test[list(X_test.columns[:2]) + list(X_test.columns[30:58]) + list(X_test.columns[86:])]

# Divide training set and get last five games data
X_train_last_five_games = X_train[list(X_train.columns[:2]) + list(X_train.columns[58:])]
X_test_last_five_games = X_test[list(X_test.columns[:2]) + list(X_test.columns[58:])]

---
## Last game

In [8]:
n_comp = f.get_n_principal_components(X_train_last_game)
print('Number of components: {}'.format(n_comp))

Number of components: 6


In [9]:
X_train_reduced, X_test_reduced = f.reduce(X_train_last_game, X_test_last_game, n_comp)

In [21]:
################################ Fazer por último
bsc_last_game = BestStreakClassifier(X_train_last_game['H STK'].values, X_train_last_game['A STK'].values)
bsc_pred_train = bsc_last_game.predict()
print('Best Streak Classifier score in last game dataset: {}'.format(bsc_last_game.score(y_train)))

Best Streak Classifier score in last game dataset: 0.5822046259637425


In [26]:
from sklearn.linear_model import LogisticRegression
lr_last_game = LogisticRegression()
lr_last_game.fit(X_train_reduced, y_train)
train_score = lr_last_game.score(X_train_reduced, y_train)
print('Logistic regression score in the training set: {:.3f}%'.format(train_score))

Logistic regression score in the training set: 0.612%


---
## Last two games

In [18]:
n_comp_last_two_games = f.get_n_principal_components(X_train_last_two_games)
print('Number of components: {}'.format(n_comp_last_two_games))

Number of components: 6


In [19]:
X_train_last_two_games_reduced, X_test_last_two_games_reduced = f.reduce(X_train_last_two_games, X_test_last_two_games, n_comp_last_two_games)

In [31]:
lr_last_two_games = LogisticRegression()
lr_last_two_games.fit(X_train_last_two_games_reduced, y_train)
train_score_last_two_games = lr_last_two_games.score(X_train_last_two_games_reduced, y_train)
print('Logistic regression score in the training set: {:.3f}%'.format(train_score_last_two_games))

Logistic regression score in the training set: 0.614%


In [37]:
adaboost_last_two_games = AdaBoostClassifier()
adaboost_last_two_games.fit(X_train_last_two_games_reduced, y_train)
ada_score_l2g = adaboost_last_two_games.score(X_train_last_two_games_reduced, y_train)
print('AdaBoost Classifier score in the training set: {}'.format(ada_score_l2g))

AdaBoost Classifier score in the training set: 0.6263804959366535


---
## Last five games

In [23]:
n_comp_last_five_games = f.get_n_principal_components(X_train_last_five_games)
print('Number of components: {}'.format(n_comp_last_five_games))

Number of components: 6


In [24]:
X_train_last_five_games_reduced, X_test_last_five_games_reduced = f.reduce(X_train_last_five_games, X_test_last_five_games, n_comp_last_five_games)

In [33]:
lr_last_five_games = LogisticRegression()
lr_last_five_games.fit(X_train_last_five_games_reduced, y_train)
train_score_last_five_games = lr_last_five_games.score(X_train_last_five_games_reduced, y_train)
print('Logistic regression score in the training set: {:.3f}'.format(train_score_last_five_games))

Logistic regression score in the training set: 0.614


In [50]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)
adaboost_last_five_games = AdaBoostClassifier(base_estimator=tree, n_estimators=500, learning_rate=0.1, random_state=42)
adaboost_last_five_games.fit(X_train_last_five_games_reduced, y_train)
ada_score_l5g = adaboost_last_five_games.score(X_train_last_five_games_reduced, y_train)
ada_test_score_l5g = adaboost_last_five_games.score(X_test_last_five_games_reduced, y_test)
print('AdaBoost Classifier score in the training set: {}'.format(ada_score_l5g))
print('AdaBoost Classifier score in the testing set: {}'.format(ada_test_score_l5g))

AdaBoost Classifier score in the training set: 0.6236715982496354
AdaBoost Classifier score in the testing set: 0.6098687408847837
