# Introduction to Machine Learning and Classification #
#### Presented by: Data Science Society ####
#### Authors: Roshan Lodha and Varun Murthy ####

<i> Credits: This notebook borrows heavily from the Data 100 lecture notebook. </i>

Topics covered:
- exploratory data analysis and feature selection / engineering
- loss functions
- gradient descent / logistic regression
- classification and measures of model integrity

## Introduction ##
Let's start by simply loading our packages and data. Lucky for us, sklearn provides many learning datasets that are already cleaned. Since data cleaning is not the focus of this workshop, we'll ignore this portion of the data science life cycle. 

In [None]:
#imports and loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn.datasets
from scipy.optimize import minimize

data_raw = sklearn.datasets.load_breast_cancer()
breast_cancer_data = pd.DataFrame(data_raw['data'], columns=data_raw['feature_names'])
# Target data_dict['target'] = 0 is malignant; 1 is benign
breast_cancer_data['malignant'] = 1 - data_raw['target']

### Feature Selection and Engineering ###

Let's begin selecting our features by sifting through the data.

In [None]:
breast_cancer_data.describe()

As noted in the introduction, the dataset is already cleaned. Thus, we note that there are no null values, and the counts of all the columns are equal. 

In [None]:
breast_cancer_data.head()

#### Questions: ####
1) What are some observations you can make about this data? <br>
2) What is the granularity of the data (what does each row represent)<br>

Planning Ahead:<br>
1) What are we trying to figure out from this dataset? <br>

Going back to the observations, lets analyze the columns more closely. 

In [None]:
breast_cancer_data.iloc[0]

There are clearly a lot of factors in play, which (based on the column values) seem more or less independent from one another. We can check if this is true using seaborns pairplot function. 

In [None]:
sns.pairplot(breast_cancer_data)

Some of the values have nearly linear relationships! Looking breifly through the pairplot, we can see that specifically row 1, col 3.

For the remainder of the notebook, we will ignore this correlation for the sake of simplicity, however this can have significant implications on our results.

<b>Before we continue, breifly glance across the bottom row of the pairplot, and think about that that row represents and how we can use it for feature selection.</b>

In [None]:
plt.xlabel('') #add the xlabel here
plt.ylabel('malignant')
x_vals = breast_cancer_data[''] #choose your features here
plt.scatter(x_vals, breast_cancer_data['malignant']);

We can better approximate the probabilities associated with these features by binning the radii, and calculating the proportion of malignant tumors in each bin.

<i>Credit: Data 100 Staff</i>

In [None]:
#binned data logistic simulation
radii = np.linspace(5, 30, 50)
averages = [np.average(breast_cancer_data[np.abs(breast_cancer_data['worst radius']-r)<2]['malignant']) for r in radii]
plt.xlabel('') #add the xlabel here
plt.ylabel('malignant')
x_vals = breast_cancer_data[''] #enter your feature here
plt.scatter(x_vals, breast_cancer_data['malignant']);
plt.scatter(radii, averages, color='red');

In [None]:
#added intercept column
breast_cancer_data['bias'] = 1.0 

### Classification ###

Now that we have selected our features, we can start learning! The first thing we need to do is split the data. This can be easily done using sklearn's train_test_split library.

#### Questions: ####
1) Hypothesize what a reasonable test-train split size. <br>
2) Which column represents the y values?

In [None]:
#prepwork for classification
from sklearn.model_selection import train_test_split

train, test = train_test_split(breast_cancer_data, test_size=?, random_state=100) #enter a test-size below
print("Training Data Size: ", len(train))
print("Test Data Size: ", len(test))

Based on the EDA we did above, lets select some features. Think back onto the pairplot, and which features you think work best, and try them below.

In [None]:
#selecting features explain the code in this cell, specifically x_train/y_train
def features(t):
    return t[['feature1', 'feature2', 'feature3']].values.T #replace feature(n) with the desired feature
    
x_train, y_train = features(train), train['malignant'].values

The last thing we need to do is fit the data. Again, this can be easily done using sklearn's LogisticRegression library.

The math behind this is fairly complex; we will talk about loss minimization and what sklearn is doing behind the scenes while our model is built below. <br>

Specifically, we will discuss: <br>
1) What do the numbers below mean. <br>
2) What are the different loss functions, and why sigmoid loss is used in classification. <br>
3) How the loss is being minimized. <br>
4) How can we use the model below to classify future inputs.

In [None]:
from sklearn.linear_model import LogisticRegression

breast_cancer_model = LogisticRegression(fit_intercept=False, C=1e9, solver='lbfgs')
breast_cancer_model.fit(x_train.T, y_train)
breast_cancer_model_features = breast_cancer_model.coef_[0]
breast_cancer_model_features

In [None]:
def sigma(t):
    return 1 / (1 + np.e**(-t))

Our predictions are in N-dimensional space, where N is the number of features. We can plot 2D cross sections to see the impact of a specific feature below. Take some time to try out all of the features you chose above and make note of any interesting observations. 

In [None]:
plt.scatter(train[''], train['malignant'], label = 'original data'); #enter your feature here
plt.xlabel('') #enter x-axis label here
plt.ylabel('malignant')
plt.scatter(radii, averages, color='gold', label = 'binned means');
plt.plot(radii, sigma(breast_cancer_model_features[0] + radii * breast_cancer_model_features[1]), color='r', label = 'logistic model');
plt.legend();

### Predicting ###

Let's first define some helper functions, predict_prob and classify.

<i>Credit: Data 100 Staff </i>

In [None]:
def predict_prob(X, betas = breast_cancer_model_features):
    return sigma(X.T @ betas)

def classify(probabilities, threshold = 0.5):
    return np.int64(probabilities > threshold)

In the classify function, we see a "threshold." This simply tells the function the raw probability value that marks the cutoff between malignant and benign tumors. 

In [None]:
#predicting our trained values
train_predicted = classify(predict_prob(x_train))
train_predicted

We can messure how many predictions our model got right or wrong using measures like accuracy, precision, and recall. Shown below is the accuracy. 

In [None]:
#trained data accuracy
trianing_acc = np.sum(train_predicted == y_train) / len(train_predicted)
trianing_acc

#### Optional Question:####
1) Calculate the train and test mean squared error. 

### Assessing The Model  ###
First, we should check how well our model did on test data. 

In [None]:
x_test, y_test = features(test), test['malignant'].values
test_predicted = classify(predict_prob(x_test))
test_acc = np.sum(test_predicted == y_test) / len(test_predicted)
test_acc

We can now go back to assess how the threshold affects the accuracy. Try different threshold values to understand the relationship between accuracy and treshold better.

In [None]:
#try different tresholds here
test_predicted = classify(predict_prob(x_test), threshold = ?) #enter threshold here
test_acc = np.sum(test_predicted == y_test) / len(test_predicted)
test_acc

#### Questions: ####
1) What can't we just pick the treshold that gives us the highest accuracy?<br>

<i>Credit: Data 100 Staff</i>

In [None]:
#precision and recall
def precision_recall(classified, actual):
    # It's not necessary to define each of these in both the function for precision
    # and recall, but they're here just for the sake of clarity
    tp = sum((actual == classified) & (actual == 1))
    tn = sum((actual == classified) & (actual == 0))
    fp = sum((actual != classified) & (actual == 0))
    fn = sum((actual != classified) & (actual == 1))
    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    return precision, recall

precision, recall = precision_recall(test_predicted, y_test)
print('precision = ', precision)
print('recall = ', recall)

That concludes our walkthrough of machine learning! Through this notebook, we both explored a dataset, selected and engineered features, and even built a model and assessed how well it functioned. 

Some notable topics we didn't cover in this workshop: <br>
1. K-fold cross validation <br>
2. Neural Networks <br>
3. Linear Regression

## Mini-Contest ##
Now that you know a little bit about machine learning, try building your own classifier to predict the probability of a basketball team winning given statistics from the National Basketball Association.

In [None]:
#Note: This code is copied from Data 100 Course Staff
#    It creates the games database to use for the classification problem
import requests
import os

def fetch():
    path = 'nba.csv'
    if not os.path.exists(path):
        url = 'https://stats.nba.com/stats/leaguegamelog/'
        params = (
            ('Counter', '0'),
            ('DateFrom', ''),
            ('DateTo', ''),
            ('Direction', 'ASC'),
            ('LeagueID', '00'),
            ('PlayerOrTeam', 'T'),
            ('Season', '2017-18'),
            ('SeasonType', 'Regular Season'),
            ('Sorter', 'DATE'),
        )
        headers = {
            'User-Agent': 'PostmanRuntime/7.4.0'
        }
        response = requests.get(url, params=params, headers=headers)
        data = response.json()['resultSets'][0]
        df = pd.DataFrame(data=data['rowSet'], columns=data['headers'])
        df.to_csv(path, index=False)
        return df
    else:
        return pd.read_csv(path)
    
df = fetch()

one_team = df.groupby("GAME_ID").first()
opponent = df.groupby("GAME_ID").last()
games = one_team.merge(opponent, left_index = True, right_index = True, suffixes = ["", "_OPP"])
games["FG_PCT_DIFF"] = games["FG_PCT"] - games["FG_PCT_OPP"]
games['WON'] = games['WL'].replace('L', 0).replace('W', 1)

In [None]:
games.head()