In [None]:
!pip install plotly
import numpy as np
import pandas as pd


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import altair as alt

# Binary Classification and Logistic Regression

This lab is more of a walkthrough than a coding exercise. All questions in this lab are free response, so turn in a pdf to Gradescope.

The goal of this lab is to give you ideas on how to classify data into 2 categories, with a little extra note on how to predict more categories.

This assignment should be completed and submitted before **11:59 PM on Friday, June 5, 2020**.

Collaboration Policy
Data science is a collaborative activity. While you may talk to others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others, please **include their names** in the following cell:



*List collaborators here*

## Data

For this lab we will use the Pokemon Gen VI Tierlist Dataset which we can obtain from [kaggle](https://www.kaggle.com/notgibs/smogon-6v6-pokemon-tiers). You can find the description of the original attributes [here](https://www.kaggle.com/abcsds/pokemon).

- **X.**: Pokedex Number
- **Name**: Name of each pokemon
- **Type 1**: Primary type. Each pokemon has a type, this determines weakness/resistance to attacks.
- **Type 2**: Secondary type. Some pokemon are dual type and have two types.
- **Total**: sum of all stats given below, a general guide to how strong a pokemon is.
- **HP**: hit points, or health, defines how much damage a pokemon can withstand before fainting.
- **Attack**: the base modifier for normal attacks (e.g., Scratch, Punch).
- **Defense**: the base damage resistance against normal attacks.
- **SP Atk**: special attack, the base modifier for special attacks (e.g., fire blast, bubble beam).
- **SP Def**: the base damage resistance against special attacks.
- **Speed**: determines which pokemon attacks first each round.

In [None]:
df = pd.read_csv("pokemon.csv")
df.info()

In [None]:
df["Tier"].unique()

Pokemon in the competitive scene can be classified into several tier lists and can be played in any format that they are listed in and above. In today's lab, we will consider a pokemon to be competitively viable (called "being in meta") if they are in tier UU or above, meaning that the following tiers will be considered "meta." (*If you are curious, you can read more about the [explanation of tiers](https://pokemon.neoseeker.com/wiki/Tier_listings)*).

* AG
* Uber
* OU
* BL
* UU
* BL2


Create a new column `"Meta"` that is True when a pokemon is not in the above tiers, and False when a pokemon is.

In [None]:
df["Meta"] = df["Tier"].isin(["AG", "Uber", "OU", "BL", "UU", "BL2"])

In [None]:
# TEST
sum(df["Meta"])

In order to visualize and classify our pokemon's we will use the `Total` column (i.e., a sum of all stats, which explains the overall pokemon's strength).

Altair allows us to automatically nicely visualize the spread of the two classes.

In [None]:
alt.Chart(df).mark_tick().encode(
    x = 'Total', 
    y = 'Meta',
    color = 'Meta'
).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
).properties(
    width=600,
    height=200
).interactive()

Perhaps a better way to visualize the data is using stacked histograms.

In [None]:
# Configure the options common to all layers
brush = alt.selection(type='interval')
base = alt.Chart(df).add_selection(brush)

hist = alt.Chart(df).transform_fold(
    ['Meta', 'Total']
).mark_area(
    opacity=0.5,
    interpolate='step'
).encode(
    alt.X('Total:Q', bin=alt.Bin(maxbins=30)),
    alt.Y('count()', stack=None),
    alt.Color('Meta')
    #color = alt.condition(brush, 'malignant', alt.value('grey'))
)
hist

What if we wanted to visualize these data as a scatterplot?

In [None]:
alt.Chart(df).mark_point().encode(
    x = 'Total:Q', 
    y = 'Meta:Q',
    color = 'Meta'
).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
).properties(
    width=600,
    height=200
)

## Least Squares Regression


**Goal:** We would like to predict whether a Pokemon is meta or not based on the sum of its stat values

We will define $X$ and $Y$ as variables containing the training features and labels.

In [None]:
X = df[['Total']].values
Y = df['Meta'].values.astype('float')

Fit a least squares regression model.

In [None]:
import sklearn.linear_model as linear_model

least_squares_model = linear_model.LinearRegression()
least_squares_model.fit(X,Y)

## How is our fit?

Adding small amount of noise avoids overplotting when we draw our scatterplot.

In [None]:
jitter_y = Y + 0.1*np.random.rand(len(Y)) - 0.05
jitter_x = X + 10*np.random.rand(len(X), 1) - 5
print(jitter_x.shape)
points = go.Scatter(name="Jittered Data", 
                    x=np.squeeze(jitter_x), y = jitter_y, 
                    mode="markers", marker=dict(opacity=0.5))
X_plt = np.linspace(np.min(X), np.max(X), 10)
model_line = go.Scatter(name="Least Squares",
    x=X_plt, y=least_squares_model.predict(X_plt[:,np.newaxis]), 
    mode="lines", line=dict(color="orange"))
py.iplot([points, model_line])

## What is the Root Mean Squared Error?

In [None]:
from sklearn.metrics import mean_squared_error as mse
print("RMSE:", np.sqrt(mse(Y, least_squares_model.predict(X))))

## Question 1

1. Are we happy with the fit?
2. What is the meaning of predictions that are neither 0 or 1?
3. Put the RMSE in context of the goal

*Write your answer here, replacing this text.*

## Classification Error

This is a classification problem so we probably want to measure how often we predict the correct value.  This is sometimes called the zero-one loss (or error):

$$ \large
\textbf{ZeroOneLoss} = \frac{1}{n} \sum_{i=1}^n \textbf{I}\left[ y_i \neq f_\theta(x) \right]
$$

However to use the classification error we need to define a decision rule that maps $f_\theta(x)$ to the $\{0,1\}$ classification values.

## Simple Decision Rule

Suppose we instituted the following simple decision rule:

$$\Large
\text{If } f_\theta(x) > 0.5  \text{ predict 1 (meta) else predict 0 (non-meta).}
$$

This simple **decision rule** is deciding that a Pokemon is meta if our model predicts a values above 0.5 (closer to 1 than zero).

In the following we plot the implication of these decisions on our training data.

In [None]:
jitter_y = Y + 0.1*np.random.rand(len(Y)) - 0.05
jitter_x = X + 10*np.random.rand(len(X), 1) - 5
ind_mal = least_squares_model.predict(X) > 0.5

mal_points = go.Scatter(name="Classified as meta", 
                    x=np.squeeze(jitter_x[ind_mal]), y = jitter_y[ind_mal], 
                    mode="markers", marker=dict(opacity=0.5, color="red"))
ben_points = go.Scatter(name="Classified as non-meta", 
                    x=np.squeeze(jitter_x[~ind_mal]), y = jitter_y[~ind_mal], 
                    mode="markers", marker=dict(opacity=0.5, color="blue"))
dec_boundary = (0.5 - least_squares_model.intercept_)/least_squares_model.coef_[0]
dec_line = go.Scatter(name="Least Squares Decision Boundary", 
                      x = [dec_boundary,dec_boundary], y=[-0.5,1.5], mode="lines",
                     line=dict(color="black", dash="dot"))
py.iplot([mal_points, ben_points, model_line,dec_line])

## Compute `ZeroOneLoss`

In [None]:
from sklearn.metrics import zero_one_loss
print("Training Fraction incorrect:", 
      zero_one_loss(Y, least_squares_model.predict(X) > 0.5))

## Question 2

1. Are we happy with this error level?
1. 40% of eligible pokemon are considered meta. If we only guessed the majority label (meta, non-meta), what percent would we get wrong?

*Write your answer here, replacing this text.*



#### Can we think of the line as a "probability"?


Not really.  Probabilities are constrained between 0 and 1.   How could we learn a model that captures this probabilistic interpretation?



#### Could we just truncate the line?

Maybe we can define the probability as:

$$ \large
p_i = \min\left(\max \left( x^T \theta , 0 \right), 1\right)
$$

this would look like:

In [None]:
def bound01(z):
    u = np.where(z > 1, 1, z)
    return np.where(u < 0, 0, u)

In [None]:
X_plt = np.linspace(np.min(X), np.max(X), 100)
p_line = go.Scatter(name="Truncated Least Squares",
    x=X_plt, y=bound01(least_squares_model.predict(np.array([X_plt]).T)), 
    mode="lines", line=dict(color="green", width=8))
py.iplot([mal_points, ben_points, model_line, p_line, dec_line], filename="lr-06")

So far least squares regression seems pretty reasonable and we can "force" the predicted values to be bounded between 0 and 1. 

## Logistic Regression

In most cases, a truncated linear regression is not a robust enough model to predict probabilities because it is very sensitive to outliers. This is why in most cases, we use logistic regression to model probabilities. 

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic_model = LogisticRegression(solver='lbfgs').fit(X,Y)
Y_pred = logistic_model.predict(np.linspace(200,800,100).reshape(-1,1))
source1 = pd.DataFrame({
    'x': X.squeeze(),
    'y': Y
})
source2 = pd.DataFrame({
    'model_x': np.linspace(200,800,100),
    'model_y': Y_pred
})
alt.Chart(source1).mark_point().encode(
    x='x',
    y='y'
) + alt.Chart(source2).mark_line(color="red").encode(
    x=alt.X('model_x', title="Total stat values", scale= alt.Scale(zero=False)),
    y=alt.Y('model_y', title="Probability of being meta")
)

Notice how the logistic regression line looks similar to our decision boundary that we chose earlier.

We also expect the accuracy to be around the same as our decision boundary in this case. Let's also take a look at our RMSE loss.

In [None]:
print("Fraction incorrect:", 
      zero_one_loss(Y, logistic_model.predict(X) > 0.5))
print("RMSE:", 
      np.sqrt(mse(Y, logistic_model.predict(X))))

## Question 3

Notice how the RMSE is larger for logistic regression. Why do you think this is, and is this kind of loss useful for this kind of model?

*Write your answer here, replacing this text.*

## Multiclass classification (Generalized Logistic Regression)

The rest of this notebook is just informational. There are no further questions, but you may find this useful for your final project.

Logistic regression can be thought of as binary classification, where we pick between the 0 class and the 1 class. In many cases, there are not just two classes we would like to predict. For this Pokemon dataset, we would instead like to predict the highest format that a Pokemon with a certain set of stats will be legal under. 

First let's edit our dataframe so that it's easier to use and designate indexes to categories and combines the respective banlists and tiers (so the class is the format rather than the tier)

In [None]:
df_classification = df[["HP", "Attack", "Defense", "Sp..Atk", "Sp..Def", "Legendary", "Mega"]]
df_classification = df_classification.astype("int64")
df_target = df["Tier"].replace({"AG":0,
                                "Uber":1,
                                "OU":2,
                                "BL":2,
                                "UU":3, 
                                'BL2':3, 
                                'RU':4, 
                                'BL3':4, 
                                'NU':5, 
                                'BL4':5,
                                'PU':6})
target_onehot = pd.get_dummies(df_target)

We can also check how many examples of each tier we have in the dataset.

In [None]:
df['Tier'].value_counts()

We now have 7 different categories that we want to predict.

To get probabilities of each category, we will use a special kind of function called the softmax function that takes an array of weights and turns them into probabilities according to this function:

$$\large p_i = \frac{exp(x_i)}{\sum_{j}^{ }exp(x_j))}$$
Let's define a loss function that can handle multiple classes. This is called cross entropy, and one of its additional properties is it is robust to class imbalances.

$$\large-\sum_{c=1}^My_{o,c}\log(p_{o,c})$$

Where $M$ is the number of classes, $y$ is a binary indicator for whether the observed class is correct, and $p$ is the probability that the observation is of class $c$.

In [None]:
from scipy.special import softmax

def cross_entropy(prediction, target, epsilon=1e-12):
    predictions = np.clip(prediction, epsilon, 1. - epsilon)
    N = predictions.shape[0]
    ce = -np.sum(np.sum(target*np.log(predictions+1e-9)))
    return ce

def softmax_cross_entropy(weights):
    weights = weights.reshape(7,7)
    p = softmax(df_classification @ weights, axis=1)
    return cross_entropy(p, target_onehot)
    
softmax_cross_entropy(np.zeros((7,7)))

Let's try to minimize this loss function using scipy's library. Normally we would want to use something like gradient descent, but to keep the code simple, we will use this prebuilt function. This can take a while, so sit back and relax.

In [None]:
from scipy.optimize import minimize
m = minimize(softmax_cross_entropy, x0=np.zeros((7,7)))

As a result, what we created here is mostly known as a **Perceptron**, and it is the basic building block of many neural networks. Let's see how well it did

In [None]:
from sklearn.metrics import confusion_matrix
def predict(df, weights):
    return (df_classification @ weights.reshape(7,7)).idxmax(axis=1)

y_pred = predict(df_classification, m['x'])

color = confusion_matrix(df_target,y_pred, normalize='true')
x,y = np.meshgrid(range(0,7),range(0,7))
source = pd.DataFrame({
    'x':x.ravel(),
    'y':y.ravel(),
    'z':color.ravel()
})
alt.Chart(source).mark_rect().encode(
    x=alt.X('x:O', title="Predicted"),
    y=alt.Y('y:O', title="Target"),
    color='z:Q'
)

This is called a confusion matrix, and it tells us what proportion of each class our model predicted vs what the actual class is. Ideally, we would have a dark diagonal, but we can see that our model predicts a lot of Pokemon to be in the PU format. This is probably due to a various number of reasons. Most likely, there are more pokemon in the PU format, and their stats closely resemble those found in UU, RU, and NN, so we need to look at more data than just the numerical values of the pokemon like the types, movesets, and abilities. 

To save as a pdf, click `File` -> `Download as` -> `PDF`