We'll be playing around a little bit more with the *Super Smash Bros. Melee* dataset.  Last time, I promise.  The dataset has been altered to have a larger sample size.  This data generation was intentionally done to lean into logistic regression.

We want to predict the outcome variable `won`; this column takes the values `1` or `0`.  A `1` indicates that the player in the row won the set; a `0` indicates that the player lost.

Review questions:
* What are the two major classes of machine learning problems? What type is this one?
* The correct answer to the first question has two major subclasses of machine learning problems, what are they?  Which of the two groups would you put our SSBM task into?

In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

from mlxtend.plotting import plot_decision_regions
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


data_url = "https://raw.githubusercontent.com/AdamSpannbauer/twitch_chat/master/data/slippi_data/generated_ssbm.csv"
ssbm = pd.read_csv(data_url).drop(columns=["index"])
ssbm.head()

Let's say we've already done our EDA and want to get straight into modeling.  How might we start down this road?

A good place to start could be to split up our data into it's `X` and `y` components and then doing a train test split.

* Our target variable is `won` so we'll call it `y`
* The rest of the features (except for the `gamerTag`) we'll call `X`
* Use the seed 1969 for the random_state in the train test split

Let's say we only want to move forward with the 2 best predictors of our output variable `won`.  How might we do that?

We could use `SelectKBest` from `sklearn.feature_selection`.  We want to use the `f_classif` method to run ANOVAs.  We do this because our features are continuous and our target is categorical.

In [None]:
# Storing columns since we're going to overwrite
# X with a numpy array (which will delete its column names)
cols = X_train.columns

# Perform ANOVAs for each of our features and outcome
selector = SelectKBest(f_classif, k=2)
X_train = selector.fit_transform(X_train, y_train)

# We don't have to transform this back into a dataframe
# this is just being done for better display
selected_cols = cols[selector.get_support()]
X_train = pd.DataFrame(X_train, columns=selected_cols, index=y_train.index)
X_train.head()

Let's visualize our remaining 2 features with our target variable.  How do we want to do this?

Maybe boxplots?

Maybe a scatterplot?

Now let's build a `LogisticRegression` model and score it on our train and test data.

But first! Let's look at what logistic regression is doing.  If we were to just use one of our columns to predict who won, we might look at a scatter plot of that column with the target variable.

In doing so, we see a pattern.  In general, the higher the `numKillingPunishes`, the more likely a player is to win.  But how could we draw a line to predict the probability of someone winning based on this info?

The trick is that the line doesn't have to be straight.

In [None]:
sns.scatterplot("numKillingPunishes", "won", data=ssbm)
plt.show()

We'll talk about the below code in a couple cells, for now, just now we're using a logisitic regression model to draw some lines on the scatter plot from before.

In [None]:
# Nothing to see here.. scroll down to the plot
model = LogisticRegression()
model.fit(X_train[["numKillingPunishes"]], y_train)

pred_df = model.predict_proba(X_train[["numKillingPunishes"]])
pred_df = pd.DataFrame(pred_df, columns=["prob_lose", "prob_win"])
pred_df["numKillingPunishes"] = X_train["numKillingPunishes"].reset_index(drop=True)
pred_df = pred_df.sort_values("numKillingPunishes")

sns.scatterplot("numKillingPunishes", "won", data=ssbm, label="Actual")
plt.plot(pred_df["numKillingPunishes"], pred_df["prob_win"], c="orange", label="Fit")
plt.axvline(11.42896875, c="red", ls="--", alpha=0.5, label="Decision\nBoundary")
plt.axhline(0.5, c="black", alpha=0.1, label="50%")
plt.legend(loc="upper left")
plt.show()

But we want to predict with more than just one variable! What does that look like?

In [None]:
px.scatter_3d(ssbm, "numKillingPunishes", "openingsPerKill", "won")

We'll talk about the below code in a couple cells, for now, just now we're using a logisitic regression model to draw some lines on the scatter plot from before.

In [None]:
# Nothing to see here.. scroll down to the plot
model = LogisticRegression()
model.fit(X_train, y_train)

pred_df = model.predict_proba(X_train)
pred_df = pd.DataFrame(pred_df[:, 1], columns=["prob_win"])
pred_df["numKillingPunishes"] = X_train["numKillingPunishes"].reset_index(drop=True)
pred_df["openingsPerKill"] = X_train["openingsPerKill"].reset_index(drop=True)
pred_df["won"] = y_train.reset_index(drop=True)
pred_df = pd.melt(
    pred_df, id_vars=["numKillingPunishes", "openingsPerKill"], var_name="won"
)
pred_df.loc[pred_df["won"] == "prob_win", "won"] = "Fit"
pred_df.loc[pred_df["won"] == "won", "won"] = "Actual"

px.scatter_3d(pred_df, "numKillingPunishes", "openingsPerKill", "value", color="won")

Alright, now to talk about actually fitting the model.

* Define a `LogisticRegression` model and `fit` it to the training data

In [None]:
model = ____
model.____

* `score` the model on the training data

* `score` the model on the testing data

Let's make some predictions, and see how our model is making mistakes.

In [None]:
y_pred = ____

In [None]:
confusion_mat = ____
confusion_mat

Convert the confusion matrix to a dataframe
* Use `['actual_0', 'actual_1']` for the `index`
* Use `['predicted_0', 'predicted_1']` for the `columns`

To visualize our mistakes we might put our data back into a dataframe.  

In [None]:
diagnostic_df = pd.DataFrame(X_test, columns=X_train.columns)
diagnostic_df['won'] = y_test.values
diagnostic_df['won_pred'] = y_pred

* Add a column indicating `True` or `False` when we made an error in prediction.

In [None]:
diagnostic_df['error'] = ____
diagnostic_df.head()

We can then use `seaborn` like before, but this time let's somehow display whether each observation was an error.

If we wanted to be more formal about seeing the decision boundary we could plot using `plot_decision_regions` from `mlextend`.

In [None]:
# This wants an array that's not the float type
labels = y_test.values.astype(int)

plot_decision_regions(X_test, labels, clf=model)
plt.show()

So what is actually happening when making predictions.

In [None]:
coef_1 = model.coef_[0, 0]
coef_2 = model.coef_[0, 1]
intercept = model.intercept_[0]

print(
    f"log(odds) = {intercept:.2f} + {coef_1:.2f}*numKillingPunishes + {coef_2:.2f}*openingsPerKill"
)

Above shows how we can write the formula that our logistic regression model learned from our data.

But we need to get some intuition of what the `log(odds)` means.  Let's see how we could make a prediction 'by hand' with a single observation.

First we'll subset out a single observation, we'll call it `obs`.

In [None]:
obs = ssbm.loc[[0], ['numKillingPunishes', 'openingsPerKill']]
obs

Now we can fill out the right hand side of the equation and do the math to calculate the value of this `log(odds)` thing.

In [None]:
X1 = ____
X2 = ____

log_odds = intercept + coef_1 * X1 + coef_2 * X2
# Some manipulation to get to just a number from log_odds (which is a pandas.Series)
log_odds = log_odds.values[0]
log_odds

Like it sounds, log(odds) is the log of the odds.  So, to go from log(odds) to odds we need to do $e^{log(odds)}$.  This is because, raising $e$ to x is the inverse of taking the log of x (like how multiplying by x is the inverse of dividing by x).  

In [None]:
odds = np.exp(log_odds)
odds

Now we have the odds as 0.0269.  This number doesn't mean much to me; I'd much prefer probability instead of odds. If you're familiar with odds then you might be comfortable with this number.

If we did want to convert to probability we can do the following: $\frac{odds}{1 + odds}$.  For example, saying "5 to 1 odds" ($\frac{5}{1}$) is the same as saying a probability of $\frac{5}{6}$ or $0.833$.

In [None]:
prob = odds / (1 + odds)

# Formula can be re-written to match the slides
# prob = 1 / (1 + np.exp(-log_odds))

print(f"Probability of losing: {1 - prob:.4f}")
print(f"Probability of winning: {prob:.4f}")

Our results show that our observation only as a 2.6% chance of winning.  If we wanted to compare this result to what our model would predict, we could use its `predict_proba` method.  This method outputs a probability for each class.

Using our coefficients and some math, we were able to mimic our `sklearn` model.  But what do these coefficients actually mean?

The coefficient for `numKillingPunishes` is 0.7239 which is interpreted as the expected change in log odds for a one-unit increase in `numKillingPunishes`.  For example, if we went from 5 `numKillingPunishes` to 6 `numKillingPunishes` we would see our log odds increase by 0.7239 (aka our coefficient).  This still isn't the most interpretable thing.

We can use the coefficient to calculate the *odds ratio* and this will lead us to a little more interpretable result.  To go from our coefficient to this odds ratio we will raise $e$ to it just like we did before.  We can then compare this ratio to 1 to see how it affects our odds of winning.

In [None]:
np.exp(coef_1) - 1

Our output of 2.062 is then compared to the value 1 to see how we expect our odds to change for one unit increase in `numKillingPunishes`.  That is, 2.062 - 1 = 1.062.  This means we expect to see a 102% increase in the odds of winning for every unit increase in `numKillingPunishes`.

Let's also interpret the coefficient for `openingsPerKill`, which was -1.31.

In [None]:
np.exp(coef_2) - 1

This means we expect to see a 73% decrease in the odds of winning for every unit increase of `openingsPerKill`.  That is, if we went from 5 `openingsPerKill` to 6 `openingsPerKill` we would decrease our odds of winning by 73%.