# DS 2001 - Business Practicum (Spring 2022)
# Week 12: Prediction when outcome is categorical (i.e., classification)
Hello! We are now in week 12. Last week we learned to make predictions when the outcome variable is continuous.

This week we'll learn how to make predictions when the outcome variable is categorical -- specifically when the outcome is binary (yes/no, 1/0, will buy/ will not buy).

## Here is an overview of the steps we need to take: (In fact, most things we learned from last week still apply here!)

1. you need to read in the data and required packages
2. you need to explore the data (e.g., plot distribution of a given variable, compute correlations between any pair of variables)
3. you need to process the data (e.g., remove outliers, interpolate missing values, generate new columns/features)
4. you need to split the data into training and testing
    <br/><br/>
    When predicting an outcome variable that is categorical, we are looking at a **classification** problem. Basically, we want to train an algorithm to help us *classify* future observations. One simple yet powerful classification model is called **logistic regression**.
    <br/><br/>
5. you need to train a classifier, such as a logistic regression, using the **training data**.
6. you need to evaluate the classifier performance using the **test data**.

Once we make sure the classification performance is satisfactory -- we are done! Keep in mind that usually people will experiment with different classification algorithms and pick the alogirthm that yields the best performance. Basically, all you need to do is repeat steps 5-6 using different algorithms and compare their prediction performance.

**Make sure to read the slide deck for this week to get information on ways to evaluate prediction performance.**

Below, lets take a look at an online advertising data set. **We want to predict whether a given user will click the ad (yes/no).** Since the outcome is binary, this is a **classification problem**, and we'll build a **logistic regression model**.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline

In [None]:
### below, import packages for logistic regression and performance evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split


Make sure to download the data file (advertising.csv)

In [None]:
data = pd.read_csv("advertising.csv")


In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
plt.figure(figsize=(10, 8))
data['Age'].hist(bins=20)
plt.xlabel('Age')
plt.show()

In [None]:
data.corr()

Below, we will quickly check for missing values and either interpolate or drop observations if needed. However, this data set has been cleaned for us so no need to do anything here:

In [None]:
data.isnull().sum()  ### no missing values!

## Feature Engineering
In some cases you might want to generate more columns, called **feature engineering**, if you think those extra columns can help improve the prediction performance. (You can always do this regardless of whether the outcome variable is continuous or categorical!)

Say for example we think **age^2** can help improve our prediction. (There is million other variables that we could potentially generate -- remember the goal is to make prediction, so the interpretation of why a given variable is included is not that important. We just care if the prediction is better with the new variables.)

It's very easy to generate and incorporate the new variable into our pandas dataframe. Let's create a new column called Age2 to store the squared of age:


In [None]:
data['Age2'] = data['Age']*data['Age']

In [None]:
data[['Age','Age2']]

We are done! Just need to make sure when we create the model later on we include both columns as predictors.
We will also select some variables we think will be relevant for predictions to the list of predictors, and ignore the rest.


In [None]:
y = data['Clicked on Ad']
X = data[['Daily Time Spent on Site', 'Age', 'Age2', 'Area Income', 'Daily Internet Usage', 'Male']]

Now, make sure we split the data into training and testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

Training a logistic regression model is similar to training a linear regression model. You just have to create a logistic regression object and then supply the necessary parameters.

In [None]:
log_reg = LogisticRegression() ## create a logistic regression object
log_reg.fit(X_train, y_train)  ## fit (train) the logistic regression model with training data

y_pred = log_reg.predict(X_test) ## make prediction based on the test set data's X variables.


## Model evaluation
Keep in mind that to evaluate the classification performance **we need to compare the predicted results (y_pred) with the actual results (y_test)**

In [None]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

## Below we use something called f-string to print results. It's just another way to print strings that allows us to
## specify decimal places.
##
## For more information: https://realpython.com/python-f-strings/

print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")
print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

The analyses done in the cell above show that our logistic regression performs really well. Out of 400 observatios in the test set, 188+191 = 379 was classified correctly (an accuracy of 94.75%). The other metrics also look pretty good (precision, recall, F1_score). Overall, this is a good model!

As a comparison, let's try running the model without the Age^2 term.

In [None]:
X_train2 = X_train.drop(["Age2"], axis=1)

In [None]:
X_test2 = X_test.drop(["Age2"], axis=1)

In [None]:
log_reg2 = LogisticRegression() ## create a logistic regression object
log_reg2.fit(X_train2, y_train)  ## fit (train) the logistic regression model with training data

y_pred2 = log_reg2.predict(X_test2) ## make prediction based on the test set data's X variables.


In [None]:
precision2 = precision_score(y_test, y_pred2)
recall2 = recall_score(y_test, y_pred2)
f12 = f1_score(y_test, y_pred2)

## Below we use something called f-string to print results. It's just another way to print strings that allows us to
## specify decimal places.
##
## For more information: https://realpython.com/python-f-strings/

print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred2)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred2):.4f}")
print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision2:.4f}\n\tRecall: {recall2:.4f}\n\tF1_Score: {f12:.4f}")

We can see that the model without the Age^2 term performs worse than the model with it. Therefore, if we were to pick our final model between the two, **we will pick the model with Age^2 as out best prediction model.**

In [None]:
### If we want to know each variable's coefficient, run this cell

list(zip(X_train.columns,pd.Series(log_reg.coef_[0])))

In [None]:
log_reg.predict_proba(X_train)