# Introduction to Business Analytics

## Lecture 3 - Classification

So, it's time to learn about Classification models. This notebook will run you through the essential concepts, and give examples on how to run multiple Classification algorithms.


As always, let's do some imports. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

-----

## Part 1 - Logistic Regression

_The first model of the lecture relates with Logistic Regression. It is important that you understand the mechanics well, because their are applicable in other types of Classification models_

We want you to understand the intuition behind the logit function, so let's work on the basics. Imagine you want to make a function that determines the probability of some event (let's call it x). The "event" x could simply mean that "the input data corresponds to class 1" (in which case, ~x means "the input data corresponds to class 0"). 

So, to plot such a function, let's just create a vector with values between 0 and 1

In [None]:
px=np.arange(0.0001, 1, 0.001)

So, now we determine the odds ratio function

In [None]:
y=px/(1-px)

...and plot it

In [None]:
plt.plot(y,px)
plt.xlim([0,50])
plt.xlabel("odds ratio")
plt.ylabel("p(x)");

As said in the class, this function form is not ideal at all. What happens if we apply the log?

In [None]:
y=np.log(px/(1-px))

It will be clear with in plot

In [None]:
plt.plot(y,px)
#plt.xlim([0,50])
plt.xlabel("logit")
plt.ylabel("p(x)");

ok, now it seems much more balanced, doesn't it? 

Understanding this function is important:
- What is the probability of x (p(x)) when the log odds ratio is 0? 

- What is the odds ratio itself? 

- What about the extremes (when is it 1 and when is it 0?)?

------

Let's start playing with data. First we need to load:

In [None]:
f=pd.read_csv("NYC_taxis_weather_2016_with_dummies.csv")

Take a look at the dataframe yourself

Yes, the index is no longer the time, let's put it back

So, imagine that you are an NYC taxi fleet manager. At each 15 minutes, you goal is to make sure your company has enough cars for very big spikes in demand across the city (like above 90 percentile). If you detect some very big spike in a specific area, you coordinate with the cars in the neighbourhood to go there. 

For this exercise, let's assume that area 1 is the only truly important for you. Doing this manually would be very tiring (if at all possible), so you rely on your Data Science skills to get a model that does it for you:

**At each 15 minutes time interval, predict whether the next time interval will have a demand spike ("stress").**

Let's first find the actual value above which you call it a "stress" situation:

In [None]:
stress_threshold=np.percentile(f['pickups1'], 90)

How many demand pickups exist above percentile 90? And above other percentiles (e.g. percentile 50)?  

Now, let's create a new column (or variable) that is True when it is a "stress" scenario, and False when it is not

Do what to inspect this new data that you created? (e.g. use describe(), hist(), etc...)

Let's now create our model. The first thing to do is to import the sklearn package that has Logistic Regression, and then just create the respective object.

In [None]:
from sklearn.linear_model import LogisticRegression
LogReg=LogisticRegression()

We have a model, but it is "empty", so let's create the training and test sets now. 

We will create the model with 2/3 of the data (training set), and then 1/3 of the data is kept aside for later validation (test set).

In [None]:
split=int(len(f)*2.0/3)
training=f[:split]
test=f[split:]

We need to create the x and y for the training and test set now. Notice that the y is the "target" variable, i.e. our "stress" column, and the x comes from **almost** all other columns. Let's check all columns

In [None]:
training.columns

Ok, we need to create the x using EVERYTHING but the 'stress' variable, but also we need to remove "pickups1" (**why?**)

Notice also that we have several dummy variables relating to the categorical 'time of day' variable. Recall that we discussed the method of dummy variables as a way of including categorical variables as predictors in our model.  In this case, the time of day variable has seven possible values and we created seven dummy variables using the pd.get_dummies() function. In logistic regression when including dumming variables for a categorical variable that takes k values, it recommended to only include k-1 of the dummies and not all k of them. The reason for this is that these k dummy variables sum to 1 leading to multicollinearity (linear dependence), which will make it difficult to interpret the coefficients. Thus, we will arbitrarily exclude the dummy 'time_of_day_night'. It is also important to note that when interpreting the coefficients for these dummy variables you interpret them as measuring an 'effect' relative to the baseline category that you left out (in this case the remaining six dummy variables should be interpreted as measuring effects relative to the 'time_of_day_night'). FYI, in Pandas, we can automatically create k-1 dummies instead of k dummies using the option 'drop_first=False' in the pd.get_dummies() function.

The other methods for classification that we will cover (e.g., SVM, decision trees) are less sensitive to multicollinearity and there you can include all the seven dummies.

In [None]:
x_train=training[['pickups17_lag1', 'pickups17_lag2', 'pickups1_lag1',
       'pickups1_lag2', 'pickups21_lag1', 'pickups21_lag2', 'pickups28_lag1',
       'pickups28_lag2',  'temp', 'prcp','fog', 'rain_drizzle', 'is_weekend', 'time_of_day_afternoon',
       'time_of_day_afternoon rush', 'time_of_day_evening',
       'time_of_day_lunch time', 'time_of_day_morning',
       'time_of_day_morning rush']]
x_test=test[['pickups17_lag1', 'pickups17_lag2', 'pickups1_lag1',
       'pickups1_lag2', 'pickups21_lag1', 'pickups21_lag2', 'pickups28_lag1',
       'pickups28_lag2',  'temp', 'prcp','fog', 'rain_drizzle', 'is_weekend', 'time_of_day_afternoon',
       'time_of_day_afternoon rush', 'time_of_day_evening',
       'time_of_day_lunch time', 'time_of_day_morning',
       'time_of_day_morning rush']]

To make sure you understand things, don't forget to ALWAYS play with the code here... For example, what's the content of the new lists x_train and x_test?

...and now the ys are trivial

In [None]:
y_train=training['stress']

y_test=test['stress']

To make sure you understood, do you want to see what's inside these two vectors?

Ok, we have our Xs and Ys! Ready to go... it's totally trivial with sklearn:

In [None]:
LogReg.fit(x_train, y_train)

Congrats! You trained your first Logistic Regression model. What's its accuracy (**on the test set**)?

In [None]:
LogReg.score(x_test,y_test)

Do you want to try it on the training set? What do you expect?

If the values tend to be similar, then congrats, it's very likely that your model is not overfitting! :-)



Anyway, accuracy is not everything in a classifier. Another VERY interesting concept is the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

To use it properly, let's first obain the predictions of our model using the test set:

In [None]:
ypred=LogReg.predict(x_test)

Now, we can compare the predictions with the observations, using the confusion matrix

In [None]:
cm = confusion_matrix(y_test, ypred)
cm

Important: pay attention to the rows and columns in the confusion matrix. The format that Sklearn uses is different from what we had in the slides - the rows represent actual (true) values and the columns represent predicted values.
You can see the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Alternatively, we can also plot the confusion matrix just to be sure:

In [None]:
display = ConfusionMatrixDisplay(cm)
display.plot() 

Remember we mentioned that accuracy is not a good measure if the data set is imbalanced. In our problem and data set, is the response variable imbalanced? In general, a better measure to use is the f1 score. We can compute this easily:

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test,ypred)

A last useful thing about Logistic Regression. It is a parametric model, so its parameters beta (its "coefficients") can actually mean something. Let's take a look at them:

In [None]:
LogReg.coef_

This is a bit confusing. Which coefficient correspond to which variable? Let's make it more interpretable:

In [None]:
for cname, val in zip(x_train.columns, LogReg.coef_.tolist()[0]):
    print("%s=%.3f"%(cname, val))

What do you think of the results? Take a look at the signs. Do they make sense? Try to play with the stress_threshold above (instead of 90 percentile, you can try others...)

------

## Part 2 - Support Vector Machines


For the rest of the notebook, we will include the dummy that we left out earlier - 'time_of_day_night':

In [None]:
x_train=training[['pickups17_lag1', 'pickups17_lag2', 'pickups1_lag1',
       'pickups1_lag2', 'pickups21_lag1', 'pickups21_lag2', 'pickups28_lag1',
       'pickups28_lag2',  'temp', 'prcp','fog', 'rain_drizzle', 'is_weekend', 'time_of_day_afternoon',
       'time_of_day_afternoon rush', 'time_of_day_evening',
       'time_of_day_lunch time', 'time_of_day_morning',
       'time_of_day_morning rush','time_of_day_night']]
x_test=test[['pickups17_lag1', 'pickups17_lag2', 'pickups1_lag1',
       'pickups1_lag2', 'pickups21_lag1', 'pickups21_lag2', 'pickups28_lag1',
       'pickups28_lag2',  'temp', 'prcp','fog', 'rain_drizzle', 'is_weekend', 'time_of_day_afternoon',
       'time_of_day_afternoon rush', 'time_of_day_evening',
       'time_of_day_lunch time', 'time_of_day_morning',
       'time_of_day_morning rush','time_of_day_night']]

In Part 1, we did almost everything for you. But now, we'll just help you with the import:

In [None]:
from sklearn.svm import SVC

Create the object

Remember that SVM's rely on computing 'distances' between features. Hence, it is critical that your data is standardized. Can you standardize the data first? 

Train the model

Check its accuracy and f1 score

Check its confusion matrix

Is this model better than the Logistic regression model? Look at both the f1 score and the confusion matrix. In this case, do you care more about false positives or false negatives?