In [None]:
from IPython.display import Pretty as disp
hint = 'https://raw.githubusercontent.com/soltaniehha/Business-Analytics/master/docs/hints/'  # path to hints on GitHub

import pandas as pd
import seaborn as sns

### Classification: Predicting discrete labels

We will first take a look at a simple *classification* task, in which you are given a set of labeled points and want to use these to classify some unlabeled points.

Imagine that we have the data shown in this figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-1.png?raw=true" width="600" align="center"/>

Here we have two-dimensional data: that is, we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane.
In addition, we have one of two *class labels* for each point, here represented by the colors of the points.
From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled "blue" or "red."

There are a number of possible models for such a classification task, but here we will use an extremely simple one. We will make the assumption that the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group.
Here the *model* is a quantitative version of the statement "a straight line separates the classes", while the *model parameters* are the particular numbers describing the location and orientation of that line for our data.
The optimal values for these model parameters are learned from the data (this is the "learning" in machine learning), which is often called *training the model*.

The following figure shows a visual representation of what the trained model looks like for this data:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-2.png?raw=true" width="600" align="center"/>

Now that this model has been trained, it can be generalized to new, unlabeled data.
In other words, we can take a new set of data, draw this model line through it, and assign labels to the new points based on this model.
This stage is usually called *prediction*. See the following figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-3.png?raw=true" width="900" align="center"/>

This is the basic idea of a classification task in machine learning, where "classification" indicates that the data has discrete class labels.
At first glance this may look fairly trivial: it would be relatively easy to simply look at this data and draw such a discriminatory line to accomplish this classification.
A benefit of the machine learning approach, however, is that it can generalize to much larger datasets in many more dimensions.

For example, this is similar to the task of automated spam detection for email; in this case, we might use the following features and labels:

- *feature 1*, *feature 2*, etc. $\to$ normalized counts of important words or phrases ("Serious cash", "Nigerian prince", etc.)
- *label* $\to$ "spam" or "not spam"

For the training set, these labels might be determined by individual inspection of a small representative sample of emails; for the remaining emails, the label would be determined using the model.
For a suitably trained classification algorithm with enough well-constructed features (typically thousands or millions of words or phrases), this type of approach can be very effective.

**Examples:**

Here are a few of the multitude of ways classification can be used in the real world.

**Predicting credit risk**

A financing company might look at a number of variables before offering a loan to a company or individual. Whether or not to offer the loan is a binary classification problem.

**News classification**

An algorithm might be trained to predict the topic of a news article (sports, politics, business, etc.).

**Classifying human activity**

By collecting data from sensors such as a phone accelerometer or smart watch, you can predict the person’s activity. The output will be one of a finite set of classes (e.g., walking, sleeping, standing, or running).

## Types of Classification

Before we continue, let’s review several different types of classification.

### Binary Classification

The simplest example of classification is binary classification, where there are only two labels you can predict. One example is fraud analytics, where a given transaction can be classified as fraudulent or not; or email spam, where a given email can be classified as spam or not spam.

### Multiclass Classification

Beyond binary classification lies multiclass classification, where one label is chosen from more than two distinct possible labels. A typical example is Facebook predicting the people in a given photo or a meterologist predicting the weather (rainy, sunny, cloudy, etc.). Note how there is always a finite set of classes to predict; it’s never unbounded. This is also called multinomial classification.

### Multilabel Classification

Finally, there is multilabel classification, where a given input can produce multiple labels. For example, you might want to predict a book’s genre based on the text of the book itself. While this could be multiclass, it’s probably better suited for multilabel because a book may fall into multiple genres. Another example of multilabel classification is identifying the number of objects that appear in an image. Note that in this example, the number of output predictions is not necessarily fixed, and could vary from image to image.

## Popular Classification Algorithms



*   Logistic Regression - 1930s
*   Decision Trees - 1960s (ID3 in 1980s)
*   Random Forests - 2001
*   Gradient-boosted Trees - Late 1990s
*   Multilayer Perceptron Classifier - 1950s (Backpropagation in 1980s)

From the early days of logistic regression in the 1930s, we've come a long way to the complex world of deep learning today. This growth in machine learning, starting from basic algorithms, shows how much the field has evolved. Modern AI systems, like deep networks, can trace their roots back to these simple beginnings, reminding us of the value of learning the basics in this rapidly changing field.

Today, let's begin our journey with Logistic Regression.





## Logistic Regression

Logistic regression is one of the most popular methods of classification. It is a linear method that combines each of the individual inputs (or features) with specific weights (these weights are generated during the training process) that are then combined to get a probability of belonging to a particular class. These weights are helpful because they are good representations of feature importance; if you have a large weight, you can assume that variations in that feature have a significant effect on the outcome (assuming you performed normalization). A smaller weight means the feature is less likely to be important.

Consider the following example: The *Default* Dataset

<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-the-default-dataset.png?raw=true" width="800" align="center"/>

<div style="text-align: center"> The Default data set. Left: The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a function of default status. Right: Boxplots of income as a function of default status. </div>

*<div style="text-align: right"> From [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, et al </div>*

Considering this data where the response "default" falls into a "Yes" or "No" category, let's find a model that can predict the probability of default by using the "Balance" on the credit card. Rather than modeling this response Y
directly, logistic regression models the probability that Y belongs to a particular category.

The probability of default given balance can be written as:
> **Pr(**default = Yes|balance**)**

The values of Pr(default = Yes|balance), which we abbreviate p(balance), will range between 0 and 1. Then for any given value of balance, a prediction can be made for default. For example, one might predict default = Yes for any individual for whom p(balance) > 0.5. Alternatively, if a company wishes to be conservative in predicting individuals who are at risk for default, then they may choose to use a lower threshold, such as p(balance) > 0.1.


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-Classification-lr.png?raw=true" width="800" align="center"/>

<div style="text-align: center"> Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1.
 </div>

*<div style="text-align: right"> From [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, et al </div>*

### The Logistic Model

How should we model the relationship between $p(X) = Pr(Y = 1|X)$ and $X$? (For convenience we are using the generic 0/1 coding for the response). One approach would be using a linear regression model to represent these probabilities:

> $p(X) = \beta_0 + \beta_1 X$

If we use this approach to predict default=Yes using balance, then we obtain the model shown in the left-hand panel of figure above. Here we see the problem with this approach: for balances close to zero we predict a negative probability of default; if we were to predict for very large balances, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of default, regardless of credit card balance, must fall between 0 and 1.

To avoid this problem, we must model $p(X)$ using a function that gives outputs between 0 and 1 for all values of $X$. Many functions meet this description. In logistic regression, we use the logistic function:

> $p(X) = \dfrac{e ^ {\beta_0 + \beta_1 X }}{1 + e ^ {\beta_0 + \beta_1 X }}$

To fit the model above, we use a method called `maximum likelihood`. The right-hand panel of figure above illustrates the fit of the logistic regression model to the Default data. Notice that for low balances we now predict the probability of default as close to, but never below, zero. Likewise, for high balances we predict a default probability close to, but never above, one. The logistic function will always produce an S-shaped curve of this form, and so regardless of the value of X, we will obtain a sensible prediction. We also see that the logistic model is
better able to capture the range of probabilities than is the linear regression model in the left-hand plot.

After a bit of manipulation of the logistic function, we find that:

> $\dfrac{p(X)}{1-p(X)} = e ^ {\beta_0 + \beta_1 X }$

The quantity $p(X)/[1−p(X)]$ is called the odds, and can take on any value between $0$ and $\infty$. Values of the odds close to $0$ and $\infty$ indicate very low and very high probabilities of default, respectively.

For example, Let’s say that the probability of default is $0.2$ (or $1$ in $5$ people), this will give an odds of default equal to $1$ to $4$. This means that the odds of someone defaulting is $25\%$ of not defaulting. Since $p(X) = 0.2$ implies an odds of $\dfrac{0.2}{1-0.2} = 1/4$.

Likewise on average $9$ out of every $10$ people with an odds of $9$ will default, since $p(X) = 0.9$ implies an odds of $\dfrac{0.9}{1-0.9} = 9$.

Odds are traditionally used instead of probabilities in horse-racing, since they relate more naturally to the correct betting strategy.

By taking the logarithm of both sides of the equation above, we arrive at

> $log(\dfrac{p(X)}{1-p(X)}) = \beta_0 + \beta_1 X$


The left-hand side is called the *log-odds* or *logit*. We see that the logistic regression mdoel (defined above) has a logit that is linear in $X$.

The table below shows estimated coefficients of the logistic regression model that predicts the probability of "default" using "balance":

| - |Coefficient| Std. error| Z-statistic| P-value|
|--|--|--|--|--|
|Intercept| −10.6513| 0.3612| −29.5| <0.0001|
|balance |0.0055 |0.0002 |24.9 |<0.0001|

We can measure the accuracy of the coefficient estimates by computing their standard errors. The z-statistic in table above plays the same role as the t-statistic in the linear regression output. A large (absolute) value of the z-statistic indicates evidence against the null hypothesis $H0 : β1 = 0$. This null hypothesis implies that $p(X) = \dfrac{e ^ {\beta_0 }}{1 + e ^ {\beta_0 }}$. In other words, that the probability of default does not depend on balance. Since the p-value associated with balance in the table is tiny, we can reject $H0$. In other words, we conclude that there is indeed an association between balance and probability of default.

The estimated intercept in the table is typically not of interest; its main purpose is to adjust the average fitted
probabilities to the proportion of ones in the data.

# Logistic Regression Example - Titanic survival dataset



In [None]:
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


## Preprocessing - missing values:

1. 177 missing values for "age"; we will replace them with 28 which is the median age.
2. 2 missing values in "embarked"; we will impute with "S" (the most common boarding port).
3. Too many missing values in "deck". We won't use this field and will drop it.
4. "embarked" is the short version of "embark_town"; we will drop "embark_town"
5. "survived" is our target variable `y`. Note that we also have another variable "alive" which is equivalent to "survived". We will drop that too.
6. "pclass" is the numeric representation of "class"; we will drop "class"
7. Information in "sex" is captured in "who"; we will drop "sex"

In [None]:
titanic["age"].fillna(titanic["age"].median(skipna=True), inplace=True)                # replace 177 missing age by the median
titanic["embarked"].fillna(titanic['embarked'].value_counts().idxmax(), inplace=True)  # replace 2 missing embarked by most common
titanic.drop(['deck','embark_town', 'alive', 'class', 'sex'], axis=1, inplace=True)    # drop deck, embark_town, alive, class & sex
titanic.isnull().sum()  # check for missing value

survived      0
pclass        0
age           0
sibsp         0
parch         0
fare          0
embarked      0
who           0
adult_male    0
alone         0
dtype: int64

In [None]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,who,adult_male,alone
0,0,3,22.0,1,0,7.25,S,man,True,False
1,1,1,38.0,1,0,71.2833,C,woman,False,False
2,1,3,26.0,0,0,7.925,S,woman,False,True
3,1,1,35.0,1,0,53.1,S,woman,False,False
4,0,3,35.0,0,0,8.05,S,man,True,True


## Preprocessing - categorical variables:

We will make sure all of our variables are numerical by converting the categorical to *dummy* variables. We can do this by pandas `get_dummies()` function. `get_dummies()` convert categorical variable into dummy/indicator variables, meaning that if we have 3 categories it will create 3 variables with modified column names to indicate the category. We will always need n-1 dummy variables when having n categories, so we will drop the extra one.

In [None]:
titanic = pd.get_dummies(titanic, columns=['embarked', 'who', 'alone', 'adult_male'])
titanic.drop(['embarked_C', 'who_man', 'alone_False', 'adult_male_False'], axis=1, inplace=True)
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked_Q,embarked_S,who_child,who_woman,alone_True,adult_male_True
0,0,3,22.0,1,0,7.25,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,1,0,0
2,1,3,26.0,0,0,7.925,0,1,0,1,1,0
3,1,1,35.0,1,0,53.1,0,1,0,1,0,0
4,0,3,35.0,0,0,8.05,0,1,0,0,1,1


Now that all of our variables are numeric we can continue with our logistic regression.

Define a feature matrix (DataFrame) that includes all the variables except our target variable "survived". Call this DataFrame `X`. Check out the shape of `X` to make sure it makes sense. You can also visually inspect it by looking at the first few rows:

In [None]:
# Your answer goes here


In [None]:
# Don't run this cell to keep the outcome as your frame of reference

(891, 11)

In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-x')

Create a target vector with "survived" and call it `y`:

In [None]:
# Your answer goes here


In [None]:
# Don't run this cell to keep the outcome as your frame of reference

(891,)

In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-y')

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. Use a 30% split for test. You can use seed value 833 if you would like to get similar values as this notebook:

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-split')

With the data arranged, we can follow our recipe to predict the labels:

First, instantiate a logistic regrssion model. You would first need to import `LogisticRegression`; it can be found under the `linear_model` module in `sklearn`. Call this model: `model`.

While instantiating our model specify the `solver` as "liblinear". Solvers is the algorithm that will be used in the optimization problem.

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-instantiate')

Fit model to the training data:

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-fit')

predict on new (test) data and store the results as `y_model`:

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-predict')

Now that our predictions are ready we can merge them along with the ground truth, our `survived`, to the test features and visually inspect our model performance:

In [None]:
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()

Unnamed: 0,index,pclass,age,sibsp,parch,fare,embarked_Q,embarked_S,who_child,who_woman,alone_True,adult_male_True,survived,predicted
0,243,3,22.0,0,0,7.125,0,1,0,0,1,1,0,0
1,241,3,28.0,1,0,15.5,1,0,0,1,0,0,1,1
2,561,3,40.0,0,0,7.8958,0,1,0,0,1,1,0,0
3,108,3,38.0,0,0,7.8958,0,1,0,0,1,1,0,0
4,166,1,28.0,0,1,55.0,0,1,0,1,0,0,1,1


Finally, we can use the ``accuracy_score`` utility to see the fraction of predicted labels that match their true value:

In [None]:
# Your answer goes here


In [None]:
# Don't run this cell to keep the outcome as your frame of reference

0.7873134328358209

In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-acc')

Our basic model is giving us an accuracy of 79%. What accuracy can you reach by trying gaussian naive bayes? Repeat the steps for `GaussianNB`:

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-GaussianNB')

And its accuracy on the test set?

In [None]:
# Your answer goes here


In [None]:
# Don't run this cell to keep the outcome as your frame of reference

0.7657232704402517

In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-02-acc')

As we have finished the example, compare and contrast Logistic Regression and gaussian naive bayes model. Think about their similarities and differences in terms of:

1.   The nature of the target variable
2.   Underlying Assumptions
3.   Interpretability
4.   Performance
5.   Limits


There are several different ways to improve a model that we won't go into their details here but just mention them:

* Model tuning or hyper-parameter tuning
* Feature engineering
* Trying out different models
* bringing new sources of data

For other classification algorithms please visit the [Scikit learn documentation page](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).