### Classification: Predicting discrete labels

We will first take a look at a simple *classification* task, in which you are given a set of labeled points and want to use these to classify some unlabeled points.

Imagine that we have the data shown in this figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-1.png?raw=true" width="600" align="center"/>

Here we have two-dimensional data: that is, we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane.
In addition, we have one of two *class labels* for each point, here represented by the colors of the points.
From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled "blue" or "red."

There are a number of possible models for such a classification task, but here we will use an extremely simple one. We will make the assumption that the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group.
Here the *model* is a quantitative version of the statement "a straight line separates the classes", while the *model parameters* are the particular numbers describing the location and orientation of that line for our data.
The optimal values for these model parameters are learned from the data (this is the "learning" in machine learning), which is often called *training the model*.

The following figure shows a visual representation of what the trained model looks like for this data:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-2.png?raw=true" width="600" align="center"/>

Now that this model has been trained, it can be generalized to new, unlabeled data.
In other words, we can take a new set of data, draw this model line through it, and assign labels to the new points based on this model.
This stage is usually called *prediction*. See the following figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-classification-3.png?raw=true" width="900" align="center"/>

This is the basic idea of a classification task in machine learning, where "classification" indicates that the data has discrete class labels.
At first glance this may look fairly trivial: it would be relatively easy to simply look at this data and draw such a discriminatory line to accomplish this classification.
A benefit of the machine learning approach, however, is that it can generalize to much larger datasets in many more dimensions.

For example, this is similar to the task of automated spam detection for email; in this case, we might use the following features and labels:

- *feature 1*, *feature 2*, etc. $\to$ normalized counts of important words or phrases ("Serious cash", "Nigerian prince", etc.)
- *label* $\to$ "spam" or "not spam"

For the training set, these labels might be determined by individual inspection of a small representative sample of emails; for the remaining emails, the label would be determined using the model.
For a suitably trained classification algorithm with enough well-constructed features (typically thousands or millions of words or phrases), this type of approach can be very effective.

**Examples:**

Here are a few of the multitude of ways classification can be used in the real world.

**Predicting credit risk**

A financing company might look at a number of variables before offering a loan to a company or individual. Whether or not to offer the loan is a binary classification problem.

**News classification**

An algorithm might be trained to predict the topic of a news article (sports, politics, business, etc.).

**Classifying human activity**

By collecting data from sensors such as a phone accelerometer or smart watch, you can predict the person’s activity. The output will be one of a finite set of classes (e.g., walking, sleeping, standing, or running).

## Types of Classification

Before we continue, let’s review several different types of classification.

### Binary Classification

The simplest example of classification is binary classification, where there are only two labels you can predict. One example is fraud analytics, where a given transaction can be classified as fraudulent or not; or email spam, where a given email can be classified as spam or not spam.

### Multiclass Classification

Beyond binary classification lies multiclass classification, where one label is chosen from more than two distinct possible labels. A typical example is Facebook predicting the people in a given photo or a meterologist predicting the weather (rainy, sunny, cloudy, etc.). Note how there is always a finite set of classes to predict; it’s never unbounded. This is also called multinomial classification.

### Multilabel Classification

Finally, there is multilabel classification, where a given input can produce multiple labels. For example, you might want to predict a book’s genre based on the text of the book itself. While this could be multiclass, it’s probably better suited for multilabel because a book may fall into multiple genres. Another example of multilabel classification is identifying the number of objects that appear in an image. Note that in this example, the number of output predictions is not necessarily fixed, and could vary from image to image.

## Popular Classification Algorithms

* Logistic regression
* Decision trees
* Random forests
* Gradient-boosted trees
* Multilayer perceptron classifier

## Logistic Regression

Logistic regression is one of the most popular methods of classification. It is a linear method that combines each of the individual inputs (or features) with specific weights (these weights are generated during the training process) that are then combined to get a probability of belonging to a particular class. These weights are helpful because they are good representations of feature importance; if you have a large weight, you can assume that variations in that feature have a significant effect on the outcome (assuming you performed normalization). A smaller weight means the feature is less likely to be important.

Consider the following example: The *Default* Dataset

<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-the-default-dataset.png?raw=true" width="800" align="center"/>

<div style="text-align: center"> The Default data set. Left: The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a function of default status. Right: Boxplots of income as a function of default status. </div>

*<div style="text-align: right"> From [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, et al </div>*

Considering this data where the response "default" falls into a "Yes" or "No" category, let's find a model that can predict the probability of default by using the "Balance" on the credit card. Rather than modeling this response Y
directly, logistic regression models the probability that Y belongs to a particular category.

The probability of default given balance can be written as:
> **Pr(**default = Yes|balance**)**

The values of Pr(default = Yes|balance), which we abbreviate p(balance), will range between 0 and 1. Then for any given value of balance, a prediction can be made for default. For example, one might predict default = Yes for any individual for whom p(balance) > 0.5. Alternatively, if a company wishes to be conservative in predicting individuals who are at risk for default, then they may choose to use a lower threshold, such as p(balance) > 0.1.


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/11-02-Classification-lr.png?raw=true" width="800" align="center"/>

<div style="text-align: center"> Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1.
 </div>

*<div style="text-align: right"> From [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, et al </div>*

### The Logistic Model

How should we model the relationship between $p(X) = Pr(Y = 1|X)$ and $X$? (For convenience we are using the generic 0/1 coding for the response). One approach would be using a linear regression model to represent these probabilities:

> $p(X) = \beta_0 + \beta_1 X$

If we use this approach to predict default=Yes using balance, then we obtain the model shown in the left-hand panel of figure above. Here we see the problem with this approach: for balances close to zero we predict a negative probability of default; if we were to predict for very large balances, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of default, regardless of credit card balance, must fall between 0 and 1.

To avoid this problem, we must model $p(X)$ using a function that gives outputs between 0 and 1 for all values of $X$. Many functions meet this description. In logistic regression, we use the logistic function:

> $p(X) = \dfrac{e ^ {\beta_0 + \beta_1 X }}{1 + e ^ {\beta_0 + \beta_1 X }}$

To fit the model above, we use a method called `maximum likelihood`. The right-hand panel of figure above illustrates the fit of the logistic regression model to the Default data. Notice that for low balances we now predict the probability of default as close to, but never below, zero. Likewise, for high balances we predict a default probability close to, but never above, one. The logistic function will always produce an S-shaped curve of this form, and so regardless of the value of X, we will obtain a sensible prediction. We also see that the logistic model is
better able to capture the range of probabilities than is the linear regression model in the left-hand plot. 

After a bit of manipulation of the logistic function, we find that:

> $\dfrac{p(X)}{1-p(X)} = e ^ {\beta_0 + \beta_1 X }$

The quantity $p(X)/[1−p(X)]$ is called the odds, and can take on any value between $0$ and $\infty$. Values of the odds close to $0$ and $\infty$ indicate very low and very high probabilities of default, respectively. 

For example, Let’s say that the probability of default is $0.2$ (or $1$ in $5$ people), this will give an odds of default equal to $1$ to $4$. This means that the odds of someone defaulting is $25\%$ of not defaulting. Since $p(X) = 0.2$ implies an odds of $\dfrac{0.2}{1-0.2} = 1/4$. 

Likewise on average $9$ out of every $10$ people with an odds of $9$ will default, since $p(X) = 0.9$ implies an odds of $\dfrac{0.9}{1-0.9} = 9$.

Odds are traditionally used instead of probabilities in horse-racing, since they relate more naturally to the correct betting strategy.

By taking the logarithm of both sides of the equation above, we arrive at

> $log(\dfrac{p(X)}{1-p(X)}) = \beta_0 + \beta_1 X$


The left-hand side is called the *log-odds* or *logit*. We see that the logistic regression mdoel (defined above) has a logit that is linear in $X$.

The table below shows estimated coefficients of the logistic regression model that predicts the probability of "default" using "balance":

| - |Coefficient| Std. error| Z-statistic| P-value|
|--|--|--|--|--|
|Intercept| −10.6513| 0.3612| −29.5| <0.0001|
|balance |0.0055 |0.0002 |24.9 |<0.0001|

We can measure the accuracy of the coefficient estimates by computing their standard errors. The z-statistic in table above plays the same role as the t-statistic in the linear regression output. A large (absolute) value of the z-statistic indicates evidence against the null hypothesis $H0 : β1 = 0$. This null hypothesis implies that $p(X) = \dfrac{e ^ {\beta_0 }}{1 + e ^ {\beta_0 }}$. In other words, that the probability of default does not depend on balance. Since the p-value associated with balance in the table is tiny, we can reject $H0$. In other words, we conclude that there is indeed an association between balance and probability of default. 

The estimated intercept in the table is typically not of interest; its main purpose is to adjust the average fitted
probabilities to the proportion of ones in the data.

# Basic Logistic Regression Example

This example is a simplified version of the Qwiklab notebook [Predict Visitor Purchases with a Classification Model in BQML](https://google.qwiklabs.com/focuses/1794?parent=catalog).

We will create a simple logistic regression model, using a sample dataset with 573,054 rows and only two fields. While this elementary model is helpful to understand how this process works within BigQuery, its prediction power is very limited since we are not providing a ton of information:

In [1]:
%%bigquery
SELECT * FROM `ba-770.public.ecommerce_training_sample` LIMIT 5

Unnamed: 0,fullVisitorId,bounces,time_on_site,will_buy_on_return_visit
0,9202145451116083716,1,0,0
1,5545935958268388079,1,0,0
2,4755910499190436580,1,0,0
3,473125307681564143,1,0,0
4,5588278221318225661,1,0,0


`bounces` (whether the visitor left the website immediately)

`time_on_site` (how long the visitor was on our website)

`will_buy_on_return_visit` (whether the visitor made a purchase in the future visits)

**Question:** What are the risks of only using the above two fields?

**Answer:** Machine learning is only as good as the training data that is fed into it. If there isn't enough information for the model to determine and learn the relationship between your input features and your label (in this case, whether the visitor bought in the future) then you will not have an accurate model.


**Question:** Which fields are the input features and the label?

**Answer:** The inputs are `bounces` and `time_on_site`. The label is `will_buy_on_return_visit`.

**Which model type should you choose?**

Since you are bucketing visitors into "will buy in future" or "won't buy in future", use logistic_reg in a classification model.

## Train the model - TRAINING
The following query creates a model and specifies model options. Run this query to train your model (<5 minutes):

In [2]:
%%bigquery
CREATE OR REPLACE MODEL `temp_dataset.classification_model`
OPTIONS(model_type='logistic_reg', labels = ['will_buy_on_return_visit'])
AS
SELECT * EXCEPT(fullVisitorId) FROM `ba-770.public.ecommerce_training_sample`

## Evaluate classification model performance - EVALUATE

### Select your performance criteria

For classification problems in ML, you want to minimize the False Positive Rate (predict that the user will return and purchase and they don't) and maximize the True Positive Rate (predict that the user will return and purchase and they do).

This relationship is visualized with a ROC (Receiver Operating Characteristic) curve like the one shown here, where you try to maximize the area under the curve or AUC:

<img src="https://github.com/soltaniehha/Business-Analytics-Toolbox/blob/master/docs/images/roc-curve.png?raw=true" width="400" align="center"/>

In BQML, roc_auc is simply a queryable field when evaluating your trained ML model.

Now that training is complete, run this query to evaluate how well the model performs using `ML.EVALUATE`:

In [3]:
%%bigquery
SELECT *
FROM ML.EVALUATE
(
    MODEL temp_dataset.classification_model,  
    (SELECT * EXCEPT(fullVisitorId) FROM `ba-770.public.ecommerce_eval_sample`)
)

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.16129,0.00303,0.983624,0.005949,0.077383,0.723829


## Predict which new visitors will come back and purchase - TEST

Next you will write a query to predict which new visitors will come back and make a purchase.

The prediction query below uses the classification model to predict the probability that a first-time visitor to the Google Merchandise Store will make a purchase in a later visit:

In [4]:
%%bigquery
CREATE OR REPLACE TABLE temp_dataset.ecommerce_predictions
AS
SELECT 
    predicted_will_buy_on_return_visit, 
    predicted_will_buy_on_return_visit_probs[OFFSET(0)].prob,
    will_buy_on_return_visit,
    bounces,
    time_on_site
FROM ML.PREDICT
(
    MODEL `temp_dataset.classification_model`,
    (SELECT * EXCEPT(fullVisitorId) FROM `ba-770.public.ecommerce_test_sample`), 
    STRUCT(0.0229 AS threshold)
)
ORDER BY prob DESC

In [5]:
%%bigquery
SELECT * FROM temp_dataset.ecommerce_predictions LIMIT 10

Unnamed: 0,predicted_will_buy_on_return_visit,prob,will_buy_on_return_visit,bounces,time_on_site
0,1,0.877775,0,0,6808
1,1,0.852786,1,0,6555
2,1,0.847594,1,0,6507
3,1,0.782035,0,0,5991
4,1,0.777807,0,0,5962
5,1,0.762166,0,0,5858
6,1,0.756735,0,0,5823
7,1,0.713996,0,0,5564
8,1,0.713649,0,0,5562
9,1,0.69382,0,0,5450


Note that we are setting a `threshold` to make the classification. Our logstic regression model gives us a "probability", but we need to set this threshold to the right value to get the most meaningful results. The selected value was chosen from the ROC curve aloge with considerations on confusion matrix and precision-recall curve. You can access all of these in the BigQeury UI and under the model evaluation tab that corresponds to the model you are using to make predictions.

In [6]:
%%bigquery
SELECT 
    COUNT(*) total_visits,
    COUNTIF(will_buy_on_return_visit=1) actual_purchases, 
    COUNTIF(will_buy_on_return_visit=1)/COUNT(*)*100 rate_percent,
    COUNTIF(predicted_will_buy_on_return_visit=1) pridicted_positive,
    COUNTIF(will_buy_on_return_visit=1 AND predicted_will_buy_on_return_visit=1) true_positive,
    COUNTIF(will_buy_on_return_visit=1 AND predicted_will_buy_on_return_visit=1)/COUNTIF(predicted_will_buy_on_return_visit=1)*100 rate_percent_predicted
FROM temp_dataset.ecommerce_predictions

Unnamed: 0,total_visits,actual_purchases,rate_percent,pridicted_positive,true_positive,rate_percent_predicted
0,59608,583,0.978057,17134,399,2.328703


Out of 17,134 positive predictions we can see that 399 of them actually do make a purchase. This is about 1/3 of our total customers with a convergence rate of more than double of our baseline (a random selection). Not too bad for a very basic model with only two variables. In the corresponding Qwiklab we will make a more sophisticated model with more number of variables.

## Predicting the unknown future
Below we will make a prediction on a new data point (bounces=0 and time_on_site=1800):

In [7]:
%%bigquery
SELECT 
    predicted_will_buy_on_return_visit, 
    predicted_will_buy_on_return_visit_probs[OFFSET(0)].prob,
--  will_buy_on_return_visit,
    bounces,
    time_on_site
FROM ML.PREDICT
(
    MODEL `temp_dataset.classification_model`,
    (SELECT 0 bounces, 600 time_on_site), 
    STRUCT(0.0229 AS threshold)
)

Unnamed: 0,predicted_will_buy_on_return_visit,prob,bounces,time_on_site
0,1,0.035518,0,600
