# <center>Classification

Classification is a very common and important variant among Machine Learning Problems. Many Machine Algorithms have been framed to tackle classification (discrete not continuous) problems. Examples of classification based predictive analytics problems are:

* Diabetic Retinopathy: Given a retinal image, classify the image (eye) as Diabetic or Non-Diabetic.
* Sentiment Analysis: Given a sentence, analyze the sense of the sentence (for ex. happiness/sadness, praise/insult, etc.)
* Digit Recognition: Given an image of a digit, recognize the digit (0–9). This is an example of Multi-Class Classification.

and many more...

## Terminology related to classification

- Classifier: An algorithm that maps the input data to a specific category.
- Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
- Feature: A feature is an individual measurable property of a phenomenon being observed.
- Binary Classification: Classification task with two possible outcomes. Eg: Gender classification (Male / Female)
- Multi class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time
- Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, and location at the same time.

# Logistic Regression

Despire being called a regression, logistic regression is actually a widely used supervised classification technique. 
Allows us to predict the probability that an observation is of a certain class

<h2>Some Python Libraries</h2>

<p style="text-align: justify;">In the first place, Let's define some libraries to help us in the manipulation the data set, such as `pandas`, `numpy`, `matplotlib`, `seaborn`. In this tutorial, we are implementing a Logistic Regression with `sikit-learn`. The goal here is to be as simple as possible! So to help you with this task, we implementing the Logistic regression using ready-made libraries and their functinality.</p>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

<h4> Ad click Project </h4>

Let us now start implementing what we learned from the previous section into python codes. We will use a website data of Customers to understand which customer will be click the AD, by the end of this section we will be able to make predictions using our "home-made" Logistic Regression.

This data set contains the following features:

* '`Daily Time Spent on Site`': consumer time on site in minutes
* '`Age`': cutomer age in years
* '`Area Income`': Avg. Income of geographical area of consumer
* '`Daily Internet Usage`': Avg. minutes a day consumer is on the internet
* '`Ad Topic Line`': Headline of the advertisement
* '`City`': City of consumer
* '`Male`': Whether or not consumer was male
* '`Country`': Country of consumer
* '`Timestamp`': Time at which consumer clicked on Ad or closed window
* '`Clicked on Ad`': 0 or 1 indicated clicking on Ad

In [None]:
# import the dataset
import os
df=pd.read_csv(os.getcwd()+"\\Datasets\\Web_data_v3.csv")

## Basic Data Exploration

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all').T

In [None]:
df.nunique()

## Check Duplicates

In [None]:
print(df.duplicated().value_counts())
df.drop_duplicates(inplace = True)
print(len(df))

## Checking missing value

In [None]:
df.isnull().sum()

## Basic Data Exploration Results
Based on the basic exploration, this dataset have 6657 rows and 14 columns also we see there are no missing values in this dataset and no duplicate rows.

#### The selected columns in this step are not final, further study will be done and then a final list will be created
- VistID: Qualitative
- Time_Spent: Continuous
- Age: Continuous
- Avg_Income: Continuous
- Internet_Usage: Continuous
- Ad_Topic: Categorical
- Country_Name: Categorical
- City_code: Categorical
- Male: Categorical
- Time_Period: Categorical
- Weekday: Categorical
- Month: Categorical
- Year: Categorical
- Clicked: Categorical. This is the Target Variable!

## Target Variable

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,5))
plt.rc("font", size=14)
ax = sns.countplot(y ='Clicked',data=df)
total = len(df['Clicked'])
for p in ax.patches:
    percentage = '{:.2f}%'.format(100 * p.get_width()/total)
    x = p.get_x() + p.get_width() + 0.02
    y = p.get_y() + p.get_height()/2
    ax.annotate(percentage, (x, y))
plt.rc("font", size=14)
plt.show()

- Over here we see that class 0 have 3619 rows and class 1 have 3038 rows.
- After checking the percentage of those class it didn't imply that this data is imbalanced.

# Exploratory Data Analysis

In [None]:
# Create a variables list based on their types.
categorical_col=[]
numerical_col=[]

for col in df.columns[2:-1]:
    if df[col].dtype =="object":
        categorical_col.append(col)
    elif df[col].dtype =="int64" or df[col].dtype =="float64":
        numerical_col.append(col)

## Visual exploration (Categorical Vs Categorical) -- Bar Charts or Grouped Bar Charts
When the target variable is Categorical and the predictor is also Categorical then we explore the correlation between them visually using barplots and Grouped Bar Plots

In [None]:
for col in categorical_col:
    plt.figure(figsize=(20, 6))
    sns.barplot(data=df,x=col,y="Clicked",palette='hot')
    plt.xticks(rotation=45)
    plt.show()

In [None]:
plt.figure(figsize=(45, 6))
sns.barplot(data=df,x="Ad_Topic",y="Clicked",palette='hot')
plt.show()

In [None]:
# Creating Grouped bar plots for each categorical predictor against the Target Variable "Clicked"
for col in categorical_col:
    CrossTabResult=pd.crosstab(index=df[col], columns=df["Clicked"])
    CrossTabResult.plot.bar(color=['lightblue','blue'], figsize=(15,6))
    plt.show()

## From above plots we learn that
- In `Ad_Topic` all the others ads distrivution is same except 'product_2' ad is lowest and 'product_3' ad is highest but in case of 'product_3' the click rate is highest than any other product
- In `City_Code` most of the country have only one cities data that's why the distribution is high of 'city_1' but we also see that In case any countries have more than one city the 'city_1' people have more clicked the ad other than other cities.
- In `Male` columns we see the similar distribution and clicked rate only count of male visitor are more than female that's why click rate also higher than female.
- In `Time_Prediod` we see most people visit the website in 'early_morning' similary the click rate of ad is higher in that time.
- In `week_day` the distibution and click rate is same.
- In `Month` column we see more visitor in february and may month and similarly the ad click rate also high in those months.

In [None]:
# Why do we need Feature Selection ? 

# 100, 200, 400 - imagination - not all of these features are significant to make the prediction
# Larger Dimension - rows, columns -> more complex models are needed
# Overfitting - model will start using insignificant features to explain randomness of the error


# Feature selection -> relevancy of feature w.r.t the output
# Redundant with other input data

# features - numerical, categorical
# output - numerical and categorical
#  I/P            O/P
# numerical and numerical
# numerical and categorical
# categorical and numerical
# categorical and categorical

## Statistical Feature Selection (Categorical Vs Categorical) using Chi-Square Test
Chi-Square test is conducted to check the correlation between two categorical variables

* Assumption(H0): The two columns are NOT related to each other
* Result of Chi-Sq Test: The Probability of H0 being True

In [None]:
# Hypothesis Testing

# Null hypothesis  - Opposite the idea ( There is no relationship )
# Alternative Hypothesis - Acceptance of the idea 


# p-value < 0.05 - Reject the null hypothesis - accepting the idea / accepting the alternative hypothesis
# p-value >=0.05 - Cannot reject the NH - accept the opposition f the idea

In [None]:
# Writing a function to find the correlation of all categorical variables with the Target variable
def FunctionChisq(inpData, TargetVariable, CategoricalVariablesList):
    from scipy.stats import chi2_contingency
    
    # Creating an empty list of final selected predictors
    SelectedPredictors=[]
    
    print('##### chi-square Results ##### \n')
    for predictor in CategoricalVariablesList:
        CrossTabResult=pd.crosstab(index=inpData[TargetVariable], columns=inpData[predictor])
        ChiSqResult = chi2_contingency(CrossTabResult)
        
        # If the ChiSq P-Value is <0.05, that means we reject H0, -> p low, null go
        if ( ChiSqResult[1] < 0.05 ):
            SelectedPredictors.append(predictor)
            print(predictor, 'is correlated with', TargetVariable, '| P-Value:', ChiSqResult[1])
        else:
            print(predictor, 'is NOT correlated with', TargetVariable, '| P-Value:', ChiSqResult[1])
            
    #return(SelectedPredictors)
    return CrossTabResult

In [None]:
temp = FunctionChisq(inpData=df, 
              TargetVariable="Clicked",
              CategoricalVariablesList= categorical_col)

- We see 'Ad_Topic', 'Country_Name', 'City_code', 'Male', 'Time_Period' are imporatant varibles.

## Visual exploration (Continuous Vs Categorical) -- Histogram and Box/Violin Plots
When the target variable is Categorical and the predictor is also Continuous then we explore the correlation between them visually using Histogram and Box Plots or Violin Plots.

In [None]:
for col in numerical_col:
    df.hist(col, figsize=(12,6))
    plt.show()

In [None]:
for col in numerical_col:
    plt.figure(figsize=(14,6))
    sns.boxplot(x="Clicked", y=col, data=df)
    plt.show()

In [None]:
for col in numerical_col:
    plt.figure(figsize=(14,6))
    sns.violinplot(x="Clicked", y=col, data=df)
    plt.show()

## From above plots we learn that
- In `Age` higher the age less frequently they visited the website and mid-aged people are more frequent clicked the ad.
- In `Avg_Income` the distribution id left skewed and higher the income lesser the frequent they clicked the ad.
- In `Internet_Usage` people who spent maximum time in the website are less frequent to clicked the ad.

## Statistical Feature Selection (Categorical Vs Continuous) using ANOVA test
Analysis of variance(ANOVA) is performed to check if there is any relationship between the given continuous and categorical variable

- Assumption(H0): There is NO relation between the given variables (i.e. The average(mean) values of the numeric Predictor variable is same for all the groups in the categorical Target variable)
- ANOVA Test result: Probability of H0 being true

In [None]:
# Defining a function to find the statistical relationship with all the categorical variables
def FunctionAnova(inpData, TargetVariable, ContinuousPredictorList):
    from scipy.stats import f_oneway

    # Creating an empty list of final selected predictors
    SelectedPredictors=[]
    
    print('##### ANOVA Results ##### \n')
    for predictor in ContinuousPredictorList:
        CategoryGroupLists=inpData.groupby(TargetVariable)[predictor].apply(list)
        AnovaResults = f_oneway(*CategoryGroupLists)
        
        # If the ANOVA P-Value is <0.05, that means we reject H0
        if (AnovaResults[1] < 0.05):
            print(predictor, 'is correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
            SelectedPredictors.append(predictor)
        else:
            print(predictor, 'is NOT correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
    
    return (SelectedPredictors)

In [None]:
# Calling the function to check which categorical variables are correlated with target
FunctionAnova(inpData=df, TargetVariable="Clicked", ContinuousPredictorList=numerical_col)

- From ANOVA test we saw 'Age', 'Avg_Income', 'Internet_Usage' are the variables have some impact on target variable.

In [None]:
plt.figure(figsize=(20, 20))
sns.pairplot(df, hue='Clicked')
plt.show()

## Check Correlation between two variables

In [None]:
corr_data=df[['Age', 'Time_Spent', 'Avg_Income', 'Internet_Usage', 'Clicked']]
plt.figure(figsize=(12, 8))
sns.heatmap(corr_data.corr(), annot=True)
plt.show()

In [None]:
corr=corr_data.corr()
corr['Clicked'][abs(corr['Clicked']) > 0.5 ]

## Prepare Data for Logistic Regression Model
The assumptions made by logistic regression about the distribution and relationships in your data are much the same as the assumptions made in linear regression.

Ultimately in predictive modeling machine learning projects you are more focused on making accurate predictions rather than interpreting the results. As such, you can break some assumptions as long as the model is robust and performs well.

- **Binary Output Variable:** This might be obvious as we have already mentioned it, but logistic regression is intended for binary (two-class) classification problems. It will predict the probability of an instance belonging to the default class, which can be snapped into a 0 or 1 classification.
- **Remove Noise:** Logistic regression assumes no error in the output variable (y), consider removing outliers and possibly misclassified instances from your training data.
- **Gaussian Distribution:** Logistic regression is a linear algorithm (with a non-linear transform on output). It does assume a linear relationship between the input variables with the output. Data transforms of your input variables that better expose this linear relationship can result in a more accurate model. For example, you can use log, root, Box-Cox and other univariate transforms to better expose this relationship.
- **Remove Correlated Inputs:** Like linear regression, the model can overfit if you have multiple highly-correlated inputs. Consider calculating the pairwise correlations between all inputs and removing highly correlated inputs.
- **Fail to Converge:** It is possible for the expected likelihood estimation process that learns the coefficients to fail to converge. This can happen if there are many highly correlated inputs in your data or the data is very sparse (e.g. lots of zeros in your input data).

In [None]:
# Mathematical Requirement - Feature Selection has happened - Feature Processing
# Categorical Data -> Categorical Transformation
# 2 - type of categorical data ->  Ordinal, Nominal 

# Ordinal categorical data has an order ( Negative, Neutral, Positive ) - numerical conversion
# Nominal Categorical data - has no order ( Red, Green Blue )  - Encoding of data

In [None]:
# ( Negative, Neutral, Positive ) -> ( -1 , 0 , +1 )
# Color

# Red
# Green
# Blue
# Green
# Green
# Blue
# Blue
# Red

In [None]:
# Add n number of columns , where n  = number of unique categories in the column
# for each existing value in the original categorical column - replace the new columns created with 1, rest all new columns will be 0

# Red , Green , Blue 
# 1 ,   0 ,     0
# 0 ,   1 ,     0
# 0 ,   0  ,    1
# 0 ,   1 ,     0
# 0 ,   1 ,     0
# 0 ,   0  ,    1
# 0 ,   0  ,    1
# 1 ,   0 ,     0

## Label Encodeing 

In machine learning, we usually deal with datasets that contain multiple labels in one or more than one columns. These labels can be in the form of words or numbers. To make the data understandable or in human-readable form, the training data is often labelled in words.

In [None]:
# apply Label encoder to df_categorical
from sklearn.preprocessing import LabelEncoder

label_encoders = {}
for column in categorical_col:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

In [None]:
# look at the final data
df.head()

## Machine Learning: Splitting the data into Training and Testing sample
We dont use the full data for creating the model. Some data is randomly selected and kept aside for checking how good the model is. This is known as Testing Data and the remaining data is called Training data on which the model is built. Typically 70% of data is used as Training data and the rest 30% is used as Tesing data.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(['VistID', 'Year', 'Clicked'], axis=1)
y = df['Clicked']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Standardization/Normalization of data
You can choose not to run this step if you want to compare the resultant accuracy of this transformation with the accuracy of raw data.

In [None]:
# from sklearn.preprocessing import StandardScaler

# ss = StandardScaler()

# df[col] = ss.fit_transform(df[col])

In [None]:
# pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
    (MinMaxScaler(), categorical_col),
    (StandardScaler(), numerical_col[:-1]),
    remainder='passthrough'
)

X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

In [None]:
# Data Processing for ML - 

# 1. Scaling / Standardisation
# 2. Train and Testing Split
# 3. Feature Selection - Correlation of num to num, num to cat, cat to cat

## Recursive Feature Elimination

The idea behind recursive feature elimination (RFE) is to train a model that contains some parameters (also called weights or coefficients) like linear regression or support vector machines repeatedly. The first time we train the model, we include all the features. Then, we find the
feature with the smallest parameter (notice that this assumes the features are either rescaled or standardized), meaning it is less important, and remove the feature from the feature set.

The obvious question then is: how many features should we keep? We can (hypothetically) repeat this loop until we only have one feature left. A better approach requires that we include a new concept called cross-validation (CV). but here is the general idea.

Given data containing 1) a target we want to predict and 2) a feature matrix, first we split the data into two groups: a training set and a test set. Second, we train our model using the training set. Third, we pretend that we do not know the target of the test set, and apply our model to the test set’s features in order to predict the values of the test set. Finally, we compare our predicted target values with the true target values to evaluate our model.

We can use CV to find the optimum number of features to keep during RFE. Specifically, in RFE with CV after every iteration, we use cross-validation to evaluate our model. If CV shows that our model improved after we eliminated a feature, then we continue on to the next loop. However, if CV shows that our model got worse after we eliminated a feature, we put that feature back into the feature set and select those features as the best.

In scikit-learn, RFE with CV is implemented using RFECV and contains a number of important parameters. The estimator parameter determines the type of model we want to train (e.g., linear regression). The step regression sets the number or proportion of features to drop during each loop. The scoring parameter sets the metric of quality we use to evaluate our model during cross-validation.

In [None]:
from sklearn.feature_selection import RFECV # Recursive feature elimination with cross validation
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFECV(logreg, step=1, scoring="neg_mean_squared_error")
rfe = rfe.fit(X_train, y_train.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

Once we have conducted RFE, we can see the number of features we should keep:

In [None]:
# Number of best features
rfe.n_features_

We can also see which of those features we should keep:

In [None]:
# Which categories are best
rfe.support_

In [None]:
X.columns[rfe.support_]

We can even view the rankings of the features:

In [None]:
# Rank features best (1) to worst
rfe.ranking_

## Training a Binary Classifier

Dispire having "regression" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\beta_0 + \beta_i x$) is included in a logistic (also called sigmoid) function, $\frac{1}{1+e^{-z }}$, such that:
$$
P(y_i = 1 | X) = \frac{1}{1+e^{-(\beta_0 + \beta_1x)}}
$$
where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\beta_0$ and $\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted

In [None]:
# import the logistic regression algorithm
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0)

# Customisation of your logistic regression model
print(logreg)
logreg.fit(X_train, y_train)

In [None]:
y_pred = logreg.predict(X_test)
f3 = pd.DataFrame(y_pred)
f3['Clicked'] = y_test.values
print("Accuracy of Logistic Regression is : {}%".format(round(f3.loc[f3[0]==f3['Clicked']].shape[0] / f3.shape[0] * 100,2)))

In [None]:
# import necessary packages to measure model performace
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, accuracy_score, f1_score, precision_score, recall_score,auc
from sklearn.model_selection import cross_val_score
from scipy import stats

<h2>Performance Measurement<h2>

#### 1. Confusion Matrix
- Each row: actual class
- Each column: predicted class

#### 2. Precision

**Precision** measures the accuracy of positive predictions. Also called the `precision` of the classifier

$$\textrm{precision} = \frac{\textrm{True Positives}}{\textrm{True Positives} + \textrm{False Positives}}$$

#### 3. Recall

`Precision` is typically used with `recall` (`Sensitivity` or `True Positive Rate`). The ratio of positive instances that are correctly detected by the classifier.

$$\textrm{recall} = \frac{\textrm{True Positives}}{\textrm{True Positives} + \textrm{False Negatives}}$$ 

#### 4. F1 Score

$F_1$ score is the harmonic mean of precision and recall. Regular mean gives equal weight to all values. Harmonic mean gives more weight to low values.


$$F_1=\frac{2}{\frac{1}{\textrm{precision}}+\frac{1}{\textrm{recall}}}=2\times \frac{\textrm{precision}\times \textrm{recall}}{\textrm{precision}+ \textrm{recall}}=\frac{TP}{TP+\frac{FN+FP}{2}}$$

The $F_1$ score favours classifiers that have similar precision and recall.

In [None]:
# classifaction Model - My model's accuracy is 93% what does it mean in classification and why it is wrong?

# Actual - True , False ( 2 ) 
# Predicted - True, False  ( 2 ) 
# Pred  Actual
# True  True -> True Positive Prediction
# True  False -> False Positive Prediction
# False True -> False negative
# False False -> True negative

In [None]:
# COVID Question ? 

# initial phase of the COVID detection
# Focused on -> True positive + False Negative => Actual positive
# Recall measure

# Later stage of COVID detection
# focused on -> True Positive + False Positive => Positive predictions made by me
# Precision Measure

In [None]:
# Checking Confusion Metrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# confusion metrics
plt.figure(figsize=(10,8))
ax = sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'viridis_r', fmt = 'd')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.show()

print(classification_report(y_test, y_pred))

## Precision / Recall Tradeoff

Increasing precision reduced recall and vice versa

In [None]:
from sklearn.metrics import precision_recall_curve

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.title("Precisions/recalls tradeoff")

precisions, recalls, thresholds = precision_recall_curve(y_test, logreg.predict(X_test))

plt.figure(figsize=(15, 8))
plt.subplot(2, 2, 1)
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

plt.subplot(2, 2, 2)
plt.plot(precisions, recalls)
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.title("PR Curve: precisions/recalls tradeoff")
plt.show()

With this chart, you can select the threshold value that gives you the best precision/recall tradeoff for your task.

Some tasks may call for higher precision (accuracy of positive predictions). Like designing a classifier that picks up adult contents to protect kids. This will require the classifier to set a high bar to allow any contents to be consumed by children.

Some tasks may call for higher recall (ratio of positive instances that are correctly detected by the classifier). Such as detecting shoplifters/intruders on surveillance images - Anything that remotely resemble "positive" instances to be picked up.

<h2>The Receiver Operating Characteristics (ROC) Curve</h2>

Instead of plotting precision versus recall, the ROC curve plots the `true positive rate` (another name for recall) against the `false positive rate`. The `false positive rate` (FPR) is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the `true negative rate`, which is the ratio of negative instances that are correctly classified as negative.

The TNR is also called `specificity`. Hence the ROC curve plots `sensitivity` (recall) versus `1 - specificity`.

In [None]:
from sklearn.metrics import roc_curve

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], "k--")
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')

fpr, tpr, thresholds = roc_curve(y_test, logreg.predict(X_test))
plt.figure(figsize=(12,8)); 
plot_roc_curve(fpr, tpr)
plt.show()

Use PR curve whenever the **positive class is rare** or when you care more about the false positives than the false negatives

Use ROC curve whenever the **negative class is rare** or when you care more about the false negatives than the false positives


In the example above, the ROC curve seemed to suggest that the classifier is good. However, when you look at the PR curve, you can see that there are room for improvement.

## Training a Multiclass Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
features = iris.data
target = iris.target

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

logistic_regression = LogisticRegression(random_state=0, multi_class="ovr")
#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class="multinomial")

model = logistic_regression.fit(features_standardized, target)

On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent

Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:
$$
P(y_I = k | X) = \frac{e^{\beta_k x_i}}{\sum_{j=1}^{K}{e^{\beta_j x_i}}}
$$
where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable

We can switch to an MNL by setting `multi_class='multinomial'`

## Reducing Variance Through Regularization

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
features = iris.data
target = iris.target

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

logistic_regression = LogisticRegressionCV(
    penalty='l2', Cs=10, random_state=0, n_jobs=-1)

model = logistic_regression.fit(features_standardized, target)

Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties

In the L1 penalty:
$$
\alpha \sum_{j=1}^{p}{|\hat\beta_j|}
$$
where $\hat\beta_j$ is the parameters of the jth of p features being learned and $\alpha$ is a hyperparameter denoting the regularization strength.

With the L2 penalty:
$$
\alpha \sum_{j=1}^{p}{\hat\beta_j^2}
$$
higher values of $\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\alpha$ where C is the inverse of the regularization strength: $C = \frac{1}{\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.

## Training a Classifier on Very Large Data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
features = iris.data
target = iris.target

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

logistic_regression = LogisticRegression(random_state=0, solver="sag") # stochastic average gradient (SAG) solver
model = logistic_regression.fit(features_standardized, target)

scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.

Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important

## Handling Imbalanced Classes

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
features = iris.data[40:, :]
target = iris.target[40:]

target = np.where((target == 0), 0, 1)

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

logistic_regression = LogisticRegression(random_state=0, class_weight="balanced")
model = logistic_regression.fit(features_standardized, target)

`LogisticRegression` comes with a built in method of handling imbalanced classes.
`class_weight="balanced"` will automatically weigh classes inversely proportional to their frequency:
$$
w_j = \frac{n}{kn_j}
$$
where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes