# Logistic Regression

<font color='steelblue'>

<span style="font-family:verdana; font-size:1.6em;">
    <strong>Logistic Regression Example</strong><br><br>
    Use Income Prediction dataset and apply Data Preprossor to process data<br><br>
</span>
<span style="font-family:verdana; font-size:1.4em;">
    <b>Following examples are included in the processing:</b><em>
    <ol>
        <li>Load dataset</li>
        <li>Convert target column from string to numbers</li>
        <li>Build a pipeline for data processing</li>
        <li>Apply the pipeline to the dataframe</li>
        <li>Build a Logistic Regression Model</li>
        <li>Explore trained model performance</li>
        <li>Make predictions using test dataset</li>
        <li>Explore model performance using Confusion Matrix</li>
        <li>Persist the pipeline and model</li>
    </ol></em>
</span>

</font>

## Dataset Review

The Adult dataset we are going to use is publicly available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult).
This data derives from census data, and consists of information about individuals and their annual income.
We will use this information to predict if an individual earns **<=50K or >50k** a year.
The dataset is rather clean, and consists of both numeric and categorical variables.

Attribute Information:

- age: continuous
- workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...

<b>Target/Label: - <=50K, >50K</b>

In [None]:
%config IPCompleter.greedy = True

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('seaborn-whitegrid')    # grids in the plots
import warnings
warnings.filterwarnings('ignore')

## Load Data

Load data from the datasets directory (agent.csv)

In [None]:
df = pd.read_csv('../datasets/agent.csv')

In [None]:
df.head()

In [None]:
df.shape

## Convert the income column
<span style="font-family:times, serif; font-size:16pt; font-style:bold">
Convert the income from string to number using LabelEncoder
</span>

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['income'] = labelencoder.fit_transform(df.income)

In [None]:
df['income'].tail()

## Handle Categorical columns
<span style="font-family:Arial; font-size:14pt; font-style:bold">
    Building a pipeline to process data: 
<ol>
<li>For numerical columns - replace missing values with median</li>
<li>Then apply standard scaler to standardize these columns</li>
<li>For the categorical column fill any missing value with constant</li>
<li>Then perform One Hot Encoding (similar to pd_dummies)</li>
</ol>
</span>

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
# Define the pipeline stages for numeric and categorical columns
numericPipe = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                              ('scaler', StandardScaler())])
stringPipe = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', 
                                                       fill_value='missing')),
                             ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
# Create list of numeric and categorical columns
numericCols = df.select_dtypes(include=['int64', 'float64']).columns
stringCols = df.select_dtypes(include=['object']).columns

In [None]:
numericCols

In [None]:
stringCols

<font color='red'>
    <h3>income column (int32) is not in any list</h3>
</font>

## Process the pipeline
<span style="font-family:verdana; font-size:14pt; font-style:bold">
    Processing the pipeline 
<ol>
<li>Use Column Transformer to define the numeric and categorical transformers defined above</li>
<li>Fit and Transform the dataframe</li>
</ol>
    
<i>ColumnTransformer transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.</i>
</span>

In [None]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[('num', numericPipe, numericCols),
                                               ('cat', stringPipe, stringCols)])

In [None]:
df1 = preprocessor.fit_transform(df)

In [None]:
df1.shape

In [None]:
print(type(df1))

In [None]:
from sklearn.model_selection import train_test_split
X = df1
y = df['income'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40, 
                                                    random_state = 2345)

In [None]:
X_train.shape

In [None]:
X_test.shape

## Use Logistic Regression to build model
<span style="font-family:times, serif; font-size:14pt; font-style:bold">
    Build a model using the Logistic Regression algorithm

</span>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logReg = LogisticRegression(solver = 'lbfgs', random_state = 2345)
logReg.fit(X_train, y_train)

## Explore Training Performance

<font color='teal'><h2>ROC Curve</h2></font>
<ul>
<span style="font-family:Arial; font-size:16pt; font-style:italic">
    <li>A visual way to measure the performance of binary classifier ROC (Receiver Operating Characteristic) Curve</li>
    <li>Created by plotting True Positive Rate (TPR or recall) against False Positive Rate (FPR)</li>

</span></ul>


<font color='teal'><h2>AUC - Area Under the ROC curve</h2></font>

<ul>
    <span style="font-family:Arial; font-size:16pt; font-style:italic">
    <li>AUC is a good measure of performance of the classifier</li>
    <li>If it is near 0.5, the classifier is not much better than random guessing</li>
    <li>Classifier gets better when the curve get close to 1</li>
    <li>Since our value is close to 1, it indicates that classifier is good
at minimizing false negatives (not purchased as purchased) and true negative
(purchased is classified as purchased)</li> 

</span></ul>
<font color='tomato'><h2>ONLY VALID FOR BINARY CLASSIFICATION</h2></font>

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
logReg_auc = roc_auc_score(y_test, logReg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logReg.predict_proba(X_test)[:,1])

auc = str(np.round(logReg_auc, 4))
plt.figure(figsize = (10, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Random guess', color = 'red')
plt.plot(fpr, tpr, label = "Train AUC " + auc)
plt.ylabel('False Positive Rate', fontsize = 16)
plt.xlabel('True Positive Rate', fontsize = 16)
plt.title('Receiver Operating Curve', fontsize = 16)
plt.legend(loc = 4, fontsize = 16)
plt.show()

## Make Predictions

In [None]:
# make predictions on the test data and save them
y_pred = logReg.predict(X_test)

In [None]:
# Score how good our model performedIMp
logReg.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# Print Acurate and Error prediction counts
correct = cm[0, 0] + cm[1, 1]
error = cm[0, 1] + cm[1,0]
total = correct + error
print('Correct predictions: {} of {}'.format(correct, total))
print('Errored predictions: {} of {}'. format(error, total))

In [None]:
cseg = ["<= 50K", "> 50K"]
cm_df = pd.DataFrame(cm, index = cseg, columns = cseg)

In [None]:
# Plot the confusion matrix
import seaborn as sns
plt.figure(figsize = (10, 6))
sns.heatmap(cm_df, annot=True, cmap=plt.cm.Blues, fmt = 'g', annot_kws={"size": 16})
sns.set(font_scale=0.5)
plt.title('Confusion Matrix\n', fontsize = 18)
plt.ylabel('True label', fontsize = 16)
plt.xlabel('Predicted label', fontsize = 16)
plt.show()

## Persist the preprocessor so that it can be reused

In [None]:
from pickle import dump
dump(preprocessor, open('../preprocessor.pkl', 'wb'))

## Presist the model
<span style="font-family:times, serif; font-size:14pt; font-style:bold">
    Persist the model using joblib library

</span>

In [None]:
from joblib import dump

In [None]:
dump(logReg, '../logRegModel.joblib')