# Logistic Regression for Docker App

The purpose of this project is to develop a simple Logistic Regression model to be used for a production app using docker. The focus will be on the deployment of the app and not the performance itself, we could come back and improve the model by twitching parameters in the future.

The input dataframe has three feature columns and one target column, the latest being the **"converted"** column. We will try to predict then, whether a sale is converted or not.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('online_sales.csv')
df.head()

Unnamed: 0,'age,new_user,total_pages_visited,converted
0,25,1,1,0
1,23,1,5,0
2,28,1,4,0
3,39,1,5,0
4,30,1,6,0


In [3]:
# Let's check the size of the file
df.shape

(316200, 4)

### Target class balance

Let's validate if the target class is balanced or if we need to perform some sampling before training the model. As we are not looking for the best model, a sampling task will be deferred to a later time. 

Let's also check if there is missing information.

In [4]:
df['converted'].value_counts()

converted
0    306000
1     10200
Name: count, dtype: int64

In [5]:
# Missing values?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316200 entries, 0 to 316199
Data columns (total 4 columns):
 #   Column               Non-Null Count   Dtype
---  ------               --------------   -----
 0   'age                 316200 non-null  int64
 1   new_user             316200 non-null  int64
 2   total_pages_visited  316200 non-null  int64
 3   converted            316200 non-null  int64
dtypes: int64(4)
memory usage: 9.6 MB


### Defining train and test sets

In [6]:
input_columns = [column for column in df.columns if column != 'converted']
input_columns

["'age", 'new_user', 'total_pages_visited']

In [7]:
output_column = 'converted'

In [8]:
X = df[input_columns].values

In [9]:
X.shape

(316200, 3)

In [10]:
y = df[output_column]

In [11]:
y.shape

(316200,)

In [12]:
# Import sklearn dependencies
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=555, stratify=y)

In [14]:
np.sum(y_train)

7140

In [15]:
np.sum(y_test)

3060

### Define and fit the model

In [16]:
logreg = LogisticRegression(class_weight='balanced').fit(X_train,y_train)

In [17]:
logreg.score(X_test,y_test)

0.9369175627240144

In [18]:
# use the model to make predictions
predictions = logreg.predict(X_test)

### Confusion matrix

We will use a confusion matrix to find out the Recall and Precision scores for this model.

In [19]:
# Import dependencies
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [20]:
print(classification_report(y_test, predictions, target_names=['Non Converted', 'Converted']))

               precision    recall  f1-score   support

Non Converted       1.00      0.94      0.97     91800
    Converted       0.33      0.92      0.49      3060

     accuracy                           0.94     94860
    macro avg       0.66      0.93      0.73     94860
 weighted avg       0.98      0.94      0.95     94860



### Export the model

This is the critical part of this notebook.  We will use pickle/joblib to export the model that will be used later to make predictions. We will handle two types of predictions:

- Single customer.  
- Group/multpile customers.

In [21]:
import pickle

In [22]:
# define a pickle object with the name of the file to which the model will be written, then dump the model
pickle_out = open('logreg.pkl', 'wb')
pickle.dump(logreg, pickle_out)

In [23]:
pickle_out.close()

### Test the pickled model

We will load the pickled model to test if it is making correct predictions.

In [24]:
pickle_in = open('logreg.pkl', 'rb')
model = pickle.load(pickle_in)

In [26]:
model.predict([[45,0,5]])[0]

0

In [27]:
# Predict for a group of customers by using information on test_csv
df_test = pd.read_csv('test_data.csv')

In [28]:
predictions = model.predict(df_test)



In [29]:
print(predictions)

[0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0]


We are now redy to build a Flask app to deploy the model.