# Background

In this notebook we'll train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between spam data (food reviews) and legitimate data (Austen). 

Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate).  Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!

We begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). 

In [1]:
import pandas as pd
import os.path

feats = pd.read_parquet(os.path.join("data", "features_tfidf.parquet"))

When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check.  In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!

In [2]:
feats.sample(10)

Unnamed: 0,index,label,0,1,2,3,4,5,6,7,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
26441,6441,spam,0.0,0.154098,0.0,0.0,0.0,0.0,0.306593,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38964,18964,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21345,1345,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.231271,0.0,0.0,0.0
186,186,legitimate,0.0,0.0,0.0,0.0,0.13503,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21122,1122,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.144964,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21176,1176,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39405,19405,spam,0.0,0.0,0.0,0.088632,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14200,14200,legitimate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13687,13687,legitimate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.152935,0.0,0.0
11363,11363,legitimate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.19175,0.0,0.0,0.0,0.0


The first 2 columns of the `feats` matrix are the index, and label. The remaining columns are the feature vectors. 

We begin by splitting the data into 2 sets: 

* `train` - a set of feature vectors which will be used to train the model
* `test` - a set of feature vectors which will be used to evaluate the model we trained

In [3]:
from sklearn import model_selection
train, test = model_selection.train_test_split(feats, random_state=43)

In [4]:
from sklearn.linear_model import LogisticRegression

In [5]:
model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)

In [6]:
#training the model
import time

start = time.time()
model.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])
end = time.time()
print(end - start)


1.664494514465332


With the model trained we can use it to make predictions. We apply the model to the `test` set, then compare the predicted classification of spam or legitimate to the truth.  

In [7]:
predictions = model.predict(test.iloc[:,2:test.shape[1]])

In [8]:
predictions

array(['legitimate', 'spam', 'legitimate', ..., 'spam', 'legitimate',
       'legitimate'], dtype=object)

We use a binary confusion matrix to visualise the accuracy of the model. 

In [9]:
from mlworkflows import plot

In [10]:
df, chart = plot.binary_confusion_matrix(test["label"], predictions)

In [11]:
chart

We can look at the raw numbers, and proportions of correctly and incorrectly classified items: 

In [12]:
df

Unnamed: 0,predicted,actual,raw_count,value
0,legitimate,legitimate,4749,0.949041
1,legitimate,spam,255,0.050959
2,spam,legitimate,115,0.023018
3,spam,spam,4881,0.976982


We can also look at the precision, recall and f1-score for the model. 

In [13]:
from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))

              precision    recall  f1-score   support

  legitimate       0.98      0.95      0.96      5004
        spam       0.95      0.98      0.96      4996

    accuracy                           0.96     10000
   macro avg       0.96      0.96      0.96     10000
weighted avg       0.96      0.96      0.96     10000



We want to save the model so that we can use it outside of this notebook.

In [14]:
model

In [15]:
from mlworkflows import util
util.serialize_to(model, "model_tfidf_logistic.sav")