# Background

In this notebook we'll train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between spam data (food reviews) and legitimate data (Austen). 

Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate).  Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!

We begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). 

In [1]:
import pandas as pd
import os.path

feats = pd.read_parquet(os.path.join("data", "features.parquet"))

When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check.  In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!

In [2]:
feats.sample(10)

Unnamed: 0,index,label,no_punct,number_words,mean_wl,max_wl,min_wl,pc_10_wl,pc_90_wl,upper,stop_words
38487,18487,spam,11,70,4.628571,14,1,2.0,7.0,11,31
38401,18401,spam,18,99,4.151515,12,1,2.0,7.0,15,44
21777,1777,spam,14,93,4.451613,12,1,2.0,8.0,22,40
36305,16305,spam,6,72,3.847222,9,1,2.0,7.0,6,35
7625,7625,legitimate,9,54,4.259259,11,1,2.0,7.0,10,25
24848,4848,spam,13,58,4.551724,10,1,2.0,7.3,8,25
17089,17089,legitimate,7,46,5.021739,11,1,2.0,10.0,4,23
33058,13058,spam,20,119,3.983193,11,1,2.0,7.0,16,59
14449,14449,legitimate,20,127,4.385827,12,1,2.0,8.4,14,76
6538,6538,legitimate,17,43,4.930233,12,1,2.0,8.8,4,21


The first 2 columns of the `feats` matrix are the index, and label. The remaining columns are the feature vectors. 

We begin by splitting the data into 2 sets: 

* `train` - a set of feature vectors which will be used to train the model
* `test` - a set of feature vectors which will be used to evaluate the model we trained

In [3]:
from sklearn import model_selection
train, test = model_selection.train_test_split(feats, random_state=43)

In [4]:
from sklearn.linear_model import LogisticRegression

In [5]:
model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)

In [6]:
#training the model
import time

start = time.time()
model.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])
end = time.time()
print(end - start)


0.6096940040588379


With the model trained we can use it to make predictions. We apply the model to the `test` set, then compare the predicted classification of spam or legitimate to the truth.  

In [7]:
predictions = model.predict(test.iloc[:,2:test.shape[1]])

In [8]:
predictions

array(['legitimate', 'spam', 'legitimate', ..., 'spam', 'legitimate',
       'spam'], dtype=object)

In [15]:
import pickle
logmodel = pickle.dumps(model)

We use a binary confusion matrix to visualise the accuracy of the model. 

In [9]:
from mlworkflows import plot

ModuleNotFoundError: No module named 'altair'

In [10]:
df, chart = plot.binary_confusion_matrix(test["label"], predictions)

NameError: name 'plot' is not defined

In [11]:
chart

NameError: name 'chart' is not defined

We can look at the raw numbers, and proportions of correctly and incorrectly classified items: 

In [12]:
df

NameError: name 'df' is not defined

We can also look at the precision, recall and f1-score for the model. 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))

We want to save the model so that we can use it outside of this notebook.

In [None]:
model

In [None]:
from mlworkflows import util
util.serialize_to(model, "model.sav")