## Detecting false statements with L1-regularized logistic regression

You have been hired by a news wire service to train a machine learning model to identify potentially false statements and claims made by public figures. These statements will be flagged for additional fact-checking by a human fact checker.

To train your model, you are given a dataset of already-fact-checked statements from the past decade, half of which have been evaluated as true and half of which have been evaluated as false.

In the attached workspace, you will load in this data, then use it to train a [LogisticRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) model. Since the text data has many features once converted to a "bag of words" representation, you will also explore the use of L1 regularization.

> Note: your code will be evaluated on a random array of C values! For this reason, the grader feedback may change even if your code does not (e.g. for one random array of values, you might happen to get the correct result even with incorrect code; but for another random array, you do not.).

| Name | Type | Description |
| ---- | ---- | ---- |
|Xtr_vec	|2d numpy array	|Vector representation of training data.|
|Xts_vec	|2d numpy array	|Vector representation of test data.|
|acc_val	|2d numpy array	|Accuracy of LogisticRegression model for each fold and each value of C.|
|C_best	|float	|Value of C for which the validation accuracy is highest.|
|acc_best	|float	|Test accuracy of model trained with C_best.|
|count_best	|int	|Number of non-zero coefficients of model trained with C_best.|
|C_one_se	|float	|Value of C to use according to one-SE rule.|
|acc_one_se	|float	|Test accuracy of model trained with C_one_se.|
|count_one_se	|int	|Number of non-zero coefficients of model trained with C_one_se.|

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold


First, load in the data:

In [2]:
df = pd.read_csv("data_factcheck.csv")

The data frame includes

*  a `statement` field, which has the text of the statement, 
* and a `label_binary` field indicating if it is false (1) or true/mostly true (0).


In [3]:
df.sample(5)

Unnamed: 0,statement,label_binary
1352,Nearly one in four people in their prime worki...,0
286,More than 43 percent of all food stamps are gi...,1
756,Gov. Chris Christie owes the state money for a...,1
1472,"In Harrisburg, I passed more bills than all th...",0
701,"Says Barack Obama said, If we keep talking abo...",1


We will split the data into a training and test set. Note that the data is ordered by class, so we *must* shuffle the data.

In [4]:
Xtr_str, Xts_str, ytr, yts = train_test_split(df['statement'].values, df['label_binary'].values, shuffle=True, random_state=0, test_size=0.25)

To train a model, we will need to get this text into some kind of numeric representation. We will use a basic approach called "bag of words", that works as follows:

0. Remove the "trivial" words that you want to ignore, such as "the", "an", "has", etc. from the text.
1. Compile a "vocabulary" - a list of all of the words in the dataset - with integer indices from 0 to $d-1$.
2. Convert every sample into a $d$-dimensional vector $x$, by letting the $j$ th coordinate of $x$ be the number of occurences of the $j$ th words in the document. (This number is often called the "term frequency".)

Now, we have a set of vectors - one for each sample - containing the frequency of each word.

We will use the `sklearn` implementation of this, which is called `CountVectorizer`. You may refer to the [`CountVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

We will create an instance of a `CountVectorizer`, passing `stop_words = 'english'` and leaving other arguments at their default values.

Create `Xtr_vec` by fitting it on the training data, then using it to transform the training data. Use the already fitted vectorizer to transform the test data, to create `Xts_vec`. (This is the 'typical' data pre-processing pattern, where data pre-processing always uses the statistics of the training data *only*.)

In [5]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
vec = CountVectorizer(stop_words='english')
Xtr_vec = vec.fit_transform(Xtr_str)
Xts_vec = vec.transform(Xts_str)

In [6]:
Xtr_vec.shape

(1500, 4569)

You notice that in its vector representation, the data has several thousand features. You wonder if maybe you can use a sparse representation of this data in your classifier. You decide to try L1 regularization, which you know tends to fit sparse coefficients.

In the following cell, use K-fold CV with 5 folds to evaluate different values of `C` in the list defined by

```
np.logspace(-3, 3, num=20)
```

for an `l1` penalty in the `LogisticRegression`.  Iterate over the K-fold split (don't shuffle the data again!) and in each fold:

* train a `LogisticRegression` with an `l1` penalty and the specified value of `C`. Also set `random_state = 0`, and use the `liblinear` solver. Leave other settings at their default values.
* compute the accuracy of the model on the validation data, and save it in `acc_val`.



In [7]:
C_test = np.logspace(-3, 3, num=20)
# Note: use this C_test array to implement your solution, and do *not* re-define C_test.
# The grader will evaluate your solution over a *DIFFERENT* array, that is also named C_test.

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

nfold = 5
acc_val = np.zeros((len(C_test), nfold))

kf = KFold(n_splits=nfold)

# For each fold
for ifold, (Itr, Ival) in enumerate(kf.split(Xtr_vec)):
    # For each C in the list, fit a LogisticRegression model
    for iC, C in enumerate(C_test):
        clf  = LogisticRegression(random_state = 0, penalty='l1', solver='liblinear', C = C)
        clf.fit(Xtr_vec[Itr], ytr[Itr])
        yhat = clf.predict(Xtr_vec[Ival])
        # update the appropriate entry in acc_val
        acc_val[iC, ifold] = accuracy_score(ytr[Ival], yhat)



Find the value of C that had the best validation accuracy, and save it in `C_best`. (Note: do not hard-code this value!)

In [9]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# C_best = ...
acc_mean = np.mean(acc_val, axis=1)
C_best = C_test[np.argmax(acc_mean)]

Using the entire training set, train a `LogisticRegression` with `l1` penalty and this value of `C`, `random_state = 0`, `liblinear` solver, and leave other settings at their default values. Get the accuracy of this model on the test data, and save it in `acc_best`.

In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# acc_best = ..
model_best = LogisticRegression(penalty='l1', C=C_best, random_state=0, solver='liblinear')
model_best.fit(Xtr_vec, ytr)
y_pred_best = model_best.predict(Xts_vec)
acc_best = accuracy_score(yts, y_pred_best)

Also check whether this model achieves your goal of "zeroing" out many features - use [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html) to count the number of model coefficients that are *not* zero. Save this count in `count_best`.

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# count_best = ..
count_best = np.count_nonzero(model_best.coef_)

Now, find the value of C that is best according to the one-SE rule, and save it in `C_one_se`. (Note: do not hard-code this value!) Remember that C is the *inverse* of the strength of the regularization penalty, e.g. a larger value of C means *less* regularization.

In [12]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# C_one_se = ...
acc_mean = acc_val.mean(axis=1)
acc_std = acc_val.std(axis=1)
acc_one_se = acc_mean - acc_std
C_one_se = C_test[np.argmax(acc_mean >= np.max(acc_one_se))]

Using the entire training set, train a `LogisticRegression` with `l1` penalty and this value of `C`, `random_state = 0`, `liblinear` solver, and leave other settings at their default values. Get the accuracy of this model on the test data, and save it in `acc_one_se`.

In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# acc_one_se = ..
model_one_se = LogisticRegression(penalty='l1', C=C_one_se, random_state=0, solver='liblinear')
model_one_se.fit(Xtr_vec, ytr)
y_pred_one_se = model_one_se.predict(Xts_vec)
acc_one_se = accuracy_score(yts, y_pred_one_se)

Also check whether this model achieves your goal of "zeroing" out many features - use [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html) to count the number of model coefficients that are *not* zero. Save this count in `count_one_se`.

In [14]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# count_one_se = ..
count_one_se = np.count_nonzero(model_one_se.coef_)