## Detecting false statements with L1-regularized logistic regression

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold


First, load in the data:

In [2]:
df = pd.read_csv("data_factcheck.csv")

The data frame includes

*  a `statement` field, which has the text of the statement, 
* and a `label_binary` field indicating if it is false (1) or true/mostly true (0).


In [3]:
df.sample(5)

Unnamed: 0,statement,label_binary
262,Kesha Rogers is not a Democrat.,1
214,The newly created state Immigration Enforcemen...,1
346,On support for the Bridge to Nowhere.,1
1906,"The Congressional Budget Office, a nonpartisan...",0
658,Says Gov. Scott Walker is cooking the books by...,1


We will split the data into a training and test set. Note that the data is ordered by class, so we *must* shuffle the data.

In [6]:
Xtr_str, Xts_str, ytr, yts = train_test_split(df['statement'].values, df['label_binary'].values, shuffle=True, random_state=0, test_size=0.25)

To train a model, we will need to get this text into some kind of numeric representation. We will use a basic approach called "bag of words", that works as follows:

0. Remove the "trivial" words that you want to ignore, such as "the", "an", "has", etc. from the text.
1. Compile a "vocabulary" - a list of all of the words in the dataset - with integer indices from 0 to $d-1$.
2. Convert every sample into a $d$-dimensional vector $x$, by letting the $j$ th coordinate of $x$ be the number of occurences of the $j$ th words in the document. (This number is often called the "term frequency".)

Now, we have a set of vectors - one for each sample - containing the frequency of each word.

We will use the `sklearn` implementation of this, which is called `CountVectorizer`. You may refer to the [`CountVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

We will create an instance of a `CountVectorizer`, passing `stop_words = 'english'` and leaving other arguments at their default values.

Create `Xtr_vec` by fitting it on the training data, then using it to transform the training data. Use the already fitted vectorizer to transform the test data, to create `Xts_vec`. (This is the 'typical' data pre-processing pattern, where data pre-processing always uses the statistics of the training data *only*.)

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
vec = CountVectorizer(stop_words='english')

# Fit and transform the training data
Xtr_vec = vec.fit_transform(Xtr_str)

# Use the fitted vectorizer to transform the test data
Xts_vec = vec.transform(Xts_str)

In [8]:
Xtr_vec.shape

(1500, 4569)

You notice that in its vector representation, the data has several thousand features. You wonder if maybe you can use a sparse representation of this data in your classifier. You decide to try L1 regularization, which you know tends to fit sparse coefficients.

In the following cell, use K-fold CV with 5 folds to evaluate different values of `C` in the list defined by

```
np.logspace(-3, 3, num=20)
```

for an `l1` penalty in the `LogisticRegression`.  Iterate over the K-fold split (don't shuffle the data again!) and in each fold:

* train a `LogisticRegression` with an `l1` penalty and the specified value of `C`. Also set `random_state = 0`, and use the `liblinear` solver. Leave other settings at their default values.
* compute the accuracy of the model on the validation data, and save it in `acc_val`.



In [9]:
C_test = np.logspace(-3, 3, num=20)
# Note: use this C_test array to implement your solution, and do *not* re-define C_test.
# The grader will evaluate your solution over a *DIFFERENT* array, that is also named C_test.

In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

nfold = 5
acc_val = np.zeros((len(C_test), nfold))

kf = KFold(n_splits=nfold)
  
# Iterate over each fold
for fold_idx, (train_idx, val_idx) in enumerate(kf.split(Xtr_vec)):
    X_train, X_val = Xtr_vec[train_idx], Xtr_vec[val_idx]
    y_train, y_val = ytr[train_idx], ytr[val_idx]
    
    # Iterate over each C value
    for c_idx, C in enumerate(C_test):
        # Train a LogisticRegression model with L1 penalty and the specified value of C
        clf = LogisticRegression(solver='liblinear', penalty='l1', C=C, random_state=0)
        clf.fit(X_train, y_train)
        
        # Make predictions on the validation data
        y_val_pred = clf.predict(X_val)
        
        # Compute and store the accuracy of the model on the validation data
        acc_val[c_idx, fold_idx] = accuracy_score(y_val, y_val_pred)

# acc_val now contains the accuracy for each value of C and each fold



Find the value of C that had the best validation accuracy, and save it in `C_best`. (Note: do not hard-code this value!)

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

acc_mean = np.mean(acc_val, axis=1)

# Find the value of C that gives the highest mean accuracy
C_best = C_test[np.argmax(acc_mean)]

Using the entire training set, train a `LogisticRegression` with `l1` penalty and this value of `C`, `random_state = 0`, `liblinear` solver, and leave other settings at their default values. Get the accuracy of this model on the test data, and save it in `acc_best`.

In [12]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
clf_best = LogisticRegression(solver='liblinear', penalty='l1', C=C_best, random_state=0)
clf_best.fit(Xtr_vec, ytr)  # Use Xtr_vec and ytr for training

# Get the accuracy on the test data
y_hat = clf_best.predict(Xts_vec)  # Use Xts_vec for testing
acc_best = accuracy_score(yts, y_hat)  # Calculate accuracy

Also check whether this model achieves your goal of "zeroing" out many features - use [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html) to count the number of model coefficients that are *not* zero. Save this count in `count_best`.

In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
count_best = np.count_nonzero(clf_best.coef_)

Now, find the value of C that is best according to the one-SE rule, and save it in `C_one_se`. (Note: do not hard-code this value!) Remember that C is the *inverse* of the strength of the regularization penalty, e.g. a larger value of C means *less* regularization.

In [14]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# 1. Find the best mean validation accuracy
mean_val_accuracy = np.mean(acc_val, axis=1)
best_mean_accuracy = np.max(mean_val_accuracy)

# 2. Calculate the standard deviation of validation accuracies
std_val_accuracy = np.std(acc_val, axis=1)

# 3. Find the largest C that is within one standard error of the best mean accuracy
threshold = best_mean_accuracy - std_val_accuracy
C_one_se = C_test[np.where(mean_val_accuracy >= threshold)[0][-1]]



Using the entire training set, train a `LogisticRegression` with `l1` penalty and this value of `C`, `random_state = 0`, `liblinear` solver, and leave other settings at their default values. Get the accuracy of this model on the test data, and save it in `acc_one_se`.

In [15]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# Train a LogisticRegression model using the best C from the one-SE rule
clf_one_se = LogisticRegression(solver='liblinear', penalty='l1', C=C_one_se, random_state=0)
clf_one_se.fit(Xtr_vec, ytr)  # Train on the entire training set

# Get the accuracy of this model on the test data
yts_pred = clf_one_se.predict(Xts_vec)  # Make predictions on the test set
acc_one_se = accuracy_score(yts, yts_pred)  # Calculate accuracy on test set


Also check whether this model achieves your goal of "zeroing" out many features - use [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html) to count the number of model coefficients that are *not* zero. Save this count in `count_one_se`.

In [16]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# Train a LogisticRegression model using the best C from the one-SE rule
clf_one_se = LogisticRegression(solver='liblinear', penalty='l1', C=C_one_se, random_state=0)
clf_one_se.fit(Xtr_vec, ytr)  # Train on the entire training set

# Get the accuracy of this model on the test data
yts_pred = clf_one_se.predict(Xts_vec)  # Make predictions on the test set
acc_one_se = accuracy_score(yts, yts_pred)  # Calculate accuracy on test set

# Check how many coefficients are non-zero
count_one_se = np.count_nonzero(clf_one_se.coef_)  #