You have been hired by a news wire service to train a machine learning model to identify potentially false statements and claims made by public figures. These statements will be flagged for additional fact-checking by a human fact checker.

To train your model, you are given a dataset of already-fact-checked statements from the past decade, half of which have been evaluated as true and half of which have been evaluated as false.

In the attached workspace, you will load in this data, then use it to train an SVC model. Since the text data has many features once converted to a "bag of words" representation, you will also explore the use of feature selection to see if an equally effective model can be trained on fewer features.

> Note: your code will be evaluated on slightly different data than the file given in the workspace, so don't hard-code anything - the code you write should work with any data frame having similar column names and types.

| Name	| Type	| Description |
| --- | --- | --- |
|`Xtr_vec`|	2d numpy array	|Vector representation of training data (part 1).|
|`Xts_vec`|	2d numpy array	|Vector representation of test data (part 1).|
|`acc_score`|	float	|Accuracy of SVC model trained on all features (part 1).|
|`mi_scores`|1d numpy array	|Mutual information score for each feature (part 2).|
|`threshold`|	float	|Threshold for feature selection, based on 0.5 quantile mutual information score (part 2).|
|`Xtr_selected`|	2d numpy array	|Selected columns in training data using on 0.5 quantile mutual information score (part 2).|
|`Xts_selected`|	2d numpy array	|Selected columns in test data using on 0.5 quantile mutual information score (part 2).|
|`acc_score_median`|	float	|Accuracy of SVC model trained on features with mutual information score above the 0.5 quantile (part 2).|
|`acc_scores_kfold`|	2d numpy array	|Accuracy of SVC model for each fold and each quantile value (part 3).|
|`best_quantile_mean`	|float	|The optimal quantile to use as a threshold according to best mean accuracy (part 3).|
|`best_quantile_one_se`	|float	|The optimal quantile to use as a threshold according to one-SE rule (part 3).|

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold
from sklearn.feature_selection import mutual_info_classif



First, load in the data:

In [2]:
df = pd.read_csv("data_factcheck.csv")

The data frame includes

*  a `statement` field, which has the text of the statement, 
* and a `label_binary` field indicating if it is false (1) or true/mostly true (0).


In [3]:
df.sample(5)

Unnamed: 0,statement,label_binary
1199,I have always opposed drivers licenses for ill...,0
275,If an Iranian woman shows too much hair in pub...,1
472,On an earmark moratorium.,1
1597,Barack Obamas health care bill is nothing new....,0
87,When Mitt Romney chose Paul Ryan as his vice-p...,1


We will split the data into a training and test set. Note that the data is ordered by class, so we *must* shuffle the data.

In [4]:
Xtr_str, Xts_str, ytr, yts = train_test_split(df['statement'].values, df['label_binary'].values, shuffle=True, random_state=42, test_size=0.25)

#### Part 1 - train an SVC classifier

Note: Graded variables in this section: `Xtr_vec`, `Xts_vec`, `acc_score`.

First, we will train an SVC classifier.

To train a model, we will need to get this text into some kind of numeric representation. We will use a basic approach called "bag of words", that works as follows:

0. (Optional) Remove the "trivial" words that you want to ignore, such as "the", "an", "has", etc. from the text.
1. Compile a "vocabulary" - a list of all of the words in the dataset - with integer indices from 0 to $d-1$.
2. Convert every sample into a $d$-dimensional vector $x$, by letting the $j$ th coordinate of $x$ be the number of occurences of the $j$ th words in the document. (This number is often called the "term frequency".)

Now, we have a set of vectors - one for each sample - containing the frequency of each word.

We will use the `sklearn` implementation of this, which is called `CountVectorizer`. You may refer to the [`CountVectorizer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

We will create an instance of a `CountVectorizer`, passing `stop_words = 'english'` and leaving other arguments at their default values.

Create `Xtr_vec` by fitting it on the training data, then using it to transform the training data. Use the already fitted vectorizer to transform the test data, to create `Xts_vec`. (This is the 'typical' data pre-processing pattern, where data pre-processing always uses the statistics of the training data *only*.)

In [5]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
vec = CountVectorizer(stop_words='english')
Xtr_vec = vec.fit_transform(Xtr_str)
Xts_vec = vec.transform(Xts_str)


In [6]:
Xtr_vec.shape

(1500, 4551)

Next, fit an SVC classifier on the vector representation of the training data, leaving all arguments at their default values. 

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
clf = SVC()
clf.fit(Xtr_vec, ytr)

Compute the accuracy of this model on the test data.

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
acc_score = accuracy_score(yts, clf.predict(Xts_vec))

#### Part 2 - use only the most relevant features

Note: Graded variables in this section: `mi_scores`, `threshold`, `Xtr_selected`, `Xts_selected`, `acc_score_median`.

You notice that in its vector representation, the data has several thousand features. You wonder if maybe you can get comparable performance just by using the most relevant features (i.e. the words that are most related to whether or not the claim is true).

Compute a score for each feature in the training data, using `mutual_info_classif` (you can refer to the [`mutual_info_classif` documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html)) as the scoring metric. Save the scores in `mi_scores`.

Then, set `threshold` to the 0.5 quantile value of the mutual information scores you computed. Use the `quantile` function in `numpy` (you can find the [`quantile` documentation here](https://numpy.org/doc/stable/reference/generated/numpy.quantile.html).)

Let `Xtr_selected` include only the columns of `Xtr_vec` for which the mutual information score is strictly greater than the 0.5 quantile (and similarly, `Xts_selected` will only include those columns of `Xts_vec`). Use this data to fit an SVC classifier (leaving all arguments at their default values) and compute its accuracy on the test set. Save this value in `acc_score_median`.


In [9]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
mi_scores = mutual_info_classif(Xtr_vec, ytr)
threshold = np.quantile(mi_scores, 0.5)
selected_indices = np.where(mi_scores > threshold)[0]
Xtr_selected = Xtr_vec[:, selected_indices]
Xts_selected = Xts_vec[:, selected_indices]

clf_selected = SVC()
clf_selected.fit(Xtr_selected, ytr)
acc_score_median = accuracy_score(yts, clf_selected.predict(Xts_selected))

#### Part 3 - find the threshold by CV

Note: Graded variables in this section: `acc_scores_kfold`, `best_quantile_mean`, `best_quantile_one_se`.

You still think this idea could work, but maybe excluding half of the features (by using the `0.5` quantile value as the threshold) wasn't exactly right. You decide to use K-fold CV to find a better threshold.

In the following cell, use K-fold CV with `n_folds = 5` to evaluate different quantiles for setting the threshold: `[0.25, 0.5, 0.75, 0.9, 0.95, 0.99]`. Iterate over the K-fold split (don't shuffle the data again!) and in each fold:

* train a `CountVectorizer` (passing `stop_words = 'english'` and leaving other arguments at their default values) using the training data for that fold, then use it to transform the training and validation data for the fold,
* compute the mutual information scores using the training data for that fold (using the default arguments to `mutual_info_classif`, as before),
* compute `threshold_list` as the list of threshold mutual information scores corresponding to the quantiles `[0.25, 0.5, 0.75, 0.9, 0.95, 0.99]`,
* then iterate over `threshold_list`! In this inner loop:
   * fit an SVC classifier using the training data for that fold, selecting only the columns with mutual information score strictly exceeding the threshold,
   * and compute the accuracy score of this model, saving the results in `acc_scores_kfold`.

In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

n_folds = 5
quantiles = np.array([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

acc_scores_kfold = np.zeros((len(quantiles), n_folds))
  
# K-fold cross-validation
kf = KFold(n_folds, shuffle=False)
for i, (train_index, val_index) in enumerate(kf.split(Xtr_str)):
    X_train_fold, X_val_fold = Xtr_str[train_index], Xtr_str[val_index]
    y_train_fold, y_val_fold = ytr[train_index], ytr[val_index]
    
    vec_fold = CountVectorizer(stop_words='english')
    X_train_vec = vec_fold.fit_transform(X_train_fold)
    X_val_vec = vec_fold.transform(X_val_fold)
    
    mi_scores_fold = mutual_info_classif(X_train_vec, y_train_fold)
    for j, q in enumerate(quantiles):
        threshold_fold = np.quantile(mi_scores_fold, q)
        selected_indices_fold = np.where(mi_scores_fold > threshold_fold)[0]
        X_train_selected = X_train_vec[:, selected_indices_fold]
        X_val_selected = X_val_vec[:, selected_indices_fold]
        
        clf_fold = SVC()
        clf_fold.fit(X_train_selected, y_train_fold)
        acc_scores_kfold[j, i] = accuracy_score(y_val_fold, clf_fold.predict(X_val_selected))


Find the quantile value with the best mean accuracy across folds, and save this value in `best_quantile_mean`:

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
mean_scores = np.mean(acc_scores_kfold, axis = 1)
best_quantile_mean = quantiles[np.argmax(mean_scores)]

Find the quantile value that is best according to the one-SE rule (i.e. the simplest model whose mean accuracy is within one standard error of the best model), and save this value in `best_quantile_one_se`:

In [12]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
std_scores = np.std(acc_scores_kfold, axis = 1)
one_se_bound = np.max(mean_scores) - np.min(std_scores)
candidate_quantiles = quantiles[mean_scores >= one_se_bound]
best_quantile_one_se = candidate_quantiles[np.argmin(np.abs(mean_scores[mean_scores >= one_se_bound] - np.max(mean_scores)))]