# Selecting Logistic Regression Model with K-Fold CV

In the attached workspace, you will use K-fold CV to select the regularization strength in an L2-regularized classification model. Then, you will fit the optimal model and evaluate its accuracy on a test set not used for model fitting.

| Name | Type | Description |
| ---- | ---- | ---- |
|`acc_mean`	|1d numpy array	|The mean validation accuracy for each model in the K-fold CV.|
|`C_best`	|float	|The value of the tuning parameter C that yields the best accuracy on the validation data.|
|`acc_ts`	|float	|The accuracy of the final model on the test set.|


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold


When interacting with an ML-enabled chatbot or personal assistant application, the user expects to speak to the assistant in natural language. An *intent classification* model is used to identify the user's intent - what capability of the personal assistant is the user trying to access - so that the user's request can then be processed further.

In this notebook, you will use part of a [dataset](https://github.com/xliuhw/NLU-Evaluation-Data) of natural language commands for a smart home assistant labeled with the corresponding intent, to train an intent classifier.

First, load in the data:

In [2]:
df = pd.read_csv("personal_assistant_intent.csv")

The data frame includes

*  an `intent` field, which has the label indicating the intent: is it a command related to `cleaning` with a robotic vacuum cleaner, playing `music` or adjusting the sound level (`volume_mute`, `volume_up`, `volume_down`), configuring the lights (`hue_lightchange`, `hue_lightoff`, `hue_lighton`, `hue_lightup`, `hue_lightdim`), making `coffee`, ordering a `taxi`, or placing an `order` for food?
* and an `answer_normalized` field, with the text of the command.


In [3]:
df.sample(5)

Unnamed: 0,intent,answer_normalised
168,order,order the best priced general tofu in nyack
134,cleaning,start the vacuum at nine am
695,cleaning,it's dirty here make some noise
343,music,play michael jackson from my playlist
421,volume_down,please turn the volume down


To train a model, we will need to get this text into some kind of numeric representation. We will use a basic approach called "bag of words", that works as follows:

0. (Optional) Remove the "trivial" words that you want to ignore, such as "the", "an", "has", etc. from the text.
1. Compile a "vocabulary" - a list of all of the words in the dataset - with integer indices from 0 to $d-1$.
2. Convert every sample into a $d$-dimensional vector $x$, by letting the $j$th coordinate of $x$ be the number of occurences of the $j$ th words in the document. (This number is often called the "term frequency".)

Now, we have a set of vectors - one for each sample - containing the frequency of each word.

For example, if we had two samples:

```
dog eats dog
dog eats cat
```

our "vocabulary" might be

```
dog,0
eats,1
cat,2
```

and the two samples would be represented by the term frequencies

```
2,1,0
1,1,1
```

There are more sophisticated ways of representing text, but this approach will work for now. We will use the `sklearn` implementation of this, which is called `CountVectorizer`. 

First, we will split the data into training and test sets:

In [4]:
Xtr_str, Xts_str, ytr, yts = train_test_split(df['answer_normalised'].values, df['intent'].values, shuffle=True, random_state=42, test_size=0.3)

Then, we'll create an instance of a `CountVectorizer`, specify the list of "stop words" to remove, specify that it should use only the 400 most frequent words, and "fit" it using the text from the training set:

In [5]:
vec = CountVectorizer(stop_words='english', max_features=400)
Xtr_vec = vec.fit_transform(Xtr_str)
Xts_vec = vec.transform(Xts_str)

Now we have Xtr_vec and Xts_vec, the text of the commands, in the form of a numeric array that we can use to train a LogisticRegression classifier. (Note that these arrays have a very large number of columns - one for every word in the vocabular! - so they are represented internally as a "sparse" matrix.)

In [6]:
Xtr_vec.shape

(490, 400)

In [7]:
Xts_vec.shape

(210, 400)

(Optional: If you are interested, you can run

```
vec.vocabulary_
```

to see the "vocabulary".)

**Note**: when using a regularized model, we standardize the data if features do not share a common scale. In this case, all features are on the same scale (frequency) so we do *not* standardize.

Since there is likely to be a lot of correlation between features, the model may have high variance. We can use L2 regularization to try and improve the model performance.

In an `sklearn` `LogisticRegression`, the hyperparameter `C` controls the "strength" of the regularization term in the objective function. `C` is the **inverse** of the regularization strength; a greater value of `C` means the model is *less* regularized.

We will evaluate models for the following values of `C`:

In [8]:
C_test = np.array([0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000])

In the following cells, we are going to set up a K-fold CV to select a value of `C`.  First, we will set up an array to hold the results of each model in each fold. (Note that our K-fold CV will use 5 folds.)

In [9]:
nfold = 5
acc_val = np.zeros((len(C_test), nfold))

Now, create a KFold object using the `sklearn` implementation. Use 5 folds (and don't shuffle the data inside the K-Fold CV). 

Use this to evalute an `sklearn` `LogisticRegression` regression model for each of the `C` values in `C_test`, and save the validation accuracy inside `acc_val`. In the `LogisticRegression`, 

* specify `solver = 'saga'`
* specify `tol = 1e-3`
* specify `max_iter = 500`
* specify `multi_class = 'multinomial'`
* specify `penalty = 'l2'`
* specify `random_state = 42`

and leave other hyperparameters and settings at their default values, except for `C`.

In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

kf = KFold(n_splits=nfold, shuffle=False)

# For each fold
for ifold, (Itr, Ival) in enumerate(kf.split(Xtr_vec)):
    # For each C in the list, fit a LogisticRegression model
    for iC, C in enumerate(C_test):
        clf  = LogisticRegression(random_state = 42, solver = 'saga', multi_class = 'multinomial', penalty='l2',tol=1e-3, max_iter=500, C = C)
        clf.fit(Xtr_vec[Itr], ytr[Itr])
        yhat = clf.predict(Xtr_vec[Ival])
        # update the appropriate entry in acc_val
        acc_val[iC, ifold] = accuracy_score(ytr[Ival], yhat)





Next, compute the mean validation accuracy for each of the models, and identify the value of `C` for which the validation accuracy is maximized.

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
acc_mean = np.mean(acc_val, axis=1)
C_best = C_test[np.argmax(acc_mean)]

Using the `C` value you identified in the previous step (and the rest of the logistic regression parameters as specified earlier), fit a logistic regression model on the entire training set. Then, get its prediction on the test set in `y_hat`. Also compute its accuracy on the test set, and save this value in `acc_ts`.

In [12]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
clf_best = LogisticRegression(random_state = 42, solver = 'saga', multi_class = 'multinomial', penalty='l2',tol=1e-3, max_iter=500, C = C_best)
clf_best.fit(Xtr_vec, ytr)
y_hat = clf_best.predict(Xts_vec)
acc_ts = accuracy_score(yts, y_hat)

