# COGS 108 - Assignment 5: Natural Language Processing

# Important Reminders
**You must submit this file (`A5_NLP.ipynb`) to TritonED to finish the homework.**

- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
    - In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values. 
        - It is up to you to check the values, and make sure they seem reasonable.
- A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.
    - For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail. 
    - Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

# Background & Work Flow

- In this homework assignment, we will be analyzing text data. A common approach to analyzing text data is to use methods that allow us to convert text data into some kind of numerical representation - since we can then use all of our mathematical tools on such data. In this assignment, we will explore 2 feature engineering methods that convert raw text data into numerical vectors:
    - Bag of Words (BoW)
        - BoW encodes an input sentence as the frequency of each word in the sentence. 
        - In this approach, all words contribute equally to the feature vectors.
    - Term Frequency - Inverse Document Frequency (TF-IDF)
        - TF-IDF is a measure of how important each term is to a specific document, as compared to an overall corpus. 
        - TF-IDF encodes a each word as the it's frequency in the document of interest, divided by a measure of how common the word is across all documents (the corpus).
        - Using this approach each word contributes differently to the feature vectors.
        - The assumption behind using TF-IDF is that words that appear commonly everywhere are not that informative about what is specifically interesting about a document of interest, so it is tuned to representing a document in terms of the words it uses that are different from other documents. 

- To compare those 2 methods, we will first apply them on the same Movie Review dataset to analyse sentiment (how positive or negative a text is). In order to make the comparison fair, the same SVM (support vector machine) classifier will be used to classify positive reviews and negative reviews.

- SVM is a simple yet powerful and interpretable linear model. To use it as a classifier, we need to have at least 2 splits of the data: training data and test data. The training data is used to tune the weight parameters in the SVM to learn an optimal way to classify the training data. We can then test this trained SVM classifier on the test data, to see how well it works on data that the classifier has not seen before. 

In [1]:
# Imports - these are all the imports needed for the assignment
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
#   PennTreeBank word tokenizer 
#   English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Vectorizer, which that transforms text data into bag-of-words feature
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

For this assignemnt we will be using nltk: the Natural Language Toolkit.

To do so, we will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and or example text data) which is what we will download here now, if you don't already have them.

In [2]:
# In the cell below, we will download some files from nltk. 
#   If you hit an error doing so, come back to this cell, and uncomment and run the code below. 
#   This code gives python permission to write to your disk (if it doesn't already have persmission to do so)

# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context

In [3]:
# Download the NLTK English tokenizer and the stopwords of all languages
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Error loading punkt: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zafri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Part 1: Sentiment Analysis on Movie Review Data

In part 1 we will apply sentiment analysis to Movie Review (MR) data.

- The MR data contains more than 10,000 reviews collect from IMDB website, and each of the reviews is annotated as either positive or negative sample. The number of positive and negative reviews are roughly the same. For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/

- For this homework assignment, we've already shuffled the data, and truncated the data to contain only 5000 reviews.

In this part of the assignment we will:
- Transform the raw text data into vectors with the BoW encoding method
- Split the data into training and test sets
- Write a function to train an SVM classifier on the training set
- Test this classifier on the test set and report the results

### 1a) Import data
Import the textfile 'rt-polarity.txt' into a DataFrame called MR_df,

Set the column names as `index`, `label`, `review`
Note that 'rt-polarity.txt' is a tab separated raw text file, in which data is separated by tabs ('\t')

You can load this file with `read_csv`:
- Specifying the `sep` (separator) argument as tabs ('\t')
- You will have the set 'header' as None

In [4]:
MR_filepath='data/rt-polarity.tsv'

# YOUR CODE HERE
MR_df=pd.read_csv(MR_filepath, delimiter="\t",header=None, names=['index', 'label', 'review'])
MR_df.head()

Unnamed: 0,index,label,review
0,8477,neg,except as an acting exercise or an exceptional...
1,4031,pos,japanese director shohei imamura 's latest fil...
2,10240,neg,i walked away not really know who `` they `` w...
3,8252,neg,what could have been a neat little story about...
4,1346,pos,no screen fantasy-adventure in recent memory h...


In [5]:
assert isinstance(MR_df, pd.DataFrame)
assert list(MR_df.columns) == ['index', 'label', 'review']


In [6]:
# Check the data
MR_df.head()

Unnamed: 0,index,label,review
0,8477,neg,except as an acting exercise or an exceptional...
1,4031,pos,japanese director shohei imamura 's latest fil...
2,10240,neg,i walked away not really know who `` they `` w...
3,8252,neg,what could have been a neat little story about...
4,1346,pos,no screen fantasy-adventure in recent memory h...


### 1b) String labels to numerical
Create a function that converts string labels to numerical labels

Function name: `convert_label`

The function should do the following:
- if the input label is "pos" return 1.0
- if the input label is "neg" return 0.0
- otherwise, return the input label as is

In [7]:
def convert_label(label):
    if label =="pos":
        output=1.0
    elif label=="neg":
        output=0.0
    else: 
        output=label
    return output

In [8]:
assert callable(convert_label)


### 1c) 
Convert all labels in `MR_df["label"]` to numerical labels, using the `convert_label` function

Save them as a new column named "y" in MR_df

In [9]:
# YOUR CODE HERE
MR_df["y"]=MR_df["label"].apply(convert_label)

In [10]:
assert set(MR_df['y']) == { 0., 1. }
# Check for roughly the right labels
assert 0.3 < MR_df['y'].mean() < 0.7


In [11]:
# Check the MR_df data
MR_df.head()

Unnamed: 0,index,label,review,y
0,8477,neg,except as an acting exercise or an exceptional...,0.0
1,4031,pos,japanese director shohei imamura 's latest fil...,1.0
2,10240,neg,i walked away not really know who `` they `` w...,0.0
3,8252,neg,what could have been a neat little story about...,0.0
4,1346,pos,no screen fantasy-adventure in recent memory h...,1.0


### 1d) Convert Text data into vector 

We will now create a "CountVectorizer" object to transform the text data into vectors with numerical values. 

To do so, we will initialize a "CountVectorizer" object, and name it as "vectorizer".

We need to pass 4 arguments to initialize a CountVectorizer:
  1. analyzer: 'word'
       Specify to analyze data from word-level
  2. max_features: 2000
       Set a max number of unique words
  3. tokenizer: word_tokenize
       Set to tokenize the text data by using the word_tokenizer from NLTK 
  4. stop_words: stopwords.words('english')
       Set to remove all stopwords in English. We do this since they generally don't provide useful discriminative information.

In [12]:
# YOUR CODE HERE
vectorizer = CountVectorizer(analyzer="word", max_features=2000, tokenizer=word_tokenize, stop_words=stopwords.words('english'))


In [13]:
assert vectorizer.analyzer == 'word'
assert vectorizer.max_features == 2000
assert vectorizer.tokenizer == word_tokenize
assert vectorizer.stop_words == stopwords.words('english')
assert hasattr(vectorizer, "fit_transform")

### 1e) 

Transform reviews (`MR_df["review"])` into vectors using the "vectorizer" we created above:

The method you will be using is: `MR_X = vectorizer.fit_transform(...).toarray()`

Note that we apply the `toarray` method at the type cast the output to a numpy array. This is something we will do multiple times, to turn custom sklearn objects back into arrays. 

In [14]:
# YOUR CODE HERE
MR_X = vectorizer.fit_transform(MR_df["review"]).toarray()

  sorted(inconsistent))


In [15]:
assert type(MR_X) == np.ndarray


### 1f) 

Copy out `y` column in MR_df and save it as an np.ndarray named `MR_y`

Make sure the shape of `MR_y` is (5000,) - you may have to use `reshape` to do so. 

In [16]:
# YOUR CODE HERE
MR_y=MR_df.iloc[:,3].as_matrix()
print (MR_y.shape)

(5000,)


In [17]:
assert MR_y.shape == (5000,)


### 1g) Split up train and test sets
We first set 80% of the data as the training set to train an SVM classifier. We will then test the learnt classifier on the rest 20% data samples.

- Calculate the number of training data samples (80% of total) and store it in `num_training`
- Calculate the number of test data samples (20% of total) and store it in `num_testing`
- Make sure both of these variables are of type `int`

In [18]:
# YOUR CODE HERE
rows,cols=MR_df.shape
num_training=int(rows*(.8))
num_testing=int(rows*(.2))
print (num_training, num_testing)

4000 1000


In [19]:
assert type(num_training) == int
assert type(num_testing) == int


### 1h) 

Split the `MR_X` and `MR_y` into training set and test set. You should use the `num_training` variable to extract the data from MR_X and MR_y.
     
Extract the first 'num_training' samples as training data, and extract the rest as test data.

Name them as:
- `MR_train_X` and `MR_train_y` for the training set
- `MR_test_X` and `MR_test_y` for the test set

In [20]:
# YOUR CODE HERE
#train
MR_train_X=MR_X[0:num_training]
MR_train_y=MR_y[0:num_training]
#test
MR_test_X=MR_X[num_training:]
MR_test_y=MR_y[num_training:]
print ("Is everything the right shape?       ",len(MR_test_X)==len(MR_test_y)==num_testing)

Is everything the right shape?        True


In [21]:
assert MR_train_X.shape[0] == MR_train_y.shape[0]
assert MR_test_X.shape[0] == MR_test_y.shape[0]

assert len(MR_train_X) == 4000
assert len(MR_test_y) == 1000

### 1i) SVM

Define a function called `train_SVM` that initializes an SVM classifier and trains it

Inputs: 
- `X`: np.ndarray, training samples, 
- `y`: np.ndarray, training labels,
- `kernel`: string, set the default value of "kernel" as "linear"

Output: a trained classifier `clf`

Hint: There are 2 steps involved in this function:
- Initializing an SVM classifier: `clf = SVC(...)`
- Training the classifier: `clf.fit(X, y)`

In [22]:
def train_SVM(X, y, kernel='linear'):
    clf= SVC(kernel=kernel)
    clf.fit(X, y)
    
    return clf

In [23]:
assert callable(train_SVM)


### 1j) Train SVM

Train an SVM classifier on the samples `MR_train_X` and the labels `MR_train_y`

You need to call the function `train_SVM` you just created. Name the returned object as `MR_clf`.

Note that running this function may take many seconds / up to a few minutes to run.

In [24]:
# YOUR CODE HERE
MR_clf=train_SVM(MR_train_X, MR_train_y)

In [25]:
assert isinstance(MR_clf, SVC)
assert hasattr(MR_clf, "predict")

### 1k) 

Predict labels for both training samples and test samples. You will need to use `MR_clf.predict(...)`

Name the predicted labels for the training samples as `MR_predicted_train_y`.
Name the predicted labels for the testing samples as `MR_predicted_test_y`.

Your code here will also take a minute to run.

In [26]:
# YOUR CODE HERE
MR_predicted_train_y=MR_clf.predict(MR_train_X)
MR_predicted_test_y=MR_clf.predict(MR_test_X)

In [27]:
# Now we will use the function 'classification_report'
#  to print out the performance of the classifier on the training set

# Your classifier should be able to reach above 90% accuracy on the training set
print(classification_report(MR_train_y,MR_predicted_train_y))

              precision    recall  f1-score   support

         0.0       0.91      0.92      0.91      2008
         1.0       0.92      0.91      0.91      1992

   micro avg       0.91      0.91      0.91      4000
   macro avg       0.91      0.91      0.91      4000
weighted avg       0.91      0.91      0.91      4000



In [28]:
# Tests for 1k
assert MR_predicted_train_y.shape == (4000,)
assert MR_predicted_test_y.shape == (1000,)

precision, recall, _, _ = precision_recall_fscore_support(MR_train_y,MR_predicted_train_y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.92, 0.02)


In [29]:
# And finally, we check the performance of the trained classifier on the test set

# Your classifier should be able to reach around 70% accuracy on the test set.
print(classification_report(MR_test_y, MR_predicted_test_y))

              precision    recall  f1-score   support

         0.0       0.70      0.68      0.69       482
         1.0       0.71      0.72      0.72       518

   micro avg       0.70      0.70      0.70      1000
   macro avg       0.70      0.70      0.70      1000
weighted avg       0.70      0.70      0.70      1000



# Part 2: TF-IDF

In this part, we will explore TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the BoW's approach used in Part 1. 

To do, we will:
- Transform the raw text data into vectors using TF-IDF
- Train an SVM classifier on the training set and report the performance this classifer on the test set

### 2a) Text Data to Vectors

We will create a "TfidfVectorizer" object to transform the text data into vectors with TF-IDF

To do so, we will initialize a "TfidfVectorizer" object, and name it as "tfidf".

We need to pass 4 arguments into the "TfidfVectorizer" to initialize a "tfidf":
  1. sublinear_tf: True
       Set to apply TF scaling.
  2. analyzer: 'word'
       Set to analyze the data at the word-level
  3. max_features: 2000
       Set the max number of unique words
  4. tokenizer: word_tokenize
       Set to tokenize the text data by using the word_tokenizer from NLTK

In [30]:
# YOUR CODE HERE
tfidf=TfidfVectorizer(sublinear_tf=True, analyzer="word", 
                      max_features=2000, tokenizer=word_tokenize)

In [31]:
assert tfidf.analyzer == 'word'
assert tfidf.max_features == 2000
assert tfidf.tokenizer == word_tokenize
assert tfidf.stop_words == None
assert hasattr(vectorizer, "fit_transform")

### 2b) Transform Reviews 

Transform the `review` column of MR_df into vectors using the `tfidf` we created above.

Save the transformed data into a variable called `MR_tfidf_X`

Hint: You might need to cast the datatype of `MR_tfidf_X` to `numpy.ndarray` by using `.toarray()`

In [32]:
# YOUR CODE HERE
MR_tfidf_X = tfidf.fit_transform(MR_df["review"]).toarray()

In [33]:
assert isinstance(MR_tfidf_X, np.ndarray)

assert "skills" in set(tfidf.stop_words_)
assert "risky" in set(tfidf.stop_words_)
assert "adopts" in set(tfidf.stop_words_)


### 2c) 
Split the `MR_tfidf_X` and `MR_y` into training set and test set. 

Name these variables as:
- `MR_train_tfidf_X` and `MR_train_tfidf_y` for the training set
- `MR_test_tfidf_X` and `MR_test_tfidf_y` for the test set

We will use the same 80/20 split as in part 1. You can use the same `num_training` variable from part 1 to split up the data.

In [34]:
# YOUR CODE HERE
#train
MR_train_tfidf_X=MR_tfidf_X[0:num_training]
MR_train_tfidf_y=MR_y[0:num_training]
#test
MR_test_tfidf_X=MR_tfidf_X[num_training:]
MR_test_tfidf_y=MR_y[num_training:]
print ("Is everything the right shape?       ",len(MR_test_X)==len(MR_test_y)==num_testing)

Is everything the right shape?        True


In [35]:
assert MR_train_tfidf_X[0].tolist() == MR_tfidf_X[0].tolist()
assert MR_train_tfidf_X.shape == (4000, 2000)
assert MR_test_tfidf_X.shape == (1000, 2000)

### 2d) Training

Train an SVM classifier on the samples `MR_train_tfidf_X` and the labels `MR_train_y`.

You need to call the function `train_SVM` you created in part 1. Name the returned object as `MR_tfidf_clf`.

Note that this may take many seconds, up to a few minutes, to run.

In [36]:
# YOUR CODE HERE
MR_tfidf_clf=train_SVM(MR_train_tfidf_X, MR_train_tfidf_y)

In [37]:
assert isinstance(MR_clf, SVC)
assert hasattr(MR_tfidf_clf, "predict")

### 2e) Prediction

Predict the labels for both the training and test samples (the 'X' data). You will need to use `MR_tfidf_clf.predict(...)`

Name the predicted labels on training samples as `MR_pred_train_tfidf_y`. Name the predicted labels on testing samples as `MR_pred_test_tfidf_y`

In [38]:
# YOUR CODE HERE
MR_pred_train_tfidf_y=MR_tfidf_clf.predict(MR_train_tfidf_X)
MR_pred_test_tfidf_y=MR_tfidf_clf.predict(MR_test_tfidf_X)

In [39]:
# Again, we use 'classification_report' to check the performance on the training set 

# Your classifier should be able to reach above 85% accuracy.
print(classification_report(MR_train_tfidf_y, MR_pred_train_tfidf_y))

              precision    recall  f1-score   support

         0.0       0.86      0.88      0.87      2008
         1.0       0.87      0.85      0.86      1992

   micro avg       0.87      0.87      0.87      4000
   macro avg       0.87      0.87      0.87      4000
weighted avg       0.87      0.87      0.87      4000



In [40]:
# Tests for 2e
precision, recall, _, _ = precision_recall_fscore_support(MR_train_tfidf_y, MR_pred_train_tfidf_y)
assert np.isclose(precision[0], 0.86, 0.02)
assert np.isclose(precision[1], 0.87, 0.02)


In [41]:
# And check performance on the test set

# Your classifier should be able to reach around 70% accuracy.
print(classification_report(MR_test_tfidf_y, MR_pred_test_tfidf_y))

              precision    recall  f1-score   support

         0.0       0.72      0.72      0.72       482
         1.0       0.74      0.74      0.74       518

   micro avg       0.73      0.73      0.73      1000
   macro avg       0.73      0.73      0.73      1000
weighted avg       0.73      0.73      0.73      1000



# Part 3: Sentiment Analysis on Customer Review with TF-IDF

In this part, we will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. The reviews are annotated by human as either positive reviews and negative reviews. In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In this part, we have alreay split the data into a training set and a test set, in which the training set has labels for the reviews, but the test set doesn't. 

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

To do so, we will:
- Use the TF-IDF feature engineering method to encode the raw text data into vectors
- Train an SVM classifier on the training set
- Predict labels for the reviews in the test set

The performance of your trained classifier on the test set will be checked by a hidden test.

### 3a) Loading the data

Customer review task has 2 files
- "data/custrev.tsv" contains training data with labels
- "data/custrev.tsv" contains test data without labels which need to be predicted 

Import raw textfile `data/custrev.train` into a DataFrame called `CR_train_df`. Set the column names as `index`, `label`, `review`.

Import raw textfile `data/custrev.test` into a DataFrame called `CR_test_df`. Set the column names as `index`, `review`

Note that both will need to be imported with `sep` and `header` arguments (like in 1a)

In [42]:
CR_train_file='data/custrev_train.tsv'
CR_test_file = 'data/custrev_test.tsv'

# YOUR CODE HERE
CR_train_df=pd.read_csv(CR_train_file, sep='\t', header=None, names=['index', 'label', 'review'])
CR_test_df=pd.read_csv(CR_test_file, sep='\t', header=None, names=['index', 'review'])

In [43]:
assert isinstance(CR_train_df, pd.DataFrame)
assert list(CR_train_df.columns) == ['index', 'label', 'review']
assert CR_train_df.shape == (3016, 3)

assert isinstance(CR_test_df, pd.DataFrame)
assert list(CR_test_df.columns) == ['index', 'review']
assert CR_test_df.shape == (755, 2)

### 3b) 
Concatenate 2 DataFrames into 1 DataFrame, and name it `CR_df`

In [44]:
# YOUR CODE HERE
print (CR_train_df.shape)
print (CR_test_df.shape)
CR_df=pd.concat([CR_train_df, CR_test_df])
print (CR_df.shape)

(3016, 3)
(755, 2)
(3771, 3)


In [45]:
assert len(CR_df) == 3771


### 3c) 

Convert all labels in `CR_df["label"]` using the function we defined above `convert_label`. Save these numerical labels as a new column named `y` in CR_df.

In [46]:
# YOUR CODE HERE
CR_df["y"]=CR_df['label'].apply(convert_label)

In [47]:
assert isinstance(CR_df['y'], pd.Series)

### 3d) 

Transform reviews `CR_df["review"]` into vectors using the `tfidf` vectorizer we created in part 2. Save the transformed data into a variable called `CR_tfidf_X`.

In [48]:
# YOUR CODE HERE
CR_tfidf_X = tfidf.fit_transform(CR_df["review"]).toarray()

In [49]:
assert isinstance(CR_tfidf_X, np.ndarray)
assert CR_tfidf_X.shape == (3771, 2000)

In [50]:
# Here we will collect all training samples & numerical labels from CR_tfidf_X [code provided]
#   The code provided below will extract all samples with labels from the dataframe

CR_train_X = CR_tfidf_X[~CR_df['y'].isnull()]
CR_train_y = CR_df['y'][~CR_df['y'].isnull()]

# Note: if these asserts fail, something went wrong
#  Go back and check your code (in part 3) above this cell
assert CR_train_X.shape == (3016, 2000)
assert CR_train_y.shape == (3016, )

### 3e) 

Train an SVM classifier on the samples `CR_train_X` and the labels `CR_train_y`
- You need to call the function `train_SVM` you created above.
- Name the returned object as `CR_clf`.
- Note that this function will take many seconds / up to a few minutes to run.

In [51]:
# YOUR CODE HERE
CR_clf=train_SVM(CR_train_X, CR_train_y)

In [52]:
assert isinstance(CR_clf, SVC)

### 3f) 

Predict labels on the training set, and name the returned variable as `CR_pred_train_y`

In [53]:
# YOUR CODE HERE
CR_pred_train_y=CR_clf.predict(CR_train_X)

In [54]:
# Check the classifier accuracy on the train data
#   Note that your classifier should be able to reach above 90% accuracy.
print(classification_report(CR_train_y, CR_pred_train_y))

              precision    recall  f1-score   support

         0.0       0.90      0.84      0.87      1097
         1.0       0.91      0.95      0.93      1919

   micro avg       0.91      0.91      0.91      3016
   macro avg       0.91      0.89      0.90      3016
weighted avg       0.91      0.91      0.91      3016



In [55]:
# Tests for 3f
precision, recall, _, _ = precision_recall_fscore_support(CR_train_y, CR_pred_train_y)
assert np.isclose(precision[0], 0.90, 0.02)
assert np.isclose(precision[1], 0.91, 0.02)

In [56]:
# Collect all test samples from CR_tfidf_X
CR_test_X = CR_tfidf_X[CR_df['y'].isnull()]

### 3g) 
Predict the labels on the test set. Name the returned variable as `CR_pred_test_y`

In [57]:
# YOUR CODE HERE
CR_pred_test_y=CR_clf.predict(CR_test_X)

In [58]:
assert isinstance(CR_test_X, np.ndarray)
assert isinstance(CR_pred_test_y, np.ndarray)

### 3h) 

Convert the predicted numerical labels back to string labels.

Create a column called `label` in `CR_test_df` to store the converted labels.

In [59]:
# YOUR CODE HERE
def convert_label_tostr(label):
    if label ==1.0:
        output="pos"
    elif label==0:
        output="neg"
    else: 
        output=label
    return output


CR_pred_test_y

array([1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
       1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 1.,
       1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0.,
       1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
       0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 0.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 0., 1., 1.,
       1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1.,
       1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
       1., 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.,
       1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.,
       0., 1., 1., 1., 1.

In [60]:
string_label=[convert_label_tostr(i) for i in CR_pred_test_y]
CR_test_df["label"]=string_label

In [61]:
assert isinstance(CR_test_df['label'], pd.Series)
assert set(CR_test_df['label']) == {'neg', 'pos'}
