# COGS 108 - Assignment 6: Natural Language Processing

This assignment covers working with text data and NLP.

This assignment is out of 8 points, worth 8% of your grade.

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INLCUDED IN THE ASSIGNMENT.**

# Important

- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
    - In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values. 
        - It is up to you to check the values, and make sure they seem reasonable.
- A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.
    - For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail. 
    - Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

# Background & Work Flow

- In this homework assignment, we will be analyzing text data. A common approach to analyzing text data is to use methods that allow us to convert text data into some kind of numerical representation - since we can then use all of our mathematical tools on such data. In this assignment, we will explore 2 feature engineering methods that convert raw text data into numerical vectors:
    - **Bag of Words (BoW)**
        - BoW encodes an input sentence as the frequency of each word in the sentence. 
        - In this approach, all words contribute equally to the feature vectors.
    - **Term Frequency - Inverse Document Frequency (TF-IDF)**
        - TF-IDF is a measure of how important each term is to a specific document, as compared to an overall corpus. 
        - TF-IDF encodes each word as its frequency in the document of interest, divided by a measure of how common the word is across all documents (the corpus).
        - Using this approach, each word contributes differently to the feature vectors.
        - The assumption behind using TF-IDF is that words that appear commonly everywhere are not that informative about what is specifically interesting about a document of interest, so it is tuned to representing a document in terms of the words it uses that are different from other documents. 

- To compare those 2 methods, we will first apply them on the same Movie Review dataset to analyse sentiment (how positive or negative a text is). In order to make the comparison fair, an **SVM (support vector machine)** classifier will be used to classify positive reviews and negative reviews.

- SVM is a simple yet powerful and interpretable linear model. To use it as a classifier, we need to have at least 2 splits of the data: training data and test data. The training data is used to tune the weight parameters in the SVM to learn an optimal way to classify the training data. We can then test this trained SVM classifier on the test data, to see how well it works on data that the classifier has not seen before. 

In [1]:
# Imports - these are all the imports needed for the assignment
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

For this assignment we will be using `nltk`: the Natural Language Toolkit.

To do so, we will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and/or example text data) which is what we will download here now, if you don't already have them.

In [2]:
# Download the NLTK English tokenizer and the stopwords of all languages
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rabwa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rabwa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Part 1: Sentiment Analysis on Movie Review Data (4.75 points)

In part 1 we will apply sentiment analysis to Movie Review (MR) data.

- The MR data contains more than 10,000 reviews collected from the IMDB website, and each of the reviews is annotated as either positive or negative. The number of positive and negative reviews are roughly the same. For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/

- For this homework assignment, we've already shuffled the data, and truncated the data to contain only 5000 reviews.

In this part of the assignment we will:
- Transform the raw text data into vectors with the BoW encoding method
- Split the data into training and test sets
- Write a function to train an SVM classifier on the training set
- Test this classifier on the test set and report the results

### 1a) Import data

Import the textfile 'rt-polarity.tsv' in the `data/` directory into a DataFrame called `MR_df`,

Set the column names as 'index', 'label', 'review'

Note that 'rt-polarity.tsv' is a tab separated raw text file, in which data is separated by tabs ('\t'). You can load this file with `read_csv`, specifying the `sep` (separator) argument as tabs ('\t'). You will have to set `header` as None.

In [3]:
# YOUR CODE HERE
file_path = 'data/rt-polarity.tsv'

MR_df = pd.read_csv(file_path, sep='\t', header=None, names=['index', 'label', 'review'])

FileNotFoundError: [Errno 2] No such file or directory: 'data/rt-polarity.tsv'

In [4]:
assert isinstance(MR_df, pd.DataFrame)


NameError: name 'MR_df' is not defined

In [5]:
# Check the data
MR_df.head()

NameError: name 'MR_df' is not defined

### 1b) Create a function that converts string labels to numerical labels

Function name: `convert_label`

The function should do the following:
- take two parameters `label` and `direction`
- if `direction` is 'tonumber', 
    - and if the input label is "pos" return 1.0
    - and if the input label is "neg" return 0.0
    - otherwise, return the input label as is
- if `direction` is 'tolabel'
    - and if the input label is `1.0` return "pos"
    - and if the input label is `0.0` return "neg"
    - otherwise, return the label as is
        

In [6]:
# YOUR CODE HERE
def convert_label(label, direction):
    if direction == 'tonumber':
        return 1.0 if label == 'pos' else 0.0 if label == 'neg' else label
    elif direction == 'tolabel':
        return 'pos' if label == 1.0 else 'neg' if label == 0.0 else label
    else:
        raise ValueError("Invalid direction")

In [7]:
assert convert_label


In [8]:
assert callable(convert_label)


### 1c) Numerical Labels

Convert all labels in `MR_df["label"]` to numerical labels, using the `convert_label` function. Be sure to specify the appropriate argument to the `direction` parameter.

Save them as a new column named "Y" in `MR_df`. 

In [9]:
# YOUR CODE HERE
MR_df['Y'] = MR_df['label'].apply(lambda x: convert_label(x, 'tonumber'))

NameError: name 'MR_df' is not defined

In [10]:
assert sorted(set(MR_df['Y'])) == [0., 1.]


NameError: name 'MR_df' is not defined

In [11]:
# Check the MR_df data
MR_df.head()

NameError: name 'MR_df' is not defined

### 1d) Convert Text data into vector 

We will now create a `CountVectorizer` object to transform the text data into vectors with numerical values. 

To do so, we will initialize a `CountVectorizer` object, and name it as `vectorizer`.

We need to pass 4 arguments to initialize a CountVectorizer:
  1. `analyzer`: `'word'` 
          Specify to analyze data from word-level.
  2. `max_features`: `2000`
          Set a max number of unique words.
  3. `tokenizer`: `word_tokenize`
          Set to tokenize the text data by using the word_tokenizer from NLTK .
  4. `stop_words`: `stopwords.words('english')`
          Set to remove all stopwords in English. We do this since they generally don't provide useful discriminative information.

In [12]:
# YOUR CODE HERE
vectorizer = CountVectorizer(analyzer='word',
                             max_features=2000,
                             tokenizer=word_tokenize,
                             stop_words=stopwords.words('english'))

In [13]:
assert vectorizer.analyzer == 'word'
assert vectorizer.max_features == 2000
assert vectorizer.tokenizer == word_tokenize
assert vectorizer.stop_words == stopwords.words('english')
assert hasattr(vectorizer, "fit_transform")

### 1e) Vectorize reviews

Transform reviews `MR_df["review"]` into vectors using the `vectorizer` we created above:

The method you will be using is: `MR_X = vectorizer.fit_transform(...).toarray()`

Note that we apply the `toarray` method to the type cast the output to a numpy array. This is something we will do multiple times, turning custom sklearn objects back into arrays. 

Note this may post a warning about stopwords. This is ok.

In [14]:
# YOUR CODE HERE
MR_X = vectorizer.fit_transform(MR_df["review"]).toarray()

NameError: name 'MR_df' is not defined

In [15]:
assert type(MR_X) == np.ndarray


NameError: name 'MR_X' is not defined

### 1f)  Outcome variable

Store the `y` column in `MR_df` as an np.array named `MR_Y`

Make sure the shape of `MR_Y` is (5000,) - depending upon your earlier approach, you *may* have to use `reshape` to do so. 

In [16]:
# YOUR CODE HERE
MR_Y = MR_df['Y'].values
MR_Y = MR_Y.reshape(-1)

NameError: name 'MR_df' is not defined

In [17]:
assert MR_Y.shape == (5000,)


NameError: name 'MR_Y' is not defined

### 1g) Defining the train & test sets

Now, we'll instead use `sklearn`'s `train_test_split()` function here to define our train and test set. Store train data (predictors) into `MR_train_X` and labels (outcomes) into `MR_train_Y`. Similarly, store test data into `MR_test_X` and test labels into `MR_test_Y`.

In addition to providing the predictors (`MR_X`) and outcomes (`MR_Y`) to the function, we will use the following arguments for this task:
- `test_size`: 0.2
- `random_state`: 200

In [18]:
# YOUR CODE HERE
MR_train_X, MR_test_X, MR_train_Y, MR_test_Y = train_test_split(
    MR_X, MR_Y, test_size=0.2, random_state=200)

NameError: name 'MR_X' is not defined

In [19]:
assert MR_train_X.shape[0] == MR_train_Y.shape[0]
assert MR_test_X.shape[0] == MR_test_Y.shape[0]

assert len(MR_train_X) == 4000
assert len(MR_test_Y) == 1000

NameError: name 'MR_train_X' is not defined

### 1i) SVM

Define a function called `train_SVM` that initializes an SVM classifier and trains it

Inputs: 
- `X`: np.ndarray, training samples, 
- `y`: np.ndarray, training labels,
- `kernel`: string, set the default value of "kernel" as "linear"

Output: a trained classifier `clf`

Hint: There are 2 steps involved in this function:
- Initializing an SVM classifier: `clf = SVC(...)`
- Training the classifier: `clf.fit(X, y)`

In [20]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    return clf

In [21]:
assert callable(train_SVM)


### 1j) Train SVM

Train an SVM classifier with the default linear kernel on the samples `MR_train_X` and the labels `MR_train_Y`

You need to call the function `train_SVM` you just created. Name the returned object as `MR_clf`.

Note: training your model may take many seconds / up to a few minutes to run.

In [22]:
# YOUR CODE HERE
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    return clf
MR_clf = train_SVM(MR_train_X, MR_train_Y)

NameError: name 'MR_train_X' is not defined

In [23]:
assert isinstance(MR_clf, SVC)
assert hasattr(MR_clf, "predict")

NameError: name 'MR_clf' is not defined

### 1k) Predict outcome

Predict labels for both training samples and test samples. You will need to use `MR_clf.predict(...)`

Name the predicted labels for the training samples as `MR_predicted_train_Y`.
Name the predicted labels for the testing samples as `MR_predicted_test_Y`.

Note: Your code here will also take a minute to run.

In [24]:
# YOUR CODE HERE
MR_predicted_train_Y = MR_clf.predict(MR_train_X)
MR_predicted_test_Y = MR_clf.predict(MR_test_X)

NameError: name 'MR_clf' is not defined

Now we will use the function `classification_report` to print out the performance of the classifier on the training set:

In [25]:
# Your classifier should be able to reach above 90% accuracy 
# on the training set
print(classification_report(MR_train_Y,MR_predicted_train_Y))

NameError: name 'MR_train_Y' is not defined

And finally, we check the performance of the trained classifier on the test set:

In [26]:
# Your classifier should be able to reach around 67% accuracy on the test set.
print(classification_report(MR_test_Y, MR_predicted_test_Y))

NameError: name 'MR_test_Y' is not defined

In [27]:
assert MR_predicted_train_Y.shape == (4000,)
assert MR_predicted_test_Y.shape == (1000,)

precision, recall, _, _ = precision_recall_fscore_support(MR_train_Y,MR_predicted_train_Y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.92, 0.02)


NameError: name 'MR_predicted_train_Y' is not defined

# Part 2: TF-IDF (1.5 points)

In this part, we will explore TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the BoW approach used in Part 1. 

To do this, we will:
- Transform the raw text data into vectors using TF-IDF
- Train an SVM classifier on the training set and report the performance this classifer on the test set

### 2a) Text Data to Vectors

We will create a `TfidfVectorizer` object to transform the text data into vectors with TF-IDF

To do so, we will initialize a `TfidfVectorizer` object, and name it as `tfidf`.

We need to pass 4 arguments into the "TfidfVectorizer" to initialize a "tfidf":
  1. `sublinear_tf`: `True`
           Set to apply TF scaling.
  2. `analyzer`: `'word'`
           Set to analyze the data at the word-level
  3. `max_features`: `2000`
           Set the max number of unique words
  4. `tokenizer`: `word_tokenize`
           Set to tokenize the text data by using the word_tokenizer from NLTK

In [28]:
# YOUR CODE HERE
tfidf = TfidfVectorizer(sublinear_tf=True,
                        analyzer='word',
                        max_features=2000,
                        tokenizer=word_tokenize)

In [29]:
assert tfidf.analyzer == 'word'
assert tfidf.max_features == 2000
assert tfidf.tokenizer == word_tokenize
assert tfidf.stop_words == None
assert hasattr(vectorizer, "fit_transform")

### 2b) Transform Reviews 

Transform the `review` column of `MR_df` into vectors using the `tfidf` we created above.

Save the transformed data into a variable called `MR_tfidf_X`

Hint: You might need to cast the datatype of `MR_tfidf_X` to `numpy.ndarray` by using `.toarray()`

In [30]:
# YOUR CODE HERE
MR_tfidf_X = tfidf.fit_transform(MR_df['review']).toarray()

NameError: name 'MR_df' is not defined

In [31]:
assert isinstance(MR_tfidf_X, np.ndarray)

assert "skills" in set(tfidf.stop_words_)
assert "risky" in set(tfidf.stop_words_)
assert "adopts" in set(tfidf.stop_words_)


NameError: name 'MR_tfidf_X' is not defined

### 2c) 
Aain, using `train_test_split`, split the `MR_tfidf_X` and `MR_Y` into training set and test set. 

Name these variables as:
- `MR_train_tfidf_X` and `MR_train_tfidf_Y` for the training set
- `MR_test_tfidf_X` and `MR_test_tfidf_Y` for the test set

We will use the same 80/20 split as in part 1 and same arguments for the parameters `test_size` (0.2) and `random_state` (200). 

In [32]:
# YOUR CODE HERE
MR_train_tfidf_X, MR_test_tfidf_X, MR_train_tfidf_Y, MR_test_tfidf_Y = train_test_split(
    MR_tfidf_X, MR_Y, test_size=0.2, random_state=200)

NameError: name 'MR_tfidf_X' is not defined

In [33]:
assert MR_train_tfidf_X.shape == (4000, 2000)
assert MR_test_tfidf_X.shape == (1000, 2000)
assert MR_train_tfidf_Y.shape == (4000,)
assert MR_test_tfidf_Y.shape == (1000,)

NameError: name 'MR_train_tfidf_X' is not defined

### 2d) Training

Train an SVM classifier on the samples `MR_train_tfidf_X` and the labels `MR_train_tfidf_Y`.

You need to call the function `train_SVM` you created in part 1. Name the returned object as `MR_tfidf_clf`.

Note: training your model may take many seconds, up to a few minutes, to run.

In [34]:
# YOUR CODE HERE
MR_tfidf_clf = train_SVM(MR_train_tfidf_X, MR_train_tfidf_Y)

NameError: name 'MR_train_tfidf_X' is not defined

In [35]:
assert isinstance(MR_tfidf_clf, SVC)
assert hasattr(MR_tfidf_clf, "predict")

NameError: name 'MR_tfidf_clf' is not defined

### 2e) Prediction

Predict the labels for both the training and test samples (the 'X' data). You will need to use `MR_tfidf_clf.predict(...)`

Name the predicted labels on training samples as `MR_pred_train_tfidf_Y`. Name the predicted labels on testing samples as `MR_pred_test_tfidf_Y`

Note: this may take a few seconds to run.

In [36]:
# YOUR CODE HERE
MR_pred_train_tfidf_Y = MR_tfidf_clf.predict(MR_train_tfidf_X)
MR_pred_test_tfidf_Y = MR_tfidf_clf.predict(MR_test_tfidf_X)

NameError: name 'MR_tfidf_clf' is not defined

Again, we use `classification_report` to check the performance on the training set.

In [37]:
# Your classifier should be able to reach above 86% accuracy.
print(classification_report(MR_train_tfidf_Y, MR_pred_train_tfidf_Y))

NameError: name 'MR_train_tfidf_Y' is not defined

Again, check performance on the test set:

In [38]:
# Your classifier should be able to reach around 71% accuracy.
print(classification_report(MR_test_tfidf_Y, MR_pred_test_tfidf_Y))

NameError: name 'MR_test_tfidf_Y' is not defined

In [39]:
precision, recall, _, _ = precision_recall_fscore_support(MR_train_tfidf_Y, MR_pred_train_tfidf_Y)
assert np.isclose(precision[0], 0.85, 0.02)
assert np.isclose(precision[1], 0.89, 0.02)


NameError: name 'MR_train_tfidf_Y' is not defined

### Written Answer Question

How does the performance of the TF-IDF classifier compare to the classifier used in part 1?

It is performing much better.

# Part 3: Sentiment Analysis on Customer Review with TF-IDF (2 points)

In this part, we will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. The reviews are annotated by humans as either positive reviews or negative reviews. In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In this part, we have already split the data into a training set and a test set, in which the training set has labels for the reviews, but the test set doesn't. 

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

To do so, we will:
- Use the TF-IDF feature engineering method to encode the raw text data into vectors
- Train an SVM classifier on the training set
- Predict labels for the reviews in the test set

The performance of your trained classifier on the test set will be checked by a hidden test.

### 3a) Loading the data

Customer review task has 2 files
- "data/custrev_train.tsv" contains training data with labels
- "data/custrev_test.tsv" contains test data without labels which need to be predicted 

Import raw textfile `data/custrev_train.csv` into a DataFrame called `CR_train_df`. Set the column names as `index`, `label`, `review`.

Import raw textfile `data/custrev_test.csv` into a DataFrame called `CR_test_df`. Set the column names as `index`, `review`

Note that both will need to be imported with `sep` and `header` arguments (like in 1a)

In [40]:
CR_train_file = 'data/custrev_train.tsv'
CR_test_file = 'data/custrev_test.tsv'

CR_train_df = pd.read_csv(CR_train_file, sep='\t', header=None, names=['index', 'label', 'review'])
CR_test_df = pd.read_csv(CR_test_file, sep='\t', header=None, names=['index', 'review'])

FileNotFoundError: [Errno 2] No such file or directory: 'data/custrev_train.tsv'

In [41]:
assert isinstance(CR_train_df, pd.DataFrame)
assert isinstance(CR_test_df, pd.DataFrame)

NameError: name 'CR_train_df' is not defined

### 3b) Concatenation
Concatenate the 2 DataFrames from the last step into a single DataFrame, and name it `CR_df`. 

In [42]:
# YOUR CODE HERE
CR_df = pd.concat([CR_train_df, CR_test_df], ignore_index=True)

NameError: name 'CR_train_df' is not defined

In [43]:
assert len(CR_df) == 3771


NameError: name 'CR_df' is not defined

### 3c) Cleaning

Convert all labels in `CR_df["label"]` using the `convert_label` function we defined above. Save these numerical labels as a new column named `Y` in CR_df.

In [44]:
# YOUR CODE HERE
CR_df['Y'] = CR_df["label"].apply(lambda x: convert_label(x, 'tonumber'))

NameError: name 'CR_df' is not defined

In [45]:
assert isinstance(CR_df['Y'], pd.Series)

NameError: name 'CR_df' is not defined

### 3d)  Use `tfidf`

Transform reviews `CR_df["review"]` into vectors using the `tfidf` vectorizer we created in part 2. Save the transformed data into a variable called `CR_tfidf_X`.

In [46]:
# YOUR CODE HERE
CR_tfidf_X = tfidf.fit_transform(CR_df["review"]).toarray()

NameError: name 'CR_df' is not defined

In [47]:
assert isinstance(CR_tfidf_X, np.ndarray)


NameError: name 'CR_tfidf_X' is not defined

Here we will collect all training samples & numerical labels from `CR_tfidf_X`. The code provided below will extract all samples with labels from the dataframe:


In [48]:
# code provided to collect labels
CR_train_X = CR_tfidf_X[~CR_df['Y'].isnull()]
CR_train_Y = CR_df['Y'][~CR_df['Y'].isnull()]

# Note: if these asserts fail, something went wrong
#  Go back and check your code (in part 3) above this cell
assert CR_train_X.shape == (3016, 2000)
assert CR_train_Y.shape == (3016, )

NameError: name 'CR_tfidf_X' is not defined

### 3e) SVM 

Train an SVM classifier on the samples `CR_train_X` and the labels `CR_train_Y`:
- You need to call the function `train_SVM` you created above.
- Name the returned object as `CR_clf`.

Note: training your model may take many seconds / up to a few minutes to run.

In [49]:
# YOUR CODE HERE
CR_clf = train_SVM(CR_train_X, CR_train_Y)

NameError: name 'CR_train_X' is not defined

In [50]:
assert isinstance(CR_clf, SVC)

NameError: name 'CR_clf' is not defined

### 3f) Predict: training data

Predict labels on the training set, and name the returned variable as `CR_pred_train_Y`

In [51]:
# YOUR CODE HERE
CR_pred_train_Y = CR_clf.predict(CR_train_X)

NameError: name 'CR_clf' is not defined

In [52]:
# Check the classifier accuracy on the train data
#   Note that your classifier should be able to reach above 90% accuracy.
print(classification_report(CR_train_Y, CR_pred_train_Y))

NameError: name 'CR_train_Y' is not defined

In [53]:
precision, recall, _, _ = precision_recall_fscore_support(CR_train_Y, CR_pred_train_Y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.91, 0.02)

NameError: name 'CR_train_Y' is not defined

In [54]:
# Collect all test samples from CR_tfidf_X
CR_test_X = CR_tfidf_X[CR_df['Y'].isnull()]

NameError: name 'CR_tfidf_X' is not defined

### 3g)  Predict: test set
Predict the labels on the test set. Store the predictions in a pandas DataFrame called `CR_pred_test_Y`, with the numeric predictions in a column (series) `'label'`

In [55]:
# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [56]:
assert isinstance(CR_test_X, np.ndarray)
assert isinstance(CR_pred_test_Y, pd.DataFrame)
assert CR_pred_test_Y.columns == 'label'

NameError: name 'CR_test_X' is not defined

### 3h) Convert labels

Using the `convert_label` function, convert the predicted numerical labels back to string labels ("pos" and "neg").

Create a column called `label` in `CR_test_df` to store the converted labels.

In [57]:
# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [58]:
assert isinstance(CR_test_df['label'], pd.Series)
assert set(CR_test_df['label']) == {'neg', 'pos'}


NameError: name 'CR_test_df' is not defined

The hidden assignments tests for the cell above will check that your model predicts the right number of pos/neg reviews in the test data provided. 

We now have a model that can predict positive or negative sentiment! 

In the cell below, as a written answer question, briefly, in your own words, what BoW and TF/IDF word representations are, and how they differ. Also, think about and write a quick example of when and why it might be useful to computationally analyze the sentiment of text data. [This whole answer can/should be a couple of sentences].

After you answer this question, you are done! 

YOUR ANSWER HERE

# Complete! 

Good work! Have a look back over your answers, and also make sure to `Restart & Run All` from the kernel menu to double check that everything is working properly. While you can typically use the 'Validate' button above, which runs your notebook from top to bottom and checks to ensure all `assert` statements pass silently, ***this may fail on this assignment as the code takes too long to run. Use Restart & Run All instead***. When you are ready, submit on datahub!

Note that ***the final validation is for your reassurance and is not a required step***. You can submit without validating. You can also submit without passing all asserts (for partial credit on the assignment). We grade whatever is submitted on datahub. We will grade your most recent submission.