In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# use seaborn plotting defaults
import seaborn as sns; sns.set()

# TF-IDF Application (OPTIONAL, 0 Points)

In this task we will take a look at how to use TF-IDF on real data. The task is optional, no points are given for this task. 

We will work with the dataset `imdb` from the library `datasets`. The description of the dataset can be found [here](https://huggingface.co/datasets/imdb).

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
training = pd.DataFrame(dataset["train"])
test = pd.DataFrame(dataset["test"])
X_train, y_train = training["text"], training["label"]
X_test, y_test = test["text"], test["label"]

Our predictor variable will be text reviews of the movies. In order to apply them in a SVM model, we will have to transform texts to numerical features. 

In [None]:
X_train.head()

0    I rented I AM CURIOUS-YELLOW from my video sto...
1    "I Am Curious: Yellow" is a risible and preten...
2    If only to avoid making this type of film in t...
3    This film was probably inspired by Godard's Ma...
4    Oh, brother...after hearing about this ridicul...
Name: text, dtype: object

Our label vector consists of 0 and 1 values. 0 represents negative reviews and 1 represents positive reviews.

In [None]:
y_train.head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

In [None]:
X_train.shape, X_test.shape

((25000,), (25000,))

### 5a.Preprocessing
When working with text data, it is important to bring it to the smallest possible amount of different words (vocabulary) without losing semantic meaning. The first step in doing it, is to preprocess the text. There are different steps and approaches in doing it. The choice of the steps depends on the data and the applications (e.g., if you analyse tweets, you probably want to leave punctuation that represents emojis). Below under the function `preprocessing()` is one of the possible ways to do it.

In [None]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 4.8 MB/s 
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [None]:
import re
import unidecode

def preprocessing(text):
  """Preprocess given string: lowecase all words, normalize, remove all punctuation, special characters and excessive spaces"""
  text = text.lower() # lowercase the text
  text = unidecode.unidecode(text) # text normalization, e.g., substitute ê with e.
  text = re.sub(r'[^\w\s\d\']', ' ', text) # substitute all the punctuation and special characters with a space = substitute everything else but letters (\w), spaces (\s), digits (\d) and character ' (\') with a space
  text = re.sub(r"\s+", " ", text) # remove extra spaces
  text = text.strip() # remove spaces at the start and the end of the sentence
  return text

In [None]:
X_train_preprocessed = X_train.apply(preprocessing)
X_test_preprocessed = X_test.apply(preprocessing)
X_train_preprocessed.head()

0    i rented i am curious yellow from my video sto...
1    i am curious yellow is a risible and pretentio...
2    if only to avoid making this type of film in t...
3    this film was probably inspired by godard's ma...
4    oh brother after hearing about this ridiculous...
Name: text, dtype: object

### 5b. Lemmatization
The second approach to reduce the vocabulary size (i.e., data dimensionality) without losing semantic meaning, is to apply lemmatization. The goal is to have the same token for different forms of a word (e.g. fishing, fished, fisher, fishers, etc.). Lemmatization finds what is the lemma of a word and reduces the word to it (e.g. feet -> foot).

We will use the library `spacy` for this substask.

[spaCy](https://spacy.io/) is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently.

We install the library and its English-language model:

In [None]:
# Install and update spaCy
!pip install -U spacy

# Download the english language model\n",
!python -m spacy download en

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-10-26 20:36:11.484164: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Import the library and load English language model

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')

Apply Lemmatization on `train_data` and `test_data`. You might want to go prepare a tea, this will take a bit of time...

In [None]:
def lemmatize(text):
  """Substitute every word with its lemma"""
  lem = sp(text)
  # Find lemma for each word
  new_text = ""
  for word in lem:
    new_text = " ".join([new_text, word.lemma_])
  return new_text

In [None]:
X_train_lemmatized = X_train_preprocessed.apply(lemmatize)
X_test_lemmatized = X_test_preprocessed.apply(lemmatize)
X_train_lemmatized.head()

### 5c. Vectorizing the reviews
Transform the reviews from text to vector representation. Use `TfidfVectorizer` from `sklearn.feature_extraction.text`. Please, set the parameter `max_features` to `50000`. This will limit the vocabulary size to 50000 words. If we don't do it, Colab Notebook will run  out of available RAM and will crash.

**Hint**: 

-   ***fit*** lemmatized X **training** samples, 
-   ***transform*** lemmatized X **training AND test** samples

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=50000)
X_train_vectors = tfidf.fit_transform(X_train_lemmatized)
X_test_vectors = tfidf.transform(X_test_lemmatized)
X_train_vectors.todense().shape

(25000, 50000)

### 5d. Applying Linear SVM
Now we apply a simple `LinearSVC` Model on the vectorized data in order to predict the "mood" of the reviews. This task is called Sentiment Analysis in Natural Language Processing. As mentioned above, zeros represent negative sentiment of the review, while ones represent positive sentiment.

Apply a simple linear SVM on the data. You don't need to extra tune the hyperparameters, you can use the default values for all the parameters of `LinearSVC()`

In [None]:
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc.fit(X_train_vectors, y_train)

LogisticRegression()

Predict the y values for the test data and compare it with groundtruth `y_test`. You can use Confusion Matrix, Accuracy, Precision, Recall, F1-Score or any other evaluaton critera. 

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

y_pred = svc.predict(X_test_vectors)

print(f"Confusion matrix: {confusion_matrix(y_test, y_pred)}")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print()
print(classification_report(y_true=y_test, y_pred=y_pred, digits=3))

Confusion matrix: [[11018  1482]
 [ 1483 11017]]
Accuracy: 0.8814
