# Text Classification for Sentiment Analysis
---
<small><i>February 2023 - *Notebook created by Mariona Carós

One of the most common use cases of NLP is text Classification. The goal of this task is given a piece of text and a label, learning a model that is able to predict its labels.

Take for example a simple **Sentiment Analysis** problem. For this problem let's consider texts that belong to movie reviews. **These reviews can be classified on positive or negative**. We will train a model that based on a set of labelled reviews will be able to classify a new review on these terms.



### Playing with Spacy

For preprocessing the data and extracting the text of the reviews and the label that tells if it is positive or not, we will use the NLP library [Spacy](https://spacy.io/)


In [1]:
# We import the libraries that we are going to use
import os
import shutil
import string
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import spacy
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from tqdm.notebook import tqdm_notebook
import gensim
import time
from sklearn import metrics
import random

In [2]:
spacy.__version__

'3.7.2'

In [3]:
# workaround for solving UTF-8 error of colab
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding


We will start by defining a dummy text.

In [4]:
# Fill in the list with sentences in spanish.
## TODO ##
texts = ["En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor. Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lantejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda. El resto della concluían sayo de velarte, calzas de velludo para las fiestas, con sus pantuflos de lo mesmo, y los días de entresemana se honraba con su vellorí de lo más fino. Tenía en su casa una ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte, y un mozo de campo y plaza, que así ensillaba el rocín como tomaba la podadera. Frisaba la edad de nuestro hidalgo con los cincuenta años; era de complexión recia, seco de carnes, enjuto de rostro, gran madrugador y amigo de la caza. Quieren decir que tenía el sobrenombre de Quijada, o Quesada, que en esto hay alguna diferencia en los autores que deste caso escriben; aunque por conjeturas verosímiles se deja entender que se llamaba Quijana. Pero esto importa poco a nuestro cuento: basta que en la narración dél no se salga un punto de la verdad."]

In [5]:
# print elements of our first string
for elem in texts[0]:
    print(elem)

E
n
 
u
n
 
l
u
g
a
r
 
d
e
 
l
a
 
M
a
n
c
h
a
,
 
d
e
 
c
u
y
o
 
n
o
m
b
r
e
 
n
o
 
q
u
i
e
r
o
 
a
c
o
r
d
a
r
m
e
,
 
n
o
 
h
a
 
m
u
c
h
o
 
t
i
e
m
p
o
 
q
u
e
 
v
i
v
í
a
 
u
n
 
h
i
d
a
l
g
o
 
d
e
 
l
o
s
 
d
e
 
l
a
n
z
a
 
e
n
 
a
s
t
i
l
l
e
r
o
,
 
a
d
a
r
g
a
 
a
n
t
i
g
u
a
,
 
r
o
c
í
n
 
f
l
a
c
o
 
y
 
g
a
l
g
o
 
c
o
r
r
e
d
o
r
.
 
U
n
a
 
o
l
l
a
 
d
e
 
a
l
g
o
 
m
á
s
 
v
a
c
a
 
q
u
e
 
c
a
r
n
e
r
o
,
 
s
a
l
p
i
c
ó
n
 
l
a
s
 
m
á
s
 
n
o
c
h
e
s
,
 
d
u
e
l
o
s
 
y
 
q
u
e
b
r
a
n
t
o
s
 
l
o
s
 
s
á
b
a
d
o
s
,
 
l
a
n
t
e
j
a
s
 
l
o
s
 
v
i
e
r
n
e
s
,
 
a
l
g
ú
n
 
p
a
l
o
m
i
n
o
 
d
e
 
a
ñ
a
d
i
d
u
r
a
 
l
o
s
 
d
o
m
i
n
g
o
s
,
 
c
o
n
s
u
m
í
a
n
 
l
a
s
 
t
r
e
s
 
p
a
r
t
e
s
 
d
e
 
s
u
 
h
a
c
i
e
n
d
a
.
 
E
l
 
r
e
s
t
o
 
d
e
l
l
a
 
c
o
n
c
l
u
í
a
n
 
s
a
y
o
 
d
e
 
v
e
l
a
r
t
e
,
 
c
a
l
z
a
s
 
d
e
 
v
e
l
l
u
d
o
 
p
a
r
a
 
l
a
s
 
f
i
e
s
t
a
s
,
 
c
o
n
 
s
u
s
 
p
a
n
t
u
f
l
o
s
 
d
e
 
l
o
 
m
e
s
m
o
,
 
y
 
l
o
s
 
d
í
a
s


In [19]:
#Length of first string
len(texts[0])

1204

We would like to split the sentence in words instead of letters... As you know this is easily done with Spacy!

First, we need to select a Language model from those availables in the library. In this case, the texts are in spanish, so we will instantiate the spanish model.

In [20]:
# Mind the ! sign at the beginning of the command that indicates that it belongs to a command line call.
! python -m spacy download es_core_news_sm


^C
Traceback (most recent call last):
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/runpy.py", line 144, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/site-packages/spacy/__init__.py", line 6, in <module>
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/Users/ulises.rey/opt/anaconda3/envs/openCV_local/lib/python3.8/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/Users/ulises.r

Once you've downloaded and installed a trained pipeline, you can load it via `spacy.load`. This will return a Language object containing all components and data needed to process text. We usually call it `nlp`.

In [21]:
nlp_sp = spacy.load("es_core_news_sm") # loading spanish small model

**`nlp_sp`** represents the core processing engine for performing various linguistic analyses on text data. This NLP pipeline includes a sequence of processing components to perform linguistic tasks on text. Let's check them...

In [22]:
nlp_sp.component_names

['tok2vec',
 'morphologizer',
 'parser',
 'senter',
 'attribute_ruler',
 'lemmatizer',
 'ner']

`.component_names` is an attribute that provides a list of the names of all the components in a spaCy processing pipeline in the order they are added to the pipeline. A processing pipeline in spaCy consists of various components that perform different tasks on the text data, such as tokenization, part-of-speech tagging, named entity recognition, and more.


Calling the `nlp` object on a string of text will return a processed `Doc`. The `Doc` is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline typically include a tagger, a lemmatizer, a parser and an entity recognizer.

spacy-pipe.svg

In [23]:
# A Doc is a sequence of Token objects.
doc = nlp_sp(texts[0])
doc

En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor. Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lantejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda. El resto della concluían sayo de velarte, calzas de velludo para las fiestas, con sus pantuflos de lo mesmo, y los días de entresemana se honraba con su vellorí de lo más fino. Tenía en su casa una ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte, y un mozo de campo y plaza, que así ensillaba el rocín como tomaba la podadera. Frisaba la edad de nuestro hidalgo con los cincuenta años; era de complexión recia, seco de carnes, enjuto de rostro, gran madrugador y amigo de la caza. Quieren decir que tenía el sobrenombre de Quijada, o Quesada, que en esto hay alguna diferencia en los autores q

Next, we can check the length of the sentence and its grammatical structure.

In [24]:
print(f'Length: {len(doc)}')

Length: 247


In [25]:
# a doc is composed of tokens that contain linguistic anotations
print("word ->, pos_, tag_, dep_, shape_, is_alpha, is_stop")

for token in doc:
    print(f"{token.text} -> {token.pos_}, {token.tag_}, {token.dep_}, {token.shape_}, {token.is_alpha}, {token.is_stop}")

word ->, pos_, tag_, dep_, shape_, is_alpha, is_stop
En -> ADP, ADP, case, Xx, True, True
un -> DET, DET, det, xx, True, True
lugar -> NOUN, NOUN, obl, xxxx, True, False
de -> ADP, ADP, case, xx, True, True
la -> DET, DET, det, xx, True, True
Mancha -> PROPN, PROPN, nmod, Xxxxx, True, False
, -> PUNCT, PUNCT, punct, ,, False, False
de -> ADP, ADP, case, xx, True, True
cuyo -> PRON, PRON, nmod, xxxx, True, False
nombre -> NOUN, NOUN, obl, xxxx, True, False
no -> ADV, ADV, advmod, xx, True, True
quiero -> VERB, VERB, acl, xxxx, True, False
acordarme -> VERB, VERB, xcomp, xxxx, True, False
, -> PUNCT, PUNCT, punct, ,, False, False
no -> ADV, ADV, advmod, xx, True, True
ha -> AUX, AUX, ROOT, xx, True, True
mucho -> DET, DET, det, xxxx, True, True
tiempo -> NOUN, NOUN, obj, xxxx, True, False
que -> SCONJ, SCONJ, mark, xxx, True, True
vivía -> VERB, VERB, acl, xxxx, True, False
un -> DET, DET, det, xx, True, True
hidalgo -> NOUN, NOUN, obj, xxxx, True, False
de -> ADP, ADP, case, xx, True, T

In [26]:
## TODO ## get list of lemmas by using lemma_ attribute of each word in doc
# Try using a list comprehension
my_lemmas = [word.lemma_ for word in doc]

The dependency visualizer, `dep`, shows part-of-speech tags and syntactic dependencies.

In [27]:
my_lemmas

['en',
 'uno',
 'lugar',
 'de',
 'el',
 'Mancha',
 ',',
 'de',
 'cuyo',
 'nombre',
 'no',
 'querer',
 'acordar yo',
 ',',
 'no',
 'haber',
 'mucho',
 'tiempo',
 'que',
 'vivir',
 'uno',
 'hidalgo',
 'de',
 'el',
 'de',
 'lanza',
 'en',
 'astillero',
 ',',
 'adarga',
 'antiguo',
 ',',
 'rocín',
 'flaco',
 'y',
 'galgo',
 'corredor',
 '.',
 'uno',
 'olla',
 'de',
 'algo',
 'más',
 'vaco',
 'que',
 'carnero',
 ',',
 'salpicón',
 'el',
 'más',
 'noche',
 ',',
 'duelo',
 'y',
 'quebranto',
 'el',
 'sábado',
 ',',
 'lanteja',
 'el',
 'viernes',
 ',',
 'alguno',
 'palomino',
 'de',
 'añadidura',
 'el',
 'domingo',
 ',',
 'consumir',
 'el',
 'tres',
 'parte',
 'de',
 'su',
 'hacienda',
 '.',
 'el',
 'resto',
 'della',
 'concluir',
 'sayo',
 'de',
 'velarte',
 ',',
 'calza',
 'de',
 'velludo',
 'para',
 'el',
 'fiesta',
 ',',
 'con',
 'su',
 'pantuflo',
 'de',
 'él',
 'mesmo',
 ',',
 'y',
 'el',
 'día',
 'de',
 'entresemana',
 'él',
 'honrar',
 'con',
 'su',
 'vellorí',
 'de',
 'él',
 'más',
 '

In [28]:
# Visualizing a dependency parse
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

In [29]:
# get stop words
spanish_spacy_stopwords = spacy.lang.es.stop_words.STOP_WORDS

In [32]:
## TODO ## show first 10 stop words
# spanish_spacy_stopwords[:10] #No funciona porque es un Set no una lista
for i, element in enumerate(spanish_spacy_stopwords):
    print(element)
    if i >=10: break

aquello
cierta
adelante
dicho
próximo
esas
expresó
suyo
sabe
pues
muy


En un set no hay orden, por lo que los 10 primeros elementos siempre seran distintos

In [None]:
print(list(spanish_spacy_stopwords)[0:10])

['qué', 'saben', 'ademas', 'hoy', 'once', 'consigues', 'propios', 'delante', 'sobre', 'tuyo']


In [None]:
# all stopwords
print(spanish_spacy_stopwords)
# Es un set, para que sea mas rapido de acceder


{'y', 'hemos', 'últimos', 'hace', 'siguiente', 'también', 'demasiado', 'nuevo', 'estamos', 'puedo', 'dicen', 'ésta', 'mal', 'cierto', 'atras', 'hubo', 'usais', 'ellas', 'cuantas', 'cualquier', 'tuyo', 'propio', 'quiza', 'nuestra', 'aquéllas', 'debajo', 'antes', 'vaya', 'fuera', 'en', 'alguno', 'próximos', 'estar', 'muchas', 'puede', 'enseguida', 'quiénes', 'agregó', 'pues', 'ningunos', 'suya', 'vosotras', 'final', 'delante', 'parece', 'tengo', 'nada', 'haciendo', 'casi', 'quizá', 'respecto', 'llegó', 'cuanta', 'aún', 'sé', 'todo', 'qeu', 'lo', 'pueda', 'otra', 'sigue', 'ésos', 'siete', 'usar', 'poder', 'informo', 'claro', 'propias', 'toda', 'había', 'solas', 'salvo', 'verdadero', 'tuvo', 'estará', 'dónde', 'entre', 'ultimo', 'éstas', 'voy', 'aquéllos', 'bien', 'tus', 'muchos', 'ademas', 'mismas', 'están', 'aqui', 'otros', 'suyos', 'eres', 'mías', 'usamos', 'pesar', 'hecho', 'él', 'varios', 'esa', 'nosotras', 'contra', 'poco', 'estados', 'nuestro', 'os', 'vuestro', 'cierta', 'tuya', 'co

In [33]:
# Create our set of punctuation marks
# string is a package we had imported in the first cell already
punctuations = set(string.punctuation) #Searching through a set is faster than searching through a list
punctuations

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~'}


### Download the data
First, we download the data from the dataset web page with the `wget` command line tool.

The example that we use here is based on the IMDB dataset. The IMDB dataset includes 50K movie reviews for natural language processing or text analytics. This is a dataset for binary sentiment classification, which includes a set of 25,000 movie reviews for training and 25,000 for testing.

In [34]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


zsh:1: command not found: wget
tar: Error opening archive: Failed to open 'aclImdb_v1.tar.gz'
ls: aclImdb/: No such file or directory


In [None]:
!tar xzf aclImdb_v1.tar.gz


In [35]:
!ls aclImdb/

README     imdb.vocab imdbEr.txt [34mtest[m[m       [34mtrain[m[m


In [36]:
# remove unnecessary files from the dataset
shutil.rmtree('aclImdb/train/unsup')

### Load the data

According to the description of the dataset:
"There are two top-level directories `[train/, test/]` corresponding to
the training and test sets. Each contains `[pos/, neg/]` directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention `[[id]_[rating].txt]` where `[id]` is a unique ID and `[rating]` is
the star rating for that review on a 1-10 scale.

For example, the file `test/pos/200_8.txt` is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb. The
`[train/unsup/]` directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."

Next, we show how to load this data to start preprocessing it.

## **Workflow**

`preprocess` → `train` → `evaluate`

### Data preprocessing

In the `load_data` function, we prepare our dataset. It is a very important step as we need to clean the data before using it for learning our model. In NLP, cleaning the data means a serie of steps that are very common: remove HTML tags, remove punctuations, numbers, tokenize, extract the lemmas. These steps are optional and it will depend on each use case.

In this case, we use the spacy model for tokenize the text, i.e. split it in an array of semantic representations, here words.

we need to select a Language model from those availables in the library. In this case, the texts are in English, we instantiate the English model and we retrieve the Englishg stop words and the punctuations texts that we will use later for removing unsueful information.

In [40]:
! python -m spacy download en_core_news_sm # Somehow did not work, and I did it through the terminal

In [6]:
## TODO ## load "en_core_web_sm" model by using spacy and store it in a variable called nlp
nlp = spacy.load("en_core_web_sm")

Add the code needed to retrieve `clean_tokens` without puntuations and stop words in the function `spacy_tokenizer()`. Additionaly, remove words shorter than 2 letters.

In [8]:
# Creating our tokenizer function
def spacy_tokenizer(text, nlp, stop_words, punctuations):
    '''
    Lemmatize each token
    convert each token into lowercase
    Remove stop words
    modify integer numbers into word 'number'
    Note: I think originally the funcioned wanted stop_words and punctuations as inputs, but now they are extracted from spacy
    '''
    clean_tokens = []

    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(text, disable=[ "parser", "ner", "senter"])

    for word in mytokens:
        #print(word)
        #print(type(word))
        # Modify integer numbers into word 'number'
        if word.like_num:
            word='number'
            clean_tokens.append(word)
            continue

    ## TODO ##
    ##1. Remove punctuation and stopwords

        if word.is_punct:
            continue
        if word in stop_words:
            continue

    ##2. Remove words shorter than 2 letters

        if len(word) < 2:
            continue

    # Lemmatizing each token and converting each token into lowercase

        word = word.lemma_.lower().strip()
        clean_tokens.append(word)

    # return preprocessed list of tokens

    return clean_tokens



In [9]:
# Manera de Mariona
# Creating our tokenizer function
def spacy_tokenizer(text, nlp, stop_words, punctuations):
    '''
    Lemmatize each token
    convert each token into lowercase
    Remove stop words
    modify integer numbers into word 'number'
    '''
    clean_tokens = []

    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(text, disable=[ "parser", "ner", "senter"])

    for word in mytokens:
        # Modify integer numbers into word 'number'
        if word.like_num:
            clean_tokens.append('number')

        elif not word.is_punct and not word.is_stop and len(word)>1:
        # Lemmatizing each token and converting each token into lowercase
            lemma = word.lemma_.lower().strip()
            clean_tokens.append(lemma)

    # return preprocessed list of tokens
    return clean_tokens



In [15]:
# Create stop words set outside the function for efficiency
punctuations = spacy.lang.en.stop_words.STOP_WORDS
print(type(stop_words))


<class 'set'>


Test your spacy tokenizer with a text

In [11]:
my_test_text = "Uno dos 3 5 tres Hola como estas? Como te llamas? Me llamo Ulises. 4 5 Y tu? Como te llamas?"

In [12]:
## TODO ##
spacy_tokenizer(my_test_text, nlp, stop_words, "")

['uno',
 'do',
 'number',
 'number',
 'tre',
 'hola',
 'como',
 'esta',
 'como',
 'te',
 'llamas',
 'llamo',
 'ulises',
 'number',
 'number',
 'tu',
 'como',
 'te',
 'llamas']

Now you are going to preprocess the data to be used in your models. `load_data()` function loads text data from files in the IMDB movie review dataset, tokenizes and preprocesses the text, and returns a list of dictionaries, where each dictionary represents a single review. The function takes two boolean arguments: `is_train` and `is_neg`, which specify whether to load training or test data, and whether to load negative or positive reviews, respectively. The maxium number of loaded files can also be specified.

In [16]:
def load_data(is_train, is_neg = True, max_files=None):
    """
    Input arguments: is_train (boolean), is_neg (optional boolean, default value
    is True), max_files (integer)
    Output: A list of dictionaries, where each dictionary contains information
    about a single review.

    The function reads all files in the specified directory, preprocesses the text
    by removing HTML tags, converting to lowercase, lemmatizing, and removing stop
    words and punctuation. It then creates a dictionary for each review, which
    contains the file name, label (0 for negative reviews, 1 for positive reviews),
    original text, and preprocessed tokens. The function returns a list of these
    dictionaries.
    """
    reviews = []
    if is_train:
        base_dir = 'aclImdb/train'
    else:
        base_dir = 'aclImdb/test'
    if is_neg:
        subdir = 'neg'
        label = 0
    else:
        subdir = 'pos'
        label = 1

    reviews_dir = os.path.join(base_dir,subdir)

    file_names = os.listdir(reviews_dir)
    random.shuffle(file_names)

    if max_files is not None:
        file_names = file_names[:max_files]

    for filename in tqdm_notebook(file_names):
    # open in readonly mode
        with open(os.path.join(reviews_dir, filename), 'r', encoding='UTF-8') as f:
            text = f.read()
            # use BeautifulSoup library to parse the HTML content and remove any HTML tags
            soup = BeautifulSoup(text, "lxml")
            tags_del = soup.get_text()

            # tokenize, clean and get lemmas
            clean_tokens = spacy_tokenizer(tags_del, nlp, stop_words, punctuations)
            file_data = {
              "file_name": f.name,
              "label": label,
              "text": text,
              "clean_tokens":clean_tokens
            }
            reviews.append(file_data)
    return reviews




In [17]:
# calling preprocessing function, it takes time
train_neg_reviews = load_data(is_train=True, max_files=10000)

  0%|          | 0/10000 [00:00<?, ?it/s]

  soup = BeautifulSoup(text, "lxml")


Let's visualize our data! Print a review with raw text and then the clean tokens of the same review. Are puntuations and stop words removed?

*Hint: You need to obtain a review from the list.
Then, use the dictionary key to obtain text or clean tokens*

In [18]:
train_neg_reviews[0].keys()

dict_keys(['file_name', 'label', 'text', 'clean_tokens'])

In [19]:
## TODO ## show raw text of a negative review
train_neg_reviews[0]["text"]


'I\'m going to go on the record as the second person who has, after years of using the IMDb to look up movies, been motivated by Nacho\'s film, The Abandoned to create an account and post a comment. This was hands down the worst movie I\'ve ever seen in my entire life. The plot was on the verge of non-existence, and none of the "puzzle-pieces" added up in any way whatsoever. The acting was laughable and the writing was embarrassing. How this film got backed and came to be is completely beyond me. The only saving grace I could find was Anastasia Hille\'s cunning and repetitive use of the f word. (and brilliant sound design) If I were faced with the option of seeing this film again or being mauled by wild bores I would be up against a difficult decision. I\'m disappointed that I am unable to give it 0 stars.'

In [20]:
## TODO ## show clean tokens for the same review. You can use ' '.join() to get
# a string of clean tokens
" ".join(train_neg_reviews[0]["clean_tokens"])

'go record number person year imdb look movie motivate nacho film abandon create account post comment hand bad movie see entire life plot verge non existence puzzle piece add way whatsoever acting laughable writing embarrassing film get back come completely saving grace find anastasia hille cunning repetitive use word brilliant sound design face option see film maul wild bore difficult decision disappointed unable number star'

In [21]:
# Process remaining data
train_pos_reviews = load_data(is_train=True, is_neg=False, max_files=10000)
test_neg_reviews = load_data(is_train=False, max_files=5000)
test_pos_reviews = load_data(is_train=False, is_neg=False, max_files=5000)

  0%|          | 0/10000 [00:00<?, ?it/s]

  soup = BeautifulSoup(text, "lxml")


  0%|          | 0/5000 [00:00<?, ?it/s]

  0%|          | 0/5000 [00:00<?, ?it/s]

In [22]:
train_data = pd.DataFrame(train_pos_reviews + train_neg_reviews).sample(frac = 1) #frac=1 means it resamples everything
test_data = pd.DataFrame(test_pos_reviews + test_neg_reviews).sample(frac = 1)

In [23]:
train_data.head()

Unnamed: 0,file_name,label,text,clean_tokens
8569,aclImdb/train/pos/9731_8.txt,1,Sidney Young (Pegg) moves from England to New ...,"[sidney, young, pegg, move, england, new, york..."
12349,aclImdb/train/neg/784_3.txt,0,Despite positive reviews and screenings at the...,"[despite, positive, review, screening, interna..."
18499,aclImdb/train/neg/2004_3.txt,0,I read the book a long time back and don't spe...,"[read, book, long, time, specifically, remembe..."
2547,aclImdb/train/pos/10625_7.txt,1,"""Hatred of a Minute"" is arguably one of the be...","[hatred, minute, arguably, number, well, film,..."
7500,aclImdb/train/pos/4252_9.txt,1,this film takes you inside itself in the early...,"[film, take, inside, early, minute, hold, till..."


We add clean texts to our DataFrames

In [24]:
train_data['clean_texts'] = train_data.apply(lambda row: ' '.join(row['clean_tokens']), axis=1)
test_data['clean_texts'] = test_data.apply(lambda row: ' '.join(row['clean_tokens']), axis=1)

In [25]:
train_data.head()

Unnamed: 0,file_name,label,text,clean_tokens,clean_texts
8569,aclImdb/train/pos/9731_8.txt,1,Sidney Young (Pegg) moves from England to New ...,"[sidney, young, pegg, move, england, new, york...",sidney young pegg move england new york work p...
12349,aclImdb/train/neg/784_3.txt,0,Despite positive reviews and screenings at the...,"[despite, positive, review, screening, interna...",despite positive review screening internationa...
18499,aclImdb/train/neg/2004_3.txt,0,I read the book a long time back and don't spe...,"[read, book, long, time, specifically, remembe...",read book long time specifically remember plot...
2547,aclImdb/train/pos/10625_7.txt,1,"""Hatred of a Minute"" is arguably one of the be...","[hatred, minute, arguably, number, well, film,...",hatred minute arguably number well film come m...
7500,aclImdb/train/pos/4252_9.txt,1,this film takes you inside itself in the early...,"[film, take, inside, early, minute, hold, till...",film take inside early minute hold till end hu...


### Basic classification structure
Usually a text classifier has these basic components:


* **Text prepocessor**: this step usually includes preprocessing tasks like transforming to lower case, removing stop words, removing HTML, etc. It has to be carefully designed depending on each use case, as it might have a high impact on the performance of the model. For example, regarding a seniment analysis, is it interesting to remove punctuations or not?
* **Vectorizer**: the component that given a string of text transforms it into a vector, also known as embedding.
* **Classification network**: a model that aims to codify the vector into the output classes. The result of this component is what we call the logits.
* **Model output**: a disptribution of probabilities across the classes, usually after applying a sigmoid transformation.


In the following sections, we will see **different approaches** for solving this problem. From a very simple approach based on the scikit learn framework to advanced neural networks. All of them follow the described architecture, **differences will come in how the embedding is obtained**.

### A simple baseline

Let's build a very simple version of this classifier based on a simple **vectorizer** and a **logistic regression** model to have a reference of what we can do with a few lines of code.

First, we define a tokenizer, similar to the one used previously, that is based on Spacy and that removes punctuation signs and stop words.

#### Vectorizers
To be able to apply a model to our reviews, we need first to transform them into something numeric, as our model only understands numbers.
Here, we propose two models of vectorizing, extracting a vector from the text, which are:
* [Bag of words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)
* [Term Frequency – Inverse Document Frequency (TF/iDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

`CountVectorizer` creates a sparse matrix of word counts for each document in the corpus. It counts the frequency of each word that appears in the document and creates a feature vector where each feature represents a unique word in the corpus.

`TfidfVectorizer` creates a sparse matrix of term frequency-inverse document frequency (TF-IDF) values for each document in the corpus. It calculates the importance of each word in the document by taking into account the frequency of the word in the document and the frequency of the word in the corpus. This helps to give more weight to words that are rare in the corpus but common in a particular document, and less weight to words that are common throughout the corpus.

The `ngram_range=(1,1)` parameter in the CountVectorizer constructor specifies that only individual words (unigrams) should be used as features. By default, TfidfVectorizer also uses unigrams as features. However, it can be configured to use different n-gram ranges by specifying the ngram_range parameter in its constructor.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

bow_vector = CountVectorizer(ngram_range=(1,1))
tfidf_vector = TfidfVectorizer()


In [27]:
## TODO ## use fit_transform() method from CountVectorizer and TfidfVectorizer
# objects for creating feature vectors for the clean texts of train data
bow_train = bow_vector.fit_transform(train_data['clean_texts'])
tfidf_train = tfidf_vector.fit_transform(train_data['clean_texts'])

#### **Training Logistic Regression with bag of words**

We are going to train a classification model with bag of words and compare its performance with the same model trained with TF-IDF word representations.

In [28]:
# Define train data and labels
X_train = bow_train
y_train = list(train_data.label.values)

In [29]:
X_train

<20000x59162 sparse matrix of type '<class 'numpy.int64'>'
	with 1678683 stored elements in Compressed Sparse Row format>

In [30]:
y_train

[1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,



Here we define the classification model, in this case a simple [Logistic Regression](https://towardsdatascience.com/logistic-regression-explained-9ee73cede081).

Instantiate the model and train it by using `fit()` method.

In [31]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression

## TODO ## Use LogisticRegression to train a model for sentiment analysis
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)

#### Testing the model
Once we have trained our model, we validate it using the portion of data kept for that purpose, the test dataset.

For the test dataset, we look at different evaluation metrics that will allow to validate the efficiency of our model.

These are [accuracy, precisiona and recall](https://en.wikipedia.org/wiki/Precision_and_recall)

In [None]:
import sklearn
sklearn.__version__

'1.2.2'

In [32]:
## TODO ##obtain x_test by using BoW transformation of clean texts
# obtain y_test by using labels of test data
bow_test = bow_vector.transform(test_data["clean_texts"])


In [33]:
y_test = list(test_data.label.values)


In [34]:
## TODO ## use the method predict() of your classifier to obtain the predictions
predicted = classifier.predict(bow_test)


# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.8515
Logistic Regression Precision: 0.8572880666802195
Logistic Regression Recall: 0.8434


Now that we have a trained model, we can start using it for predicting new reviews and tell if they are positive or not.

```
 0 -> Negative review
 1 -> Positive review
```




Use the spacy tokenizer to obtain tokens for this text: '`might be the worst movie I have ever watched`'

In [36]:
## TODO ## use the tokenizer to obtain tokens for a given text
my_new_review = "might be the worst movie I have ever watched"
clean_tokens = spacy_tokenizer(my_new_review, nlp, " ",  " ")
clean_tokens

['bad', 'movie', 'watch']

In [37]:
# join tokens into a string
clean_text = [' '.join(clean_tokens)]
clean_text

['bad movie watch']

In [38]:
## TODO ## get prediction of your clean text by using BoW and the trained classifier
bow_my_text = bow_vector.transform(clean_text)
my_review_prediction = classifier.predict(bow_my_text)
my_review_prediction

array([0])

In [39]:
# Test your model with a positive review
my_positive_review = "This is a very good movie, I would recommend it to all my friends, and family"
clean_tokens = spacy_tokenizer(my_positive_review, nlp, " ",  " ")
clean_text = [' '.join(clean_tokens)]
bow_my_text = bow_vector.transform(clean_text)
my_review_prediction = classifier.predict(bow_my_text)
my_review_prediction


array([1])

#### **Training Logistic Regression with TF-IDF**
Let's try the Logistic Regression with TF-IDF vectors!

In [42]:
X_train = tfidf_train
y_train = list(train_data.label.values)

In [43]:
## TODO ## train a classifier model
classifier_tfidf = LogisticRegression(max_iter=1000)

classifier_tfidf.fit(X_train, y_train)

#### Testing the model

In [44]:
# vectorize test data
tfidf_test = tfidf_vector.transform(test_data['clean_texts'])

# Predicting with a test dataset
predicted = classifier.predict(tfidf_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.8286
Logistic Regression Precision: 0.8192771084337349
Logistic Regression Recall: 0.8432


Is there any difference between the performance of the models? Which one is beter? Why?

Somehow, the simpler model, Bag of Words (BoW) performs a bit better than the more complex model, Term Frequency-Inverse Document Frequency (TF-IDF). There could be some reasons for it:
 - TF-IDF does not give the same importance to every word, while BOW does. This could have benefited the BoW model in a dataset that contains rather short texts, where term frequency differences are not so crucial.
 - Since BoW is a simpler model, TF-IDF has more risk of overfitting on the training data.