# Python Text Analysis: Bag of Words

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Learn how to convert text data into numbers through a Bag-of-Words approach.
* Understand the TF-IDF algorithm and how it complements the Bag-of-Words representation.
* Implement Bag-of-Words and TF-IDF using the `sklearn` package and understand its parameter settings.
* Use the numerical representations of text data to perform sentiment analysis.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

### Sections
1. [Exploratory Data Analysis](#section1)
2. [Preprocessing](#section2)
3. [The Bag-of-Words Representation](#section3)
4. [Term Frequency-Inverse Document Frequency](#section4)
5. [Sentiment Classification Using the TF-IDF Representation](#section5)

In the previous part, we learned how to perform text preprocessing. However, we didn't move beyond the text data itself. If we're interested in doing any computational analysis on the text data, we still need approaches to convert the text into a **numeric representation**.

In Part 2 of our workshop series, we'll explore one of the most straightforward ways to generate a numeric representation from text: the **bag-of-words** (BoW). We will implement the BoW representation to transform our airline tweets data, and then build a classifier to explore what it tells us about the sentiment of the tweets. At the heart of the bag-of-words approach lies the assumption that the frequency of specific tokens is informative about the semantics and sentiment underlying the text.

We'll make heavy use of the `scikit-learn` package to do so, as it provides a nice framework for constructing the numeric representation.

Let's install `scikit-learn` firstǃ

In [None]:
# Uncomment to install the package
%pip install scikit-learn #instalar la librería scikit-learn usada para machine learning en Python.

Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.1-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.1-cp313-cp313-win_amd64.whl (8.7 MB)
   ---------------------------------------- 0.0/8.7 MB ? eta -:--:--
   - -------------------------------------- 0.3/8.7 MB ? eta -:--:--
   -------- ------------------------------- 1.8/8.7 MB 6.0 MB/s eta 0:00:02
   --------------- ------------------------ 3.4/8.7 MB 6.7 MB/s eta 0:00:01
   --------------------- ------------------ 4.7/8.7 MB 7.0 MB/s eta 0:00:01
   -------------------------------- ------- 7.1/8.7 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------- 8.7/8.7 MB 8.0 MB/s  0:00:01
Downloading scipy-1.16.1-cp313-cp313-win_amd64.whl (38.5 MB)
   ---------------------

In [None]:
# Uncomment to install the NLP packages introduced in Part 1
%pip install NLTK  # instala NLTK
%pip install spaCy # instala spaCy
!python -m spacy download en_core_web_sm # descarga el modelo de lenguaje en inglés

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
# Import other packages
import re  # Expresiones regulares (para limpieza de texto)
import numpy as np # Operaciones numéricas y manejo de arreglos
import pandas as pd # Manejo de datos en tablas (DataFrames)
import matplotlib.pyplot as plt # Gráficas y visualización básica
import seaborn as sns # Visualización avanzada y estética mejorada sobre matplotlib
from string import punctuation # Lista de signos de puntuación, útil para limpieza de texto
%matplotlib inline # Mostrar gráficos dentro del Jupyter Notebook


<a id='section1'></a>

# Exploratory Data Analysis

Before we ever do any preprocessing or modeling, we always should perform exploratory data analysis to familiarize ourselves with the data.

In [8]:
# Read in data
tweets_path = '../data/airline_tweets.csv' # Ruta donde está guardado el archivo CSV

tweets = pd.read_csv(tweets_path, sep=',') # Carga el archivo CSV en un DataFrame de pandas


In [None]:
tweets.head() # Mostra las primeras 5 filas del DataFrame tweets

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


As a refresher, each row in this dataframe correponds to a tweet. The following columns are of main interests to us. There are other columns containing metadata of the tweet, such as the author of the tweet, when it was created, the timezone of the user, and others, which we will set aside for now. 

- `text` (`str`): the text of the tweet.
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral," "positive," or "negative." 
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

To prepare us for sentiment classification, we'll partition the dataset to focus on the "positive" and "negative" tweets for now. 

In [12]:
# Filtra los tweets que no sean 'neutral' en 'airline_sentiment' y reinicia los índices
tweets = tweets[tweets['airline_sentiment'] != 'neutral'].reset_index(drop=True)

Let's take a look at a few tweets first!

In [13]:
# Print first five tweets
for idx in range(5):
    print(tweets['text'].iloc[idx]) # Accede al texto del tweet en la posición 'idx' y lo imprime

@VirginAmerica plus you've added commercials to the experience... tacky.
@VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
@VirginAmerica and it's a really big bad thing about it
@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.
it's really the only bad thing about flying VA
@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)


We can already see that some of these tweets contain negative sentiment—how can we tell this is the case? 

Next, let's take a look at the distribution of sentiment labels in this dataset. 

In [None]:
import seaborn as sns
# Make a bar plot showing the count of tweet sentiments
sns.countplot(
            data=tweets,  # DataFrame de origen
              x='airline_sentiment', # Columna que define las categorías del eje x
              color='cornflowerblue', # Color de las barras
              order=['positive', 'negative']); # Orden de las categorías en el gráfico

It looks like the majority of the tweets in this dataset are expressing negative sentiment!

Let's take a look at what gets more retweeted:

In [5]:
import pandas as pd

# Ruta al archivo CSV
tweets_path = '../data/airline_tweets.csv'

# Cargar el CSV en un DataFrame llamado 'tweets'
tweets = pd.read_csv(tweets_path, sep=',')

# Verificar que se cargó correctamente
tweets.head()

# Get the mean retweet count for each sentiment
# Agrupa los tweets por la columna 'airline_sentiment'. Luego selecciona la columna 'retweet_count' y calcula la media (promedio) de cada grupo
tweets.groupby('airline_sentiment')['retweet_count'].mean()

airline_sentiment
negative    0.093375
neutral     0.060987
positive    0.069403
Name: retweet_count, dtype: float64

Negative tweets are clearly retweeted more often than positive ones!

Let's see which airline receives most negative tweets:

In [7]:
# Get the proportion of negative tweets by airline
# 1. Agrupa por 'airline' y 'airline_sentiment' y cuenta el número de tweets por grupo
# 2. Divide por el total de tweets de cada aerolínea
proportions = tweets.groupby(['airline', 'airline_sentiment']).size() / tweets.groupby('airline').size()
proportions.unstack().sort_values('negative', ascending=False) # Reorganiza los datos en formato tabla (sentimientos como columnas) y ordena por la proporción de negativos

airline_sentiment,negative,neutral,positive
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US Airways,0.776862,0.130793,0.092345
American,0.710402,0.167814,0.121783
United,0.688906,0.182365,0.128728
Southwest,0.490083,0.27438,0.235537
Delta,0.429793,0.325383,0.244824
Virgin America,0.359127,0.339286,0.301587


It looks like people are most dissatified with US Airways, followed by American Airline, both having over 85\% negative tweets!

A lot of interesting discoveries could be made if you want to explore more about the data. Now let's return to our task of sentiment analysis. Before that, we need to preprocess the text data so that they are in a standard format.

<a id='section2'></a>
# Preprocessing

We spent much of Part 1 learning how to preprocess data. Let's apply what we learned! Looking at some of the tweets above, we can see that while they are in pretty good shape, we can do some additional processing on them.

In our pipeline, we'll omit the tokenization process since we will perform it in a later step. 

## 🥊 Challenge 1: Apply a Text Cleaning Pipeline

Write a function called `preprocess` that performs the following steps on a text input:

* Step 1: Lowercase the text input.
* Step 2: Replace the following patterns with placeholders:
    * URLs &rarr; ` URL `
    * Digits &rarr; ` DIGIT `
    * Hashtags &rarr; ` HASHTAG `
    * Tweet handles &rarr; ` USER `
* Step 3: Remove extra blankspace.

Here are some hints to guide you through this challenge:

* For Step 1, recall from Part 1 that a string method called [`.lower()`](https://docs.python.org/3.11/library/stdtypes.html#str.lower) can be usd to convert text to lowercase. 
* We have integrated Step 2 into a function called `placeholder`. Run the cell below to import it into your notebook, and you can use it just like any other functions.
* For Step 3, we have provided the regex pattern for identifying whitespace characters as well as the correct replacement for extract whitespace. 

Run your `preprocess` function on `example_tweet` (three cells below) to check if it works. If it does, apply it to the entire `text` column in the tweets dataframe.

In [None]:
from utils import placeholder #importa una función o variable llamada placeholder desde un módulo llamado utils.

  digit_pattern = '\d+'
  digit_pattern = '\d+'


In [10]:
blankspace_pattern = r'\s+'# Patrón de expresión regular: uno o más espacios en blanco consecutivos
blankspace_repl = ' ' # Reemplazo: un solo espacio

def preprocess(text):
    '''Create a preprocess pipeline that cleans the tweet data.'''
    
    # Step 1: Lowercase
    text = text.lower()    # Paso 1: Convertir todo el texto a minúsculas


    # Step 2: Replace patterns with placeholders
    text = re.sub(pattern=blankspace_pattern, repl=blankspace_repl, string=text)   # Usamos la función re.sub para reemplazar todos los espacios múltiples por un solo espacio


    # Step 3: Remove extra whitespace characters
    text = text.strip()    # Paso 3: Eliminar espacios en blanco al inicio y final del texto


    return text     # Devolver el texto limpio


In [16]:
import re  # Librería para expresiones regulares

# Define un tweet de ejemplo con menciones, hashtags, URL y números
example_tweet = 'lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo'

# Print the example tweet
print(example_tweet) # Imprime el tweet original
print(f"{'='*50}") # Imprime una línea separadora

# Print the preprocessed tweet
print(preprocess(example_tweet))# Esto convierte a minúsculas, reemplaza espacios múltiples por uno solo y elimina espacios al inicio y final


lol @justinbeiber and @BillGates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo
lol @justinbeiber and @billgates are like soo 2000 #yesterday #amiright saw it on https://twitter.com #yolo


In [17]:
# Apply the function to the text column and assign the preprocessed tweets to a new column
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))# Guarda los tweets limpios en una nueva columna llamada 'text_processed'

tweets['text_processed'].head()# Muestra los primeros cinco tweets preprocesados


0                  @virginamerica what @dhepburn said.
1    @virginamerica plus you've added commercials t...
2    @virginamerica i didn't today... must mean i n...
3    @virginamerica it's really aggressive to blast...
4    @virginamerica and it's a really big bad thing...
Name: text_processed, dtype: object

Congratualtions! Preprocessing is done. Let's dive into the bag-of-words!

<a id='section3'></a>
# The Bag-of-Words Representation

The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a document and toss it in a bag. The action of "throwing" the document in a bag disregards the relative position between words, so what is "in the bag" is essentially "an unsorted set of words" [(Jurafsky & Martin, 2024)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). In return, we have a list of unique words and the frequency of each of them. 

For example, as shown in the following illustration, the word "coffee" appears twice. 

<img src='../images/bow-illustration-1.png' alt="BoW-Part2" width="600">

With a bag-of-words representation, we make heavy use of word frequency but not too much of word order. 

In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word "happy," it likely conveys positive sentiment, but not always (e.g., "not happy" denotes the opposite sentiment). When these words come up more often, they'll probably more strongly convey the sentiment.

## Document Term Matrix

Now let's implement the idea of bag-of-words. Before we dive deeper, let's step back for a moment. In practice, text analysis often involves handling many documents; from now on, we use the term **document** to represent a piece of text on which we perform analysis. It could be a phrase, a sentence, a tweet, or any other text—as long as it can be represented by a string, the length dosen't really matter. 

Imagine we have four documents (i.e., the four phrases shown above), and we toss them all in the bag. Instead of a word-frequency list, we'd expect a document-term matrix (DTM) in return. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occur across the documents. For each **document** (D), we count the number of occurence of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we will construct is a $D \times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or "term").

The unique tokens in this set of documents, arranged in alphabetical order, form the columns. For each document, we mark the occurence of each word present in the document. The numerical representation for each document is a row in the matrix. For example, the first document, "the coffee roaster," has the numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.

Note that the left index column now displays these documents as text, but typically we would just assign an index to each of them. 

$$
\begin{array}{c|cccccccccccc}
 & \text{americano} & \text{coffee} & \text{iced} & \text{light} & \text{roast} & \text{roaster} & \text{the} & \text{time} \\\hline
\text{the coffee roaster} &0 &1	&0	&0	&0	&1	&1	&0 \\ 
\text{light roast} &0 &0	&0	&1	&1	&0	&0	&0 \\
\text{iced americano} &1 &0	&1	&0	&0	&0	&0	&0 \\
\text{coffee time} &0 &1	&0	&0	&0	&0	&0	&1 \\
\end{array}
$$

To create a DTM, we will use `CountVectorizer` from the package `sklearn`.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer# Importa CountVectorizer de scikit-learn, que convierte texto en una matriz de conteo de palabras

The following illustration depicts the three-step workflow of creating a DTM with `CountVectorizr`.

<img src='../images/CountVectorizer1.png' alt="CountVectorizer" width="500">

Let's walk through these steps with the toy example shown above.

### A Toy Example

In [20]:
# A toy example containing four documents
test = ['the coffee roaster',# Documento 1: contiene las palabras 'the', 'coffee', 'roaster'
        'light roast',  # Documento 2: contiene las palabras 'light', 'roast'
        'iced americano', # Documento 3: contiene las palabras 'iced', 'americano'
        'coffee time']  # Documento 4: contiene las palabras 'coffee', 'time'

The first step is to initialize a `CountVectorizer` object. Within the round paratheses, we can specify parameter settings if desired. Let's take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available.  

For now we can just leave it blank to use the default settings. 

In [21]:
# Create a CountVectorizer object
vectorizer = CountVectorizer() # Este objeto se utilizará para transformar la lista de documentos de texto en una matriz de conteo de palabras

The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the "fitted" `CountVectorizer` object, which means taking each of the document and counting the occurrences of tokens according to the vocabulary established during the "fitting" step.

It may sound a bit complex but steps 2 and 3 can be done in one swoop using a `fit_transform` function.

In [22]:
# Fit and transform to create a DTM
test_count = vectorizer.fit_transform(test) # Ajusta (fit) el CountVectorizer al corpus y transforma (transform) los documentos en una matriz de conteo de palabras (DTM)


The return of `fit_transform` is supposed to be the DTM. 

Let's take a look at it!

In [23]:
test_count #Es la matriz dispersa que resulta de aplicar CountVectorizer a la lista test

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 9 stored elements and shape (4, 8)>

Apparently we've got a "sparse matrix"—a matrix that contains a lot of zeros. This makes sense. For each document, there are words that don't occur at all, and these are counted as zero in the DTM. This sparse matrix is stored in a "Compressed Sparse Row" format, a memory-saving format designed for handling sparse matrices. 

Let's convert it to a dense matrix, where those zeros are probably represented, as in a numpy array.

In [None]:
# Convert DTM to a dense matrix 
test_count_dense = test_count.todense() # Convierte la matriz dispersa (DTM) en una matriz densa para poder visualizar los conteos de palabras


matrix([[0, 1, 0, 0, 0, 1, 1, 0],
        [0, 0, 0, 1, 1, 0, 0, 0],
        [1, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 1]])

So this is our DTM! The matrix is the same as shown above. To make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with the `get_feature_names_out` function.

In [24]:
# Retrieve the vocabulary
vectorizer.get_feature_names_out()# Devuelve un array con todas las palabras únicas de los documentos en orden alfabético


array(['americano', 'coffee', 'iced', 'light', 'roast', 'roaster', 'the',
       'time'], dtype=object)

In [25]:
# Create a DTM dataframe
# Cada fila representa un documento
# Cada columna representa una palabra del vocabulario
# Los valores indican cuántas veces aparece cada palabra en cada documento
test_dtm = pd.DataFrame(data=test_count.todense(), # Convertimos la matriz dispersa a una matriz densa
                        columns=vectorizer.get_feature_names_out()) # Nombramos las columnas con las palabras del vocabulario
                 

Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list. 

In [26]:
test_dtm # DataFrame que representa la Document-Term Matrix (DTM) de los documentos de ejemplo 'test'

Unnamed: 0,americano,coffee,iced,light,roast,roaster,the,time
0,0,1,0,0,0,1,1,0
1,0,0,0,1,1,0,0,0
2,1,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0,1


Hopefully this toy example provides a clear walkthrough of creating a DTM.

Now it's time for our tweets data!

### DTM for Tweets

We'll begin by initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.

When we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute.

In [28]:
# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True,# Convierte todo el texto a minúsculas antes de tokenizar
                             stop_words=None, # No elimina ninguna palabra por defecto (no usa stopwords)
                             min_df=1, # La palabra debe aparecer al menos en 1 documento para incluirse en el vocabulario
                             max_df=1.0, # La palabra puede aparecer como máximo en el 100% de los documentos
                             max_features=None)  # No hay límite en el número de características (palabras) a extraer

In [27]:
# Fit and transform to create DTM
# Ajusta (fit) el CountVectorizer al corpus de tweets preprocesados y transforma (transform) cada tweet en un vector de conteo de palabras (DTM)
counts = vectorizer.fit_transform(tweets['text_processed'])
counts # counts ahora contiene la Document-Term Matrix (DTM) en formato de matriz dispersa

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 234281 stored elements and shape (14640, 15051)>

In [None]:
# Do not run if you have limited memory - this includes DataHub and Binder
np.array(counts.todense()) # Convierte la matriz dispersa 'counts' a una matriz densa de NumPy


In [None]:
# Extract tokens
tokens = vectorizer.get_feature_names_out() # Estos tokens corresponden a las columnas de la Document-Term Matrix (DTM)

In [None]:
# Create DTM
# Cada fila corresponde a un tweet (usando el mismo índice que el DataFrame original)
# Cada columna corresponde a una palabra del vocabulario
first_dtm = pd.DataFrame(data=counts.todense(), # Convertimos la matriz dispersa a matriz densa
                         index=tweets.index, # Usamos los índices originales de los tweets
                         columns=tokens)  # Nombres de columnas = palabras únicas aprendidas por CountVectorizer

# Print the shape of DTM
print(first_dtm.shape) # Muestra (cantidad de tweets, cantidad de palabras únicas)

If we leave the `CountVectorizer` to the default setting, the vocabulary size of the tweet data is 8751. 

In [None]:
first_dtm.head() # Muestra las primeras 5 filas del DataFrame `first_dtm`
                   # Cada fila = un tweet
                   # Cada columna = una palabra del vocabulario
                   # Cada valor = número de veces que aparece esa palabra en ese tweet

Most of the tokens have zero occurences at least in the first five tweets. 

Let's take a closer look at the DTM!

In [None]:
# Most frequent tokens
# 1. first_dtm.sum() → suma los valores de cada columna, es decir, cuenta cuántas veces aparece cada palabra en todos los tweets
# 2. .sort_values(ascending=False) → ordena las palabras de mayor a menor frecuencia
# 3. .head(10) → muestra las 10 palabras más frecuentes
first_dtm.sum().sort_values(ascending=False).head(10)

In [None]:
# Least frequent tokens
# 1. first_dtm.sum() → suma los valores de cada columna, es decir, cuenta cuántas veces aparece cada palabra en todos los tweets
# 2. .sort_values(ascending=True) → ordena las palabras de menor a mayor frecuencia
# 3. .head(10) → muestra las 10 palabras menos frecuentes
first_dtm.sum().sort_values(ascending=True).head(10)

_exact_                     1
mightmismybrosgraduation    1
midterm                     1
midnite                     1
midland                     1
michelle                    1
michele                     1
michael                     1
mhtt                        1
mgmt                        1
dtype: int64

It is not surprising to see "user" and "digit" to be among the most frequent tokens as we replaced each idiosyncratic one with these placeholders. The rest of the most frequent tokens are mostly stop words.

Perhaps a more interesting pattern is to look for which token appears most in any given tweet:

In [None]:
counts = pd.DataFrame() # Crear un DataFrame vacío para almacenar los tokens más frecuentes por tweet

# Retrieve the index of the tweet where a token appears most frequently
counts['token'] = first_dtm.idxmax(axis=1)# idxmax(axis=1) devuelve el nombre de la columna con el valor máximo por fila

# Retrieve the number of occurrence
# Recuperar el nombre del token que aparece con mayor frecuencia en cada tweet 
counts['number'] = first_dtm.max(axis=1) # max(axis=1) devuelve el valor máximo por fila (frecuencia del token)

# Filter out placeholders
# Filtrar los tokens que son placeholders (digit, hashtag, user)
# y mostrar los 10 tokens más frecuentes después del filtrado
counts[(counts['token']!='digit')
       & (counts['token']!='hashtag')
       & (counts['token']!='user')].sort_values('number', ascending=False).head(10)

It looks like among all tweets, at most a token appears six times, and it is either the word "It" or the word "worst." 

Let's go back to our tweets dataframe and locate the 918th tweet.

In [41]:
# Retrieve 918th tweet: "worst"
# Recuperar el tweet en la posición 918 del DataFrame
# Muestra el contenido del tweet original en la columna 'text'
tweets.iloc[918]['text']

"@united Agent in LAS letting 20 customers know they can't help them rebook delayed flight to DEN #unfriendlyskies http://t.co/QuzVmK2rTR"

## Customize the `CountVectorizer`

So far we've always used the default parameter setting to create our DTMs, but in many cases we may want to customize the `CountVectorizer` object. The purpose of doing so is to further filter out unnecessary tokens. In the example below, we tweak the following parameters:

- `stop_words = 'english'`: ignore English stop words 
- `min_df = 2`: ignore words that don't occur at least twice
- `max_df = 0.95`: ignore words if they appear in more than 95\% of the documents

🔔 **Question**: Let's pause for a minute to discuss whether it sounds reasonable to set these parameters! What do you think?

Oftentimes, we are not interested in words whose frequencies are either too low or too high, so we use `min_df` and `max_df` to filter them out. Alternatively, we can define our vocabulary size as $N$ by setting `max_features`. In other words, we tell `CountVectorizer` to only consider the top $N$ most frequent tokens when constructing the DTM.

In [42]:
# Customize the parameter setting
# Crear un objeto CountVectorizer con parámetros personalizados
vectorizer = CountVectorizer(lowercase=True, # Convierte todo el texto a minúsculas antes de tokenizar
                             stop_words='english', # Elimina las palabras vacías (stop words)
                             min_df=2,  # Ignora palabras que aparecen en menos de 2 documentos
                             max_df=0.95, # Ignora palabras que aparecen en más del 95% de los documentos
                             max_features=None) # No limita el número máximo de palabras (todas se incluyen)

In [43]:
# Fit, transform, and get tokens
# Ajustar el vectorizador al texto preprocesado y transformar los tweets en una matriz de conteo
counts = vectorizer.fit_transform(tweets['text_processed']) # Devuelve una matriz dispersa donde cada fila = tweet, cada columna = palabra, y cada celda = número de apariciones
tokens = vectorizer.get_feature_names_out() # Obtener los tokens (palabras únicas) identificadas por el CountVectorizer

# Create the second DTM
# Crear un DataFrame de la segunda Document-Term Matrix (DTM)
second_dtm = pd.DataFrame(data=counts.todense(), # Convertir la matriz dispersa a matriz densa
                          index=tweets.index,  # Mantener los índices originales de los tweets
                          columns=tokens) # Usar los tokens como nombres de columnas

Our second DTM has a substantially smaller vocabulary compared to the first one.

In [None]:
# Imprimir la forma (número de filas y columnas) de la primera DTM
# Cada fila = un tweet, cada columna = una palabra
print(first_dtm.shape)
# Imprimir la forma (número de filas y columnas) de la segunda DTM
# Esta DTM usa parámetros de CountVectorizer más estrictos (stop words, min_df, max_df)
print(second_dtm.shape)

In [45]:
# Muestra las primeras 5 filas del DataFrame `second_dtm`
# Cada fila = un tweet
# Cada columna = una palabra del vocabulario filtrado por CountVectorizer
# Cada valor = número de veces que aparece esa palabra en ese tweet
second_dtm.head()

Unnamed: 0,00,000,00am,00pm,02,03,05,0510,05am,05pm,...,yvonne,yvr,yyj,yyz,zero,zik2uoxgnw,zkatcher,zone,zoom,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The most frequent token list now includes words that make more sense to us, such as "cancelled" and "service." 

In [46]:
# Obtener las 10 palabras más frecuentes en la segunda DTM
# 1. second_dtm.sum() → suma los valores de cada columna, contando cuántas veces aparece cada palabra en todos los tweets
# 2. .sort_values(ascending=False) → ordena las palabras de mayor a menor frecuencia
# 3. .head(10) → muestra las 10 palabras más frecuentes
second_dtm.sum().sort_values(ascending=False).head(10)

united          4164
flight          3939
usairways       3053
americanair     2964
southwestair    2461
jetblue         2395
http            1155
thanks          1083
cancelled       1065
just             974
dtype: int64

## 🥊 Challenge 2: Lemmatize the Text Input

Recall from Part 1 that we introduced using `spaCy` to perform lemmatization, i.e., to "recover" the base form of a word. This process will reduce vocabulary size by keeping word variations minimal—a smaller vocabularly may help improve model performance in sentiment classification. 

Now let's implement lemmatization on our tweet data and use the lemmatized text to create a third DTM. 

Complete the function `lemmatize_text`. It requires a text input and returns the lemmas of all tokens. 

Here are some hints to guide you through this challenge:

- Step 1: initialize a list to hold lemmas
- Step 2: apply the `nlp` pipeline to the input text
- Step 3: iterate over tokens in the processed text and retrieve the lemma of the token
    - HINT: lemmatization is one of the linguistic annotations that the `nlp` pipeline automatically does for us. We can use `token.lemma_` to access the annotation.

In [47]:
# Import spaCy
import spacy # Importa la librería spaCy para procesamiento de lenguaje natural
nlp = spacy.load('en_core_web_sm') # Carga el modelo en inglés pequeño 'en_core_web_sm' para tokenización, lematización y reconocimiento de entidades

In [48]:
# Create a function to lemmatize text
def lemmatize_text(text):
    '''Lemmatize the text input with spaCy annotations.'''

    # Step 1: Initialize an empty list to hold lemmas
    lemma = [] 

    # Step 2: Apply the nlp pipeline to input text
    doc = nlp(text)

    # Step 3: Iterate over tokens in the text to get the token lemma
    for token in doc:
        lemma.append(token.lemma_)# token.lemma_ devuelve la forma base del token

    # Step 4: Join lemmas together into a single string
    text_lemma = ' '.join(lemma)
    
    return text_lemma # Devolver el texto lematizado

Let's apply the function to the following example tweet first!

In [49]:
# Apply the function to an example tweet
print(tweets.iloc[33]["text_processed"])# Mostra el tweet preprocesado en la posición 33
print(f"{'='*50}") # Imprime una línea separadora
# Mostrar el mismo tweet después de aplicar la función de lematización
# Cada palabra se convierte a su forma base (por ejemplo: "running" → "run")
print(lemmatize_text(tweets.iloc[33]['text_processed']))

@virginamerica awaiting my return phone call, just would prefer to use your online self-service option :(
@virginamerica await my return phone call , just would prefer to use your online self - service option :(


And then let's lemmatize the tweet data and save the output to a new column `text_lemmatized`.

In [50]:
# This may take a while!
tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))# Aplica la función de lematización a todos los tweets preprocesados

Now with the `text_lemmatized` column, let's create a third DTM. The parameter setting is the same as the second DTM. 

In [51]:
# Create the vectorizer (the same param setting as previous)
vectorizer = CountVectorizer(lowercase=True,# Convierte todo el texto a minúsculas
                             stop_words='english', # Eliminar stop words en inglés
                             min_df=2, # Ignora palabras que aparecen en menos de 2 documentos
                             max_df=0.95, # Ignora palabras que aparecen en más del 95% de los documentos
                             max_features=None) # No limita el número de palabras

# Fit, transform, and get tokens
counts = vectorizer.fit_transform(tweets['text_lemmatized']) # Ajusta el vectorizador a los tweets lematizados y transformar en matriz de conteo
tokens = vectorizer.get_feature_names_out() # Obtiene los tokens (palabras únicas) identificadas por CountVectorizer

# Create the third DTM
third_dtm = pd.DataFrame(data=counts.todense(), # Convierte la matriz dispersa a matriz densa
                         index=tweets.index, # Mantene los índices originales de los tweets
                         columns=tokens) # Usa los tokens como nombres de columnas
third_dtm.head() # Mostrar las primeras 5 filas de la tercera DTM


Unnamed: 0,00,000,00am,00pm,02,03,05,0510,05am,05pm,...,yvonne,yvr,yyj,yyz,zero,zik2uoxgnw,zkatcher,zone,zoom,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Print the shapes of three DTMs
# Imprime las dimensiones de las tres Document-Term Matrices (DTM)
# Cada fila = número de tweets
# Cada columna = número de palabras únicas en el vocabulario correspondiente
print(first_dtm.shape)   # Primera DTM: sin filtrar stop words ni aplicar frecuencia mínima/máxima
print(second_dtm.shape)  # Segunda DTM: con stop words y filtros de frecuencia aplicados
print(third_dtm.shape)   # Tercera DTM: después de lematización y filtros, más refinada

Let's print the top 10 most frequent tokens as usual. These tokens are now lemmas and their counts also change after lemmatization. 

In [53]:
# Get the most frequent tokens in the third DTM
# 1. third_dtm.sum() → suma los valores de cada columna, contando cuántas veces aparece cada palabra en todos los tweets lematizados
# 2. .sort_values(ascending=False) → ordena las palabras de mayor a menor frecuencia
# 3. .head(10) → muestra las 10 palabras más frecuentes
third_dtm.sum().sort_values(ascending=False).head(10)

flight          4754
americanair     2964
united          2591
southwestair    2461
jetblue         2395
usairway        2366
thank           1680
unite           1574
http            1155
hour            1152
dtype: int64

In [54]:
# Compared to the most frequent tokens in the second DTM
# 1. second_dtm.sum() → suma los valores de cada columna, contando cuántas veces aparece cada palabra en todos los tweets preprocesados (sin lematización)
# 2. .sort_values(ascending=False) → ordena las palabras de mayor a menor frecuencia
# 3. .head(10) → muestra las 10 palabras más frecuentes
second_dtm.sum().sort_values(ascending=False).head(10)

united          4164
flight          3939
usairways       3053
americanair     2964
southwestair    2461
jetblue         2395
http            1155
thanks          1083
cancelled       1065
just             974
dtype: int64

<a id='section4'></a>

# Term Frequency-Inverse Document Frequency 

So far, we're relying on word frequency to give us information about a document. This assumes if a word appears more often in a document, it's more informative. However, this may not always be the case. For example, we've already removed stop words because they are not informative, despite the fact that they appear many times in a document. We also know the word "flight" is among the most frequent words, but it is not that informative, because it appears in many documents. Since we're looking at airline tweets, we shouldn't be surprised to see the word "flight"!

To remedy this, we use a weighting scheme called **tf-idf (term frequency-inverse document frequency)**. The big idea behind tf-idf is to weight a word not just by its frequency within a document, but also by its frequency in one document relative to the remaining documents. So, when we construct the DTM, we will be assigning each term a **tf-idf score**. Specifically, term $t$ in document $d$ is assigned a tf-idf score as follows:

<img src='../images/tf-idf_finalized.png' alt="TF-IDF" width="1200">

In essence, the tf-idf score of a word in a document is the product of two components: **term frequency (tf)** and **inverse document frequency (idf)**. The idf acts as a scaling factor. If a word occurs in all documents, then idf equals 1. No scaling will happen. But idf is typically greater than 1, which is the weight we assign to the word to make the tf-idf score higher, so as to highlight that the word is informative. In practice, we add 1 to both the denominator and numerator ("add-1 smooth") to prevent any issues with zero occurrences.

We can also create a tf-idf DTM using `sklearn`. We'll use a `TfidfVectorizer` this time:

In [55]:
# Importa TfidfVectorizer de scikit-learn, usado para convertir texto en una matriz TF-IDF
# TF-IDF pondera las palabras según su frecuencia en un documento y su rareza en todo el corpus
from sklearn.feature_extraction.text import TfidfVectorizer

In [56]:
# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True, # Convierte todo el texto a minúsculas antes de vectorizar
                             stop_words='english', # Elimina palabras vacías (stop words) en inglés
                             min_df=2, # Ignora palabras que aparecen en menos de 2 documentos
                             max_df=0.95, # Ignora palabras que aparecen en más del 95% de los documentos
                             max_features=None) # No limitar el número de palabras

In [57]:
# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])# Ajusta el vectorizador TF-IDF a los tweets lematizados y transforma el texto en una matriz TF-IDF
tf_dtm # Muestra la matriz TF-IDF resultante (matriz dispersa)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 119862 stored elements and shape (14640, 5090)>

In [58]:
# Create a tf-idf dataframe
# Crear un DataFrame TF-IDF a partir de la matriz dispersa
tfidf = pd.DataFrame(tf_dtm.todense(),  # Convertir la matriz TF-IDF dispersa a matriz densa
                     columns=vectorizer.get_feature_names_out(), # Usar los tokens (palabras) como nombres de columnas
                     index=tweets.index)  # Mantener los índices originales de los tweets
tfidf.head() # Mostrar las primeras 5 filas del DataFrame TF-IDF

Unnamed: 0,00,000,00am,00pm,02,03,05,0510,05am,05pm,...,yvonne,yvr,yyj,yyz,zero,zik2uoxgnw,zkatcher,zone,zoom,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


You may have noticed that the vocabulary size is the same as we saw in Challenge 2. This is because we used the same parameter setting when creating the vectorizer. But the values in the matrix are different—they are tf-idf scores instead of raw counts. 

## Interpret TF-IDF Values

Let's take a look the document where a term has the highest tf-idf values. We'll use the `.idxmax()` method to find the index.

In [59]:
# Retrieve the index of the document
tfidf.idxmax() # Obtener el índice del tweet donde cada palabra tiene su valor TF-IDF máximo

00            10280
000            3288
00am          10222
00pm          11271
02             7905
              ...  
zik2uoxgnw    11888
zkatcher       8389
zone           3975
zoom           4970
zurich         2727
Length: 5090, dtype: int64

For example, the term "worst" occurs most distinctively in the 918th tweet. 

In [60]:
tfidf.idxmax()['worst']# Obtener el índice del tweet donde la palabra 'worst' tiene el valor TF-IDF más alto


np.int64(10265)

Recall that this is the tweet where the word "worst" appears six times!

In [61]:
tweets['text_processed'].iloc[918]# Obtiene el tweet preprocesado en la posición 918


"@united agent in las letting 20 customers know they can't help them rebook delayed flight to den #unfriendlyskies http://t.co/quzvmk2rtr"

How about "cancel"? Let's take a look at another example. 

In [62]:
tfidf.idxmax()['cancel'] # Obtener el índice del tweet donde la palabra 'cancel' tiene el valor TF-IDF más alto

np.int64(7840)

In [63]:
tweets['text_processed'].iloc[5945]# Obtiene el tweet preprocesado en la posición 5945

"@southwestair you need to get your act together. you new this morning at 830 our plane was malfunctioning. yet i've been delayed 3 times .."

## 🥊 Challenge 3: Words with Highest Mean TF-IDF scores

We have obtained tf-idf values for each term in each document. But what do these values tell us about the sentiments of tweets? Are there any words that are  particularly informative for positive/negative tweets? 

To explore this, let's gather the indices of all positive/negative tweets and calculate the mean tf-idf scores of words appear in each category. 

We've provided the following starter code to guide you:
- Subset the `tweets` dataframe according to the `airline_sentiment` label and retrieve the index of each subset (`.index`). Assign the index to `positive_index` or `negative_index`.
- For each subset:
    - Retrieve the td-idf representation 
    - Take the mean tf-idf values across the subset using `.mean()`
    - Sort the mean values in the descending order using `.sort_values()`
    - Get the top 10 terms using `.head()`

Next, run `pos.plot` and `neg.plot` to plot the words with the highest mean tf-idf scores for each subset. 

In [64]:
# Complete the boolean masks 
positive_index = tweets[tweets['airline_sentiment'] == 'positive'].index # Índices de tweets con sentimiento positivo
negative_index = tweets[tweets['airline_sentiment'] == 'negative'].index # Índices de tweets con sentimiento negativo

In [65]:
# Complete the following two lines
# Obtenerlas palabras más frecuentes en tweets positivos
pos = tfidf.loc[positive_index]       # Selecciona los tweets con sentimiento positivo usando sus índices
pos = pos.mean()                      # Calcula el valor promedio TF-IDF de cada palabra en esos tweets
pos = pos.sort_values(ascending=False) # Ordena de mayor a menor importancia
pos = pos.head(10)   
# Obtener las palabras más frecuentes en tweets negativos
neg = tfidf.loc[negative_index]       # Selecciona los tweets con sentimiento negativo usando sus índices
neg = neg.mean()                      # Calcula el valor promedio TF-IDF de cada palabra en esos tweets
neg = neg.sort_values(ascending=False) # Ordena de mayor a menor importancia
neg = neg.head(10)                     # Selecciona las 10 palabras más representativas

In [None]:
# Graficar las 10 palabras más representativas de tweets positivos usando un gráfico de barras horizontal
pos.plot(kind='barh', # Tipo de gráfico: barras horizontales
         xlim=(0, 0.18), # Limitar el eje X entre 0 y 0.18
         color='cornflowerblue',  # Color de las barras
         title='Top 10 terms with the highest mean tf-idf values for positive tweets');# Título del gráfico

In [None]:
# Graficar las 10 palabras más representativas de tweets negativos usando un gráfico de barras horizontal
neg.plot(kind='barh', # Tipo de gráfico: barras horizontales
         xlim=(0, 0.18), # Limitar el eje X entre 0 y 0.18
         color='darksalmon', # Color de las barras
         title='Top 10 terms with the highest mean tf-idf values for negative tweets');# Título del gráfico

🔔 **Question**: How would you interpret these results? Share your thoughts in the chat!

<a id='section5'></a>

## 🎬 **Demo**: Sentiment Classification Using the TF-IDF Representation

Now that we have a tf-idf representation of the text, we are ready to do sentiment analysis!

In this demo, we will use a logistic regression model to perform the classification task. Here we briefly step through how logistic regression works as one of the supervised Machine Learning methods, but feel free to explore our workshop on [Python Machine Learning Fundamentals](https://github.com/dlab-berkeley/Python-Machine-Learning) if you want to learn more about it.

Logistic regression is a linear model, with which we use to predict the label of a tweet, based on a set of features ($x_1, x_2, x_3, ..., x_i$), as shown below:

$$
L = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_T x_T
$$

The list of features we'll pass to the model is the vocabulary of the DTM. We also feed the model with a portion of the data, known as the training set, along with other model specification, to learn the coeffient ($\beta_1, \beta_2, \beta_3, ..., \beta_i$) of each feature. The coefficients tell us whether a feature contributes positively or negatively to the predicted value. The predicted value corresponds to adding all features (multiplied by their coefficients) up, and the predicted value gets passed to a [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) to be converted into the probability space, which tells us whether the predicted label is positive (when $p>0.5$) or negative (when $p<0.5$). 

The remaining portion of the data, known as the test set, is used to test whether the learned coefficients could be generalized to unseen data. 

Now that we already have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!

In [7]:
from sklearn.linear_model import LogisticRegressionCV # Importar el modelo de regresión logística con validación cruzada
from sklearn.model_selection import train_test_split # Importar función para dividir los datos en conjuntos de entrenamiento y prueba


We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:

In [None]:
# Train-test split
# Definir las variables predictoras (features) y la variable objetivo (target)
X = tfidf # Variables predictoras: matriz TF-IDF de los tweets
y = tweets['airline_sentiment'] # Variable objetivo: sentimiento de cada tweet
# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(
    X, y, # Datos y etiquetas
    test_size=0.15 # Proporción del 15% para el conjunto de prueba
)
  

The `fit_logistic_regression` function is written below to streamline the training process.

In [10]:
# Definir una función para entrenar un modelo de regresión logística
def fit_logistic_regression(X, y):
    '''Fits a logistic regression model to provided data.'''
    # Crear y entrenar el modelo de regresión logística con validación cruzada
    model = LogisticRegressionCV(Cs=10,# Número de valores de regularización C a probar
                                 penalty='l1', # Tipo de penalización L1 (Lasso) para selección de variables
                                 cv=5,# Número de pliegues para validación cruzada
                                 solver='liblinear', # Algoritmo de optimización (compatible con L1)
                                 class_weight='balanced',# Ajusta pesos para manejar clases desbalanceadas
                                 random_state=42,# Fijar semilla para reproducibilidad
                                 refit=True # Reentrena el modelo final usando todos los datos
                                 ).fit(X, y)# Ajusta el modelo a los datos proporcionados
    return model # Devuelve el modelo entrenado

We'll fit the model and compute the training and test accuracy.

In [None]:
# Fit the logistic regression model
model = fit_logistic_regression(X_train, y_train)# Ajusta el modelo de regresión logística usando los datos de entrenamiento


In [None]:
# Get the training and test accuracy
print(f"Training accuracy: {model.score(X_train, y_train)}")# Muestra la precisión (accuracy) del modelo en los datos de entrenamiento
print(f"Test accuracy: {model.score(X_test, y_test)}")# Muestra la precisión (accuracy) del modelo en los datos de prueba

The model achieved ~94% accuracy on the training set and ~89% on the test set—that's pretty good! The model generalizes reasonably well to the test data.

Next, let's also take a look at the fitted coefficients to see if what we see makes sense. 

We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:

In [None]:
# Get coefs of all features
coefs = model.coef_.ravel()# Obtener los coeficientes (pesos) que el modelo de regresión logística aprendió

# Get all tokens
tokens = vectorizer.get_feature_names_out()# Obtener la lista de tokens (palabras) del vocabulario generado por el vectorizador TF-IDF

# Create a token-coef dataframe
importance = pd.DataFrame()# Crear un DataFrame vacío para organizar la relación entre tokens y coeficientes
importance['token'] = tokens # Agregar la columna "token" con todas las palabras del vocabulario
importance['coefs'] = coefs # Agregar la columna "coefs" con los valores de los coeficientes aprendidos para cada palabra

In [None]:
# Get the top 10 tokens with lowest coefs
# Ordena el DataFrame 'importance' por la columna 'coefs' en orden ascendente
# De esta forma, los coeficientes más negativos (palabras asociadas a sentimientos negativos) quedan arriba
neg_coef = importance.sort_values('coefs').head(10)
neg_coef # Muestra las primeras 10 filas de ese nuevo DataFrame


In [None]:
# Get the top 10 tokens with highest coefs
# Ordena el DataFrame 'importance' por la columna 'coefs' en orden ascendente
# y selecciona las últimas 10 filas, que corresponden a los coeficientes más altos (positivos)
pos_coef = importance.sort_values('coefs').tail(10)
pos_coef 

Let's plot the top 10 tokens with the highest/lowest coefficients. 

In [None]:
importance.head()
pos_coef = importance.sort_values(by='coefs', ascending=True).tail(10)
# Plot the top 10 tokens that have the highest coefs
# Ordena el DataFrame 'pos_coef' de mayor a menor según la columna 'coefs'
pos_coef.sort_values('coefs', ascending=False) \
        .plot(kind='barh', # Genera un gráfico de barras horizontal (barh) con los datos ordenados
              xlim=(0, 18), # Establece el rango del eje X entre 0 y 18
              x='token', # Define que el eje Y (categorías) muestre el valor de la columna 'token'
              color='cornflowerblue', # Asigna el color 'cornflowerblue' a las barras
              title='Top 10 tokens with highest coeffient values'); # Coloca un título al gráfico

In [None]:
# Plot the top 10 tokens that have the lowest coefs
neg_coef.plot( 
                kind='barh',       # Tipo de gráfico: barras horizontales
                xlim=(0, -18),     # Limite del eje x (aunque negativo aquí probablemente sea un error)
                x='token',         # Define que el eje Y muestre los nombres de los tokens
                color='darksalmon',# Color de las barras
                title='Top 10 tokens with lowest coeffient values' # Título del gráfico
)

Words like "ruin," "rude," and "hour" are strong indicators of negative sentiment, while "thank," "awesome," and "wonderful" are associated with positive sentiment. 

We will wrap up Part 2 with these plots. These coefficient terms and the words with the highest TF-IDF values provide different perspectives on the sentiment of tweets. If you'd like, take some time to compare the two sets of plots and see which one provides a better account of the sentiments conveyed in tweets.

<div class="alert alert-success">

## ❗ Key Points

* A Bag-of-Words representation is a simple method to transform our text data to numbers. It focuses on word frequency but not word order. 
* A TF-IDF representation is a step further; it also considers if a certain word distinctively appears in one document or occurs uniformally across all documents. 
* With a numerical representation, we can perform a range of text classification task, such as sentiment analysis. 

</div>