# Python Text Analysis: Preprocessing

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.
* Know commonly used NLP packages and what they are capable of.
* Understand tokenizers, and how they have changed since the advent of Large Language Models.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

### Sections
1. [Preprocessing](#section1)
2. [Tokenization](#section2)

In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).

Now, let's have these packages properly installed before diving into the materials.

In [None]:
# Uncomment the following lines to install packages/model
%pip install NLTK #Instala la librería NLTK para procesamiento de lenguaje natural clásico.
%pip install transformers #Instala transformers de Hugging Face para usar modelos de NLP preentrenados como BERT o GPT.
%pip install spaCy #Instala spaCy, otra librería de NLP rápida y orientada a producción.
!python -m spacy download en_core_web_sm #Descarga el modelo pequeño de inglés en_core_web_sm para spaCy.

Collecting NLTK
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from NLTK)
  Using cached click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting joblib (from NLTK)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from NLTK)
  Downloading regex-2025.7.34-cp313-cp313-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from NLTK)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--
   --------------------------- ------------ 1.0/1.5 MB 4.5 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 3.6 MB/s  0:00:00
Downloading regex-2025.7.34-cp313-cp313-win_amd64.whl (275 kB)
Using cached click-8.2.1-py3-none-any.whl (102 kB)
Downloading joblib-1.5.1-py3-none-any.whl (307 kB)
Downloading tqdm-4.67.1-py3-none-any.whl

<a id='section1'></a>

# Preprocessing

In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.

You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers. 

🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. 
- What is the format of the text data you have interacted with (plain text, CSV, or XML)?
- Where does it come from (structured corpus, scraped from the web, survey data)?
- Is it messy (i.e., is the data formatted consistently)?

## Common Processes

Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.

Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.

The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. 
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters
- Remove stop words

After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features).  

Before we jump into these operations, let's take a look at our data!

### Import the Text Data

The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. 

Let's read the file `airline_tweets.csv` into dataframe with `pandas`.

In [None]:
# Import pandas
import pandas as pd #Importa la librería pandas y le asigna el alias pd (forma estándar para trabajar con pandas en Python).

# File path to data
csv_path = '../data/airline_tweets.csv' #Define la ruta del archivo CSV que contiene los datos, en este caso airline_tweets.csv que está en la carpeta ../data/.

# Specify the separator
tweets = pd.read_csv(csv_path, sep=',') # Lee el archivo CSV usando pandas y lo guarda en el DataFrame tweets.

In [None]:
# Show the first five rows
tweets.head() #Muestra por defecto las primeras 5 filas del DataFrame tweets.

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


The dataframe has one row per tweet. The text of tweet is shown in the `text` column.
- `text` (`str`): the text of the tweet.

Other metadata we are interested in include: 
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral," "positive," or "negative."
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

Let's take a look at some of the tweets:

In [4]:
print(tweets['text'].iloc[0]) # Imprime el texto del primer tweet (fila 0 de la columna 'text')
print(tweets['text'].iloc[1]) # Imprime el texto del segundo tweet (fila 1 de la columna 'text')
print(tweets['text'].iloc[2]) # Imprime el texto del tercer tweet (fila 2 de la columna 'text')

@VirginAmerica What @dhepburn said.
@VirginAmerica plus you've added commercials to the experience... tacky.
@VirginAmerica I didn't today... Must mean I need to take another trip!


🔔 **Question**: What have you noticed? What are the stylistic features of tweets?

### Lowercasing

While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.

More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.

We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.

Let's apply it to the following example:

In [5]:
# Print the first example
first_example = tweets['text'][108] # Selecciona el texto del tweet que está en la posición 108 de la columna 'text'
print(first_example) # Imprime ese tweet en pantalla

@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?


In [6]:
# Check if all characters are in lowercase
print(first_example.islower())# Verifica si todos los caracteres de 'first_example' están en minúsculas
print(f"{'=' * 50}")# Imprime una línea separadora de 50 signos "="


# Convert it to lowercase
print(first_example.lower())# Convierte el texto a minúsculas y lo imprime
print(f"{'=' * 50}")# Imprime nuevamente una línea separadora de 50 signos "="

# Convert it to uppercase
print(first_example.upper())# Convierte el texto a mayúsculas y lo imprime

False
@virginamerica i was scheduled for sfo 2 dal flight 714 today. changed to 24th due weather. looks like flight still on?
@VIRGINAMERICA I WAS SCHEDULED FOR SFO 2 DAL FLIGHT 714 TODAY. CHANGED TO 24TH DUE WEATHER. LOOKS LIKE FLIGHT STILL ON?


### Remove Extra Whitespace Characters

Sometimes we might come across texts with extraneous whitespace, such as spaces, tabs, and newline characters, which is particularly common when the text is scrapped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regex) and the `re` package. 

Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but they can be very efficient when we get a handle on them. Many NLP packages heavily rely on regex under the hood. Regex testers, such as [regex101](https://regex101.com), are useful tools in both understanding and creating regex expressions.

Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!

The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\n`) that we want to remove.

In [7]:
# File path to the poem
text_path = '../data/poem_wordsworth.txt'# Ruta del archivo de texto (el poema de Wordsworth)

# Read the poem in
with open(text_path, 'r') as file:# Abre el archivo en modo lectura ('r') y lo carga en la variable 'text'
    text = file.read() # Lee todo el contenido del archivo y lo guarda en 'text'
    file.close()

As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, making it difficult to read. 

In [None]:
text #La variable text contiene todo el contenido del archivo

"I wandered lonely as a cloud\n\n\nI wandered lonely as a cloud\nThat floats on high o'er vales and hills,\nWhen all at once I saw a crowd,\nA host, of golden daffodils;\nBeside the lake, beneath the trees,\nFluttering and dancing in the breeze.\n\nContinuous as the stars that shine\nAnd twinkle on the milky way,\nThey stretched in never-ending line\nAlong the margin of a bay:\nTen thousand saw I at a glance,\nTossing their heads in sprightly dance.\n\nThe waves beside them danced; but they\nOut-did the sparkling waves in glee:\nA poet could not but be gay,\nIn such a jocund company:\nI gazedâ€”and gazedâ€”but little thought\nWhat wealth the show to me had brought:\n\nFor oft, when on my couch I lie\nIn vacant or in pensive mood,\nThey flash upon that inward eye\nWhich is the bliss of solitude;\nAnd then my heart with pleasure fills,\nAnd dances with the daffodils."

One handy function we can use to display the poem properly is `.splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character.   

In [10]:
# Split the single string into a list of lines
text.splitlines()# Divide el texto en una lista de líneas usando saltos de línea


['I wandered lonely as a cloud',
 '',
 '',
 'I wandered lonely as a cloud',
 "That floats on high o'er vales and hills,",
 'When all at once I saw a crowd,',
 'A host, of golden daffodils;',
 'Beside the lake, beneath the trees,',
 'Fluttering and dancing in the breeze.',
 '',
 'Continuous as the stars that shine',
 'And twinkle on the milky way,',
 'They stretched in never-ending line',
 'Along the margin of a bay:',
 'Ten thousand saw I at a glance,',
 'Tossing their heads in sprightly dance.',
 '',
 'The waves beside them danced; but they',
 'Out-did the sparkling waves in glee:',
 'A poet could not but be gay,',
 'In such a jocund company:',
 'I gazedâ€”and gazedâ€”but little thought',
 'What wealth the show to me had brought:',
 '',
 'For oft, when on my couch I lie',
 'In vacant or in pensive mood,',
 'They flash upon that inward eye',
 'Which is the bliss of solitude;',
 'And then my heart with pleasure fills,',
 'And dances with the daffodils.']

Let's return to our tweet data for an example.

In [11]:
# Print the second example
second_example = tweets['text'][5] # Selecciona el texto del tweet que está en la posición 5 de la columna 'text'
second_example # Muestra el contenido de ese tweet

"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA"

In this case, we don't really want to split the tweet into a list of strings. We still expect a single string of text but would like to remove the line break completely from the string.

The string method `.strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is in the middle of the string.

In [13]:
# Strip only removed blankspace at both ends
second_example.strip() # Elimina los espacios en blanco al inicio y al final del tweet

"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\r\nit's really the only bad thing about flying VA"

This is where regex could be really helpful.

In [1]:
import re # Buscar todas las palabras que empiezan con 't' en un texto


Now, with regex, we are essentially calling it to match a pattern that we have identified in the text data, and we want to do some operations to the matched part—extract it, replace it with something else, or remove it completely. Therefore, the way regex works could be unpacked into the following steps:

- Identify and write the pattern in regex (`r'PATTERN'`)
- Write the replacement for the pattern (`'REPLACEMENT'`)
- Call the specific regex function (e.g., `re.sub()`)

In our example, the pattern we are looking for is `\s`, which is the regex short name for any whitespace character (`\n` and `\t` included). We also add a quantifier `+` to the end: `\s+`. It means we'd like to capture one or more occurences of the whitespace character.

In [2]:
# Write a pattern in regex
blankspace_pattern = r'\s+' # Definir un patrón de expresión regular para uno o más espacios en blanco


The replacement for one or more whitespace characters is exactly one single whitespace, which is the canonical word boundary in English. Any additional whitespace will be reduced to a single whitespace. 

In [3]:
# Write a replacement for the pattern identfied
blankspace_repl = ' ' # Definir el reemplazo para el patrón de espacios en blanco: un solo espacio


Lastly, let's put everything together using the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments—the pattern, the replacement, and the string to which we want to apply the function.

In [7]:
# Replace whitespace(s) with ' '
clean_text = re.sub( # Reemplaza uno o más espacios en blanco en 'second_example' por un solo espacio
    pattern = blankspace_pattern, # Patrón que busca uno o más espacios en blanco
    repl = blankspace_repl, # Reemplazo: un solo espacio
    string = 'second_example') # Texto original donde se aplica el reemplazo
print(clean_text) # Imprime el tweet limpio con los espacios normalizados


second_example


Ta-da! The newline character is no longer there.

### Remove Punctuation Marks

Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks. 

The `string` module contains a list of predefined punctuation marks. Let's print them out.

In [8]:
# Load in a predefined list of punctuation marks
from string import punctuation # Importa una lista predefinida de signos de puntuación de la librería estándar 'string'
print(punctuation) # Muestra todos los caracteres de puntuación disponibles


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function.

In [None]:
def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []     # Crear una lista para guardar los caracteres que NO son puntuación
    for char in text:     # Itera sobre cada carácter en el texto
        if char not in punctuation: # Si el carácter no está en la lista de puntuación
            no_punct.append(char) # lo añade a la lista



    # Join the characters into a string
    text_no_punct = ''.join(no_punct) # Une los caracteres filtrados de nuevo en un solo string  
    
    return text_no_punct # Devuelve el texto limpio sin signos de puntuación

Let's apply the function to the example below. 

In [None]:
# Print the third example
third_example = tweets['text'][20] # Selecciona el texto del tweet que está en la posición 20 de la columna 'text'
print(third_example) # Imprime ese tweet original
print(f"{'=' * 50}") # Imprime una línea separadora de 50 signos "="

# Apply the function 
clean_third_example = remove_punct(third_example) # Aplica la función remove_punct para eliminar la puntuación del tweet
remove_punct(third_example) # Imprime el tweet limpio sin signos de puntuación

Let's give it a try with another tweet. What have you noticed?

In [None]:
#import pandas as pd
# Print another tweet
print(tweets['text'][100]) # Imprime el tweet que está en la posición 100 de la columna 'text'
print(f"{'=' * 50}") # Imprime una línea separadora de 50 signos "=" para mejorar la visualización
# Apply the function
clean_example_100 = remove_punct(tweets['text'][100]) # Aplica la función remove_punct para eliminar la puntuación del tweet
print(clean_example_100)# Muestra el tweet limpio sin signos de puntuación


What about the following example?

In [None]:
# Print a text with contraction
contraction_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."# Texto de ejemplo que contiene contracciones, puntuación, hashtags y menciones

# Apply the function
clean_contraction_text = remove_punct(contraction_text) # Aplica la función remove_punct para eliminar todos los signos de puntuación
print(clean_contraction_text)# Muestra el texto limpio sin puntuación

Weve got quite a bit of punctuation here dont we Python DLab


⚠️ **Warning:** In many cases, we want to remove punctuation marks **after** tokenization, which we will discuss in a minute. This tells us that the **order** of preprocessing is a matter of importance!

## 🥊 Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. 

The example text data for challenge 1 is shown below. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the codes we've used above!

In [None]:
challenge1_path = '../data/example1.txt'# Ruta del archivo de texto que queremos leer

with open(challenge1_path, 'r') as file:# Abre el archivo en modo lectura ('r') y lo carga en la variable 'challenge1'
    challenge1 = file.read() # Lee todo el contenido del archivo
    
print(challenge1)# Imprime el contenido del archivo

In [None]:
def clean_text(text):   #Limpia un texto aplicando minúsculas, eliminando puntuación y normalizando espacios

    # Step 1: Lowercase 
    text = ...    # Step 1: Convertir todo el texto a minúsculas


    # Step 2: Use remove_punct to remove punctuation marks
    text = ...    # Step 2: Usar remove_punct para eliminar signos de puntuación

    # Step 3: Remove extra whitespace characters
    text = ...    # Step 3: Eliminar espacios extra usando el patrón y reemplazo definidos

    return text 

In [None]:
#Uncomment to apply the above function to challenge 1 text 
clean_text(challenge1) #Ejecuta la función clean_text tomando como entrada el texto almacenado en la variable challenge1.

## Task-specific Processes

Now that we understand common preprocessing operations, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.

For example, if we are working with financial documents, we might want to standardize monetary symbols by converting them to digits. It our tweets data, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify the subsequent analysis.

### 🎬 **Demo**: Remove Hashtags and URLs 

Although URLs, hashtags, and numbers are informative in their own right, oftentimes we don't necessarily care about the exact meaning of each of them. 

While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. In practice, we replace individual URLs and hashtags with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL" and "HASHTAG."

Since these types of text often follow a regular structure, they're an apt case for using regular expressions. Let's apply these patterns to the tweets data.

In [None]:
# Print the example tweet 
url_tweet = tweets['text'][13] # Selecciona el tweet en la posición 13 de la columna 'text'
print(url_tweet)# Imprime el contenido del tweet

In [None]:
# URL 
url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'# Definir un patrón de expresión regular para detectar URLs en el texto
url_repl = ' URL ' # Definir el reemplazo por el texto ' URL '
re.sub(url_pattern, url_repl, url_tweet) # Reemplaza todas las URLs en 'url_tweet' por la cadena ' URL 

In [None]:
# Hashtag
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)' # Definir un patrón de expresión regular para detectar hashtags en el texto
hashtag_repl = ' HASHTAG '# Definir el reemplazo por la cadena ' HASHTAG '
re.sub(hashtag_pattern, hashtag_repl, url_tweet)# Mostrar el resultado

<a id='section2'></a>

# Tokenization

## Tokenizers Before LLMs

One of the most important steps in text analysis is tokenization. This is the process of breaking a long sequence of text into word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that don't contribute to the core meaning of the text.

In this section, we'll introduce how to perform tokenization using `nltk`, `spaCy`, and a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, help you understand their functionalities, and demonstrate how to access key functions in each package.

### `nltk`

The first package we'll be using is called **Natural Language Toolkit**, or `nltk`. 

Let's install a couple modules from the package.

In [None]:
import nltk #Importa la librería NLTK (Natural Language Toolkit) en Python.

In [None]:
#Uncomment the following lines to install these modules 
nltk.download('wordnet') #descarga el diccionario WordNet, usado para lematización
nltk.download('stopwords') #descarga listas de stopwords
nltk.download('punkt') #descarga el modelo de tokenización

`nltk` has a function called `word_tokenize`. It requires one argument, which is the text to be tokenized, and it returns a list of tokens for us.

In [None]:
# Load word_tokenize 
from nltk.tokenize import word_tokenize # Importa la función para dividir texto en tokens (palabras y signos)

# Print the example
text = tweets['text'][7] # Divide el tweet en tokens
print(text) # Imprime la lista de tokens

@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP


In [None]:
# Apply the NLTK tokenizer
nltk_tokens = word_tokenize(text) # Divide el texto del tweet en una lista de tokens (palabras, signos, etc.)
nltk_tokens # Devuelve la lista de tokens

Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect them! 

🔔 **Question**: Do word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet. 

You may feel that accessing functions in `nltk` is pretty straightforward. The function we used above was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization. 

Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:

| NLTK module   | Fucntion                  | Link                                                         |
|---------------|---------------------------|--------------------------------------------------------------|
| nltk.tokenize | Tokenization              | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |
| nltk.corpus   | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/)             |
| nltk.tag      | Part-of-speech tagging    | [Documentation](https://www.nltk.org/api/nltk.tag.html)      |
| nltk.stem     | Stemming                  | [Documentation](https://www.nltk.org/api/nltk.stem.html)     |
| ...           | ...                       | ...                                                          |

Let's import `stopwords` from the `nltk.corpus` module, which hosts a range of built-in corpora. 

In [43]:
# Load predefined stop words from nltk
from nltk.corpus import stopwords # importa desde NLTK la lista predefinida de stopwords

Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies. 

In [None]:
# Print the first 10 stopwords
stop = stopwords.words('english') #Cargar la lista de palabras vacías (stopwords) en inglés desde NLTK
stop[:10] #Mostrar las primeras 10 palabras vacías de la lista


### `spaCy`
Other than `nltk`, we have another widely-used package called `spaCy`. 

`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, as well as [other text analysis components](https://spacy.io/usage/processing-pipelines#custom-components) requested by the user. These components are pretty similar to modules in `nltk`. 

<img src='../images/spacy.png' alt="spacy pipeline" width="700">

Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. The name suggests that it is a lightweight model trained on some text data (e.g., blogs); see model descriptions [here](https://spacy.io/models/en#en_core_web_sm).

This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the model has been trained on a vast amount of data. As a result, it comes with a certain "knowledge" of word structure and grammar of the language.

Therefore, when we apply the model to our own data, we can expect it to be reasonably accurate in performing various annotation tasks, e.g., tagging a word's part of speech, identifying the syntactic head of a phrase, and etc. 

Let's dive in! We'll first need to load the pretrained language model we installed earlier.

In [45]:
import spacy # Importa la librería spaCy para procesamiento de lenguaje natural (NLP)
nlp = spacy.load('en_core_web_sm') # Carga el modelo en inglés "small" 

The `nlp` pipeline, by default, includes a set of components, which we can access via the `.pipe_names` attribute. 

You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes.

In [47]:
# Retrieve components included in NLP pipeline
nlp.pipe_names # Recupera los nombres de los componentes que forman parte del pipeline de procesamiento de spaCy


['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`.

In [None]:
# Apply the pipeline to example tweet
doc = nlp(tweets['text'][7]) # Aplicar el pipeline de spaCy al tweet número 7

Under the hood, the `doc` object contains the tokens (created by the tokenizer) and their annotations (created by other components), which are [linguistic features](
https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes. 

| Attribute      | Annotation                              | Link                                                                      |
|----------------|-----------------------------------------|---------------------------------------------------------------------------|
| token.text     | The token in verbatim text              | [Documentation](https://spacy.io/api/token#attributes)                    |
| token.is_stop  | Whether the token is a stop word        | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.lemma_   | The base form of the token              | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |
| token.pos_     | The simple POS-tag of the token         | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging)   |
| ...            | ...                                     | ...                                                                       |

Let's first get the tokens themselves! We'll iterate over the `doc` object and retrieve the text of each token. 

In [None]:
# Get the verbatim texts of tokens
spacy_tokens = [token.text for token in doc] # Obtener el texto literal de cada token en el objeto Doc de spaCy

spacy_tokens # Mostrar la lista de tokens

In [None]:
# Get the NLTK tokens
nltk_tokens  #Una lista de tokens obtenida usando la función de NLTK

['@',
 'VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

🔔 **Question**: Let's pause for a minute to compare the tokens generated by `nltk` and `spaCy`. What have you noticed?

Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word. 

In [None]:
# Retrieve the is_stop annotation
spacy_stops = [token.is_stop for token in doc] # Recupera la anotación 'is_stop' de cada token en el objeto Doc de spaCy

# The results are boolean values
spacy_stops # Muestra la lista de valores booleanos

## 🥊 Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. 

Let's write **two** functions to remove stop words from our text data. 

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input
 
A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!

In [55]:
def remove_stopword_nltk(raw_text, stopword):
    """
    Elimina stopwords de un texto usando NLTK.
    """
    # Step 1: Tokenization with nltk
    # YOUR CODE HERE 
    tokens = word_tokenize(raw_text) # Divide el texto en palabras y signos de puntuación
    
    # Step 2: Filter out tokens in the stop word list
    # YOUR CODE HERE
    clean_tokens = [token for token in tokens if token.lower() not in stopword] # Crea una nueva lista de tokens excluyendo aquellos que estén en la lista de stopwords

    return clean_tokens   # Devolver la lista de tokens limpios


In [57]:
def remove_stopword_spacy(raw_text):

    # Step 1: Apply the nlp pipeline
    # YOUR CODE HERE
    doc = nlp(raw_text) # Convierte el texto en un objeto Doc que contiene tokens y atributos lingüísticos
    
    # Step 2: Filter out tokens that are stop words
    # YOUR CODE HERE
    clean_tokens = [token.text for token in doc if not token.is_stop and not token.is_space] # Crea una lista de tokens del objeto Doc de spaCy, excluyendo las stopwords y los espacios

    return clean_tokens   # Devolver la lista de tokens limpios



In [None]:
remove_stopword_nltk(text, stop) # Llama a la función remove_stopword_nltk para eliminar las stopwords de 'text' usando la lista 'stop'

In [None]:
remove_stopword_spacy(text) # Llama a la función remove_stopword_spacy para eliminar las stopwords de 'text' usando spaCy

## 🎬 **Demo**: Powerful Features from `spaCy`

`spaCy`'s nlp pipeline includes a number of linguistic annotations, which could be very useful for text analysis. 

For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs.

In [None]:
# Print tokens and their annotations
for token in doc: # Imprimir tokens y sus anotaciones lingüísticas del objeto Doc de spaCy
    print(f"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |")

As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets. 

In [None]:
# Print example tweets with place names and airport codes
tweet_city = tweets['text'][8273]  # Tweet con nombres de ciudades
tweet_airport = tweets['text'][502] # Tweet con códigos de aeropuertos
print(tweet_city)  # Imprimir el tweet con nombres de ciudades
print(f"{'=' * 50}")  # Línea separadora para mayor claridad
print(tweet_airport) # Imprimir el tweet con códigos de aeropuertos


We can use the "ner" ([Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) component to identify entities and their categories.

In [62]:
tweet_city = "Voy a visitar Nueva York y París el próximo verano."
# Print entities identified from the text
doc_city = nlp(tweet_city) # Procesar el tweet que contiene nombres de ciudades con spaCy
# Recorrer las entidades nombradas detectadas y mostrarlas en formato tabular
for ent in doc_city.ents:
    print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")

Nueva York      | 14         | 24         | GPE       
París el próximo verano | 27         | 50         | GPE       


We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category. 

In the following example, we have four `GPE` (i.e., geopolitical entities, usually countries and cities) identified. 

In [None]:
import spacy
from spacy import displacy
# Visualize the identified entities
from spacy import displacy # Importar displacy para visualización de entidades
displacy.render(doc_city, style='ent', jupyter=True) # Visualizar las entidades nombradas del tweet procesado en Jupyter Notebook

Let's give it a try with another example.

In [None]:
# Print entities identified from the text
doc_airport = nlp(tweet_airport) # Analiza el texto del tweet con el modelo de spaCy
# Recorre todas las entidades (entidades nombradas) que spaCy identificó en el texto
for ent in doc_airport.ents:
     print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}") #Imprime en columnas: el texto de la entidad, su posición inicial, final y la etiqueta de la entidad


Interesting that airport codes are identified as `ORG`—organizations, and the tweet handle as `CARDINAL`.

In [None]:
# Visualize the identified entities
displacy.render(doc_airport, style='ent', jupyter=True)# Muestra en Jupyter las entidades nombradas (NER) detectadas en el texto con resaltado visual.



## Tokenizers Since LLMs

So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that `nltk` struggles with URLs. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, and etc (collectively called "out of vocabulary" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.

In fact, tokenization schemes change substantially with **Large Language Models** (LLMs), which are models trained on an enormous amount of data from mixed sources. With that magnitude of data, LLMs are better at chunking a longer sequence into tokens and tokens into **subtokens**. These subtokens can be morphological units of a word, such as an affix, but they can also be parts of a word where the model sets a "meaningful" boundary. 

In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6). 

We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., BERT). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!

### WordPiece Tokenization

Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has a moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be lowercased by default.

In [None]:
# Load BERT tokenizer in
from transformers import BertTokenizer # Importa el tokenizador de BERT desde Hugging Face

# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Inicializa el tokenizador preentrenado 'bert-base-uncased'

The tokenizer has multiple functions, as we will see in a minute. Now we want to access the `.tokenize()` function from the tokenizer. 

Let's tokenize an example tweet below. What have you noticed?

In [None]:
# Select an example tweet from dataframe
text = tweets['text'][194] # Selecciona el texto del tweet número 194 en el DataFrame "tweets"
print(f"Text: {text}")# Imprime el texto del tweet
print(f"{'=' * 50}") # Imprime una línea de separación con 50 signos "="


# Apply tokenizer
tokens = tokenizer.tokenize(text)# Aplica el tokenizador de BERT al texto para dividirlo en sub-palabras/tokens
print(f"Tokens: {tokens}")# Imprime la lista de tokens resultantes
print(f"Number of tokens: {len(tokens)}")# Imprime la cantidad total de tokens generados

The double "hashtag" symbols (`##`) refer to a subword token—a segment separated from the previous token.

🔔 **Question**: Do these subwords make sense to you? 

One significant development with LLMs is that each token is assigned an ID from its vocabulary. Our computer does not understand text in its raw form, so each token is translated into an ID. These IDs are the inputs that the model accesses and operates on.

Tokens and IDs can be converted bidirectionally, for example:

In [6]:
# Get the input ID of the word 
print(f"ID of just is: {tokenizer.vocab['just']}") # Obtiene el ID en el vocabulario de BERT correspondiente a la palabra 'just'

# Get the text of the input ID
print(f"Token 2074 is: {tokenizer.decode([2074])}")# Decodifica el ID 2074 del vocabulario de BERT y muestra la palabra correspondiente

ID of just is: 2074
Token 2074 is: just


Let's convert tokens to input IDs.

In [15]:
# Convert a list of tokens to a list of input IDs
tokens = ['[CLS]', 'Hola', 'Daniela', '2']
input_ids = tokenizer.convert_tokens_to_ids(tokens)# Convierte la lista de tokens generados por BERT a sus correspondientes IDs en el vocabulario
print(f"Number of input IDs: {len(input_ids)}")# Imprime la cantidad total de IDs generados
print(f"Input IDs of text: {input_ids}")# Imprime la lista completa de IDs que representan el texto

Number of input IDs: 4
Input IDs of text: [101, 100, 100, 1016]


### Special Tokens

In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence "unknown" tokens. The `CLS` token is added to the beginning of the sentence. It originates from text classification tasks (e.g., spam detection), where reseachers found it useful to have a token that aggregates the information of the entire sentence for classification purposes.

When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps: 
- Tokenize the text
- Add special tokens
- Convert tokens to input IDs
- Other model-specific processes
  
Let's print them out.

In [17]:
# Get the input IDs by providing the key 
text = "Hola mundo"
input_ids_from_tokenizer = tokenizer(text)['input_ids'] # Tokeniza el texto y obtiene directamente los input IDs usando el tokenizador de BERT
print(f"Number of input IDs: {len(input_ids_from_tokenizer)}")# Imprime la cantidad total de IDs generados
print(f"IDs from tokenizer: {input_ids_from_tokenizer}") # Imprime la lista completa de input IDs generados por el tokenizador

Number of input IDs: 5
IDs from tokenizer: [101, 7570, 2721, 25989, 102]


It looks like we have two more tokens added: 101 and 102. 

Let's convert them to texts!

In [18]:
# Convert input IDs to texts
print(f"The 101st token: {tokenizer.convert_ids_to_tokens(101)}")# Convierte el ID 101 a su token correspondiente en el vocabulario de BERT
print(f"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}")# Convierte el ID 102 a su token correspondiente en el vocabulario de BERT


The 101st token: [CLS]
The 102nd token: [SEP]


As you can see, our text example is now a list of vocabulary IDs. In addtion to that, BERT adds the sentence terminator `SEP` and the beginning `CLS` token to the original text. BERT's tokenizer encodes tons of texts likewise; and afterwards, they are ready for further processes.

## 🥊 Challenge 3: Find the Word Boundary

Now that we know tokenization in BERT often returns subwords. Let's try a few more examples. 

- What do you think is the correct boundary for splitting the following words into subwords?
- What other examples have you tested?

In [None]:
def get_tokens(string):
    '''Tokenzie the input string with BERT'''
    tokens = tokenizer.tokenize(string) # Aplica el tokenizador de BERT al texto de entrada

    return print(tokens)  # Imprime la lista de tokens generada


In [21]:
# Abbreviations
get_tokens('dlab')# Abreviaturas: tokeniza la palabra 'dlab'

# OOV
get_tokens('covid')# Palabra fuera del vocabulario (OOV): tokeniza 'covid'

# Prefix
get_tokens('huggable')# Prefijo: tokeniza 'huggable'

# Digits
get_tokens('378')# Dígitos: tokeniza '378'

# YOUR EXAMPLE
get_tokens('airplane')

['dl', '##ab']
['co', '##vid']
['hug', '##ga', '##ble']
['37', '##8']
['airplane']


We will wrap up Part 1 with this (hopefully) thought-provoking challenge. LLMs often come with a much more sophisticated tokenization scheme, but there is ongoing discussion about their limitations in real-world applications. The reference section includes a few blog posts discussing this problem. Feel free to explore further if this sounds like an interesting question to you!

## References

1. A tutorial introducing the tokenization scheme in BERT: [The huggingface NLP course on wordpiece tokenization](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)
2. A specific example of "failure" in tokenization: [Weaknesses of wordpiece tokenization: Findings from the front lines of NLP at VMware.](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99)
3. How does BERT decide boundaries between subtokens: [Subword tokenization in BERT](https://tinkerd.net/blog/machine-learning/bert-tokenization/#subword-tokenization)

<div class="alert alert-success">

## ❗ Key Points

* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific. 
* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations. 
* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords. 

</div>