# Python Text Analysis: Preprocessing

* * *

<div class="alert alert-success">  
    
### Learning Objectives
    
* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.
* Know commonly used NLP packages and what they are capable of.
* Understand tokenizers, and how they have changed since the advent of Large Language Models.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Sections
1. [Preprocessing](#section1)
2. [Tokenization](#section2)

In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).

Now, let's have these packages properly installed before diving into the materials.

In [332]:
# Uncomment the following lines to install packages/model
# %pip install NLTK
# %pip install transformers
# %pip install spaCy
# !python -m spacy download en_core_web_sm

<a id='section1'></a>

# Preprocessing

In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.

You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers.

🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data.
- What is the format of the text data you have interacted with (plain text, CSV, or XML)?
- Where does it come from (structured corpus, scraped from the web, survey data)?
- Is it messy (i.e., is the data formatted consistently)?

## Common Processes

Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.

Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.

The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions.
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters
- Remove stop words

After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features).  

Before we jump into these operations, let's take a look at our data!

### Import the Text Data

The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015.

Let's read the file `airline_tweets.csv` into dataframe with `pandas`.

In [333]:
# Import pandas
import pandas as pd

# File path to data
csv_path = 'ejercicio1/data/airline_tweets.csv'

# Specify the separator
tweets = pd.read_csv(csv_path, sep=',')
# Este fragmento de código importa la librería pandas, define la ruta de un archivo CSV que contiene tuits de aerolíneas,
# y luego carga ese archivo en un DataFrame llamado 'tweets' para su posterior análisis.


In [334]:
!git clone https://github.com/dlab-berkeley/Python-Text-Analysis.git
# Este comando clona (descarga) el repositorio de GitHub 'Python-Text-Analysis' en el entorno local,
# permitiéndote acceder a todos los archivos y materiales contenidos en él.


fatal: destination path 'Python-Text-Analysis' already exists and is not an empty directory.


In [335]:
from google.colab import drive
drive.mount('/content/drive')
# Este código importa el módulo 'drive' de Google Colab y monta tu Google Drive en la ruta '/content/drive',
# lo que permite acceder a tus archivos de Drive directamente desde el entorno de Colab.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [336]:
# Muestra las primeras cinco filas del DataFrame 'tweets',
# lo que permite obtener una vista preliminar de los datos cargados desde el archivo CSV.
tweets.head()


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


The dataframe has one row per tweet. The text of tweet is shown in the `text` column.
- `text` (`str`): the text of the tweet.

Other metadata we are interested in include:
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral," "positive," or "negative."
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

Let's take a look at some of the tweets:

In [337]:
# Imprime en pantalla los textos de los tres primeros tuits (filas 0, 1 y 2) de la columna 'text' del DataFrame 'tweets',
# permitiendo observar directamente el contenido textual de los mensajes.
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])


@VirginAmerica What @dhepburn said.
@VirginAmerica plus you've added commercials to the experience... tacky.
@VirginAmerica I didn't today... Must mean I need to take another trip!


🔔 **Question**: What have you noticed? What are the stylistic features of tweets?

### Lowercasing

While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.

More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.

We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.

Let's apply it to the following example:

In [338]:
# Asigna a la variable 'first_example' el contenido del tuit en la posición 108 de la columna 'text',
# y luego lo imprime para visualizar ese ejemplo específico.
first_example = tweets['text'][108]
print(first_example)


@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?


In [339]:
# Verifica si todos los caracteres del tuit almacenado en 'first_example' están en minúsculas (devuelve True o False).
print(first_example.islower())
print(f"{'=' * 50}")  # Imprime una línea divisoria de 50 signos '=' para separar visualmente los resultados.

# Convierte todo el texto del tuit a minúsculas y lo imprime.
print(first_example.lower())
print(f"{'=' * 50}")  # Otra línea divisoria para claridad.

# Convierte todo el texto del tuit a mayúsculas y lo imprime.
print(first_example.upper())


False
@virginamerica i was scheduled for sfo 2 dal flight 714 today. changed to 24th due weather. looks like flight still on?
@VIRGINAMERICA I WAS SCHEDULED FOR SFO 2 DAL FLIGHT 714 TODAY. CHANGED TO 24TH DUE WEATHER. LOOKS LIKE FLIGHT STILL ON?


### Remove Extra Whitespace Characters

Sometimes we might come across texts with extraneous whitespace, such as spaces, tabs, and newline characters, which is particularly common when the text is scrapped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regex) and the `re` package.

Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but they can be very efficient when we get a handle on them. Many NLP packages heavily rely on regex under the hood. Regex testers, such as [regex101](https://regex101.com), are useful tools in both understanding and creating regex expressions.

Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!

The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\n`) that we want to remove.

In [340]:
# Define la ruta del archivo de texto que contiene un poema (en este caso, 'poem_wordsworth.txt').
text_path = 'ejercicio1/data/poem_wordsworth.txt'

# Abre el archivo en modo lectura ('r'), lee todo su contenido y lo guarda en la variable 'text'.
# Luego cierra el archivo para liberar recursos.
with open(text_path, 'r') as file:
    text = file.read()
    file.close()

As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, making it difficult to read.

In [341]:
text

"I wandered lonely as a cloud\n\n\nI wandered lonely as a cloud\nThat floats on high o'er vales and hills,\nWhen all at once I saw a crowd,\nA host, of golden daffodils;\nBeside the lake, beneath the trees,\nFluttering and dancing in the breeze.\n\nContinuous as the stars that shine\nAnd twinkle on the milky way,\nThey stretched in never-ending line\nAlong the margin of a bay:\nTen thousand saw I at a glance,\nTossing their heads in sprightly dance.\n\nThe waves beside them danced; but they\nOut-did the sparkling waves in glee:\nA poet could not but be gay,\nIn such a jocund company:\nI gazed—and gazed—but little thought\nWhat wealth the show to me had brought:\n\nFor oft, when on my couch I lie\nIn vacant or in pensive mood,\nThey flash upon that inward eye\nWhich is the bliss of solitude;\nAnd then my heart with pleasure fills,\nAnd dances with the daffodils."

One handy function we can use to display the poem properly is `.splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character.   

In [342]:
# Divide la cadena completa almacenada en 'text' en una lista de líneas,
# separando el texto original por saltos de línea.
text.splitlines()


['I wandered lonely as a cloud',
 '',
 '',
 'I wandered lonely as a cloud',
 "That floats on high o'er vales and hills,",
 'When all at once I saw a crowd,',
 'A host, of golden daffodils;',
 'Beside the lake, beneath the trees,',
 'Fluttering and dancing in the breeze.',
 '',
 'Continuous as the stars that shine',
 'And twinkle on the milky way,',
 'They stretched in never-ending line',
 'Along the margin of a bay:',
 'Ten thousand saw I at a glance,',
 'Tossing their heads in sprightly dance.',
 '',
 'The waves beside them danced; but they',
 'Out-did the sparkling waves in glee:',
 'A poet could not but be gay,',
 'In such a jocund company:',
 'I gazed—and gazed—but little thought',
 'What wealth the show to me had brought:',
 '',
 'For oft, when on my couch I lie',
 'In vacant or in pensive mood,',
 'They flash upon that inward eye',
 'Which is the bliss of solitude;',
 'And then my heart with pleasure fills,',
 'And dances with the daffodils.']

Let's return to our tweet data for an example.

In [343]:
# Asigna a la variable 'second_example' el texto del tuit en la posición 5 de la columna 'text' del DataFrame 'tweets',
# y muestra su contenido para visualizar ese segundo ejemplo específico.
second_example = tweets['text'][5]
second_example


"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA"

In this case, we don't really want to split the tweet into a list of strings. We still expect a single string of text but would like to remove the line break completely from the string.

The string method `.strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is in the middle of the string.

In [344]:
# Elimina los espacios en blanco al principio y al final del texto contenido en 'second_example',
# dejando intactos los espacios internos.
second_example.strip()


"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA"

This is where regex could be really helpful.

In [345]:
# Importa el módulo 're' de Python, que permite trabajar con expresiones regulares para buscar,
# manipular y procesar texto de forma avanzada.
import re


Now, with regex, we are essentially calling it to match a pattern that we have identified in the text data, and we want to do some operations to the matched part—extract it, replace it with something else, or remove it completely. Therefore, the way regex works could be unpacked into the following steps:

- Identify and write the pattern in regex (`r'PATTERN'`)
- Write the replacement for the pattern (`'REPLACEMENT'`)
- Call the specific regex function (e.g., `re.sub()`)

In our example, the pattern we are looking for is `\s`, which is the regex short name for any whitespace character (`\n` and `\t` included). We also add a quantifier `+` to the end: `\s+`. It means we'd like to capture one or more occurences of the whitespace character.

In [346]:
# Define un patrón de expresión regular llamado 'blankspace_pattern' que coincide con uno o más espacios en blanco consecutivos.
blankspace_pattern = r'\s+'


The replacement for one or more whitespace characters is exactly one single whitespace, which is the canonical word boundary in English. Any additional whitespace will be reduced to a single whitespace.

In [347]:
# Define el texto de reemplazo para el patrón de espacios en blanco,
# en este caso un solo espacio, que se usará para sustituir secuencias de espacios múltiples.
blankspace_repl = ' '


Lastly, let's put everything together using the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments—the pattern, the replacement, and the string to which we want to apply the function.

In [348]:
# Reemplaza todas las ocurrencias de uno o más espacios en blanco consecutivos en 'second_example'
# por un solo espacio, utilizando la expresión regular definida.
# Luego imprime el texto resultante con los espacios limpios y uniformes.
clean_text = re.sub(pattern = blankspace_pattern,
                    repl = blankspace_repl,
                    string = second_example)
print(clean_text)


@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA


Ta-da! The newline character is no longer there.

### Remove Punctuation Marks

Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks.

The `string` module contains a list of predefined punctuation marks. Let's print them out.

In [349]:
# Importa la lista predefinida de signos de puntuación desde el módulo 'string' de Python,
# y luego imprime esa lista para mostrar todos los caracteres de puntuación estándar.
from string import punctuation
print(punctuation)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function.

In [350]:
def remove_punct(text):
    '''Remove punctuation marks in input text'''

    # Recorre cada carácter del texto y selecciona solo aquellos que no son signos de puntuación.
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Une los caracteres filtrados en una nueva cadena sin signos de puntuación.
    text_no_punct = ''.join(no_punct)

    # Devuelve el texto limpio de puntuación.
    return text_no_punct
# Esta función elimina todos los signos de puntuación de un texto dado,
# devolviendo una versión limpia sin caracteres de puntuación para facilitar el análisis textual.


Let's apply the function to the example below.

In [351]:
# Asigna a 'third_example' el texto del tuit en la posición 20 y lo imprime,
# seguido de una línea divisoria para separar visualmente la salida.

third_example = tweets['text'][20]
print(third_example)
print(f"{'=' * 50}")

# Aplica la función 'remove_punct' para eliminar signos de puntuación del texto y muestra el resultado limpio.
remove_punct(third_example)


@VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select???


'VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select'

Let's give it a try with another tweet. What have you noticed?

In [352]:
# Imprime el texto del tuit en la posición 100 y una línea divisoria para claridad visual.
print(tweets['text'][100])
print(f"{'=' * 50}")

# Aplica la función 'remove_punct' para eliminar los signos de puntuación del tuit mostrado,
# facilitando su análisis posterior.
remove_punct(tweets['text'][100])


@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM


'VirginAmerica trying to add my boy Prince to my ressie SF this Thursday VirginAmerica from LAX httptcoGsB2J3c4gM'

What about the following example?

In [353]:
# Define un texto de ejemplo que contiene contracciones, signos de puntuación y símbolos especiales.
contraction_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."

# Aplica la función 'remove_punct' para eliminar todos los signos de puntuación del texto de ejemplo,
# dejando solo las letras, números y espacios.
remove_punct(contraction_text)


'Weve got quite a bit of punctuation here dont we Python DLab'

⚠️ **Warning:** In many cases, we want to remove punctuation marks **after** tokenization, which we will discuss in a minute. This tells us that the **order** of preprocessing is a matter of importance!

## 🥊 Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function.

The example text data for challenge 1 is shown below. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the codes we've used above!

In [354]:
# Define la ruta del archivo de texto 'example1.txt' y lo abre en modo lectura.
challenge1_path = 'ejercicio1/data/example1.txt'

# Lee todo el contenido del archivo y lo guarda en la variable 'challenge1'.
with open(challenge1_path, 'r') as file:
    challenge1 = file.read()

# Imprime el contenido completo del archivo para visualizarlo.
print(challenge1)




This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches blankspace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.





In [355]:
def clean_text(text):
    """
    Función que limpia un texto aplicando varias transformaciones:
    1. Convierte el texto a minúsculas.
    2. Elimina los signos de puntuación usando la función 'remove_punct'.
    3. Remueve espacios en blanco extra, dejando solo un espacio simple entre palabras.
    """

    # Paso 1: Convertir todo el texto a minúsculas para uniformidad
    text = text.lower()

    # Paso 2: Eliminar signos de puntuación
    text = remove_punct(text)

    # Paso 3: Reemplazar múltiples espacios en blanco por un solo espacio
    text = re.sub(r'\s+', ' ', text).strip()

    return text


In [356]:
# Uncomment to apply the above function to challenge 1 text
# clean_text(challenge1)

## Task-specific Processes

Now that we understand common preprocessing operations, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.

For example, if we are working with financial documents, we might want to standardize monetary symbols by converting them to digits. It our tweets data, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify the subsequent analysis.

### 🎬 **Demo**: Remove Hashtags and URLs

Although URLs, hashtags, and numbers are informative in their own right, oftentimes we don't necessarily care about the exact meaning of each of them.

While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. In practice, we replace individual URLs and hashtags with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL" and "HASHTAG."

Since these types of text often follow a regular structure, they're an apt case for using regular expressions. Let's apply these patterns to the tweets data.

In [357]:
# Asigna a la variable 'url_tweet' el texto del tuit en la posición 13,
# y luego lo imprime para visualizar el contenido de ese tuit específico.
url_tweet = tweets['text'][13]
print(url_tweet)


@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn


In [358]:
# Define un patrón de expresión regular para detectar URLs en un texto,
# incluyendo protocolos como http, https y ftp.

url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'

# Define el texto de reemplazo que sustituirá las URLs encontradas por la cadena ' URL '.
url_repl = ' URL '

# Utiliza re.sub para buscar todas las coincidencias de URLs en 'url_tweet' y reemplazarlas por ' URL '.
re.sub(url_pattern, url_repl, url_tweet)


"@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel  URL "

In [359]:
# Define un patrón de expresión regular para detectar hashtags en un texto,
# que puede comenzar con el símbolo '#' o su variante japonesa '＃', precedido por inicio de línea o espacio.

hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'

# Define el texto de reemplazo que sustituirá los hashtags encontrados por la cadena ' HASHTAG '.
hashtag_repl = ' HASHTAG '

# Utiliza re.sub para buscar todas las coincidencias de hashtags en 'url_tweet' y reemplazarlas por ' HASHTAG '.
re.sub(hashtag_pattern, hashtag_repl, url_tweet)


"@VirginAmerica @virginmedia I'm flying your HASHTAG  HASHTAG  skies again! U take all the HASHTAG  away from travel http://t.co/ahlXHhKiyn"

<a id='section2'></a>

# Tokenization

## Tokenizers Before LLMs

One of the most important steps in text analysis is tokenization. This is the process of breaking a long sequence of text into word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that don't contribute to the core meaning of the text.

In this section, we'll introduce how to perform tokenization using `nltk`, `spaCy`, and a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, help you understand their functionalities, and demonstrate how to access key functions in each package.

### `nltk`

The first package we'll be using is called **Natural Language Toolkit**, or `nltk`.

Let's install a couple modules from the package.

In [360]:
# Importa la librería NLTK (Natural Language Toolkit), que proporciona herramientas para el procesamiento y análisis de texto en lenguaje natural.
import nltk


In [361]:
# Estas líneas, al descomentarse, descargan recursos esenciales de NLTK:
# - 'wordnet': base de datos léxica para sinónimos y relaciones entre palabras.
# - 'stopwords': lista de palabras comunes (como "el", "y") que suelen eliminarse en análisis de texto.
# - 'punkt': modelo para la tokenización, es decir, para dividir el texto en oraciones y palabras.


`nltk` has a function called `word_tokenize`. It requires one argument, which is the text to be tokenized, and it returns a list of tokens for us.

In [362]:
# Importa la función 'word_tokenize' de NLTK, que permite dividir un texto en palabras o tokens.

from nltk.tokenize import word_tokenize

# Asigna a la variable 'text' el contenido del tuit en la posición 7 y lo imprime para visualizarlo.
text = tweets['text'][7]
print(text)


@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP


In [363]:
import nltk
# Descarga el recurso 'punkt' de NLTK, necesario para la tokenización de texto en oraciones y palabras.
nltk.download('punkt')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

import nltk
nltk.download('punkt_tab')

In [364]:
# Aplica la función 'word_tokenize' de NLTK para dividir el texto almacenado en 'text'
# en una lista de tokens (palabras y signos de puntuación separados).
nltk_tokens = word_tokenize(text)

# Muestra la lista resultante de tokens.
nltk_tokens


['@',
 'VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect them!

🔔 **Question**: Do word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet.

You may feel that accessing functions in `nltk` is pretty straightforward. The function we used above was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization.

Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:

| NLTK module   | Fucntion                  | Link                                                         |
|---------------|---------------------------|--------------------------------------------------------------|
| nltk.tokenize | Tokenization              | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |
| nltk.corpus   | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/)             |
| nltk.tag      | Part-of-speech tagging    | [Documentation](https://www.nltk.org/api/nltk.tag.html)      |
| nltk.stem     | Stemming                  | [Documentation](https://www.nltk.org/api/nltk.stem.html)     |
| ...           | ...                       | ...                                                          |

Let's import `stopwords` from the `nltk.corpus` module, which hosts a range of built-in corpora.

In [365]:
# Importa la lista predefinida de palabras vacías (stop words) del corpus de NLTK,
# que son palabras comunes (como "el", "de", "y") que suelen eliminarse en análisis de texto.
from nltk.corpus import stopwords

Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies.

In [366]:
import nltk
# Descarga el conjunto de palabras vacías ('stopwords') de NLTK,
# necesario para filtrar palabras comunes en el procesamiento de texto.
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [367]:
# Obtiene la lista de palabras vacías en inglés del paquete 'stopwords' de NLTK
stop = stopwords.words('english')

# Muestra las primeras 10 palabras vacías de la lista para tener una idea de su contenido.
stop[:10]


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

### `spaCy`
Other than `nltk`, we have another widely-used package called `spaCy`.

`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, as well as [other text analysis components](https://spacy.io/usage/processing-pipelines#custom-components) requested by the user. These components are pretty similar to modules in `nltk`.

<img src='https://github.com/sauls98/Python-NLP-Fundamentals_Grupo_4/blob/main/images/spacy.png?raw=1' alt="spacy pipeline" width="700">

Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. The name suggests that it is a lightweight model trained on some text data (e.g., blogs); see model descriptions [here](https://spacy.io/models/en#en_core_web_sm).

This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the model has been trained on a vast amount of data. As a result, it comes with a certain "knowledge" of word structure and grammar of the language.

Therefore, when we apply the model to our own data, we can expect it to be reasonably accurate in performing various annotation tasks, e.g., tagging a word's part of speech, identifying the syntactic head of a phrase, and etc.

Let's dive in! We'll first need to load the pretrained language model we installed earlier.

In [368]:
# Importa la librería spaCy y carga el modelo de lenguaje en inglés pequeño ('en_core_web_sm'),
# que incluye herramientas para procesamiento de texto como tokenización, etiquetado y análisis sintáctico.
import spacy
nlp = spacy.load('en_core_web_sm')


The `nlp` pipeline, by default, includes a set of components, which we can access via the `.pipe_names` attribute.

You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes.

In [369]:
# Muestra los nombres de los componentes que forman la tubería (pipeline) del modelo NLP cargado,
# como tokenizador, etiquetador de partes de la oración, reconocedor de entidades, etc.
nlp.pipe_names


['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`.

In [370]:
# Aplica el procesamiento del pipeline de spaCy al texto del tuit en la posición 7,
# generando un objeto 'doc' que contiene tokens y anotaciones lingüísticas.
doc = nlp(tweets['text'][7])


Under the hood, the `doc` object contains the tokens (created by the tokenizer) and their annotations (created by other components), which are [linguistic features](
https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes.

| Attribute      | Annotation                              | Link                                                                      |
|----------------|-----------------------------------------|---------------------------------------------------------------------------|
| token.text     | The token in verbatim text              | [Documentation](https://spacy.io/api/token#attributes)                    |
| token.is_stop  | Whether the token is a stop word        | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.lemma_   | The base form of the token              | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |
| token.pos_     | The simple POS-tag of the token         | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging)   |
| ...            | ...                                     | ...                                                                       |

Let's first get the tokens themselves! We'll iterate over the `doc` object and retrieve the text of each token.

In [371]:
# Extrae el texto literal de cada token en el objeto 'doc' generado por spaCy
# y crea una lista con esos textos.
spacy_tokens = [token.text for token in doc]

# Muestra la lista de tokens extraídos.
spacy_tokens


['@VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https://t.co/mWpG7grEZP']

In [372]:
# Muestra la lista de tokens obtenidos previamente con el tokenizador de NLTK.
nltk_tokens


['@',
 'VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

🔔 **Question**: Let's pause for a minute to compare the tokens generated by `nltk` and `spaCy`. What have you noticed?

Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word.

In [373]:
# Extrae para cada token en 'doc' la anotación 'is_stop', que indica si es una palabra vacía (stop word).
spacy_stops = [token.is_stop for token in doc]

# Muestra la lista de valores booleanos donde True significa que el token es una stop word.
spacy_stops


[False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False]

## 🥊 Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package.

Let's write **two** functions to remove stop words from our text data.

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input

A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!

In [374]:
import nltk
from nltk.tokenize import word_tokenize

# Asegúrate de tener los recursos necesarios descargados
nltk.download('punkt')

def remove_stopword_nltk(raw_text, stopword):
    """
    Función que elimina las palabras vacías (stopwords) de un texto dado utilizando NLTK.

    Paso 1: Tokeniza el texto de entrada en palabras usando word_tokenize.
    Paso 2: Filtra los tokens, eliminando aquellos que estén en la lista de stopwords (ignorando mayúsculas/minúsculas).
    Devuelve la lista de tokens filtrados sin las stopwords.
    """
    # Tokenización del texto
    tokens = word_tokenize(raw_text)

    # Filtrado para eliminar stopwords
    filtered_tokens = [token for token in tokens if token.lower() not in stopword]

    return filtered_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [375]:
import spacy

# Carga el modelo de lenguaje en inglés de spaCy, que incluye herramientas para análisis de texto.
nlp = spacy.load("en_core_web_sm")

def remove_stopword_spacy(raw_text):
    """
    Función que elimina las palabras vacías (stopwords) de un texto usando spaCy.

    Paso 1: Procesa el texto con el pipeline de spaCy para obtener un objeto 'doc' con tokens anotados.
    Paso 2: Filtra los tokens, excluyendo aquellos marcados como palabras vacías (stop words).
    Devuelve una lista con los tokens filtrados sin las stopwords.
    """
    # Procesa el texto
    doc = nlp(raw_text)

    # Filtra tokens que no son stopwords
    filtered_tokens = [token.text for token in doc if not token.is_stop]

    return filtered_tokens


In [376]:
# remove_stopword_nltk(text, stop)

In [377]:
# remove_stopword_spacy(text)

## 🎬 **Demo**: Powerful Features from `spaCy`

`spaCy`'s nlp pipeline includes a number of linguistic annotations, which could be very useful for text analysis.

For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs.

In [378]:
# Recorre cada token en el objeto 'doc' generado por spaCy e imprime sus características:
# - Texto original del token
# - Lema (forma base o raíz del token)
# - Parte de la oración (POS tag)
# - Explicación legible de la parte de la oración
# - Indicador booleano que muestra si el token parece ser una URL
for token in doc:
    print(f"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |")


@VirginAmerica           | @VirginAmerica           | PROPN        | proper noun  | 0            |
Really                   | really                   | ADV          | adverb       | 0            |
missed                   | miss                     | VERB         | verb         | 0            |
a                        | a                        | DET          | determiner   | 0            |
prime                    | prime                    | ADJ          | adjective    | 0            |
opportunity              | opportunity              | NOUN         | noun         | 0            |
for                      | for                      | ADP          | adposition   | 0            |
Men                      | man                      | NOUN         | noun         | 0            |
Without                  | without                  | ADP          | adposition   | 0            |
Hats                     | Hats                     | PROPN        | proper noun  | 0            |
parody    

As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets.

In [379]:
# Asigna y muestra dos tuits de ejemplo: uno en la posición 8273 que probablemente contiene nombres de ciudades,
# y otro en la posición 502 que probablemente contiene códigos de aeropuertos.
# Se imprime una línea divisoria para separar visualmente ambos textos.
tweet_city = tweets['text'][8273]
tweet_airport = tweets['text'][502]
print(tweet_city)
print(f"{'=' * 50}")
print(tweet_airport)


@JetBlue Vegas, San Francisco, Baltimore, San Diego and Philadelphia so far! I'm a very frequent business traveler.
@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.


We can use the "ner" ([Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) component to identify entities and their categories.

In [380]:
# Procesa el texto del tuit con spaCy para identificar entidades nombradas (personas, lugares, organizaciones, etc.).
# Luego recorre cada entidad encontrada e imprime:
# - El texto de la entidad
# - La posición inicial y final en caracteres dentro del texto original
# - La etiqueta que clasifica el tipo de entidad (por ejemplo, ubicación, persona, organización).
doc_city = nlp(tweet_city)
for ent in doc_city.ents:
    print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")


Vegas           | 9          | 14         | GPE       
San Francisco   | 16         | 29         | GPE       
Baltimore       | 31         | 40         | GPE       
San Diego       | 42         | 51         | GPE       
Philadelphia    | 56         | 68         | GPE       


We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category.

In the following example, we have four `GPE` (i.e., geopolitical entities, usually countries and cities) identified.

In [381]:
# Importa la herramienta 'displacy' de spaCy para visualizar entidades nombradas.
# Renderiza visualmente las entidades encontradas en 'doc_city' dentro del entorno Jupyter,
# mostrando las entidades resaltadas en el texto.
from spacy import displacy
displacy.render(doc_city, style='ent', jupyter=True)


Let's give it a try with another example.

In [382]:
# Procesa el texto del tuit 'tweet_airport' con spaCy para detectar entidades nombradas.
# Luego imprime cada entidad con su texto, posición inicial y final en caracteres,
# y su etiqueta que indica el tipo de entidad (por ejemplo, código de aeropuerto, organización, etc.).
doc_airport = nlp(tweet_airport)
for ent in doc_airport.ents:
     print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")


SFO             | 29         | 32         | ORG       


Interesting that airport codes are identified as `ORG`—organizations, and the tweet handle as `CARDINAL`.

In [383]:
# Visualiza gráficamente las entidades nombradas detectadas en el texto 'doc_airport'
# usando la herramienta 'displacy' de spaCy dentro de un entorno Jupyter,
# resaltando las entidades en el texto original.
displacy.render(doc_airport, style='ent', jupyter=True)


## Tokenizers Since LLMs

So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that `nltk` struggles with URLs. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, and etc (collectively called "out of vocabulary" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.

In fact, tokenization schemes change substantially with **Large Language Models** (LLMs), which are models trained on an enormous amount of data from mixed sources. With that magnitude of data, LLMs are better at chunking a longer sequence into tokens and tokens into **subtokens**. These subtokens can be morphological units of a word, such as an affix, but they can also be parts of a word where the model sets a "meaningful" boundary.

In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6).

We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., BERT). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!

### WordPiece Tokenization

Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has a moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be lowercased by default.

In [384]:
# Importa el tokenizador BERT desde la librería transformers.
from transformers import BertTokenizer

# Inicializa el tokenizador BERT preentrenado en minúsculas ('bert-base-uncased'),
# que convierte texto en tokens compatibles con el modelo BERT.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


The tokenizer has multiple functions, as we will see in a minute. Now we want to access the `.tokenize()` function from the tokenizer.

Let's tokenize an example tweet below. What have you noticed?

In [385]:
# Selecciona un tuit de ejemplo en la posición 194 del DataFrame y lo imprime.
text = tweets['text'][194]
print(f"Text: {text}")
print(f"{'=' * 50}")

# Aplica el tokenizador BERT para dividir el texto en tokens compatibles con el modelo,
# imprime la lista de tokens y la cantidad total de tokens generados.
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")


Text: @VirginAmerica Just DM'd. Same issue persisting.
Tokens: ['@', 'virgin', '##ame', '##rica', 'just', 'd', '##m', "'", 'd', '.', 'same', 'issue', 'persist', '##ing', '.']
Number of tokens: 15


The double "hashtag" symbols (`##`) refer to a subword token—a segment separated from the previous token.

🔔 **Question**: Do these subwords make sense to you?

One significant development with LLMs is that each token is assigned an ID from its vocabulary. Our computer does not understand text in its raw form, so each token is translated into an ID. These IDs are the inputs that the model accesses and operates on.

Tokens and IDs can be converted bidirectionally, for example:

In [386]:
# Obtiene el ID numérico asociado al token 'just' en el vocabulario del tokenizador BERT e imprime su valor.
print(f"ID of just is: {tokenizer.vocab['just']}")

# Decodifica e imprime el token correspondiente al ID 2074 del vocabulario del tokenizador.
print(f"Token 2074 is: {tokenizer.decode([2074])}")


ID of just is: 2074
Token 2074 is: just


Let's convert tokens to input IDs.

In [387]:
# Convierte la lista de tokens generada por el tokenizador BERT en una lista de IDs numéricos,
# que representan la entrada codificada que usará el modelo.
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Imprime la cantidad total de IDs generados y la lista completa de esos IDs.
print(f"Number of input IDs: {len(input_ids)}")
print(f"Input IDs of text: {input_ids}")


Number of input IDs: 15
Input IDs of text: [1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012]


### Special Tokens

In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence "unknown" tokens. The `CLS` token is added to the beginning of the sentence. It originates from text classification tasks (e.g., spam detection), where reseachers found it useful to have a token that aggregates the information of the entire sentence for classification purposes.

When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps:
- Tokenize the text
- Add special tokens
- Convert tokens to input IDs
- Other model-specific processes
  
Let's print them out.

In [388]:
# Obtiene directamente la lista de IDs de entrada para el texto proporcionado
# usando el método del tokenizador que prepara la entrada para el modelo BERT.
input_ids_from_tokenizer = tokenizer(text)['input_ids']

# Imprime la cantidad de IDs y la lista completa de IDs generada automáticamente.
print(f"Number of input IDs: {len(input_ids_from_tokenizer)}")
print(f"IDs from tokenizer: {input_ids_from_tokenizer}")


Number of input IDs: 17
IDs from tokenizer: [101, 1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012, 102]


It looks like we have two more tokens added: 101 and 102.

Let's convert them to texts!

In [389]:
# Convierte los IDs de tokens 101 y 102 de vuelta a sus representaciones textuales
# usando el tokenizador BERT y los imprime para visualización.
print(f"The 101st token: {tokenizer.convert_ids_to_tokens(101)}")
print(f"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}")


The 101st token: [CLS]
The 102nd token: [SEP]


As you can see, our text example is now a list of vocabulary IDs. In addtion to that, BERT adds the sentence terminator `SEP` and the beginning `CLS` token to the original text. BERT's tokenizer encodes tons of texts likewise; and afterwards, they are ready for further processes.

## 🥊 Challenge 3: Find the Word Boundary

Now that we know tokenization in BERT often returns subwords. Let's try a few more examples.

- What do you think is the correct boundary for splitting the following words into subwords?
- What other examples have you tested?

In [390]:
def get_tokens(string):
    '''Tokenize the input string with BERT'''
    # Usa el tokenizador BERT para dividir la cadena de entrada en tokens compatibles con BERT
    tokens = tokenizer.tokenize(string)
    # Imprime la lista de tokens resultante
    return print(tokens)


In [393]:
# Abbreviations
get_tokens('dlab')

# OOV
get_tokens('covid')

# Prefix
get_tokens('huggable')

# Digits
get_tokens('378')

# Se prueba la función 'get_tokens' con diferentes tipos de palabras para observar cómo las tokeniza BERT:
# - Abreviaciones o acrónimos (ejemplo: 'dlab')
# - Palabras fuera del vocabulario preentrenado (Out-Of-Vocabulary, OOV) (ejemplo: 'covid')
# - Palabras con prefijos (ejemplo: 'huggable')
# - Números (ejemplo: '378')
# Puedes agregar tu propio ejemplo llamando a 'get_tokens' con otra palabra o frase.



# Este notebook está dedicado a los conceptos y técnicas fundamentales de preprocesamiento de texto
# para tareas de Procesamiento de Lenguaje Natural (NLP).
#
# Objetivos y contenido principal:
#
# - Carga y exploración de datos textuales: Se trabaja con conjuntos de datos reales, principalmente tuits,
#   para mostrar cómo manipular textos en Python.
#
# - Limpieza de texto: Se abordan técnicas para eliminar caracteres no deseados, como signos de puntuación
#   y espacios en blanco adicionales, que pueden afectar la calidad del análisis.
#
# - Tokenización: Se enseñan métodos para dividir el texto en unidades significativas, como palabras o tokens,
#   utilizando librerías como NLTK y spaCy.
#
# - Eliminación de stopwords: Se explica cómo identificar y remover palabras comunes que no aportan significado
#   relevante, ayudando a concentrarse en términos importantes.
#
# - Normalización: Se muestran pasos para convertir el texto a minúsculas y manejar contracciones o variaciones
#   para uniformizar los datos.
#
# - Visualización y análisis: Se incluyen ejemplos prácticos para visualizar tokens, etiquetas gramaticales y
#   entidades nombradas, facilitando la comprensión de la estructura del lenguaje.
#
# En conjunto, este notebook proporciona una base sólida para preparar textos antes de aplicar modelos más avanzados
# de NLP, garantizando que los datos sean limpios, consistentes y listos para el análisis o modelado.


['dl', '##ab']
['co', '##vid']
['hug', '##ga', '##ble']
['37', '##8']


We will wrap up Part 1 with this (hopefully) thought-provoking challenge. LLMs often come with a much more sophisticated tokenization scheme, but there is ongoing discussion about their limitations in real-world applications. The reference section includes a few blog posts discussing this problem. Feel free to explore further if this sounds like an interesting question to you!

## References

1. A tutorial introducing the tokenization scheme in BERT: [The huggingface NLP course on wordpiece tokenization](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)
2. A specific example of "failure" in tokenization: [Weaknesses of wordpiece tokenization: Findings from the front lines of NLP at VMware.](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99)
3. How does BERT decide boundaries between subtokens: [Subword tokenization in BERT](https://tinkerd.net/blog/machine-learning/bert-tokenization/#subword-tokenization)

<div class="alert alert-success">

## ❗ Key Points

* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific.
* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations.
* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords.

</div>