# tidyX examples

In [16]:
# pip install tidyX==1.6.7

In [17]:
!pip show tidyX

Name: tidyX
Version: 1.6.7
Summary: Python package to clean raw tweets for ML applications
Home-page: 
Author: Lucas Gómez Tobón, Jose Fernando Barrera
Author-email: lucasgomeztobon@gmail.com, jf.barrera10@uniandes.edu.co
License: MIT
Location: c:\users\lucas\anaconda3\envs\bx\lib\site-packages
Requires: emoji, nltk, numpy, pandas, regex, spacy, thefuzz, Unidecode
Required-by: 


In [1]:
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv

## Stemming and Lemmatizing Texts Efficiently

The `stemmer()` and `lemmatizer()` functions each accept a single token as input. Thus, if we aim to normalize an entire text or a corpus, we would need to iterate over each token in the string using these functions. This approach might be inefficient, especially if the input contains repeated words.

This tutorial demonstrates how to utilize the `unnest_tokens()` function to apply normalization functions just once for every unique word.

In [28]:
# First, load a dataframe containing 1000 tweets from Colombia discussing Venezuela.
tweets = tp.load_data(file = "spanish")
tweets.head()

Unnamed: 0,Tweet
0,RT @emilsen_manozca ¿Me regala una moneda pa u...
1,RT @CriptoNoticias Banco venezolano activa ser...
2,Capturado venezolano que asesinó a comerciante...
3,RT @PersoneriaVpar @PersoneriaVpar acompaña al...
4,"Bueno ya sacaron la carta de ""amenaza de atent..."


In [29]:
# Firstly we would clean the text easily using our preprocess function
tweets['clean'] = tweets['Tweet'].apply(lambda x: tp.preprocess(x, 
                                                                delete_emojis = False, 
                                                                remove_stopwords = True, 
                                                                language_stopwords = "spanish"))
tweets.head()

Unnamed: 0,Tweet,clean
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,regala moneda pa cafe venezolano no tuitero ah...
1,RT @CriptoNoticias Banco venezolano activa ser...,banco venezolano activa servicio usuarios crip...
2,Capturado venezolano que asesinó a comerciante...,capturado venezolano asesino comerciante merca...
3,RT @PersoneriaVpar @PersoneriaVpar acompaña al...,acompa grupo especial migratorio cesar reunion...
4,"Bueno ya sacaron la carta de ""amenaza de atent...",bueno sacaron carta amenaza atentado president...


In this step, we will utilize the `unnest_token()` function to divide each tweet into multiple rows, assigning one token to each row. This structure allows us to aggregate identical terms, thereby creating an auxiliary dataframe that acts as a dictionary for lemmas or stems.

In [30]:
dictionary_normalization = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
dictionary_normalization

Unnamed: 0,clean,id
0,,246
1,abajo,"352, 577"
2,abandonar,"337, 509"
3,abarrotarse,993
4,abiertos,72
...,...,...
5878,🤪,519
5879,🤬,"483, 520, 908, 908"
5880,🤯,615
5881,🤷,"482, 736, 841, 947, 947, 947"


Note that the `id` column represents the indices of the tweets that contain each token from the `clean` column. Now we can proceed using the `stemmer()` and `lemmatizer()` functions to create new columns of `dictionary_normalization`

In [31]:
# Apply spanish_lemmatizer function to lemmatize the token
dictionary_normalization["stemm"] = dictionary_normalization["clean"].apply(lambda x: tn.stemmer(token = x, language = "spanish"))

Don't forget to download the corresponding SpaCy model for lemmatization. For Spanish lemmatization, we suggest the `es_core_news_sm` model:

```bash
!python -m spacy download es_core_news_sm   
```

For English lemmatization, we suggest the `en_core_web_sm` model:

```bash
!python -m spacy download en_core_web_sm 
```

To see a full list of available models for different languages, visit [Spacy's documentation](https://spacy.io/models/)


In [38]:
import spacy

# Load model
model_es = spacy.load("es_core_news_sm")

# Apply lemmatizer function to lemmatize the token
dictionary_normalization["lemma"] = dictionary_normalization["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model_es))

# Lemmatizing could produce stopwords, therefore we applied remove_words function
dictionary_normalization["lemma"] = dictionary_normalization["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True, language = "spanish"))

dictionary_normalization

Unnamed: 0,clean,id,stemm,lemma
0,,246,,
1,abajo,"352, 577",abaj,abajo
2,abandonar,"337, 509",abandon,abandonar
3,abarrotarse,993,abarrot,abarrotar
4,abiertos,72,abiert,abierto
...,...,...,...,...
5878,🤪,519,🤪,🤪
5879,🤬,"483, 520, 908, 908",🤬,🤬
5880,🤯,615,🤯,🤯
5881,🤷,"482, 736, 841, 947, 947, 947",🤷,🤷


To rebuild our original tweets we will use again `unnest_tokens` function

In [40]:
tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long

Unnamed: 0,Tweet,clean,id
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,regala,0
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,moneda,0
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,pa,0
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,cafe,0
0,RT @emilsen_manozca ¿Me regala una moneda pa u...,venezolano,0
...,...,...,...
999,"RT infopresidencia: ""Sin lugar a dudas hay uno...",recibido,999
999,"RT infopresidencia: ""Sin lugar a dudas hay uno...",cerca,999
999,"RT infopresidencia: ""Sin lugar a dudas hay uno...",venezolanos,999
999,"RT infopresidencia: ""Sin lugar a dudas hay uno...",presidente,999


In [47]:
tweets_normalized = tweets_long \
    .merge(dictionary_normalization, how = "left", on = "clean") \
        .groupby(["id_x", "Tweet"])[["lemma", "stemm"]] \
            .agg(lambda x: " ".join(x)) \
                .reset_index()
tweets_normalized.head()

Unnamed: 0,id_x,Tweet,lemma,stemm
0,0,RT @emilsen_manozca ¿Me regala una moneda pa u...,regalar moneda pa cafar venezolano tuitero ah...,regal moned pa caf venezolan no tuiter ah 😂 👋
1,1,RT @CriptoNoticias Banco venezolano activa ser...,banco venezolano activo servicio usuario cript...,banc venezolan activ servici usuari criptomoned
2,2,Capturado venezolano que asesinó a comerciante...,capturado venezolano asesino comerciante merca...,captur venezolan asesin comerci merc public
3,3,RT @PersoneriaVpar @PersoneriaVpar acompaña al...,acompa grupo especial migratorio cesar reunion...,acomp grup especial migratori ces reunion real...
4,4,"Bueno ya sacaron la carta de ""amenaza de atent...",bueno sacar cartar amenazar atentado president...,buen sac cart amenaz atent president duqu func...


In [54]:
for i in range(3):
    print("-"*50)
    print("Example", i + 1)
    print("Original tweet:", tweets_normalized.loc[i, "Tweet"])
    print("Lemmatized tweet:", tweets_normalized.loc[i, "lemma"])
    print("Stemmed tweet:", tweets_normalized.loc[i, "stemm"])

--------------------------------------------------
Example 1
Original tweet: RT @emilsen_manozca ¿Me regala una moneda pa un café? -¿Eres venezolano? Noo! Tuitero. -Ahhh 😂😂😂👋
Lemmatized tweet: regalar moneda pa cafar venezolano  tuitero ah 😂 👋
Stemmed tweet: regal moned pa caf venezolan no tuiter ah 😂 👋
--------------------------------------------------
Example 2
Original tweet: RT @CriptoNoticias Banco venezolano activa servicio para usuarios de criptomonedas #ServiciosFinancieros https://t.co/1r2rZIUdlo
Lemmatized tweet: banco venezolano activo servicio usuario criptomoneda
Stemmed tweet: banc venezolan activ servici usuari criptomoned
--------------------------------------------------
Example 3
Original tweet: Capturado venezolano que asesinó a comerciante del Mercado Público https://t.co/XrmWKVYMR8 https://t.co/CfMLaB25jI
Lemmatized tweet: capturado venezolano asesino comerciante mercado publico
Stemmed tweet: captur venezolan asesin comerci merc public


## Tutorial: Word Cloud

In [None]:
import os
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import spacy
 
os.getcwd()

In [None]:
tweets = pd.read_excel(r"../../../data/Tweets sobre venezuela.xlsx")
tweets.head()

In [None]:
# Combine all documents into a single string
text = " ".join(doc for doc in tweets['Snippet'])

# Generate a word cloud image
wordcloud = WordCloud(background_color = "white", width = 800, height = 400).generate(text)

# Display the generated image
plt.figure(figsize=(10, 5))
plt.title("WordCloud before tidyX")
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off");

In [None]:
tweets['clean'] = tweets['Snippet'].apply(lambda x: tp.preprocess(x, delete_emojis = False, extract = False,
                                                                  remove_stopwords = True))
tweets

In [None]:
token_df = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
token_df

In [None]:
# Load spacy's model
model = spacy.load('es_core_news_lg')

In [None]:
# Apply spanish_lemmatizer function to lemmatize the token
token_df["lemma"] = token_df["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model))
token_df

In [None]:
token_df["lemma"] = token_df["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True))
token_df = token_df[["clean", "lemma"]]
token_df

In [None]:
tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long 

In [None]:
tweets_clean2 = tweets_long.merge(token_df, how = "left", on = "clean").groupby(["Snippet", "id"])["lemma"].agg(lambda x: " ".join(x)).reset_index()
tweets_clean2

In [None]:
tweets_clean2['lemma'] = tweets_clean2['lemma'].apply(lambda x: tp.remove_extra_spaces(x))

In [None]:
# Combine all documents into a single string
text = " ".join(doc for doc in tweets_clean2['lemma'])

# Generate a word cloud image
wordcloud = WordCloud(background_color = "white", width = 800, height = 400).generate(text)

# Display the generated image
plt.figure(figsize=(10, 5))
plt.title("WordCloud after tidyX")
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off");