# tidyX examples

In [8]:
# pip install tidyX==1.6.6

In [9]:
!pip show tidyX

Name: tidyX
Version: 1.6.6
Summary: Python package to clean raw tweets for ML applications
Home-page: 
Author: Lucas Gómez Tobón, Jose Fernando Barrera
Author-email: lucasgomeztobon@gmail.com, jf.barrera10@uniandes.edu.co
License: MIT
Location: c:\users\lucas\anaconda3\envs\bx\lib\site-packages
Requires: emoji, nltk, numpy, pandas, regex, spacy, thefuzz, Unidecode
Required-by: 


In [11]:
from tidyX import TextPreprocessor as tp
from tidyX import TextNormalization as tn
from tidyX import TextVisualizer as tv

## Stemming and Lemmatizing Texts Efficiently

The `stemmer()` and `lemmatizer()` functions each accept a single token as input. Thus, if we aim to normalize an entire text or a corpus, we would need to iterate over each token in the string using these functions. This approach might be inefficient, especially if the input contains repeated words.

This tutorial demonstrates how to utilize the `unnest_tokens()` function to apply normalization functions just once for every unique word.

In [10]:
tp.load_data(file = "spanish")

NameError: name 'tp' is not defined

## Tutorial: Word Cloud

In [None]:
import os
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import spacy
 
os.getcwd()

In [None]:
tweets = pd.read_excel(r"../../../data/Tweets sobre venezuela.xlsx")
tweets.head()

In [None]:
# Combine all documents into a single string
text = " ".join(doc for doc in tweets['Snippet'])

# Generate a word cloud image
wordcloud = WordCloud(background_color = "white", width = 800, height = 400).generate(text)

# Display the generated image
plt.figure(figsize=(10, 5))
plt.title("WordCloud before tidyX")
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off");

In [None]:
tweets['clean'] = tweets['Snippet'].apply(lambda x: tp.preprocess(x, delete_emojis = False, extract = False,
                                                                  remove_stopwords = True))
tweets

In [None]:
token_df = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = True)
token_df

In [None]:
# Load spacy's model
model = spacy.load('es_core_news_lg')

In [None]:
# Apply spanish_lemmatizer function to lemmatize the token
token_df["lemma"] = token_df["clean"].apply(lambda x: tn.lemmatizer(token = x, model = model))
token_df

In [None]:
token_df["lemma"] = token_df["lemma"].apply(lambda x: tp.remove_words(x, remove_stopwords = True))
token_df = token_df[["clean", "lemma"]]
token_df

In [None]:
tweets_long = tp.unnest_tokens(df = tweets.copy(), input_column = "clean", id_col = None, unique = False)
tweets_long 

In [None]:
tweets_clean2 = tweets_long.merge(token_df, how = "left", on = "clean").groupby(["Snippet", "id"])["lemma"].agg(lambda x: " ".join(x)).reset_index()
tweets_clean2

In [None]:
tweets_clean2['lemma'] = tweets_clean2['lemma'].apply(lambda x: tp.remove_extra_spaces(x))

In [None]:
# Combine all documents into a single string
text = " ".join(doc for doc in tweets_clean2['lemma'])

# Generate a word cloud image
wordcloud = WordCloud(background_color = "white", width = 800, height = 400).generate(text)

# Display the generated image
plt.figure(figsize=(10, 5))
plt.title("WordCloud after tidyX")
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off");