# 1. Google Colab

Since some the NLP models are quite huge they greatly benefit from using a GPU during the training phase, we're going to use Google Colab for this exercise. You can read this blog [post](https://www.dataquest.io/blog/getting-started-with-google-colab-for-deep-learning?_gl=1*1svxc04*_gcl_au*MTExNjMxNDExNS4xNzA0MjE3NTA5) to learn more about it

# 2. TensorFlow Datasets (TFDS)

TensorFlow Datasets (TFDS) is a library of utilities that simplify preprocessing, loading, and analyzing data in TensorFlow. It offers high-level APIs to create reproducible input pipelines for TensorFlow models. Additionally, TFDS supports many popular public datasets, making it easy for developers to quickly test their models on real-world data. Moreover, TFDS supports cloud-based data storage systems, such as Google Cloud Storage (GCS), and Amazon S3, allowing developers to securely access large datasets with ease. 

This will load the "train" set of the `imbd_reviews` dataset as opposed to the "test" set

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

imdb_data = tfds.load(name="imdb_reviews", split="train")
imdb_df = tfds.as_dataframe(imdb_data)
print(f"Shape of data: {imdb_df.shape}")

We can generate the list of available datasets with the following command

In [None]:
tfds.list_builders()

# 3. Exploring the IMDB Dataset

- The IMDB dataset contains 50,000 movie review texts categorized as either positive or negative. You can learn more about the dataset [here](http://ai.stanford.edu/~amaas/data/sentiment/)
- The IMDB dataset is evenly split between the "train" and "test" sets – each containing 25,000 observations. So if we call the `tfds.load()` function and pass it the `split="train"` or `split="test"` argument, the results will be different datasets of 25,000 records each.

In [None]:
imdb_df['text'] = imdb_df['text'].str.decode('utf-8')
imdb_sample = imdb_df.sample(frac=0.2, random_state=100)

In [None]:
print(f'Shape of sample: {imdb_sample.shape}')

In [None]:
imdb_sample.head(10)

In [None]:
imdb_sample.tail(10)

The distribution of the target variable: `label`

In [None]:
imdb_sample['label'].value_counts()

Positive and negative movie reviews are indicated by the `label` values 1 and 0, respectively. As we can see in the above output, this is a balanced dataset with an equal number of positive and negative reviews

In [None]:
imdb_sample.isna().sum()

# 4. Word Distribution

Understanding the distribution of text lengths is important:
- It can provide insight into a reviewer's messaging strategy. For instance, if most of their reviews are short, it could mean they are aiming for brevity and efficiency. Conversely, if they write longer reviews, it could indicate they are looking to provide detailed information or engage in meaningful dialogue.
- There may be a case where the length of the review is indicative of a positive or negative review. Thus, we can test this hypothesis by comparing statistical values related to each group to see if there is a correlation between the length of a review and its label. We can create a new variable, `text_length`, in the dataframe `df` with the following code:

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

imdb_sample['text_length'] = [len(review.split(' ')) for review in imdb_sample['text']]
sns.histplot(data=imdb_sample, x='text_length', bins=30)
plt.show()

To understand the distribution of `text_length` for both positive and negative reviews, we can calculate the mean of `text_length` grouped by the target variable, `label`.

In [None]:
imdb_sample.groupby(by="label")[["text_length"]].mean()

In [None]:
imdb_sample.groupby(by="label")[["text_length"]].median()

In [None]:
imdb_sample.groupby(by="label")[["text_length"]].std()

In [None]:
# Optional bar plot code
mean_length = imdb_sample.groupby(by="label")[["text_length"]].mean().reset_index()
median_length = imdb_sample.groupby(by="label")[["text_length"]].median().reset_index()
std_length = imdb_sample.groupby(by="label")[["text_length"]].std().reset_index()

combined_df = pd.concat([mean_length, median_length["text_length"], std_length["text_length"]], axis=1)
combined_df.columns = ["label", "mean_length", "median_length", "std_length"]
melted_df = pd.melt(combined_df, id_vars="label", var_name="Measure", value_name="Value")

sns.barplot(data=melted_df, x="label", y="Value", hue="Measure")
plt.xlabel("Label")
plt.ylabel("Value")
plt.title("Comparison of Measures")
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0.)
plt.show()


Based on the results above, there doesn't appear to be a correlation between `text_length` and `label`.

# 5. Visualizing the Most Frequent Words

Visualizing the most frequently used words is critical:
- It gives us an idea of the general theme or tone of the dataset. This can help us understand its purpose and potential applications
- It help us spot errors in the data. For example, if we see that stopwords are being used a lot in the dataset, it could be a sign that we should eliminate them during text preprocessing. Stopwords are words that are commonly used in a language, but have little semantic meaning ("a", "an", "the", "of", "to", and "in")

In [None]:
freq_words = imdb_sample['text'].str.split(expand=True).stack().value_counts()
freq_words_top100 = freq_words[:100]
freq_words_top100.describe()

A treemap is a great way to visualize frequently used words by arranging them in rectangles of varying size. The size of each rectangle is directly proportional to the frequency of that word in a given text or dataset

In [None]:
import plotly.express as px

fig = px.treemap(freq_words_top100, path=[freq_words_top100.index], values=0)
fig.update_layout(title_text='Most Frequent 100 Words in the Dataset', title_font=dict(size=20))
fig.show()

# 6. Text Preprocessing

Preprocessing text data is crucial for several reasons. It is a necessary step in any natural language processing (NLP) task, as it helps to clean and standardize the text, making it easier for the NLP algorithm to process. Furthermore, preprocessing can improve the accuracy of our results.

Preprocessing involves several steps that can help improve the overall quality of our text data. These steps include:

1. Coverting the text to lowercase
2. Removing punctuation from the text
3. Tokenizing the text
4. Removing stopwords from the text
5. Lemmatization (Stemming) of the text

In [None]:
imdb_sample['text'] = imdb_sample['text'].str.lower()

import re

def punctuation(inputs):
    return re.sub(r'[^\w\s]', ' ', inputs)

imdb_sample['text'] = imdb_sample['text'].apply(punctuation)

imdb_sample.head()

In [None]:
# Tokenization
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

def tokenization(inputs):
    return word_tokenize(inputs)

imdb_sample['text_tokenized'] = imdb_sample['text'].apply(tokenization)
imdb_sample['text_tokenized'].head()

In [None]:
# Stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words.remove('not')
stop_words.add('br')

def stopwords_remove(inputs):
    return [word for word in inputs if word not in stop_words]

imdb_sample['text_stop'] = imdb_sample['text_tokenized'].apply(stopwords_remove)
imdb_sample['text_stop'].head()

In [None]:
# Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def lemmatization(inputs):
    return [lemmatizer.lemmatize(word=word, pos='v') for word in inputs]

imdb_sample['text_lemmatized'] = imdb_sample['text_stop'].apply(lemmatization)
imdb_sample['text_lemmatized'].head()

# 8. Visualization of Reviews after Text Preprocessing

Let´s go with mmost frequently used 100 words after text preprocessing

In [None]:
imdb_sample['final'] = imdb_sample['text_lemmatized'].str.join(' ')

freq_words = imdb_sample['final'].str.split(expand=True).stack().value_counts()
freq_words_top100 = freq_words[:100]

fig = px.treemap(freq_words_top100, path=[freq_words_top100.index], values=0)
fig.update_layout(title_text='Most Frequently Used 100 Words after Text Preprocessing', title_font=dict(size=20))
fig.show()

A WordCloud is a graphical representation of the most common words used in a piece of text. The more often a word is used, the larger it appears in the word cloud visualization. A WordCloud can be used to quickly and easily identify the most important themes in a text. Let's visualize just the positive reviews.

In [None]:
from wordcloud import WordCloud

# Negative movie reviews
imdb_sample_0 = imdb_sample[imdb_sample['label'] == 0]
word_cloud_0 = WordCloud(max_words=100, stopwords=stop_words, random_state=100).generate(' '.join(imdb_sample_0['final'].tolist()))

plt.figure(figsize=(15, 10))
plt.imshow(word_cloud_0, interpolation='bilinear')
plt.title('WordCloud of Frequently Used Words in Negative Reviews', fontsize=20)
plt.axis("off")
plt.show()

In [None]:
# Positive movie reviews
imdb_sample_1 = imdb_sample[imdb_sample['label'] == 1]
word_cloud_1 = WordCloud(max_words=100, stopwords=stop_words, random_state=100).generate(' '.join(imdb_sample_1['final'].tolist()))

plt.figure(figsize=(15, 10))
plt.imshow(word_cloud_1, interpolation='bilinear')
plt.title('WordCloud of Frequently Used Words in Positive Reviews', fontsize=20)
plt.axis("off")
plt.show()

Surprisingly, it seems that positive and negative movie reviews tend to use similar words. For example, both word clouds are dominated by words like: ['film', 'movie', 'one', 'not', 'like', 'make', 'see', 'get', 'character'] which makes sense when we look at our treemap from the previous screen that shows the most frequently used words, regardless of whether they are positive or negative reviews.

For your place´s warm: If we wanted to create word clouds that are more distinct between positive and negative reviews, we could use the `stopwords` argument of the `WordCloud()` constructor to specifically remove some of these shared top-words and generate the word clouds again. Feel free to experiment!