# Project 1 -  Text Classification Task

## 1 - Data Provenance and Characteristics

The dataset we will be working with consists of publications sourced from Reddit and Google, authored by individuals from England, Australia, and India. 

The Reddit-sourced data is divided as follows:

- Reddit (England): Training data and test data
- Reddit (Australia): Training data and test data
- Reddit (India): Training data and test data

Similarly, an equivalent division applies to the Google-sourced data:

- Google (England): Training data and test data
- Google (Australia): Training data and test data
- Google (India): Training data and test data

All datasets share the same attributes: `id`, a unique identifier for each entry, `text`, the content of the publication, and `sentiment_label`, the target variable for our analysis. The `sentiment_label` is binary, where `0` indicates a negative sentiment and `1` indicates a positive sentiment.

## 2 - Exploratory Data Analysis

### 2.1 - Initial Setup

We begin by reading all 12 datasets. Since the distinction between training and test data is not relevant for our analysis, we first merge them, reducing the total to 6 datasets.

To further facilitate analysis, we also create 3 additional datasets, grouping the data by country of origin. In this step, we combine Reddit and Google data while keeping separate datasets for England, Australia, and India Moreover, we also create 3 datasets, grouping the data by their source, this is Reddit or Google.

In the end, we also created a global dataset, that is, with all the data we have available.

In addition, we remove the `id` attribute at the beginning of our process. This decision was made to prevent inconsistencies, as some publications shared the same `id` across different datasets. Keeping this attribute could lead to ambiguity in the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import spacy
import os

from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("en_core_web_sm")

In [None]:
# read Reddit-sourced data 
reddit_uk_train = pd.read_json("data/reddit-uk-train.jsonl", lines=True).drop("id", axis=1)
reddit_in_train = pd.read_json("data/reddit-in-train.jsonl", lines=True).drop("id", axis=1)
reddit_au_train = pd.read_json("data/reddit-au-train.jsonl", lines=True).drop("id", axis=1)
reddit_uk_valid = pd.read_json("data/reddit-uk-valid.jsonl", lines=True).drop("id", axis=1)
reddit_in_valid = pd.read_json("data/reddit-in-valid.jsonl", lines=True).drop("id", axis=1)
reddit_au_valid = pd.read_json("data/reddit-au-valid.jsonl", lines=True).drop("id", axis=1)

# read Google-sourced data 
google_uk_train = pd.read_json("data/google-uk-train.jsonl", lines=True).drop("id", axis=1)
google_in_train = pd.read_json("data/google-in-train.jsonl", lines=True).drop("id", axis=1)
google_au_train = pd.read_json("data/google-au-train.jsonl", lines=True).drop("id", axis=1)
google_uk_valid = pd.read_json("data/google-uk-valid.jsonl", lines=True).drop("id", axis=1)
google_in_valid = pd.read_json("data/google-in-valid.jsonl", lines=True).drop("id", axis=1)
google_au_valid = pd.read_json("data/google-au-valid.jsonl", lines=True).drop("id", axis=1)

# merge Reddit-sourced data by country
reddit_uk_union = pd.concat([reddit_uk_train, reddit_uk_valid], ignore_index=True)
reddit_au_union = pd.concat([reddit_au_train, reddit_au_valid], ignore_index=True)
reddit_in_union = pd.concat([reddit_in_train, reddit_in_valid], ignore_index=True)

# merge Google-sourced data by country
google_uk_union = pd.concat([google_uk_train, google_uk_valid], ignore_index=True)
google_au_union = pd.concat([google_au_train, google_au_valid], ignore_index=True)
google_in_union = pd.concat([google_in_train, google_in_valid], ignore_index=True)

# merge data by country
uk_union = pd.concat([reddit_uk_union, google_uk_union], ignore_index=True)
au_union = pd.concat([reddit_au_union, google_au_union], ignore_index=True)
in_union = pd.concat([reddit_in_union, google_in_union], ignore_index=True)

# merge data by source
reddit_union = pd.concat([reddit_uk_union, reddit_au_union, reddit_in_union], ignore_index=True)
google_union = pd.concat([google_uk_union, google_au_union, google_in_union], ignore_index=True)

# merge all data
global_union = pd.concat([reddit_union, google_union])

### 2.2 - Class Distribution by Source

We begin by comparing the number of entries from Reddit and Google. Our analysis shows that both sources contain approximately the same number of entries.

Next, we analyzed the distribution of the sentiment class in both sources. This analysis reveals that Reddit data is predominantly negative, while Google data is mostly positive.

In [None]:
print("From Reddit there are " + str(len(reddit_union)))
print("From Google there are " + str(len(google_union)))

In [None]:
# count distribuition of the target class by source
reddit_counts = reddit_union["sentiment_label"].value_counts().reset_index()
google_counts = google_union["sentiment_label"].value_counts().reset_index()

# create dataframe
reddit_counts.columns = ["sentiment_label", "Reddit"]
google_counts.columns = ["sentiment_label", "Google"]
df = pd.merge(reddit_counts, google_counts, on="sentiment_label", how="outer")
df_melted = df.melt(id_vars="sentiment_label", var_name="Source", value_name="Count")

# create graph
plt.figure(figsize=(4,3))
sns.barplot(data=df_melted, x="sentiment_label", y="Count", hue="Source", palette=["#1f77b4", "#ff7f0e"])
plt.xlabel("Sentiment Label")
plt.ylabel("Count")
plt.title("Comparison of Sentiment Label Distribution: Reddit vs Google")
plt.xticks(ticks=[0, 1], labels=["Negative (0)", "Positive (1)"])
plt.legend(title="Source")
plt.show()

### 2.2 - Class Distribution by Country

Similarly to the source analysis, we also examined the distribution of the data across countries, as well as the balance of the target class distribution.

As shown below, the datasets have roughly the same number of entries, and the target class is approximately evenly distributed.

In [None]:
print("From UK there are " + str(len(uk_union)))
print("From AU there are " + str(len(au_union)))
print("From IN there are " + str(len(in_union)))

In [None]:
# count distribuition of the target class by country
uk_counts = uk_union["sentiment_label"].value_counts().reset_index()
au_counts = au_union["sentiment_label"].value_counts().reset_index()
in_counts = in_union["sentiment_label"].value_counts().reset_index()

# create dataframe
uk_counts.columns = ["sentiment_label", "UK"]
au_counts.columns = ["sentiment_label", "AU"]
in_counts.columns = ["sentiment_label", "IN"]
df_counts = pd.merge(uk_counts, au_counts, on="sentiment_label", how="outer")
df_counts = pd.merge(df_counts, in_counts, on="sentiment_label", how="outer")
df_melted = df_counts.melt(id_vars="sentiment_label", var_name="Country", value_name="Count")

# create graph
plt.figure(figsize=(6,4))
sns.barplot(data=df_melted, x="sentiment_label", y="Count", hue="Country", palette=["#1f77b4", "#ff7f0e", "#2f8c1f"])
plt.xlabel("Sentiment Label")
plt.ylabel("Count")
plt.title("Comparison of Sentiment Label Distribution: UK vs AU vs IN")
plt.xticks(ticks=[0, 1], labels=["Negative (0)", "Positive (1)"])
plt.legend(title="Country")
plt.show()

### 2.3 - Word Distribution by Source

Next, we analyze the distribution of words by source, focusing on the top 10 words. As shown, despite the words appearing in different positions across the sources, there are 3 common words. This low number may be due to the topic of the posts, which can be very different between platforms.

In [None]:
# function to calculate and plot TF-IDF
def plot_tfidf(dataset, title):
    
    # get text column
    texts = dataset["text"]  
    
    # start TfidfVectorizer
    vectorizer = TfidfVectorizer(stop_words='english', max_features=20)
    tfidf_matrix = vectorizer.fit_transform(texts)   
    terms = vectorizer.get_feature_names_out()   
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)   
    sum_tfidf = tfidf_df.sum(axis=0)   
    sorted_tfidf = sum_tfidf.sort_values(ascending=False)
    
    # plot graph
    plt.figure(figsize=(6, 3))
    sorted_tfidf.head(10).plot(kind='bar')
    plt.title(title)
    plt.ylabel('TF-IDF Score')
    plt.xlabel('Words')
    plt.xticks(rotation=45)
    plt.show()
    
    return sorted_tfidf.head(10).index.tolist()

def plot_wordcolud(dataset):
    wordcloud = WordCloud().generate(" ".join(dataset["text"].dropna().astype(str)))
    plt.figure(figsize=(6, 3))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

In [None]:
reddit_top = plot_tfidf(reddit_union, 'Top 10 TF-IDF Words for Reddit Data')

In [None]:
plot_wordcolud(reddit_union)

In [None]:
google_top = plot_tfidf(google_union, 'Top 10 TF-IDF Words for Google Data')

In [None]:
plot_wordcolud(google_union)

In [None]:
comumn_top = list(set(reddit_top) & set(google_top))

print("There are " + str(len(comumn_top)) + " comumn words.")
print("They are : " + str(comumn_top))

### 2.3 - Word Distribution by Country

We also analyzed the distribution of words by country. As shown, the vocabulary does not seem to vary significantly across countries, as the number of common words between two datasets ranges from 7 to 8 out of 10.

In [None]:
uk_top = plot_tfidf(uk_union, 'Top 10 TF-IDF Words for UK Data')

In [None]:
plot_wordcolud(uk_union)

In [None]:
au_top = plot_tfidf(au_union, 'Top 10 TF-IDF Words for AU Data')

In [None]:
plot_wordcolud(au_union)

In [None]:
in_top = plot_tfidf(in_union, 'Top 10 TF-IDF Words for IN Data')

In [None]:
plot_wordcolud(in_union)

In [None]:
comumn_uk_au_top = list(set(uk_top) & set(au_top))
comumn_au_in_top = list(set(au_top) & set(in_top))
comumn_in_uk_top = list(set(in_top) & set(uk_top))

print("UK and AU")
print("There are " + str(len(comumn_uk_au_top)) + " comumn words.")
print("They are : " + str(comumn_uk_au_top))

print("\nAU and IN")
print("There are " + str(len(comumn_au_in_top)) + " comumn words.")
print("They are : " + str(comumn_au_in_top))

print("\nIN and UK")
print("There are " + str(len(comumn_in_uk_top)) + " comumn words.")
print("They are : " + str(comumn_in_uk_top))

### 2.4 - Word Distribuition by Class

We also analyzed the word distribution by class, separating the publications based on their connotation, positive or negative.

In positive publications, words like "good," "nice," and "great" stand out, which aligns with their typical use in expressing positive sentiments.

Conversely, in negative publications, these words are noticeably less common. While "good" still appears frequently, it is not as prominent as in positive posts. Additionally, the word "n't" is highly recurrent, highlighting its role in negation and negative sentiment formation.

In [None]:
plot_wordcolud(global_union[global_union["sentiment_label"] == 0])

In [None]:
plot_wordcolud(global_union[global_union["sentiment_label"] == 1])

## 3 - Data Pre-Processing

In [None]:
tokenizer = nltk.word_tokenize
stemmer = nltk.PorterStemmer()
stop_words = set(nltk.corpus.stopwords.words('english'))

stop_words.remove("no")
stop_words.remove("not")
stop_words.remove("nor")
stop_words.remove("t")

# remove words related with negation
stop_words_remove = {"no", "not", "nor", "t"}
stop_words.difference_update(stop_words_remove)

def text_clean(dataset):
    dataset['text'] = dataset["text"].apply(lambda x: re.sub(r'[^a-zA-Z]', ' ', x).strip())
    # convert to lowercase
    dataset["text"] = dataset["text"].apply(str.lower)
    # remove multiple spaces
    dataset["text"] = dataset["text"].apply(lambda x: re.sub(r'\s+', ' ', x).strip()) 
    

def text_parser(dataset):  
    # with tokenizer and stemmer
    dataset['text_tokst'] = dataset['text'].apply(tokenizer)
    dataset['text_tokst'] = dataset['text_tokst'].apply(lambda x: [stemmer.stem(w) for w in x])
    # with spacy (separates didn't into do and not)
    dataset['text_spacy'] = dataset['text'].apply(lambda x : [w.lemma_ for w in nlp(x)])
    # remove stopwords
    dataset['text_tokst'] = dataset['text_tokst'].apply(lambda x: [word for word in x if word not in stop_words])
    dataset['text_spacy'] = dataset['text_spacy'].apply(lambda x: [word for word in x if word not in stop_words])


text_clean(uk_union)
text_clean(au_union)
text_clean(in_union)

text_parser(uk_union)
text_parser(au_union)
text_parser(in_union)

print(uk_union['text_tokst'].iloc[0])
print(uk_union['text_spacy'].iloc[0])

print(len(uk_union['text_tokst'].iloc[0]))
print(len(uk_union['text_spacy'].iloc[0]))

In [None]:
folder_path = "data_prepared"

# Criar a pasta se ela não existir
os.makedirs(folder_path, exist_ok=True)

# Salvar os DataFrames como CSV (ou outro formato)
uk_union.to_csv(os.path.join(folder_path, "uk_union.csv"), index=False)
au_union.to_csv(os.path.join(folder_path, "au_union.csv"), index=False)
in_union.to_csv(os.path.join(folder_path, "in_union.csv"), index=False)