# Project 1 -  Text Classification Task
Developed by Group 05:
- Emanuel Maia - up202107486
- Rita Leite - up202105309
- Tiago Azevedo - up202108699

## 0 - Initial Set Up

This project is a continuation of the first one, but now we are exploring the use of Hugging Face Transformers.

That said, the theme of the project and the structure of our data remain the same. We have one text attribute that corresponds to a comment and another attribute that indicates the sentiment of that comment, which can be either positive or negative.

The data comes from two distinct sources, Reddit and Google, and spans three different countries: Australia, the United Kingdom, and India. Since some of the comments from India were written in languages other than English, we attempted to translate them into English.

We start by importing the necessary libraries for our project, as well as some utility functions, for example, a function that will allow us to translate all comments into English, in order to maintain consistency across the dataset.

In [7]:
import pandas as pd
import nltk
import re
import os
import contractions
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from googletrans import Translator
from langdetect import detect

translator = Translator()
def translate_text(text):
    if detect(text) == "en":
        return text
    else:
        try:
            return translator.translate(text, src="hi", dest="en").text
        except:
            try:
                return translator.translate(text, src="ur", dest="en").text
            except:
                try:
                    return translator.translate(text, src="bn", dest="en").text
                except:
                    print(f"{text}")
                    return text

Next, we load all the datasets, creating four separate ones: one containing data from Australia, another from the United Kingdom, one from India, and finally a combined dataset that includes all the data.

In [8]:
# read Reddit-sourced data 
reddit_uk_train = pd.read_json("data/reddit-uk-train.jsonl", lines=True).drop("id", axis=1)
reddit_in_train = pd.read_json("data/reddit-in-train.jsonl", lines=True).drop("id", axis=1)
reddit_au_train = pd.read_json("data/reddit-au-train.jsonl", lines=True).drop("id", axis=1)
reddit_uk_valid = pd.read_json("data/reddit-uk-valid.jsonl", lines=True).drop("id", axis=1)
reddit_in_valid = pd.read_json("data/reddit-in-valid.jsonl", lines=True).drop("id", axis=1)
reddit_au_valid = pd.read_json("data/reddit-au-valid.jsonl", lines=True).drop("id", axis=1)

# read Google-sourced data 
google_uk_train = pd.read_json("data/google-uk-train.jsonl", lines=True).drop("id", axis=1)
google_in_train = pd.read_json("data/google-in-train.jsonl", lines=True).drop("id", axis=1)
google_au_train = pd.read_json("data/google-au-train.jsonl", lines=True).drop("id", axis=1)
google_uk_valid = pd.read_json("data/google-uk-valid.jsonl", lines=True).drop("id", axis=1)
google_in_valid = pd.read_json("data/google-in-valid.jsonl", lines=True).drop("id", axis=1)
google_au_valid = pd.read_json("data/google-au-valid.jsonl", lines=True).drop("id", axis=1)

# merge and translate data by country
uk_union = pd.concat([reddit_uk_train, reddit_uk_valid, google_uk_train, google_uk_valid], ignore_index=True)
au_union = pd.concat([reddit_au_train, reddit_au_valid, google_au_train, google_au_valid], ignore_index=True)
in_union = pd.concat([reddit_in_train, reddit_in_valid, google_in_train, google_in_valid], ignore_index=True)
in_union["text"] = in_union["text"].apply(translate_text)

# merge all data
global_union = pd.concat([uk_union, au_union, in_union]).reset_index(drop=True)

In this step, we started by removing all characters that were not alphabetical or whitespace, converting it to lowercase and eliminating consecutive spaces. After cleaning the text, we perform tokenization and lemmazation. Finally, we remove words from the stopwords list, except for those with negation, as they are crucial for our classification task.

We initially experimented with stemming as an alternative to lemmatization but ultimately chose the latter. At first, our lemmatization approach did not account for a word’s context, leading to incorrect results. Upon further investigation, we discovered the importance of specifying the `pos` parameter in the "lemmatize" function, which determines whether a word is a verb, noun, adjective, or adverb. To ensure accurate classification, we leveraged the `pos_tag` function from `NLTK` to assign the appropriate part of speech before lemmatization.

In [9]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words_remove = {"no", "not", "nor", "t"}
stop_words.difference_update(stop_words_remove)

lemma = nltk.WordNetLemmatizer()
token = nltk.word_tokenize

def lemmatize_with_pos(text):
    words = token(text)   
    words_tag = nltk.pos_tag(words)   
    words_lem = []
    for word, tag in words_tag:
        if tag.startswith('N'): words_lem.append(lemma.lemmatize(word, pos='n')) # noun
        elif tag.startswith('V'): words_lem.append(lemma.lemmatize(word, pos='v')) # verb
        elif tag.startswith('J'): words_lem.append(lemma.lemmatize(word, pos='a')) # adjective
        elif tag.startswith('R'): words_lem.append(lemma.lemmatize(word, pos='r')) # adverb
        else: words_lem.append(lemma.lemmatize(word))
    return words_lem

def text_pre_processing(dataset):
    # text_vader
    dataset['text_vader'] = dataset['text'].apply(contractions.fix)
    # remove all the caracteres that do not belong to the alphabet and are not a whitespace
    dataset['text_vader'] = dataset["text_vader"].apply(lambda x: re.sub(r'[^\x00-\x7F]|[^a-zA-Z ]', ' ', x).strip())
    # converte all caracteres to lowercase
    dataset["text_vader"] = dataset["text_vader"].apply(str.lower)
    # remove multiple whitespaces  
    dataset["text_vader"] = dataset["text_vader"].apply(lambda x: re.sub(r'\s+', ' ', x).strip())
    # apply tokenization
    dataset['text_processed'] = dataset['text_vader'].apply(token)    
    # remove stopwords, mantaining words like 'no', 'not', 'nor', 't'
    dataset['text_processed'] = dataset['text_processed'].apply(lambda x: [word for word in x if word not in stop_words])
    # apply lemmazation
    dataset['text_processed'] = dataset['text_processed'].apply(lambda x: lemmatize_with_pos(" ".join(x)))
    # join words from the list in a sentence
    dataset['text_processed'] = [" ".join(text) for text in dataset["text_processed"]]
    return dataset
       
def metrics(y_test, y_pred, time):
    [[tn, fp], [fn, tp]] = (confusion_matrix(y_test, y_pred))    
    print("                    predicted negative   predicted positive")
    print("real negative       " + str(tn) + (" " * (len(str(fn)) - 1)) + "                " + str(fp))
    print("real positive       " + str(fn) + (" " * (len(str(tn)) - 1)) + "                " + str(tp))
    print(f"\nAccuracy {accuracy_score(y_test, y_pred):.2f} Precision {precision_score(y_test, y_pred):.2f} Recall {recall_score(y_test, y_pred):.2f} F1 Score {f1_score(y_test, y_pred):.2f} Time {time:.2f}") 

text_pre_processing(global_union)
text_pre_processing(uk_union)
text_pre_processing(au_union)
text_pre_processing(in_union)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,sentiment_label,text_vader,text_processed
0,Zepto has a mandate that the delivery boy need...,1,zepto has a mandate that the delivery boy need...,zepto mandate delivery boy need click picture ...
1,Give me a little money too,0,give me a little money too,give little money
2,Nooo don't protest against secular freedom fig...,0,nooo do not protest against secular freedom fi...,nooo not protest secular freedom fighter owais...
3,Har 3 mahine baad kisi bhi global celebrity ko...,0,har mahine baad kisi bhi global celebrity ko b...,har mahine baad kisi bhi global celebrity ko b...
4,Just because you don't find anything serious b...,0,just because you do not find anything serious ...,not find anything serious not mean nothing gir...
...,...,...,...,...
3783,It was ok. Chef need to bring taste in food. J...,0,it was ok chef need to bring taste in food jus...,ok chef need bring taste food ok type restaura...
3784,Food is best for middle class people here. All...,1,food is best for middle class people here all ...,food best middle class people item give quanti...
3785,I think cinema hall is better and full air con...,1,i think cinema hall is better and full air con...,think cinema hall well full air condition soun...
3786,The cafe looks good and we can celebrate birth...,1,the cafe looks good and we can celebrate birth...,cafe look good celebrate birthday food taste a...


After making these modifications, we checked for words that were not recognized in `NLTK`'s vocabulary, which likely indicated spelling errors. We attempted to use libraries like `TextBlob` to correct these mistakes but ultimately decided against it due to the high processing time.

We then deleted the entries that had an empty `text_processed`  attribute. This could happen due to pre-processing, for example, if the text attribute only consisted of characters that did not belong to the alphabet.

After we have made these modifications to the datasets, we save them to our repository.

In [10]:
nltk.download('words')

valid_words = set(nltk.corpus.words.words())
invalid_words = set()
invalid_entries = set()

corpus = global_union["text_processed"].dropna().astype(str).tolist()
for idx, comment in enumerate(corpus):
    words = comment.split()
    for word in words:
        if word not in valid_words:
            invalid_words.add(word)
            invalid_entries.add(idx)

print("There are " + str(len(invalid_entries)) + " entries with invalid words")
print("The words are " + str(invalid_words))

print("\nRemoval empty entries:")
print("- Before we had " + str(len(global_union)) + " entries.")
global_union = global_union[global_union["text_processed"].str.len() > 0]
print("- Now we have " + str(len(global_union)) + " entries.")

def escape_newlines(df):
    return df.applymap(lambda x: x.replace("\n", "\\n") if isinstance(x, str) else x)

folder_path = "data_prepared"
os.makedirs(folder_path, exist_ok=True)
escape_newlines(global_union).to_csv(os.path.join(folder_path, "global_union.csv"), index=False, encoding="utf-8")
escape_newlines(uk_union).to_csv(os.path.join(folder_path, "uk_union.csv"), index=False, encoding="utf-8")
escape_newlines(au_union).to_csv(os.path.join(folder_path, "au_union.csv"), index=False, encoding="utf-8")
escape_newlines(in_union).to_csv(os.path.join(folder_path, "in_union.csv"), index=False, encoding="utf-8")

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Rita\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


There are 7558 entries with invalid words
The words are {'dhruv', 'pattices', 'sneakily', 'hadd', 'mann', 'awared', 'atc', 'hugeee', 'purulia', 'aqi', 'darbar', 'speaks', 'playlist', 'westfields', 'committ', 'fowzzu', 'reinvested', 'frenchisey', 'tatse', 'stairwell', 'idgaf', 'abar', 'strathmore', 'wexler', 'bhik', 'batra', 'pressed', 'pansare', 'nimbys', 'barcode', 'boudi', 'sweetcorn', 'rsus', 'bich', 'bidisha', 'yemeni', 'barrrow', 'andre', 'lmao', 'jailcummings', 'eb', 'brakefast', 'heinz', 'maine', 'africa', 'dipshit', 'coonawarra', 'flustered', 'sadi', 'sunglasses', 'scamees', 'cybersecurity', 'hiding', 'askanaustralian', 'murtoa', 'laps', 'jeypore', 'criminalise', 'sympathiser', 'licensing', 'idc', 'kokilaben', 'lobbyists', 'topgolf', 'hillingdon', 'fridges', 'focaccia', 'aarti', 'whyalla', 'sooraj', 'purified', 'narasaraopet', 'ambattur', 'delishers', 'apetite', 'malteser', 'prevailing', 'saadi', 'badam', 'bafala', 'mantain', 'adversarial', 'cleared', 'bekaar', 'acclimatise', '

  return df.applymap(lambda x: x.replace("\n", "\\n") if isinstance(x, str) else x)
