# Natural Language Processing First Assignment
#### This is the notebook for the first assignment about the dataset **"Polite Guard"**. The objective of this work is to come up with a pipeline that builds a robust and good model for text classification
#### The following dependencies are needed:
`
pip install datasets pandas nltk scikit-learn wordcloud matplotlib gensim
`

## **Importing the dataset**
##### The first step is to import the dataset we are using, directly from the library provided by Hugging Face. The original dataset already split test and training data, as well as validation data.


In [1]:
import pandas as pd;
from datasets import load_dataset

dataset = load_dataset("Intel/polite-guard");
print(dataset);

training_set = pd.DataFrame.from_dict(dataset['train']);
test_set = pd.DataFrame.from_dict(dataset['test']);
validation_set = pd.DataFrame.from_dict(dataset['validation']);

print(training_set.head());\
print(test_set.head());
print(validation_set.head());



  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'source', 'reasoning'],
        num_rows: 80000
    })
    validation: Dataset({
        features: ['text', 'label', 'source', 'reasoning'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['text', 'label', 'source', 'reasoning'],
        num_rows: 10200
    })
})
                                                text            label  \
0  Your flight has been rescheduled for 10:00 AM ...          neutral   
1  We're happy to accommodate your dietary prefer...           polite   
2  Our vegetarian options are available on the me...          neutral   
3  I understand your frustration with the recent ...  somewhat polite   
4  I'll do my best to find a suitable replacement...  somewhat polite   

                                  source  \
0  meta-llama/Meta-Llama-3.1-8B-Instruct   
1  meta-llama/Meta-Llama-3.1-8B-Instruct   
2  meta-llama/Meta-Llama-3.1-8B-Instruct   
3  meta-llama/Meta-Llama-3.1

##### The text corpuses are mainly about different types of language in terms of formality, the label to be classified is if the corpus is a polite or impolite. **There are 4 classes which are "impolite" , "neutral" , "somewhat polite" and "polite"**
##### There are four features of the dataset which are:
- ##### **text** - actual text corpus;
- ##### **label** - class label to identify text formality;
- ##### **reasoning** - why it was labeled as that class;
- ##### **source** - source of the text;
##### The majority of the dataset were **result of prompting using different synthesis techniques**, including only 200 annotated real-life examples from corporate training.
##### The prompting techniques used: 
- ##### 50,000 samples generated using Few-Shot prompting
- ##### 50,000 samples generated using Chain-of-Thought (CoT) prompting
- ##### 200 annotated samples from corporate trainings; 
##### The test/train/validation split was done as follows:
- ##### **80k rows for training**;
- ##### **20k rows for testing**;
- ##### **remaining rows for validation (200 annotated samples)**;

## **Extracting text corpus**
##### We have to extract the text from the documents in te dataset so we can use different representations to operate on.
##### Note that this is an unclean version of the corpus

In [2]:
unclean_corpus = []
for i in range(0, len(training_set["text"])):
    unclean_corpus.append(training_set['text'][i]);
print(unclean_corpus[0:5]);

["Your flight has been rescheduled for 10:00 AM tomorrow. Please check the airport's website for any updates or changes.", "We're happy to accommodate your dietary preferences. Our vegetarian options are carefully crafted to ensure a delicious and satisfying meal. Would you like me to recommend some dishes that fit your needs?", 'Our vegetarian options are available on the menu, and our chef can modify any dish to suit your dietary needs.', "I understand your frustration with the recent tournament results, and I'll review the standings to see what we can do to improve your experience.", "I'll do my best to find a suitable replacement for the item you're looking for, but I need to know more about what you're looking for."]


## **Cleaning the text corpus**
##### Now we need to process the unclean text corpus, by performing actions such as:
- ##### Removing punctuation;
- ##### Lower case folding;
- ##### Stemming (using PorterStemmer);
- ##### Removing Stop Words (optional);
##### For that effect we will import [regular expression](https://docs.python.org/3/library/re.html) library and [nltk](https://www.nltk.org/api/nltk.html)

In [3]:
import nltk;
nltk.download('stopwords')
import re;
from nltk.corpus import stopwords;
from nltk.stem.porter import PorterStemmer;

ps = PorterStemmer();
sw = stopwords.words('english');
clean_corpus = []
for i in range(0,len(unclean_corpus)):
    text = re.sub('[^a-zA-Z]', ' ', unclean_corpus[i]);
    text = text.lower();

    text = [ps.stem(word) for word in text.split() if not word in sw];
    text = ' '.join(text);
    clean_corpus.append(text);
print(clean_corpus[0:5]);

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/not_real_fu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['flight reschedul tomorrow pleas check airport websit updat chang', 'happi accommod dietari prefer vegetarian option care craft ensur delici satisfi meal would like recommend dish fit need', 'vegetarian option avail menu chef modifi dish suit dietari need', 'understand frustrat recent tournament result review stand see improv experi', 'best find suitabl replac item look need know look']


## **Updating the Dataframe with unclean corpus clean corpus for comparison**

In [4]:
training_set["clean_text"] = clean_corpus;

## **Representing the corpus in different models**
##### Now that we have cleaned documents from our dataset, it's time to represent them in different textual representations so we can use them to train our model.
#### **You might want to comment the blocks of code of the Space Representation Models section due to its sparity and they might be too inefficient due to the large dataset size.**
#### **Some computers might not meet the hardware requirements to store these sparce vectors. So comment if you think your computer can't handle it for the feature space of < 9000 columns and dataset of 80k rows**

## **Sparce representation models**
#### We are working mainly with 3 sparse representational models:
- #### **Bag of Words**;
- #### **1-Hot vector**;
- #### **Tf-Idf vector**;

### First representation model is **Bag of Words**

In [5]:

from sklearn.feature_extraction.text import CountVectorizer;
import numpy as np;

# print all the array content
np.set_printoptions(threshold=np.inf)

cv = CountVectorizer();

BoW = cv.fit_transform(clean_corpus).toarray();
print("Cleaned text", BoW.shape);


#print(cv.get_feature_names_out());

#print(BoW[0]);

Cleaned text (80000, 4981)


##### We compared the representation of the unclean and clean version of the corpus as well
##### As we can see, the cleaned text reduced the feature size (column number, y of the pair "(x,y)" of shape) by a significant amount because we reduced the words to their stem, standardized the capitalization and removed unecessary tokens like punctuation.
##### This increases the model performance by eliminating unecessary repetitions.


In [None]:

'''cv_unclean = CountVectorizer();
Bow_unclean = cv_unclean.fit_transform(unclean_corpus).toarray();
print("Unclean text",Bow_unclean.shape);
print("Cleaned text", BoW.shape);
'''


'cv_unclean = CountVectorizer();\nBow_unclean = cv_unclean.fit_transform(unclean_corpus).toarray();\nprint("Unclean text",Bow_unclean.shape);\nprint("Cleaned text", BoW.shape);\n'

#### We can also plot a **wordcloud** using the representation (although it only uses the text corpus to count and not the vectorizer) respectively for:
- #### Unclean text;
- #### Clean text;

In [7]:
import wordcloud;
import matplotlib.pyplot as plt;

'''
wordcloud_unclean = wordcloud.WordCloud(width = 800, height = 800, background_color = 'white', stopwords = sw, min_font_size = 10).generate(" ".join(unclean_corpus));
plt.figure(figsize = (8, 8), facecolor = None);
plt.imshow(wordcloud_unclean);
plt.axis("off");
plt.tight_layout(pad = 0);
print("Word Cloud for unclean corpus");
plt.show();

wordcloud_clean = wordcloud.WordCloud(width = 800, height = 800, background_color = 'white', stopwords = sw, min_font_size = 10).generate(" ".join(clean_corpus));
plt.figure(figsize = (8, 8), facecolor = None);
plt.imshow(wordcloud_clean);
plt.axis("off");
plt.tight_layout(pad = 0);
print("Word Cloud for clean corpus");
plt.show();'''



'\nwordcloud_unclean = wordcloud.WordCloud(width = 800, height = 800, background_color = \'white\', stopwords = sw, min_font_size = 10).generate(" ".join(unclean_corpus));\nplt.figure(figsize = (8, 8), facecolor = None);\nplt.imshow(wordcloud_unclean);\nplt.axis("off");\nplt.tight_layout(pad = 0);\nprint("Word Cloud for unclean corpus");\nplt.show();\n\nwordcloud_clean = wordcloud.WordCloud(width = 800, height = 800, background_color = \'white\', stopwords = sw, min_font_size = 10).generate(" ".join(clean_corpus));\nplt.figure(figsize = (8, 8), facecolor = None);\nplt.imshow(wordcloud_clean);\nplt.axis("off");\nplt.tight_layout(pad = 0);\nprint("Word Cloud for clean corpus");\nplt.show();'

#### One noticeable difference is that some of the most common words like "help", "question" or "appreciate" were diminished, mainly because other words that were scattered in different forms in the unclean text were reduced to a common stem and a common standard so those words stand out more like "provid" "hear" and so on, because their counts increased. 

### Second representation model is **1-Hot vector**
##### This representation is basically Bag of Words but without counts, only a binary number indicating if the word exists or not accross the documents.

In [32]:

binary_vectorizer = CountVectorizer(binary=True);
one_hot_clean = binary_vectorizer.fit_transform(clean_corpus);
print(one_hot_clean.shape);

#print(one_hot_clean[0]);
binary_vectorizer_unclean = CountVectorizer(binary=True);
one_hot_unclean = binary_vectorizer.fit_transform(unclean_corpus);
print(one_hot_unclean.shape);


(80000, 4981)
(80000, 8532)


### Third representation is **TF-IDF**
##### This is a measure that takes into account the discriminative power of the words (repetitions of a word accross documents/text or power of a word to distinguish the document content) from the vocabulary considering all the documents(our texts), by assigning a weight to each of the terms of the vocabulary.
##### **TF** stands for Term Frequency and is the total frequency that a word appears considering all the documents;
##### **DF** stands for Document Frequency and it measures the number of documents that have a certain word. The higher the more frequent is a word accross all the documents (bad thing -> low discriminative power);
##### **IDF** stands for Inverse Document Frequency and the inverse of the DF. Higher means rarer the word is accross all the documents (good thing -> high discriminative power).
##### **TF-IDF** is a measure that is the product of **TF** and **IDF**:
- ##### Highest when t occurs many times within a small number of documents (A);
- ##### Lower when the term occurs fewer times in a document, or occurs in many documents (B);
- ##### Lowest when the term occurs in virtually all documents (C).

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer;

TfIdf_vectorizer_unclean = TfidfVectorizer();
Tf_idf_unclean = TfIdf_vectorizer_unclean.fit_transform(unclean_corpus).toarray();
features_unclean = TfIdf_vectorizer_unclean.get_feature_names_out();
print("TfIdf unclean",Tf_idf_unclean.shape)

TfIdf_vectorizer = TfidfVectorizer();
Tf_Idf = TfIdf_vectorizer.fit_transform(clean_corpus).toarray();
features_clean = TfIdf_vectorizer.get_feature_names_out();
print("TfIdf clean", Tf_Idf.shape);
#print(Tf_Idf[0]);


TfIdf unclean (80000, 8532)
TfIdf clean (80000, 4981)


#### We can plot a word cloud to see the distribution of the words for the clean and unclean corpus according to their weight and importance, with Tf-Idf representation:

In [10]:
# dictionary to sum the weights of each unique word for the clean corpus
'''word_count = dict()
for i in range(0,len(clean_corpus)):
    for feature in range(0,len(features_clean)):
        word_count[features_clean[feature]] = word_count.get(features_clean[feature],0) + Tf_Idf[i][feature]

wordcloud_tfidf_clean = wordcloud.WordCloud(width = 800, height = 800, background_color = 'white', min_font_size = 10).generate_from_frequencies(word_count);
plt.figure(figsize = (8, 8), facecolor = None);
plt.imshow(wordcloud_tfidf_clean);
plt.axis("off");
plt.tight_layout(pad = 0);
print("Word Cloud for clean corpus");
plt.show();

#dict for unclean wordcount
word_count_unclean = dict()
for i in range(0,len(unclean_corpus)):
    for feature in range(0,len(features_unclean)):
        word_count_unclean[features_unclean[feature]] = word_count_unclean.get(features_unclean[feature],0) + Tf_idf_unclean[i][feature];

wordcloud_tfidf_unclean = wordcloud.WordCloud(width = 800, height = 800, background_color = 'white', min_font_size = 10).generate_from_frequencies(word_count_unclean);
plt.imshow(wordcloud_tfidf_unclean);
plt.axis("off");
plt.tight_layout(pad = 0);
print("Word Cloud for unclean corpus");
plt.show();'''

    

'word_count = dict()\nfor i in range(0,len(clean_corpus)):\n    for feature in range(0,len(features_clean)):\n        word_count[features_clean[feature]] = word_count.get(features_clean[feature],0) + Tf_Idf[i][feature]\n\nwordcloud_tfidf_clean = wordcloud.WordCloud(width = 800, height = 800, background_color = \'white\', min_font_size = 10).generate_from_frequencies(word_count);\nplt.figure(figsize = (8, 8), facecolor = None);\nplt.imshow(wordcloud_tfidf_clean);\nplt.axis("off");\nplt.tight_layout(pad = 0);\nprint("Word Cloud for clean corpus");\nplt.show();\n\n#dict for unclean wordcount\nword_count_unclean = dict()\nfor i in range(0,len(unclean_corpus)):\n    for feature in range(0,len(features_unclean)):\n        word_count_unclean[features_unclean[feature]] = word_count_unclean.get(features_unclean[feature],0) + Tf_idf_unclean[i][feature];\n\nwordcloud_tfidf_unclean = wordcloud.WordCloud(width = 800, height = 800, background_color = \'white\', min_font_size = 10).generate_from_freq

##### As we predicted, the stop words took over the entire word cloud in the unclean dataset. These occur the most and give us the least information about a document fulliling the condition (C) presented in the block above.
##### As for the clean version of the text corpus, words like "like" and "help" are most likely filling the condition (B), since they are often repeated and common in many documents in the dataset,

#### We opted not to use N-grams because the feature space it generates is too big and inefficient given our dataset, taking a toll on the memory.

## **Beyond sparce representations**

#### **The most compatible representation models are dense vectors which reduce the dimensionality by a significant amount. The sparce vector models (all the previous models) waste memory space unecessarily due to their sparsity and due to the size of the dataset it is impossible to run those models in weaker machines.**

##### We considered 3 type of **Word Embeddings**:
- ##### **Word2Vec**;
- ##### **FastText**;
- ##### **Doc2Vec**;
##### Each of them have pros and cons:
- ##### Word2Vec generates a high dimensionality vector taking word or phrases in the document just like Doc2Vec, but Doc2Vec handles larger text corpus (such as paragraphs or phrases) better than the Word2Vec;
- ##### FastText handles better subword information, but we are working in terms of word and its semantics, so it would not help this project.
##### So we opted using Word2Vec and Doc2Vec for comparison later on.


In [None]:
from gensim.models import Word2Vec
word2vec_embedding_unclean = Word2Vec(sentences = [text.split() for text in unclean_corpus], vector_size = 100, window = 5, min_count = 1, workers = 4);
word2vec_embedding_clean = Word2Vec(sentences = [text.split() for text in clean_corpus], vector_size = 100, window = 5, min_count = 1, workers = 4);


[('assist', 0.5985382795333862), ('happy', 0.4253496825695038), ('us!', 0.4059656858444214), ('help.', 0.4044547975063324), ('shopping', 0.3524917662143707), ('skills!', 0.34916985034942627), ('equip', 0.3451002836227417), ('frustration', 0.3441314995288849), ('answer', 0.34399035573005676), ('specialists', 0.33235788345336914)]
[('assist', 0.655402421951294), ('happi', 0.4399434030056), ('regard', 0.4313318133354187), ('appreci', 0.35334745049476624), ('provid', 0.3486132025718689), ('suitabl', 0.34127140045166016), ('concern', 0.332386314868927), ('misunderstand', 0.32823318243026733), ('thank', 0.32348790764808655), ('certainli', 0.322783887386322)]
Word2Vec<vocab=4989, vector_size=100, alpha=0.025>


In [25]:
print(word2vec_embedding_unclean.wv.most_similar('help'));
print(word2vec_embedding_clean.wv.most_similar('help'));
print(word2vec_embedding_unclean.wv["help"]);
print(word2vec_embedding_clean.wv["help"]);

[('assist', 0.5985382795333862), ('happy', 0.4253496825695038), ('us!', 0.4059656858444214), ('help.', 0.4044547975063324), ('shopping', 0.3524917662143707), ('skills!', 0.34916985034942627), ('equip', 0.3451002836227417), ('frustration', 0.3441314995288849), ('answer', 0.34399035573005676), ('specialists', 0.33235788345336914)]
[('assist', 0.655402421951294), ('happi', 0.4399434030056), ('regard', 0.4313318133354187), ('appreci', 0.35334745049476624), ('provid', 0.3486132025718689), ('suitabl', 0.34127140045166016), ('concern', 0.332386314868927), ('misunderstand', 0.32823318243026733), ('thank', 0.32348790764808655), ('certainli', 0.322783887386322)]
[ 1.7538501   2.4359643   2.075145    6.5213313   0.33731833  3.710494
  2.4197502   2.3675892  -0.7035968  -2.6525035  -2.1895292  -1.2027342
  3.8873377   0.24061668  0.35402876 -0.8865496   0.47483388  1.6164935
  1.2088014   0.4597224  -1.6718328   1.3218437   0.8569037   2.8103912
  4.327201    3.5933511  -3.5309832   0.12094443  0.

In [20]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents_clean = [TaggedDocument(_d, [str(i)]) for i, _d in enumerate(clean_corpus)];
documents_unclean = [TaggedDocument(_d, [str(i)]) for i, _d in enumerate(unclean_corpus)];
doc2vec_embedding_clean = Doc2Vec(documents_clean, vector_size=5, window=2, min_count=1, workers=4);
doc2vec_embedding_unclean = Doc2Vec(documents_unclean, vector_size=5, window=2, min_count=1, workers=4);


In [31]:
# Vectors of the first document in the clean and unclean corpus
print(doc2vec_embedding_clean.dv[0]);
print(doc2vec_embedding_unclean.dv[0]);
# Finding most similar documents to the first document in the clean corpus
print(doc2vec_embedding_clean.dv.most_similar([doc2vec_embedding_clean.dv[0]]))
# Finding most similar documents to the first document in the unclean corpus
print(doc2vec_embedding_unclean.dv.most_similar([doc2vec_embedding_unclean.dv[0]]))

[-0.09117847 -0.17477754  0.00054884  0.19604598  0.04935723]
[-0.11124151 -0.11866955 -0.39550716  0.5509691  -0.19465415]
[('0', 1.0), ('2668', 0.9994635581970215), ('3868', 0.9990739226341248), ('5521', 0.9988207221031189), ('76529', 0.9988175630569458), ('66069', 0.9987872242927551), ('14095', 0.9985340237617493), ('73649', 0.9981640577316284), ('19306', 0.9979440569877625), ('36119', 0.9978556036949158)]
[('0', 0.9999999403953552), ('36829', 0.9990250468254089), ('15922', 0.9989022612571716), ('45547', 0.9988818764686584), ('32645', 0.9984160661697388), ('29271', 0.9983428120613098), ('19578', 0.9981349110603333), ('39596', 0.997815728187561), ('19047', 0.9977831244468689), ('21903', 0.9976970553398132)]


### In summary, the following representational models are what we are going to work with:

In [None]:
# BoW
print(BoW.shape);
# TfIdf
print(Tf_Idf.shape);
# One Hot
print(one_hot_clean.shape);
# Word2Vec
print(word2vec_embedding_clean);
# Doc2Vec
print(doc2vec_embedding_clean);

(80000, 4981)
(80000, 4981)
(80000, 4981)
Word2Vec<vocab=4989, vector_size=100, alpha=0.025>
Doc2Vec<dm/m,d5,n5,w2,s0.001,t4>
