# Natural Language Processing First Assignment
#### This is the notebook for the first assignment about the dataset **"Polite Guard"**. The objective of this work is to come up with a pipeline that builds a robust and good model for text classification

### **Importing the dataset**
#### The first step is to import the dataset we are using, the original dataset already split test and training data, as well as validation data

In [5]:
import pandas as pd;

traning_file = "data/train/train_cot.csv";
test_file = "data/test/test_cot.csv";
validation_file = "data/validation/validation_cot.csv";


traning_set = pd.read_csv(traning_file);
test_set = pd.read_csv(test_file);
validation_set = pd.read_csv(validation_file);

print(traning_set.head());
print(test_set.head());
print(validation_set.head());


                                                text            label  \
0  Your flight has been rescheduled for 10:00 AM ...          neutral   
1  We're happy to accommodate your dietary prefer...           polite   
2  Our vegetarian options are available on the me...          neutral   
3  I understand your frustration with the recent ...  somewhat polite   
4  I'll do my best to find a suitable replacement...  somewhat polite   

                                  source  \
0  meta-llama/Meta-Llama-3.1-8B-Instruct   
1  meta-llama/Meta-Llama-3.1-8B-Instruct   
2  meta-llama/Meta-Llama-3.1-8B-Instruct   
3  meta-llama/Meta-Llama-3.1-8B-Instruct   
4  meta-llama/Meta-Llama-3.1-8B-Instruct   

                                           reasoning  
0  This text would be classified as "neutral" bec...  
1  This text is polite because it expresses grati...  
2  This text would be classified as "neutral" bec...  
3  This text would be classified as "somewhat pol...  
4  This text would be

### **Extracting text corpus**
#### We have to extract the text from the documents in te dataset so we can use different representations to operate on.
#### Note that this is an unclean version of the corpus

In [12]:
unclean_corpus = []
for i in range(0, len(traning_set["text"])):
    unclean_corpus.append(traning_set['text'][i]);
print(unclean_corpus[0:5]);

["Your flight has been rescheduled for 10:00 AM tomorrow. Please check the airport's website for any updates or changes.", "We're happy to accommodate your dietary preferences. Our vegetarian options are carefully crafted to ensure a delicious and satisfying meal. Would you like me to recommend some dishes that fit your needs?", 'Our vegetarian options are available on the menu, and our chef can modify any dish to suit your dietary needs.', "I understand your frustration with the recent tournament results, and I'll review the standings to see what we can do to improve your experience.", "I'll do my best to find a suitable replacement for the item you're looking for, but I need to know more about what you're looking for."]


### **Cleaning the text corpus**
#### Now we need to process the unclean text corpus, by performing actions such as:
- #### Removing punctuation;
- #### Lower case folding;
- #### Stemming (using PorterStemmer);
- #### Removing Stop Words (optional);
#### For that effect we will import [regular expression](https://docs.python.org/3/library/re.html) library and [nltk](https://www.nltk.org/api/nltk.html)

In [None]:
import re;

from nltk.corpus import stopwords;
from nltk.stem.porter import PorterStemmer;

ps = PorterStemmer();
sw = stopwords.words('english');
clean_corpus = []
for i in range(0,len(unclean_corpus)):
    text = re.sub('[^a-zA-Z]', ' ', unclean_corpus[i]);
    text = text.lower();

    text = [ps.stem(word) for word in text.split() if not word in sw];
    text = ' '.join(text);
    clean_corpus.append(text);
print(clean_corpus[0:5]);

['flight reschedul tomorrow pleas check airport websit updat chang', 'happi accommod dietari prefer vegetarian option care craft ensur delici satisfi meal would like recommend dish fit need', 'vegetarian option avail menu chef modifi dish suit dietari need', 'understand frustrat recent tournament result review stand see improv experi', 'best find suitabl replac item look need know look']
