# Introduction to Natural Language Processing: Assignment 2

In this exercise we'll practice training and testing classifiers.

- You can use any built-in Python packages, scikit-learn and Pandas.
- Please comment your code
- Submissions are due Thursday at 23:59 and should be submitted **ONLY** on eCampus: **Assignmnets >> Student Submissions >> Assignment 2 (Deadline: 21.11.2023, at 23:59)**
- Name the file aproppriately "Assignment_2_\<Your_Name\>.ipynb".
- Please use relative paths, your code should run on my computer if the notebook and the file are both in the same directory.

Example: file_name = polarity.txt >> **DON'T use:** /Users/ComputerName/Username/Documents/.../polarity.txt

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import naive_bayes
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import re
from collections import Counter

[nltk_data] Downloading package punkt to /home/faris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Task 1.1 (2 point)

Create a DataFrame using the `polarity.txt` file and give name to the columns appropriately. (e.g., "Text", "Label")

In [2]:
file_path = "./polarity.txt"

In [3]:
polarity_df = pd.read_csv(file_path, sep='\t', names=['Text', 'Label'])

In [4]:
polarity_df.head(4)

Unnamed: 0,Text,Label
0,every now and then a movie comes along from a ...,pos
1,"mtv films' _election , a high school comedy st...",pos
2,did anybody know this film existed a week befo...,pos
3,the plot is deceptively simple .,pos


### Task 1.2 (2 point)

Create a new column for the DataFrame that contains labels converted to numerical values instead of strings using the function: `apply()` and drop the original column afterwards.

Hint: The numarical values can be any meaningful values, e.g., pos >> 1 and neg >> 0

In [5]:
polarity_df['numLabel'] = polarity_df['Label'].apply(lambda x: 1 if x == 'pos' else 0)
polarity_df.drop('Label', axis=1, inplace=True)

In [6]:
polarity_df.head(4)

Unnamed: 0,Text,numLabel
0,every now and then a movie comes along from a ...,1
1,"mtv films' _election , a high school comedy st...",1
2,did anybody know this film existed a week befo...,1
3,the plot is deceptively simple .,1


### Task 2 (8 points)

Write a function `create_count_and_probability` that takes a file (`corpus.txt`) as input and returns a csv file as output containing three columns:
1. Text
2. Count_Vector
3. Probability

Example:

For the line: `This document is the second document.`

The row in the csv file should contain:
`This document is the second document.`   `[0,2,0,1,0,1,1,0,1]`   `[1/6, 2/6, 1/6, 1/6, 1/6, 2/6]`

**Note**:

1. You should define your own function and not use e.g., CountVectorizer() which gives you the `count vector`, directly.

2. You can either use the whitespace in `split` as the seperator or use the `Regular Expression (re)` to extract the words, as follows:

```
import re
TEXT = "Hey, - How are you doing today!?"
words_list = re.findall(r"[\w']+", TEXT)
print(words_list)
```

3. To count the words, you can use e.g., the library: `collections`, more specifically `Counter`.

4. Please don't upload the output file. Your function should generate the file.

In [7]:
# Generate bag-of-words i.e. features (same as "count_vectorizer.vocabulary_.keys()" from tutorial)
# From corpus => list of documents/sentences
def unique_words(corpus):
    all_words = [re.findall(r"[\w']+", sentence) for sentence in corpus]
    unique_words_set = set(word.lower() for words in all_words for word in words)
    unique_words_list = list(unique_words_set)
    return unique_words_set

In [8]:
# Generate bag-of-words i.e. features (same as "count_vectorizer.vocabulary_.keys()" from tutorial)
# From plain text
def unique_words2(text):
    words_list = re.findall(r"[\w']+", text)
    unique_words_set = set(word.lower() for word in words_list)
    unique_words_list = list(unique_words_set)
    return unique_words_list

In [9]:
def words_probability(sentence):
    word_list = re.findall(r"[\w']+", sentence)
    word_list = list(word.lower() for word in word_list)
    
    # Calculate word frequencies (returns dictionary where key is word, value is frequency)
    word_counts = Counter(word_list)
    
    total_words = len(word_list)
    
    # Calculate probabilities for each word
    word_probabilities = [word_counts[word] / total_words for word in word_list]
    
    return word_probabilities

In [10]:
def count_vectorizer(sentence, unique_words):
    word_list = re.findall(r"[\w']+", sentence)
    word_list = list(word.lower() for word in word_list)
    
    # Count occurrences of UNIQUE words (i.e of all vocabulary words / features / bag-of-words)
    feature_vector = [word_list.count(word.lower()) for word in unique_words]
    return feature_vector

#### Note: DataFrame is returned as the output from the function, and .csv file is generated during execution of this function

In [11]:
def create_count_and_probability(file_name):
    # Read file into variable
    with open(file_name, 'r') as file:
        text = file.read()
        
    # Use sentence tokenizer from nltk to get sentences from string text. Output is list of sentences (i.e. corpus)
    corpus = sent_tokenize(text)
    
    
    # Get list of unique words from whole text => This words represent features (Bag of Words)
    # I will assume that capitalized and non-capitalized words are same (so I will lower all words)
    bag_of_words = unique_words(corpus)
    
    # Create an empty DataFrame
    df = pd.DataFrame(columns=['Sentence', 'Count_Vector', 'Probability'])
    
    
    for sentence in corpus:
        # For each sentece (i.e. sample) calculate how many times unique words appears in that sentence (i.e. calculate values for all features)
        feature_vector = count_vectorizer(sentence, bag_of_words)
        word_counts = words_probability(sentence)
        df = df.append({'Sentence': sentence, 'Count_Vector': feature_vector, 'Probability': word_counts}, ignore_index=True)

    
    
    df.to_csv("corpus_parsed_csv.csv", index=False)
    
    return df

In [12]:
df = create_count_and_probability("corpus.txt")

### Task 3 (8 points)

The goal of this task is to train and test classifiers provided in scikit-learn, using two datasets `rural.txt` and `science.txt`.

a) Each file (rural and science) contains sentence-wise documents. You should create a dataframe containing two columns: "Document" and " Class", as shown below. This dataframe will be used later as input for the vectorizer.

|Document                             |Class |
| ------------------------------------|----- |
|PM denies knowledge of AWB kickbacks | rural |
|The crocodile ancestor fossil, found...| science |


b) Split the data into train (70%) and test (30%) sets and use the tf-idf-vectorizer to train following classifiers provided by scikit-learn:

- naive_bayes.GaussianNB()
- svm.LinearSVC().

c) Evaluate both classifiers using the test set, report accuracy, recall, precision, f1 scores and confusion matrix.

**Hints:**
1. The Gaussian NB Classifier takes a dense matrix as input and the output of the vectorizer is a sparse matrix. Use my_matrix.toarray() for this conversion.
2. You can play around with various parameters in both the tf-idf-vectorizer and the classifier to get a better performance in terms of the accuracy. (In the exercise, we will discuss the accuracy of your model.)

## Load .txt files into the DataFrame (create dataset)

Each sentence is splitted by new line, and I am going to load it like that. After that for those two lists, I will append each element to the empty DataFrame using $\textbf{.loc}$

In [13]:
science_path = "./science.txt"
rural_path = "./rural.txt"

In [14]:
with open(science_path, 'r') as file:
        science = file.read()

In [15]:
with open(rural_path, 'r') as file:
        rural = file.read()

In [16]:
#science_corpus = re.split(r'[\n]', science)
##rural_corpus = re.split(r'[\n]', rural)
science_corpus = science.split('\n')[:-1]
rural_corpus = rural.split('\n')[:-1]

In [17]:
len(science_corpus) + len(rural_corpus)

1122

In [18]:
df_dataset = pd.DataFrame(columns=['Document', 'Class'])

In [19]:
for i in range(len(science_corpus)):
    df_dataset.loc[i,'Document'] = science_corpus[i]
    df_dataset.loc[i,'Class'] = "science"

In [20]:
prev_length = len(science_corpus)

In [21]:
for j in range(len(rural_corpus)):
    df_dataset.loc[prev_length + j,'Document'] = rural_corpus[j]
    df_dataset.loc[prev_length + j,'Class'] = "rural"

In [22]:
df_dataset.head(5)

Unnamed: 0,Document,Class
0,"Cystic fibrosis affects 30,000 children and yo...",science
1,Inhaling the mists of salt water can reduce th...,science
2,That's the conclusion of two studies published...,science
3,They found that inhaling a mist with a salt co...,science
4,"Cystic fibrosis, a progressive and frequently ...",science


## Split dataset into train and test using scikit-learn

In [23]:
text_train, text_test, label_train, label_test = train_test_split(df_dataset['Document'], df_dataset['Class'], 
                                                                  test_size=0.30, 
                                                                  random_state=1234, shuffle=True)

## Creating Features (TF-IDF-Vectorizer)

In [24]:
# Create Instance of TfidVectorizer
tf_idf_vectorizer = TfidfVectorizer()

# Fit that instance to train data
tf_idf_vectorizer.fit(text_train)

# Transform texual data to numeric data (BoW Features)
train_bow_features = tf_idf_vectorizer.transform(text_train)
test_bow_features = tf_idf_vectorizer.transform(text_test)

In [25]:
# Convert from sparse matrix to dense numpy matrix
train_bow_features = train_bow_features.toarray()
test_bow_features = test_bow_features.toarray()

## Accuracy, Precision, Recall, F1 Score and Confusion Matrix

In [26]:
def categorical_to_numerical(categorical):
    numerical = [1 if label == 'science' else 0 for label in categorical]
    return np.array(numerical)

def calculate_accuracy(predicted_labels_categorical, gt_labels_categorical):
    predicted_labels = categorical_to_numerical(predicted_labels_categorical)
    gt_labels = categorical_to_numerical(gt_labels_categorical)
    
    correct_predictions = np.sum(predicted_labels == gt_labels)
    total_samples = len(gt_labels)
    accuracy = correct_predictions / total_samples
    return accuracy

def calculate_precision(predicted_labels_categorical, gt_labels_categorical):
    predicted_labels = categorical_to_numerical(predicted_labels_categorical)
    gt_labels = categorical_to_numerical(gt_labels_categorical)
    
    true_positives = np.sum((predicted_labels == 1) & (gt_labels == 1))
    false_positives = np.sum((predicted_labels == 1) & (gt_labels == 0))
    precision = true_positives / (true_positives + false_positives)
    return precision

def calculate_recall(predicted_labels_categorical, gt_labels_categorical):
    predicted_labels = categorical_to_numerical(predicted_labels_categorical)
    gt_labels = categorical_to_numerical(gt_labels_categorical)
    
    true_positives = np.sum((predicted_labels == 1) & (gt_labels == 1))
    false_negatives = np.sum((predicted_labels == 0) & (gt_labels == 1))
    recall = true_positives / (true_positives + false_negatives)
    return recall

def calculate_f1(predicted_labels_categorical, gt_labels_categorical):    
    precision = calculate_precision(predicted_labels_categorical, gt_labels_categorical)
    recall = calculate_recall(predicted_labels_categorical, gt_labels_categorical)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def calculate_confusion_matrix(predicted_labels_categorical, gt_labels_categorical):
    predicted_labels = categorical_to_numerical(predicted_labels_categorical)
    gt_labels = categorical_to_numerical(gt_labels_categorical)
    
    true_positives = np.sum((predicted_labels == 1) & (gt_labels == 1))
    false_positives = np.sum((predicted_labels == 1) & (gt_labels == 0))
    true_negatives = np.sum((predicted_labels == 0) & (gt_labels == 0))
    false_negatives = np.sum((predicted_labels == 0) & (gt_labels == 1))

    confusion_matrix = np.array([[true_negatives, false_positives], [false_negatives, true_positives]])

    return confusion_matrix

## ML Model - LinearSVC

In [27]:
svm_classifier = svm.LinearSVC()
svm_classifier.fit(train_bow_features, label_train);



In [28]:
predicted_label_test = svm_classifier.predict(test_bow_features)

In [29]:
accuracy_svc = calculate_accuracy(predicted_label_test, label_test)
precision_svc = calculate_precision(predicted_label_test, label_test)
recall_svc = calculate_recall(predicted_label_test, label_test)
f1_svc = calculate_f1(predicted_label_test, label_test)
confusion_matrix_svc = calculate_confusion_matrix(predicted_label_test, label_test)

In [30]:
print("Accuracy of LinearSVC: ", accuracy_svc)
print("Precision of LinearSVC: ", precision_svc)
print("Recall of LinearSVC: ", recall_svc)
print("F1 Score of LinearSVC: ", f1_svc)
print("Confusion Matrix of LinearSVC: ", confusion_matrix_svc)

Accuracy of LinearSVC:  0.9406528189910979
Precision of LinearSVC:  0.9273743016759777
Recall of LinearSVC:  0.9595375722543352
F1 Score of LinearSVC:  0.9431818181818181
Confusion Matrix of LinearSVC:  [[151  13]
 [  7 166]]


## ML Model - Gaussian Naive Bayes

In [31]:
naive_bayes_classifier = naive_bayes.GaussianNB()
naive_bayes_classifier.fit(train_bow_features, label_train);

In [32]:
predicted_label_test = naive_bayes_classifier.predict(test_bow_features)

In [33]:
accuracy_gnb = calculate_accuracy(predicted_label_test, label_test)
precision_gnb = calculate_precision(predicted_label_test, label_test)
recall_gnb = calculate_recall(predicted_label_test, label_test)
f1_gnb = calculate_f1(predicted_label_test, label_test)
confusion_matrix_gnb = calculate_confusion_matrix(predicted_label_test, label_test)

In [34]:
print("Accuracy of GNB: ", accuracy_gnb)
print("Precision of GNB: ", precision_gnb)
print("Recall of GNB: ", recall_gnb)
print("F1 Score of GNB: ", f1_gnb)
print("Confusion Matrix of GNB: ", confusion_matrix_gnb)

Accuracy of GNB:  0.9050445103857567
Precision of GNB:  0.9122807017543859
Recall of GNB:  0.9017341040462428
F1 Score of GNB:  0.9069767441860466
Confusion Matrix of GNB:  [[149  15]
 [ 17 156]]
