# Introduction to Natural Language Processing: Assignment 2

In this exercise we'll practice training and testing classifiers.

- You can use any built-in Python packages, scikit-learn and Pandas.
- Please comment your code
- Submissions are due Thursday at 23:59 and should be submitted **ONLY** on eCampus: **Assignmnets >> Student Submissions >> Assignment 2 (Deadline: 21.11.2023, at 23:59)**
- Name the file aproppriately "Assignment_2_\<Your_Name\>.ipynb".
- Please use relative paths, your code should run on my computer if the notebook and the file are both in the same directory.

Example: file_name = polarity.txt >> **DON'T use:** /Users/ComputerName/Username/Documents/.../polarity.txt

### Task 1.1 (2 point)

Create a DataFrame using the `polarity.txt` file and give name to the columns appropriately. (e.g., "Text", "Label")

In [228]:
#here comes your code
import pandas as pd

df = pd.read_csv("polarity.txt", sep='\t', header=None)
df.columns = ["text", "polarity"]
df.sample(5)

Unnamed: 0,text,polarity
15,the most significant tension of _election_ is ...,pos
70,whatever . . . skip,neg
31,one thing that's been bothering me since i've ...,pos
44,critique : a mind-fuck movie for the teen gene...,neg
28,witherspoon is a revelation .,pos


### Task 1.2 (2 point)

Create a new column for the DataFrame that contains labels converted to numerical values instead of strings using the function: `apply()` and drop the original column afterwards.

Hint: The numarical values can be any meaningful values, e.g., pos >> 1 and neg >> 0

In [229]:
# here comes your code
# Custom funtion that is applied to each element
def numberize(polarity):
    if polarity == "pos":
        return 1
    else: 
        return 0

df["labels"] = df["polarity"].apply(numberize)
df.sample(5)

Unnamed: 0,text,polarity,labels
40,they get into an accident .,neg,0
32,there is an extraordinary amount of sexuality ...,pos,1
50,"there are dreams , there are characters coming...",neg,0
31,one thing that's been bothering me since i've ...,pos,1
60,do we really need to see it over and over agai...,neg,0


### Task 2 (8 points)

Write a function `create_count_and_probability` that takes a file (`corpus.txt`) as input and returns a csv file as output containing three columns:
1. Text
2. Count_Vector
3. Probability

Example:

For the line: `This document is the second document.`

The row in the csv file should contain:
`This document is the second document.`   `[0,2,0,1,0,1,1,0,1]`   `[1/6, 2/6, 1/6, 1/6, 1/6, 2/6]`

**Note**:

1. You should define your own function and not use e.g., CountVectorizer() which gives you the `count vector`, directly.

2. You can either use the whitespace in `split` as the seperator or use the `Regular Expression (re)` to extract the words, as follows:

```
import re
TEXT = "Hey, - How are you doing today!?"
words_list = re.findall(r"[\w']+", TEXT)
print(words_list)
```

3. To count the words, you can use e.g., the library: `collections`, more specifically `Counter`.

4. Please don't upload the output file. Your function should generate the file.

In [230]:
def vocab(file):
    '''
    Input: txt file url
    Output: list of unique words in the file
    '''

    if type(file) is pd.core.frame.DataFrame:
        df = file
        # Vocab from df unique words
        vocab = set()
        for text in df["Document"]:
            # Taking care of symbols - only words are considered
            vocab.update(text.split(" "))

        # Return as a list
        vocab = set(vocab)
        return(list(vocab))
    else: 
        with open(file, 'r') as f:
            vocab = set(f.read().split())
            # Return as a list
        return(list(vocab))




def count_vector_n_probabilities(text, vocab):
    '''
    Input: text, vocab
    Output: list of counts of words in vocab
    '''
    count_vector = [0]*len(vocab)
    # Split the text into words
    words = text.split()
    # For each word - add 1 to the corresponding index
    for word in words:
        if word in vocab:
            count_vector[vocab.index(word)] += 1
            
    # Calculate the probability of each word
    total_words = len(words)
    probabilities = [count/total_words for count in count_vector]
    return(count_vector, probabilities)




def create_count_and_probability(file_name):
    '''
    Input: txt file url
    Output: CSV file with "Text", "Count_Vector", "Probability" columns
    '''
    df = pd.read_csv(file_name, sep='\t', header=None)
    # Add a name to the column
    df.columns = ["text"]
    # Create a column with the count vector
    df["count_vector"] = df["text"].apply(lambda x: count_vector_n_probabilities(x, vocab("corpus.txt"))[0])
    # Create a column with the probability
    df["probability"] = df["text"].apply(lambda x: count_vector_n_probabilities(x, vocab("corpus.txt"))[1])
    # Save as a CSV file
    df.to_csv("count_vector_and_probability.csv", index=False)
    
    return(df)

df = create_count_and_probability("corpus.txt")
df

Unnamed: 0,text,count_vector,probability
0,This document is a sample document.,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16666666..."
1,"In this document, we have repeated words.","[1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, ...","[0.14285714285714285, 0.0, 0.14285714285714285..."
2,Repeated words can help us understand word fre...,"[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.125, 0.0, 0.0, 0.0, 0.1..."
3,Can you identify the repeated words in this do...,"[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, ...","[0.0, 0.1111111111111111, 0.0, 0.1111111111111..."


### Task 3 (8 points)

The goal of this task is to train and test classifiers provided in scikit-learn, using two datasets `rural.txt` and `science.txt`.

a) Each file (rural and science) contains sentence-wise documents. You should create a dataframe containing two columns: "Document" and " Class", as shown below. This dataframe will be used later as input for the vectorizer.

|Document                             |Class |
| ------------------------------------|----- |
|PM denies knowledge of AWB kickbacks | rural |
|The crocodile ancestor fossil, found...| science |


b) Split the data into train (70%) and test (30%) sets and use the tf-idf-vectorizer to train following classifiers provided by scikit-learn:

- naive_bayes.GaussianNB()
- svm.LinearSVC().

c) Evaluate both classifiers using the test set, report accuracy, recall, precision, f1 scores and confusion matrix.

**Hints:**
1. The Gaussian NB Classifier takes a dense matrix as input and the output of the vectorizer is a sparse matrix. Use my_matrix.toarray() for this conversion.
2. You can play around with various parameters in both the tf-idf-vectorizer and the classifier to get a better performance in terms of the accuracy. (In the exercise, we will discuss the accuracy of your model.)

In [231]:
rural_df = pd.read_csv("rural.txt", header=None, sep="\t")
rural_df.columns = ['Document']
rural_df["Class"] = "rural"

science_df = pd.read_csv("science.txt", header=None, sep="\t")
science_df.columns = ['Document']
science_df["Class"] = "science"

df = pd.concat([science_df, rural_df])
df

Unnamed: 0,Document,Class
0,Cystic fibrosis in Yellandu Khammam Hyderabad ...,science
1,Inhaling the mists of salt water can reduce th...,science
2,That's the conclusion of two studies published...,science
3,They found that inhaling a mist with a salt co...,science
4,"Cystic fibrosis, a progressive and frequently ...",science
...,...,...
445,Hindmarsh mayor Darryl Argall says the taskfor...,rural
446,Dairy companies compete for suppliers,rural
447,Competition for milk suppliers is intensifying...,rural
448,"After years of dismal prices, milk company Fon...",rural


In [232]:
# Splitting into testing and training
# Using scikit learn
from sklearn.model_selection import train_test_split


X = df["Document"]
y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0) # Use same random state for reproducibility


In [233]:
# TF-IDF : Term Frequency - Inverse Document Frequency

# TF(term, document) = (#term apperences/ # total terms in document) 
# ITF(term, Corpus) = log(#Number of Docs in Corpus / # documents containing term + 1)

from sklearn.feature_extraction.text import TfidfVectorizer

documents = df["Document"]

# Creating a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=4000, ngram_range=(1,2))

# Making tf-idf matrix
tfidf_train = vectorizer.fit_transform(X_train)
tfidf_test = vectorizer.transform(X_test)

# Print the shape of the matrices
print(tfidf_train.shape)
print(tfidf_test.shape)


(714, 4000)
(306, 4000)


There is a difference in the vocab size - one based on simple split function  - other is built with the TF-IDF matrix from scikitlearn

In [234]:

# Classification using GaussianNB

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(tfidf_train.toarray(), y_train)

# Predicting the Test set results
y_pred = classifier.predict(tfidf_test.toarray())

# Accuracy
from sklearn.metrics import accuracy_score
accuracy_score = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy_score)
# Recall and Precision
from sklearn.metrics import recall_score, precision_score
recall_score = recall_score(y_test, y_pred, average='weighted')
precision_score = precision_score(y_test, y_pred, average='weighted')
print("Recall: ", recall_score)
print("Precision: ", precision_score)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)

Accuracy:  0.9248366013071896
Recall:  0.9248366013071896
Precision:  0.9249137179049571
Confusion Matrix: 
 [[131  13]
 [ 10 152]]


In [235]:
# Classification using svm.LinearSVC()

from sklearn.svm import LinearSVC

classifier = LinearSVC()
classifier.fit(tfidf_train.toarray(), y_train)

# Predicting the Test set results
y_pred = classifier.predict(tfidf_test.toarray())

# Accuracy. Recall and Precision
from sklearn.metrics import accuracy_score, recall_score, precision_score
accuracy_score = accuracy_score(y_test, y_pred)
recall_score = recall_score(y_test, y_pred, average='weighted')
precision_score = precision_score(y_test, y_pred, average='weighted')
print("Accuracy: ", accuracy_score)
print("Recall: ", recall_score)
print("Precision: ", precision_score)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)


Accuracy:  0.9084967320261438
Recall:  0.9084967320261438
Precision:  0.912508261731659
Confusion Matrix: 
 [[122  22]
 [  6 156]]
