# ASSIGNMENT 4 - Classification Empirical Study: Text Classification

## Group Description

Group Number: 25

Member Name 1: Jake Wang

Member Student Number 1: ***REMOVED***

Member Name 2: Victor Li

Member Student Number 2: ***REMOVED***

## Derived Datasets

This notebook is a starting point for Assignment 4. In this assignment, you will perform a classification empirical study. This notebook will help you to create derived datasets in Section 2 of the assignment.

In [None]:
# Let's start by installing spaCy
!pip install spacy

In [None]:
import spacy
import pandas as pd
import numpy as np

You have been given a list of datasets in the assignment description. Choose one of the datasets and provide the link below and read the dataset using pandas. You should provide a link to your own Github repository even if you are using a reduced version of a dataset from your TA's repository.

<span style="color: #f00; font-size: xx-large">TODO: add description of the dataset and your justification of the choices made to obtain the derived datasets</span>

In [None]:
# url = "https://raw.githubusercontent.com/uOttawa-Collabs/CSI4106-Fall-2023/master/Assignment%204/reduced_drugsComTest_raw_fiveclasses.csv"
url = "https://raw.githubusercontent.com/uOttawa-Collabs/CSI4106-Fall-2023/master/Assignment%204/reduced_file_AirPassengerReviews.csv"
# url = "https://raw.githubusercontent.com/uOttawa-Collabs/CSI4106-Fall-2023/master/Assignment%204/reduced_file_cnnnews.csv"

In [None]:
print(url)
data = pd.read_csv(url)

In [None]:
data.head()

This is where you create the NLP pipeline. load() will download the correct model (English).

In [None]:
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Applying the pipeline to every sentences creates a Document where every word is a Token object.

Doc: https://spacy.io/api/doc

Token: https://spacy.io/api/token

In [None]:
# Apply nlp pipeline to the column that has your sentences.
data["tokenized"] = data["customer_review"].apply(nlp)

In [None]:
data.head()

A Token object has many attributes such as part-of-speech (pos_), lemma (lemma_), etc. Take a look at the documentation to see all attributes.

The following function is an example on how you can fetch a specific pos tagging from a sentence. We return the lemmatization because we only want the infinitive word.

In [None]:
# Create empty dataframes that will store your derived datasets
derived_dataset1 = pd.DataFrame(columns=["Class", "pos"])
derived_dataset2 = pd.DataFrame(columns=["Class", "pos-np"])

In [None]:
def get_pos(sentence, wanted_pos):  # wanted_pos refers to the desired pos tagging
    words = []
    for token in sentence:
        if token.pos_ in wanted_pos:
            words.append(token.lemma_)  # lemma returns a number. lemma_ return a string
    return " ".join(words)  # return value is as a string and not a list for countVectorizer

<span style="color: #f00; font-size: xx-large">TODO: Explain the choice of wanted POS.</span>

In [None]:
derived_dataset1["Class"] = data["NPS Score"]
derived_dataset1["pos"] = data["tokenized"].apply(lambda s: get_pos(s, ["NOUN", "VERB", "ADJ"]))

In [None]:
derived_dataset1.head()

In [None]:
def get_ne(sentence):
    words = []
    for entity in sentence.ents:
        words.append(entity.text)
    return " ".join(words)

In [None]:
# Change this line to fetch your desired pos taggings for the second derived dataset
derived_dataset2["Class"] = data["NPS Score"]
derived_dataset2["pos-np"] = data["tokenized"].apply(lambda s: get_ne(s) + " " + get_pos(s, ["VERB", "ADJ"]))

In [None]:
derived_dataset2.head()

In [None]:
# For Derived Dataset 2, you also need to include Named Entities
# Below is just an example of obtaining such entities on a specific sentence, but you would do NER
#   on the dataset of your choice.
# You can choose the types of entities (dates, organization, people) that you want,
#   and then in your derived dataset, just make sure you include these entities separated by spaces (as shown for verbs)
#   in a previous cell.

sentence = "apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Now that you have your derived datasets, you can move to perform your classificaton task.

## Perform Classification Task

### Encode the text as input features with associated values

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words="english")

data_vectorized = tfidf_vectorizer.fit_transform(data["tokenized"].apply(str))
derived_dataset1_vectorized = tfidf_vectorizer.fit_transform(derived_dataset1["pos"])
derived_dataset2_vectorized = tfidf_vectorizer.fit_transform(derived_dataset2["pos-np"])

### Define 2 models using some default parameters

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

default_logistic_regression = LogisticRegression()
default_multilayer_perceptron = MLPClassifier()

### Train/test/evaluate the 2 models with default parameters

***Initialize KFold Cross Validator***

Here we are initializing the `KFold` validator. Explanation of parameters:
* `n_splits=4` means we are using 4-fold validation.
* `shuffle=True` means data will be shuffled before spliting to batches.
* `random_state` is the random seed used for shuffling. Here we are using a fixed number to keep reproducibility.

In [None]:
from sklearn.model_selection import KFold
four_fold = KFold(n_splits=4, shuffle=True, random_state=4106)

***Training and Evaluating the Model***

****Utility Functions****

Description for core function `train_and_evaluate`:

Inputs:
* `model`: The machine learning model to be trained and evaluated.
* `validator`: A cross-validator object used for splitting the data into training and testing sets.
* `csr_matrix_training`: A scipy `csr_matrix` containing input features for training the model.
* `series_validating`: A pandas `Series` of target labels for validating the model's predictions.

The function iterates through the training and testing sets created by the `validator`.
1. For each iteration, it trains the model using the training sets.
2. Then, it makes predictions on the test data.
3. Precision and recall scores are calculated for both micro and macro averages using the `precision_score` and `recall_score` functions from scikit-learn.
4. Zero division is handled by setting `zero_division=0`.

The calculated precision and recall scores are appended to respective lists.

After iterating through all folds, the function calculates the average precision and recall scores for both micro and macro averages using the helper function average.
The function returns a tuple containing the average micro precision, average macro precision, average micro recall, and average macro recall scores.

In [None]:
from sklearn.metrics import precision_score, recall_score

def train_and_evaluate(model, validator, csr_matrix_training, series_validating):
    micro_precision_scores = []
    macro_precision_scores = []
    micro_recall_scores = []
    macro_recall_scores = []

    X = csr_matrix_training
    y = series_validating

    # Split the dataset into training set and testing set with validator
    for train_index, test_index in validator.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # Train the model
        model.fit(X_train, y_train)

        # Make predictions ŷ
        y_hat = model.predict(X_test)

        # Calculate precision and recall for micro and macro averages
        micro_precision = precision_score(y_test, y_hat, average="micro", zero_division=0)
        macro_precision = precision_score(y_test, y_hat, average="macro", zero_division=0)
        micro_recall = recall_score(y_test, y_hat, average="micro", zero_division=0)
        macro_recall = recall_score(y_test, y_hat, average="macro", zero_division=0)

        # Append scores to lists
        if micro_precision != 0:
            micro_precision_scores.append(micro_precision)
        if macro_precision != 0:
            macro_precision_scores.append(macro_precision)
        if micro_recall != 0:
            micro_recall_scores.append(micro_recall)
        if macro_recall != 0:
            macro_recall_scores.append(macro_recall)

    # Calculate average precision and recall scores for micro and macro averages
    return (
        average(micro_precision_scores),
        average(macro_precision_scores),
        average(micro_recall_scores),
        average(macro_recall_scores)
    )


def average(numeric_list):
    return sum(numeric_list) / len(numeric_list)


def print_result(
    average_micro_precision,
    average_macro_precision,
    average_micro_recall,
    average_macro_recall
):
    print("Average Micro Precision: {:.2f}".format(average_micro_precision))
    print("Average Macro Precision: {:.2f}".format(average_macro_precision))
    print("Average Micro Recall: {:.2f}".format(average_micro_recall))
    print("Average Macro Recall: {:.2f}".format(average_macro_recall))

****Logistic Regression on Original Dataset****

In [None]:
default_lr_original_evaluation = train_and_evaluate(
    default_logistic_regression,
    four_fold, data_vectorized,
    data["NPS Score"]
)

print_result(
    *default_lr_original_evaluation
)

****Multilayer Perceptron on Original Dataset****

In [None]:
default_mlp_original_evaluation = train_and_evaluate(
    default_multilayer_perceptron,
    four_fold, data_vectorized,
    data["NPS Score"]
)

print_result(
    *default_mlp_original_evaluation
)

****Logistic Regression on Derived Dataset #1****

In [None]:
default_lr_derived_1_evaluation = train_and_evaluate(
    default_logistic_regression,
    four_fold, derived_dataset1_vectorized,
    derived_dataset1["Class"]
)

print_result(
    *default_lr_derived_1_evaluation
)

****Multilayer Perceptron on Derived Dataset #1****

In [None]:
default_mlp_derived_1_evaluation = train_and_evaluate(
    default_multilayer_perceptron,
    four_fold, derived_dataset1_vectorized,
    derived_dataset1["Class"]
)

print_result(
    *default_mlp_derived_1_evaluation
)

****Logistic Regression on Derived Dataset #2****

In [None]:
default_lr_derived_2_evaluation = train_and_evaluate(
    default_logistic_regression,
    four_fold, derived_dataset2_vectorized,
    derived_dataset2["Class"]
)

print_result(
    *default_lr_derived_2_evaluation
)

****Multilayer Perceptron on Derived Dataset #2****

In [None]:
default_mlp_derived_2_evaluation = train_and_evaluate(
    default_multilayer_perceptron,
    four_fold, derived_dataset2_vectorized,
    derived_dataset2["Class"]
)

print_result(
    *default_mlp_derived_2_evaluation
)

### Train/test/evaluate the 2 models with modified parameters (#1)

### Train/test/evaluate the 2 models with modified parameters (#2)

## Analyze the obtained results

## References