# **Unit 2 Assignment**: Feature Engineering \& Supervised Classification
## *DATA 5420/6420*

In this second assignment you will be tasked with training your own supervised classification model, this could be to do document classification of some sort, or a sentiment analysis. You will first be tasked with selecting a labeled text dataset to train a supervised classifier, then you will apply it to your dataset from Unit 1.

Next, you will find a pretrained supervised model from Hugging Face, which has a larger collection of pretrained document classification and sentiment analysis models. You will investigate the results of the model you trained against the pretrained model and compare their performances. This will help you decide how you might incorporate some form of either document classification or sentiment analysis into your final product.

**General breakdown of steps**:


1.   Select a labeled dataset to perform document classification or sentiment analysis
2.   Train at least two different models on the dataset, compare performance - If in the 6420 section, select at least 2 different models AND perform at least two steps of parameter tuning
3.   Apply the classification model to your dataset from Unit 1
4.   Examine results, speak to how well it appears to perform
5.   Apply a pretrained transformer model to your dataset from Unit 1
6.   Examine results, speak to how well it appears to perform
7.   Compare and contrast your trained model vs the pretrained model

**Some suggested datasets for document classification**:


*   Brown Corpus -- accesible through NLTK
*   20 News Groups -- accessible through scikit learn
*   [Yelp Reviews Dataset](https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset)

**Some suggetsed datasets for sentiment analysis**:

*   [IMDB movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
*   [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)
*   Yelp Reviews Dataset - linked above

You are by no means limited to these datasets, [Kaggle](https://www.kaggle.com/datasets) has lots of datasets available for document classification and sentiment analysis, so you may find something more relevant to your dataset there. Just make sure it it labeled data (i.e., has a labeled class like positive, negative).


**Pretrained Models**:

You can find pretrained models for sentiment analysis and document classification on the models page for [HuggingFace](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending). Remember, tools like Poe, ChatGPT, Claude, etc. are excellent resources for developing code for implementing models such as these!!

Try something like: *I need a pretrained model from hugging face to do XYZ, can you provide python code*

In [1]:
# import dependencies
# Import necessary libraries
import nltk
import re
from nltk.corpus import brown
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
# load in your selected labeled dataset
df =  pd.read_csv('..\data\debates.csv')

In [6]:
df

Unnamed: 0,clean_speech,label
0,good evening washington dc welcome unique even...,democrat
1,come together tonight extraordinary time count...,democrat
2,setting debate also different reduce unnecessa...,democrat
3,come course four state florida arizona ohio il...,democrat
4,well first heart go already lost someone suffe...,democrat
...,...,...
9117,here believe believe verge greatest time alive...,republican
9118,mr trump closing statement sir,republican
9119,country serious trouble dont win anymore dont ...,republican
9120,gentleman thank,republican


**Will you be performing document classification or sentiment analysis? What is your outcome variable (i.e., positive, negative, genre type, etc.)**

I will be performing document classification. My outcome variable is political offiliation. 

**Which dataset did you decide to go with and why?**

I decided to go with presidential campaign transcripts. I just the appropriate label, either republican or democrat, based on which debate it was from. I read an academic paper that said this data makes great training data for this type of document classification. I think I still need to do some more careful preprocessing of the training data set, but I think where it is at now will work for this assignment. 

**What, if any cleaning or text normalization steps did you apply to this dataset and why?**

You'll need to look in the train_data.ipynb file for the code, but I just did the more basic things. Removing stopwords, lowercase all words, removing all special characters, and lemmatization. I think I still need to chop off the beginnings and ends of each document, and I will do so before the final project. I may also break out documents into smaller segments. 

In [7]:
# perform feature engineering on your cleaned corpus
# Split data into text and labels
texts = df['clean_speech'].tolist()
labels = df['label'].tolist()

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [8]:
# Create a TF-IDF vectorizer
# Replace np.nan values with empty strings in X_train and X_test lists
X_train = ['' if x is None or pd.isna(x) else x for x in X_train]
X_test = ['' if x is None or pd.isna(x) else x for x in X_test]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(min_df=1, max_df=0.80)

# Join the elements within each list
X_train_str = [' '.join(x) for x in X_train]
X_test_str = [' '.join(x) for x in X_test]
X_train_vec = vectorizer.fit_transform(X_train) # will determine the number of features
X_test_vec = vectorizer.transform(X_test) # will use the same number of features as X_train_vec

num_features = X_train_vec.shape
num_features

(7297, 10971)

**Which form of feature engineering did you choose (count or TFIDF) and did you go with unigrams, bigrams, etc.? Why?**

I chose TFIDF and unigrams because that seemed like an excellent place to start. 

**Next, train your supervised classifier. Remember:**



*   Create at least a training and a test set (fine if you don't have enough data to do a validation set)
*   Perform cross-validation
*   Train at least two different supervised classifiers on your training set
*   If in the 6420 section, also plan to try out at least two changes to the model parameters
* Apply your best performing model to the test set
* Provide model evaluation metrics



In [9]:
# fill in with coding steps to follow above instructions
# Define and train models
models = {
        "Logistic Regression": MultinomialNB(),
        "Linear SVM": LinearSVC(),
        "Random Forest": RandomForestClassifier(random_state=42)
}

for name, model in models.items():
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.8581
Linear SVM Accuracy: 0.8614
Random Forest Accuracy: 0.8405


In [10]:
param_grid_svc = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'loss': ['hinge', 'squared_hinge'],
    'max_iter': [1000, 5000, 10000]
}

# Grid search for LinearSVC
grid_search_svc = GridSearchCV(LinearSVC(), param_grid_svc, verbose = 10)
grid_search_svc.fit(X_train_vec, y_train)

# After fitting, you would typically print the best parameters as follows:
print("Best parameters for LinearSVC:", grid_search_svc.best_params_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV 1/5; 1/30] START C=0.001, loss=hinge, max_iter=1000.........................
[CV 1/5; 1/30] END C=0.001, loss=hinge, max_iter=1000;, score=0.540 total time=   0.0s
[CV 2/5; 1/30] START C=0.001, loss=hinge, max_iter=1000.........................
[CV 2/5; 1/30] END C=0.001, loss=hinge, max_iter=1000;, score=0.540 total time=   0.0s
[CV 3/5; 1/30] START C=0.001, loss=hinge, max_iter=1000.........................
[CV 3/5; 1/30] END C=0.001, loss=hinge, max_iter=1000;, score=0.539 total time=   0.0s
[CV 4/5; 1/30] START C=0.001, loss=hinge, max_iter=1000.........................
[CV 4/5; 1/30] END C=0.001, loss=hinge, max_iter=1000;, score=0.539 total time=   0.0s
[CV 5/5; 1/30] START C=0.001, loss=hinge, max_iter=1000.........................
[CV 5/5; 1/30] END C=0.001, loss=hinge, max_iter=1000;, score=0.539 total time=   0.0s
[CV 1/5; 2/30] START C=0.001, loss=hinge, max_iter=5000.........................
[CV 1/5; 2/30] EN

**Which model performed best and how do you know?**

THe linear SVM is performing the best. With the following parameters: {'C': 1, 'loss': 'squared_hinge', 'max_iter': 1000}.

**Now, bring in your dataset from Unit 1 and apply your best performing model to add labels to this dataset (sentiment or document class). Remember:**

*   Apply the same cleaning and text normalization steps to this dataset as you did the training data
*   Apply the same feature engineering type and parameters
*   Use the `.transform()` on your Unit 1 dataset with the vectorizer to ensure you match the number of features used to train your model
*  Store the predictions and your text observations in a dataframe



In [28]:
speech_df = pd.read_csv('..\data\cleaned_speeches.csv')

In [16]:
speech_vec = vectorizer.transform(speech_df['clean_speech'].tolist())

In [19]:
# verify the correct number of features
speech_vec.shape

(4, 10971)

In [27]:
predicted_labels = grid_search_svc.predict(speech_vec)

for label, text in zip(predicted_labels, speech_df['clean_speech']):
    print(f"'{label}' is the prediction for '{text}'.")

'democrat' is the prediction for 'trump inaugural address chief justice robert president carter president clinton president bush president obama fellow american people world thank citizen america joined great national effort rebuild country restore promise people together determine course america world year come face challenge confront hardship get job done every four year gather step carry orderly peaceful transfer power grateful president obama first lady michelle obama gracious aid throughout transition magnificent today ceremony however special meaning today merely transferring power one administration another one party another – transferring power washington dc giving back american p eople long small group nation capital reaped reward government people borne cost washington flourished – people share wealth politician prospered – job left fac tory closed establishment protected citizen country victory victory triumph triumph celebrated nation capital little celebra te struggling fa

**Now examine your results, look at some individual observations and investigate whether the model predictions are logical/appear accurate. Describe your findings below:**

Instead of using the dataset I put together for Unit 1 Assignment, I put together a really small dataset for this assignment. I made this switch because my Unit 1 dataset is no longer relevant to my final project. The predictions were correct 75% of the time. The one mistake it made was classifying Donald Trump's inaugural address as democrat. Considering for a moment how different an inaugural address is from presidential candidate debates, I think the mislabeling come more from trying to classify out-of-domain data than from the characteristics of the speaker. 

**Now select a pretrained model from Hugging Face (linked above) and make predictions onto your Unit 1 dataset. Compare how it appears to perform against how the model you trained appeared to perform.**

In [62]:
# download/import hugging face model
# apply to your dataset
# store the predictions as another column in your corpus dataframe

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

text = "speech_df['clean_speech'][2]"

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

model = AutoModelForSequenceClassification.from_pretrained("bucketresearch/politicalBiasBERT")


inputs = tokenizer(text, return_tensors="pt")
labels = torch.tensor([0])
outputs = model(**inputs, labels=labels)
loss, logits = outputs[:2]
predicted_label = torch.argmax(logits, dim=1).item()
# [0] -> left 
# [1] -> center
# [2] -> right
# print(logits.softmax(dim=-1)[0].tolist())
print(predicted_label)


1


0 (Trump) - center  
1 (Obama) - center  
2 (Bush) - center  
3 (Biden) - center  

This model added the additional classification of 'center' that makes a lot more sense for these data. Therefore, the model predicted all observations as 'center'.

**How could you incorporate supervised classification (document or sentiment classification) into a product? -- think about what it could be useful for as we continue to work towards your final project.**

- Tax Policy Type Classification
    - Categorize tax code into certain types of policy, could be used in an attempt to measure tax complexity based on forms, instructions, and/or code
- Tax Payer Sentiment Analysis
    - How does a group of tax payers feel about certain tax policy?
- Tax Policy Alignment Index
    - How much does a firm respond to an incentive created by tax policy?
- Measure of tax aggressiveness
    - How aggressively does a firm act in regards to tax uncertainty?
