The classifier should be trained and tested on the FPB dataset. Shuﬄe the FPB data with random
seed 42, and split it into training, validation, and test splits, with a 80/10/10% ratio (e.g., use
random state in sklearn.model selection.train test split.) Note that the data file is not
encoded with the default character encoding UTF-8. You may need to specify the encoding to use
ISO-8859-1 when you load the file.

In [11]:
import pandas as pd
import re
from nltk.stem import PorterStemmer

In [12]:
file_path = "HW1/FPB.csv"
fpb = pd.read_csv(file_path, encoding="ISO-8859-1")
fpb.head()

Unnamed: 0,neutral,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing ."
0,neutral,Technopolis plans to develop in stages an area...
1,negative,The international electronic industry company ...
2,positive,With the new production plant the company woul...
3,positive,According to the company 's updated strategy f...
4,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...


## Problem 1: Text Pre-Processing

Pre-process the text data using text preprocessing techniques such as tokenization, stemming, etc.
Be sure to make each step clear in your code. In the writeup, show the results of tokenization and
stemming for the first 5 sentences in the original FPB.csv file.

In [13]:
# Rename columns for clarity
fpb.columns = ["Sentiment", "Headline"]

In [14]:
import nltk
nltk.download('punkt') # downloads you a model

nltk.download('stopwords') # <--- this is new
from nltk.corpus import stopwords

# print(stop)

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer 



[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Tokenization

In [15]:
ps = PorterStemmer() 
stop = set(stopwords.words('english'))

# return a list of tokens
def pre_processing_by_nltk1(doc, stemming = False, need_sent = False):
    # step 1: get sentences
    sentences = sent_tokenize(doc)
    # step 2: get tokens
    tokens = []
    for sent in sentences:
        words = word_tokenize(sent)
        # step 3 (optional): stemming
        if stemming:
            words = [ps.stem(word) for word in words]
        if need_sent:
            tokens.append(words)
        else:
            tokens += words
    return [w.lower() for w in tokens if w.lower() not in stop]

In [16]:
fpb.Headline[:5].apply(pre_processing_by_nltk1)

0    [technopolis, plans, develop, stages, area, le...
1    [international, electronic, industry, company,...
2    [new, production, plant, company, would, incre...
3    [according, company, 's, updated, strategy, ye...
4    [financing, aspocomp, 's, growth, aspocomp, ag...
Name: Headline, dtype: object

### Stemming

In [17]:
# return a list of tokens
def pre_processing_by_nltk2(doc, stemming = True, need_sent = False):
    # step 1: get sentences
    sentences = sent_tokenize(doc)
    # step 2: get tokens
    tokens = []
    for sent in sentences:
        words = word_tokenize(sent)
        # step 3 (optional): stemming
        if stemming:
            words = [ps.stem(word) for word in words]
        if need_sent:
            tokens.append(words)
        else:
            tokens += words
    return [w.lower() for w in tokens if w.lower() not in stop]

In [18]:
fpb.Headline[:5].apply(pre_processing_by_nltk2)

0    [technopoli, plan, develop, stage, area, less,...
1    [intern, electron, industri, compani, elcoteq,...
2    [new, product, plant, compani, would, increas,...
3    [accord, compani, 's, updat, strategi, year, 2...
4    [financ, aspocomp, 's, growth, aspocomp, aggre...
Name: Headline, dtype: object

## Problem 2: Bag Of Words

Train a text classifier (e.g. logistic regression) using the following document representation tech-
niques and report AUROC, macro-f1 score, and micro-f1 score on the test set. You are highly
encouraged to pay attention to details. If your model receives a warning and fails to converge,
think about ways to address it.

i. Each document is represented as a binary-valued vector of dimension equal to the size of the
vocabulary. The value at an index is 1 if the word corresponding to that index is present in
the document, else 0.

ii. A document is represented by a vector of dimension equal to the size of the vocabulary where
the value corresponding to each word is its frequency in the document.

iii. Each document is represented by a vector of dimension equal to the size of the vocabulary
where the value corresponding to each word is its tf-idf value.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import train_test_split

In [20]:
# Convert Sentiment labels to numeric values
label_mapping = {"negative": 0, "neutral": 1, "positive": 2}
fpb["Sentiment"] = fpb["Sentiment"].map(label_mapping)

# 📌 Step 2: Split Data into Train, Validation, and Test Sets
train_fpb, temp_fpb = train_test_split(fpb, test_size=0.2, random_state=42, stratify=fpb["Sentiment"])
val_fpb, test_fpb = train_test_split(temp_fpb, test_size=0.5, random_state=42, stratify=temp_fpb["Sentiment"])

# Extract Labels
y_train = train_fpb["Sentiment"]
y_val = val_fpb["Sentiment"]
y_test = test_fpb["Sentiment"]

len(train_fpb), len(val_fpb), len(test_fpb)

(3876, 484, 485)

Macro-F1: Class-wise Averaging, computed by calculating the F1-score for each class separately and then taking the average.
f1 per class

Micro-F1: Instance-wise Averaging, computes global precision and recall across all classes before computing the F1-score.
2TP / (2TP + FP + FN)

	•	Use Macro-F1 if you care about rare classes.
	•	Use Micro-F1 if overall accuracy is the priority.

In [21]:
# Determine Dictionary Sizes
vectorizer = CountVectorizer()
vectorizer.fit(train_fpb["Headline"])  # Learn vocabulary only from training data
total_vocab_size = len(vectorizer.vocabulary_)

dictionary_sizes = [
    int(0.1 * total_vocab_size),  
    int(0.25 * total_vocab_size), 
    int(0.5 * total_vocab_size),  
    total_vocab_size  
]

In [22]:
# Evaluate Logistic Regression Model
def evaluate_model(vectorizer, dictionary_size, regularization_constant, rep_name):
    """Trains and evaluates a model using different feature representations."""
    vectorizer.set_params(max_features=dictionary_size)
    X_train = vectorizer.fit_transform(train_fpb["Headline"])
    X_val = vectorizer.transform(val_fpb["Headline"])

    model = LogisticRegression(C=regularization_constant, max_iter=500, solver="lbfgs")
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)  # Needed for AUROC

    # Compute evaluation metrics
    accuracy = accuracy_score(y_val, y_pred)
    macro_f1 = f1_score(y_val, y_pred, average="macro")
    micro_f1 = f1_score(y_val, y_pred, average="micro")
    auroc = roc_auc_score(y_val, y_pred_proba, multi_class="ovr")  

    return rep_name, dictionary_size, regularization_constant, accuracy, macro_f1, micro_f1, auroc

In [23]:
regularization_constants = [0.1, 1, 10]  
binary_vectorizer = CountVectorizer(binary=True)
freq_vectorizer = CountVectorizer(binary=False)
tfidf_vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True, norm='l2')

In [24]:
# Tune Hyperparameters Using Validation Set
results = []
for size in dictionary_sizes:
    for c in regularization_constants:
        results.append(evaluate_model(binary_vectorizer, size, c, "Binary"))
        results.append(evaluate_model(freq_vectorizer, size, c, "Frequency"))
        results.append(evaluate_model(tfidf_vectorizer, size, c, "TF-IDF"))


In [25]:
results_df = pd.DataFrame(results, columns=["Representation", "Dictionary Size", "Regularization (C)", "Validation Accuracy", "Macro-F1", "Micro-F1", "AUROC"])
results_df.head()

Unnamed: 0,Representation,Dictionary Size,Regularization (C),Validation Accuracy,Macro-F1,Micro-F1,AUROC
0,Binary,898,0.1,0.764463,0.689895,0.764463,0.859768
1,Frequency,898,0.1,0.760331,0.684003,0.760331,0.860656
2,TF-IDF,898,0.1,0.694215,0.493466,0.694215,0.82909
3,Binary,898,1.0,0.762397,0.713571,0.762397,0.852974
4,Frequency,898,1.0,0.741736,0.673417,0.741736,0.84683


In [26]:
# Select best hyperparameters separately for each representation
best_params_binary = results_df.loc[results_df[results_df["Representation"] == "Binary"]["Macro-F1"].idxmax()]
best_params_frequency = results_df.loc[results_df[results_df["Representation"] == "Frequency"]["Macro-F1"].idxmax()]
best_params_tfidf = results_df.loc[results_df[results_df["Representation"] == "TF-IDF"]["Macro-F1"].idxmax()]
best_params_binary, best_params_frequency, best_params_tfidf

(Representation           Binary
 Dictionary Size            8981
 Regularization (C)          1.0
 Validation Accuracy     0.78719
 Macro-F1               0.737779
 Micro-F1                0.78719
 AUROC                  0.862295
 Name: 30, dtype: object,
 Representation         Frequency
 Dictionary Size             2245
 Regularization (C)           1.0
 Validation Accuracy      0.77686
 Macro-F1                0.729881
 Micro-F1                 0.77686
 AUROC                   0.852602
 Name: 13, dtype: object,
 Representation           TF-IDF
 Dictionary Size            2245
 Regularization (C)         10.0
 Validation Accuracy    0.774793
 Macro-F1               0.730011
 Micro-F1               0.774793
 AUROC                  0.856902
 Name: 17, dtype: object)

In [27]:
# Train the Final Model Using Best Parameters for Each Representation
import scipy.sparse

# Merge Training & Validation Sets
X_final_train_binary = scipy.sparse.vstack([
    binary_vectorizer.fit_transform(train_fpb["Headline"]),
    binary_vectorizer.transform(val_fpb["Headline"])
])
X_final_train_freq = scipy.sparse.vstack([
    freq_vectorizer.fit_transform(train_fpb["Headline"]),
    freq_vectorizer.transform(val_fpb["Headline"])
])
X_final_train_tfidf = scipy.sparse.vstack([
    tfidf_vectorizer.fit_transform(train_fpb["Headline"]),
    tfidf_vectorizer.transform(val_fpb["Headline"])
])

y_final_train = pd.concat([y_train, y_val])

# Train final models using best hyperparameters for each representation
final_model_binary = LogisticRegression(C=best_params_binary["Regularization (C)"], 
                                        max_iter=500, solver='lbfgs')
final_model_frequency = LogisticRegression(C=best_params_frequency["Regularization (C)"], 
                                           max_iter=500, solver='lbfgs')
final_model_tfidf = LogisticRegression(C=best_params_tfidf["Regularization (C)"], 
                                       max_iter=500, solver='lbfgs')

# Fit each model
final_model_binary.fit(X_final_train_binary, y_final_train)
final_model_frequency.fit(X_final_train_freq, y_final_train)
final_model_tfidf.fit(X_final_train_tfidf, y_final_train)

In [28]:
# Evaluate the Final Model on the Test Set
def evaluate_final_model(model, vectorizer, X_test, y_test, rep_name):
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)

    test_auroc = roc_auc_score(y_test, y_pred_proba, multi_class="ovr")
    test_macro_f1 = f1_score(y_test, y_pred, average="macro")
    test_micro_f1 = f1_score(y_test, y_pred, average="micro")

    return [rep_name, test_auroc, test_macro_f1, test_micro_f1]

In [29]:
final_results = []
final_results.append(evaluate_final_model(final_model_binary, binary_vectorizer, binary_vectorizer.transform(test_fpb["Headline"]), y_test, "Binary"))
final_results.append(evaluate_final_model(final_model_frequency, freq_vectorizer, freq_vectorizer.transform(test_fpb["Headline"]), y_test, "Frequency"))
final_results.append(evaluate_final_model(final_model_tfidf, tfidf_vectorizer, tfidf_vectorizer.transform(test_fpb["Headline"]), y_test, "TF-IDF"))

final_results_df = pd.DataFrame(final_results, columns=["Representation", "AUROC", "Macro-F1", "Micro-F1"])
final_results_df

Unnamed: 0,Representation,AUROC,Macro-F1,Micro-F1
0,Binary,0.899281,0.740409,0.797938
1,Frequency,0.895827,0.732831,0.793814
2,TF-IDF,0.895387,0.721844,0.785567
