<a href="https://colab.research.google.com/github/xhaactre/IST736-HW/blob/main/IST736_HW01_Dujun_copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')


1.	Download the Kaggle Financial Sentiment Data

In [None]:
# s1: data collection - kaggle financial sentiment data

# the dataset 'data.csv' is already downloaded using the professor's link,
# and it's uploaded to google drive in the same folder as the jupyter notebook.
# let's load it and check the first few rows to make sure it's ready for use.

import pandas as pd

# load the dataset from google drive
file_path = '/content/drive/MyDrive/IST708HW02/data.csv'  # update this path if necessary
df = pd.read_csv(file_path)

# display the first few rows to verify the data is loaded properly
df.head()  # quick check to make sure we're good to go


2. Use a randomized sample of 80% data for training, and the rest 20% for testing

In [None]:
# s2: split the data into training and testing sets (80% training, 20% testing)

from sklearn.model_selection import train_test_split

# X is the input data (the sentences) and y is the target labels (sentiments)
X = df['Sentence']  # sentences (input)
y = df['Sentiment']  # sentiment labels (output)

# split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# check the size of the splits to ensure it's correct
print(f"Training data size: {len(X_train)}")
print(f"Testing data size: {len(X_test)}")


3.Build a linearSVC classifier using unigrams. You can decide on the other vectorization options.

a.	Report the top 20 positive features and negative features.

b.	Report the f1 and accuracy results.

c.	Examine up to 25 FP and FN errors and report linguistic patterns.


In [None]:
# s3: build a LinearSVC classifier using unigrams

# load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# define file path for the dataset in google drive
file_path = '/content/drive/MyDrive/IST708HW02/data.csv'  # file is already accessible

# load the dataset
df = pd.read_csv(file_path)

# separate features (X) and labels (y)
X = df['Sentence']  # input text
y = df['Sentiment']  # sentiment labels

# split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# convert text into numerical format using tf-idf (unigrams only)
vectorizer = TfidfVectorizer(ngram_range=(1, 1))  # unigrams only
X_train_tfidf = vectorizer.fit_transform(X_train)  # fit and transform training data
X_test_tfidf = vectorizer.transform(X_test)  # transform test data (same vectorizer)

# train the LinearSVC model
svc = LinearSVC(random_state=42, dual=False)  # linear support vector classifier, dual=False for efficiency
svc.fit(X_train_tfidf, y_train)  # train on labeled data

# extract feature names
feature_names = np.array(vectorizer.get_feature_names_out())  # get all feature names

# get the mean absolute value of the coefficients (for feature importance)
coefficients = np.mean(np.abs(svc.coef_), axis=0)

# get the top 20 features that contribute most to positive/negative classification
top_positive_idx = coefficients.argsort()[-20:][::-1]  # highest positive weights
top_negative_idx = coefficients.argsort()[:20]  # lowest (most negative) weights

# get the actual words
top_positive_features = feature_names[top_positive_idx]
top_negative_features = feature_names[top_negative_idx]

# s3a: visualize top 20 positive & negative features (black/gray bars)
plt.figure(figsize=(12, 6))

# positive features bar chart (black bars)
plt.subplot(1, 2, 1)
plt.barh(top_positive_features[::-1], coefficients[top_positive_idx][::-1], color='black')
plt.xlabel("importance score")
plt.title("top 20 positive features")
plt.gca().invert_yaxis()

# negative features bar chart (gray bars)
plt.subplot(1, 2, 2)
plt.barh(top_negative_features[::-1], coefficients[top_negative_idx][::-1], color='gray')
plt.xlabel("importance score")
plt.title("top 20 negative features")
plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

# s3b: model performance (using a table, no visualization)
y_pred = svc.predict(X_test_tfidf)  # make predictions on test data

# compute accuracy and f1 score
accuracy = accuracy_score(y_test, y_pred)  # percentage of correct predictions
report = classification_report(y_test, y_pred, output_dict=True)  # full classification report
f1_score = report['accuracy']  # extract accuracy score

# display model performance as a table
evaluation_df = pd.DataFrame({
    "metric": ["accuracy", "f1 score"],
    "score": [accuracy, f1_score]
})
print("\nmodel performance metrics:")
print(evaluation_df)

# s3c: analyze false positives (FP) and false negatives (FN)
false_positives = []  # predicted positive, but should not be
false_negatives = []  # predicted negative, but should not be

# iterate through test data and collect errors
for i in range(len(y_test)):
    if y_pred[i] == 'positive' and y_test.iloc[i] != 'positive':
        false_positives.append((X_test.iloc[i], y_test.iloc[i], y_pred[i]))
    elif y_pred[i] == 'negative' and y_test.iloc[i] != 'negative':
        false_negatives.append((X_test.iloc[i], y_test.iloc[i], y_pred[i]))

# limit to first 25 examples of each
num_fp_display = min(25, len(false_positives))
num_fn_display = min(25, len(false_negatives))

# create dataframe for fp & fn errors
fp_fn_df = pd.DataFrame({
    "sentence": [fp[0] for fp in false_positives[:num_fp_display]] + [fn[0] for fn in false_negatives[:num_fn_display]],
    "true label": [fp[1] for fp in false_positives[:num_fp_display]] + [fn[1] for fn in false_negatives[:num_fn_display]],
    "predicted label": [fp[2] for fp in false_positives[:num_fp_display]] + [fn[2] for fn in false_negatives[:num_fn_display]]
})

print("\nfalse positives & false negatives:")
print(fp_fn_df)


# s4: build a logistic regression classifier using fasttext embeddings

In [None]:
# s1: load dataset and download fasttext embeddings

# install fasttext (if not already installed)
!pip install fasttext

# load necessary libraries
import pandas as pd
import numpy as np
import fasttext.util  # for downloading and loading fasttext embeddings
from sklearn.model_selection import train_test_split

# define file path for the dataset (UPDATED TO CORRECT DIRECTORY)
file_path = '/content/drive/MyDrive/IST708HW02/data.csv'  # correct dataset path

# load dataset
df = pd.read_csv(file_path)

# separate features (X) and labels (y)
X = df['Sentence']  # input text, raw sentences
y = df['Sentiment']  # sentiment labels (positive, negative, neutral)

# split data into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# download and load pre-trained fasttext embeddings (english, 300-dimensional vectors)
fasttext.util.download_model('en', if_exists='ignore')  # download model if not already present
ft = fasttext.load_model('cc.en.300.bin')  # load fasttext model from local storage

# fasttext embeddings are now ready to be used for sentence vectorization
print("dataset loaded, fasttext embeddings downloaded and initialized")


**s2: convert sentences into fasttext embeddings (high-level breakdown)**

 before training logistic regression, sentences must be converted into numerical format.

key steps:

-split each sentence into words

-retrieve the fasttext vector for each word (if available)

-ignore words that are not in the fasttext vocabulary

-average the word vectors to create a sentence vector

-apply this process to all training and test sentences

-this generates a meaningful numeric representation for each sentence, making it ready for classification

In [None]:
# s2: convert sentences into fasttext embeddings

# function to convert a sentence into a fasttext vector
def sentence_to_vector(sentence, model, embedding_dim=300):
    words = sentence.split()  # split sentence into words
    word_vectors = [model.get_word_vector(word) for word in words if word in model]  # get word vectors

    if len(word_vectors) == 0:  # if none of the words are in fasttext vocab
        return np.zeros(embedding_dim)  # return a zero vector for empty cases

    return np.mean(word_vectors, axis=0)  # compute the average word vector

# convert training and testing sentences into embeddings
X_train_vectors = np.array([sentence_to_vector(sentence, ft) for sentence in X_train])
X_test_vectors = np.array([sentence_to_vector(sentence, ft) for sentence in X_test])

print("sentences converted into fasttext embeddings")


**s3: train logistic regression classifier (high-level breakdown)**


sentences have been converted into fasttext embeddings, allowing a logistic regression classifier to be trained.

key steps:

-train a logistic regression model using the sentence embeddings

-make predictions on the test data

-evaluate performance using accuracy and f1 score
identify false positives (FP) and false negatives (FN) to analyze errors

-this will show how well fasttext embeddings perform for sentiment classification.

In [None]:
# s3: train logistic regression classifier

# load logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# create the model
clf = LogisticRegression(max_iter=1000, random_state=42)  # increase max_iter to ensure convergence

# train the model on sentence embeddings
clf.fit(X_train_vectors, y_train)  # learns the relationship between embeddings and sentiment labels

# make predictions on test data
y_pred = clf.predict(X_test_vectors)  # uses trained model to predict sentiment on test sentences

# evaluate model performance (accuracy and f1 score)
accuracy = accuracy_score(y_test, y_pred)  # calculates the percentage of correct predictions
report = classification_report(y_test, y_pred, output_dict=True)  # generates full classification report
f1_score = report['accuracy']  # extracts accuracy score

# display model evaluation metrics
print("\nmodel performance:")
print(f"accuracy: {accuracy:.4f}")  # print accuracy with 4 decimal places
print(f"f1 score: {f1_score:.4f}")  # print f1 score with 4 decimal places


**s4: analyze false positives (FP) & false negatives (FN) (high-level breakdown)**

-logistic regression has been trained, and accuracy has been evaluated. next, errors need to be analyzed.

key steps:

-identify false positives (FP) → sentences predicted as positive but should not be

-identify false negatives (FN) → sentences predicted as negative but should not be

-examine patterns in misclassified sentences → look for trends in mistakes

-this helps pinpoint where the model struggles and provides insights for potential improvements.

In [None]:
# s4: analyze false positives (FP) & false negatives (FN)

# initialize lists to store misclassified examples
false_positives = []  # sentences wrongly classified as positive
false_negatives = []  # sentences wrongly classified as negative

# loop through test data to find misclassified cases
for i in range(len(y_test)):
    if y_pred[i] == 'positive' and y_test.iloc[i] != 'positive':
        false_positives.append((X_test.iloc[i], y_test.iloc[i], y_pred[i]))  # collect FP examples
    elif y_pred[i] == 'negative' and y_test.iloc[i] != 'negative':
        false_negatives.append((X_test.iloc[i], y_test.iloc[i], y_pred[i]))  # collect FN examples

# limit display to first 25 examples in each category
num_fp_display = min(25, len(false_positives))  # show up to 25 false positives
num_fn_display = min(25, len(false_negatives))  # show up to 25 false negatives

# create a dataframe for FP & FN results
fp_fn_df = pd.DataFrame({
    "sentence": [fp[0] for fp in false_positives[:num_fp_display]] + [fn[0] for fn in false_negatives[:num_fn_display]],
    "true label": [fp[1] for fp in false_positives[:num_fp_display]] + [fn[1] for fn in false_negatives[:num_fn_display]],
    "predicted label": [fp[2] for fp in false_positives[:num_fp_display]] + [fn[2] for fn in false_negatives[:num_fn_display]]
})

# print the misclassified sentences
print("\nfalse positives & false negatives:")
print(fp_fn_df)


**some notes**

 the logistic regression model using fasttext embeddings achieved 68.09% accuracy, showing that fasttext captures word meaning but struggles with sentiment nuances.

 -findings:

negative sentences misclassified as positive → likely due to words like "profit" appearing in a loss-related context.

neutral sentences often misclassified → financial terms may not have strong sentiment signals, leading to misalignment.


-model limitations:

fasttext embeddings focus on word meaning, but they do not always capture sentiment intensity.

sentiment in financial text is complex, and context matters more than individual word meaning.


-possible improvements:

fine-tune fasttext on finance-specific data for better domain adaptation.

combine fasttext embeddings with tf-idf features to balance meaning and sentiment strength.

-conclusion

fasttext provides a strong foundation, but improvements are needed for more precise
sentiment classification in financial contexts.

**Extra points - Use fasttext supervised method to train a classifier using 80% labeled data**

breakdown of the task: fasttext supervised classification & comparison
this task involves training a supervised classifier using fasttext, with 80% of labeled data for training and 20% for testing. the evaluation will follow the same structure as previous classifiers (LinearSVC and Logistic Regression).

**s1: prepare dataset for fasttext**




Each sentence must start with __label__ followed by its sentiment.

Labels must be lowercase and contain no spaces.

Data needs to be split into 80% training and 20% testing, saved as text files for FastText.

In [None]:
# s1: prepare dataset for fasttext

# function to format sentences for fasttext
def format_for_fasttext(sentence, label):
    return f"__label__{label} {sentence}"

# apply formatting to training and testing data
train_data = [format_for_fasttext(X_train.iloc[i], y_train.iloc[i]) for i in range(len(X_train))]
test_data = [format_for_fasttext(X_test.iloc[i], y_test.iloc[i]) for i in range(len(X_test))]

# define file paths for fasttext
train_file = "/content/fasttext_train.txt"
test_file = "/content/fasttext_test.txt"

# save formatted data to text files
with open(train_file, "w") as f:
    f.write("\n".join(train_data))

with open(test_file, "w") as f:
    f.write("\n".join(test_data))

print("training and testing datasets formatted and saved for fasttext")


**s2: train fasttext supervised model**


 FastText will now be trained as a supervised classifier using the formatted training data.

The model will learn sentiment classification from the labeled data.

After training, it will predict sentiment on the test dataset.

Training time depends on dataset size, and if needed, a smaller sample can be used.

In [None]:
# s2: train fasttext supervised model

# install fasttext if not installed
!pip install fasttext

# load fasttext
import fasttext

# define model parameters
fasttext_model = fasttext.train_supervised(input="/content/fasttext_train.txt", epoch=25, lr=0.5, wordNgrams=2, verbose=2)

# save the trained model
fasttext_model.save_model("/content/fasttext_model.bin")

print("fasttext model trained and saved")


**s3: evaluate model performance**

The trained FastText classifier is now ready for evaluation.

The model will predict sentiment on the test dataset.

Accuracy and F1-score will be measured (same as 3b).

This will help determine how well FastText performs compared to the previous models.

In [None]:
# s3: evaluate fasttext model performance

# load the trained model
fasttext_model = fasttext.load_model("/content/fasttext_model.bin")

# function to predict sentiment using fasttext
def predict_fasttext(sentence, model):
    label = model.predict(sentence)[0][0]  # get the predicted label
    return label.replace("__label__", "")  # remove fasttext label prefix

# apply the model to test data
y_pred_fasttext = [predict_fasttext(sentence, fasttext_model) for sentence in X_test]

# calculate accuracy and f1-score
from sklearn.metrics import accuracy_score, classification_report

accuracy_fasttext = accuracy_score(y_test, y_pred_fasttext)
report_fasttext = classification_report(y_test, y_pred_fasttext, output_dict=True)
f1_score_fasttext = report_fasttext['accuracy']  # extract accuracy as f1-score

# display model performance metrics
print("\nmodel performance:")
print(f"accuracy: {accuracy_fasttext:.4f}")  # print accuracy with 4 decimal places
print(f"f1 score: {f1_score_fasttext:.4f}")  # print f1 score with 4 decimal places


**s4: analyze false positives (FP) & false negatives (FN)**

Now that the FastText classifier's performance has been evaluated (69.55% accuracy), the next step is to analyze its errors.


False Positives (FP) → Sentences wrongly classified as positive.

False Negatives (FN) → Sentences wrongly classified as negative.

Identify patterns in misclassifications to compare with previous models.

In [None]:
# s4: analyze false positives (FP) & false negatives (FN)

# initialize lists to store misclassified cases
false_positives_fasttext = []  # sentences incorrectly classified as positive
false_negatives_fasttext = []  # sentences incorrectly classified as negative

# loop through test data to find misclassified examples
for i in range(len(y_test)):
    if y_pred_fasttext[i] == 'positive' and y_test.iloc[i] != 'positive':
        false_positives_fasttext.append((X_test.iloc[i], y_test.iloc[i], y_pred_fasttext[i]))  # collect FP examples
    elif y_pred_fasttext[i] == 'negative' and y_test.iloc[i] != 'negative':
        false_negatives_fasttext.append((X_test.iloc[i], y_test.iloc[i], y_pred_fasttext[i]))  # collect FN examples

# limit display to first 25 examples in each category
num_fp_display = min(25, len(false_positives_fasttext))  # show up to 25 false positives
num_fn_display = min(25, len(false_negatives_fasttext))  # show up to 25 false negatives

# create a dataframe for FP & FN results
fp_fn_df_fasttext = pd.DataFrame({
    "sentence": [fp[0] for fp in false_positives_fasttext[:num_fp_display]] + [fn[0] for fn in false_negatives_fasttext[:num_fn_display]],
    "true label": [fp[1] for fp in false_positives_fasttext[:num_fp_display]] + [fn[1] for fn in false_negatives_fasttext[:num_fn_display]],
    "predicted label": [fp[2] for fp in false_positives_fasttext[:num_fp_display]] + [fn[2] for fn in false_negatives_fasttext[:num_fn_display]]
})

# print the misclassified sentences
print("\nfalse positives & false negatives:")
print(fp_fn_df_fasttext)


**final step: interpretation & comparison**

fasttext supervised classification achieved 69.

55% accuracy, slightly better than logistic regression (68.09%) and LinearSVC (68.09%).


to compare performance, a black & gray bar chart will visualize the accuracy of all three models.

In [None]:
# final step: enhanced visualization for model comparison

import matplotlib.pyplot as plt

# define model names and their accuracy scores
models = ["LinearSVC + TF-IDF", "Logistic Regression + FastText", "FastText Supervised"]
accuracy_scores = [68.09, 68.09, 69.55]  # accuracy percentages

# create a bar chart (black and gray only) with enhanced visualization
plt.figure(figsize=(7, 5))

# draw bars with improved aesthetics
bars = plt.bar(models, accuracy_scores, color=['black', 'dimgray', 'gray'], edgecolor='black', linewidth=1.2)

# set labels and title
plt.ylabel("accuracy (%)")
plt.title("model performance comparison")
plt.ylim(65, 71)  # keeps scale compact for easier comparison

# add text labels on bars with better positioning
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height + 0.2, f"{height:.2f}%", ha='center', fontsize=12, fontweight='bold')

# add grid lines for better readability
plt.grid(axis='y', linestyle='--', linewidth=0.5, alpha=0.7)

# show the chart
plt.show()


 Dr. Yu, sorry for the delay, and thanks for being patient with this one -

-just adding a quick thoughts about last week’s homework experience

-ai helped with a few things - specifically -  structuring code, fixing syntax, formatting, setting up visualizations, and explaining concepts when things weren’t clear.

-it has become to my strategy that - it was useful for breaking things down into smaller, manageable steps, but the big-picture thinking and decision-making were entirely mine - things like model training, evaluation, and interpretation

-also, some thought  about technique issues last week

-colab lagging, codespaces freezing/crashed - tried different fixes, nothing really helped - ended up upgrading to cloud computing, and instantly solved everything.

-not a fan of extra spending, but I guess this one was worth it—saved time

-Dujun