# Stock Market Sentiment Analysis
**Goals**:
- Train a model that predicts sentiment towards a stock described in a social media post
- Analyze which words are related to a positive and which to a negative sentiment
- Export the model for production

**Data**:
- Data for this project is obtained from Kaggle: https://www.kaggle.com/datasets/yash612/stockmarket-sentiment-dataset

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import timeit
import nltk
import string
import time
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight

## Load and analyze raw data

In [None]:
data = pd.read_csv("../input/stockmarket-sentiment-dataset/stock_data.csv")

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data["Sentiment"].value_counts()

In [None]:
positive_count = data[data['Sentiment'] == 1]['Sentiment'].count()
negative_count = data[data['Sentiment'] == -1]['Sentiment'].count()

percentage_positive = (positive_count / data['Sentiment'].count() * 100).round()
percentage_negative = (negative_count / data['Sentiment'].count() * 100).round()

print(f'Positive vs negative (%): {percentage_positive} - {percentage_negative}')

- The data set contains 5791 examples:
    - 3685 examples of a positive sentiment (Sentiment = 1)
    - 2106 examples of a negative sentiment (Sentiment = -1)
- The ratio of positive to negative examples is 64% - 36%. The class proportion is moderately skewed, so at this point I will not take additional actions to address this imbalance.
- The machine learning task is a binary classification task.
- Data is unstructured in the form of short texts. Features will be extracted from the available texts.

## Split data into training, validation and test sets
- Examples will be split into:
    - Training set - Examples used by machine learning algorithms to train models,
    - Validation set - Examples used to validate the models, compare the models and select the best model,
    - Test set - Examples used to report predictive performance of the final model.
- Since the task is classification, I will apply stratified split to keep the same ratio of class labels in training, validation and test sets.


In [None]:
# First, split the data into training and remaining data (80% for training, 20% for the remaining)
X_train, X_remaining, y_train, y_remaining = train_test_split(data['Text'], data['Sentiment'], test_size=0.2, stratify=data['Sentiment'], random_state=42)
#data.drop(columns=['Text'])

# Then, split the remaining data into validation and test sets (50% for each)
X_val, X_test, y_val, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, stratify=y_remaining, random_state=42)

In [None]:
print(f'Training split ({X_train.shape[0]} examples):')
print(y_train.value_counts())

In [None]:
print(f'Validation split ({X_val.shape[0]} examples):')
print(y_val.value_counts())

In [None]:
print(f'Test split ({X_test.shape[0]} examples):')
print(y_test.value_counts())

## Exploratory data analysis
In this step, feature extraction and selection is performed, as well as analysis of words that frequently appear in texts of positive and negative sentiments.

### Feature extraction and selection
- Features are tokens that appear in texts, e.g., words, numbers, punctionation marks.
- To extract features:
    1. Texts are tokenized
    2. Porter stemmer is used to find common stems of words, so that different forms like buy and buying are treated as the same feature
    3. Stop words and punctionation marks are removed
    4. Tokens that have less than four characters are removed. In this manner domain names and stock tickers are removed. In addition, four character stock tickers are filtered out manually.
    5. Tokens that do not appear in at least 10 texts are removed

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))
print(stop_words)

In [None]:
punctuation = set(string.punctuation)
print(punctuation)

In [None]:
stock_tickers = set(['aapl', 'affi', 'amid', 'amzn', 'appl', 'bvsn', 'dndn', 'dvax', 'ebay', 'gevo', 'goog', 'imho', 'intc', 'invn', 'mani', 'morn', 'msft', 'nvda', 'pphm', 'qcom', 'swhc', 'yhoo', 'znga'])
print(stock_tickers)

In [None]:
stemmer = PorterStemmer()
def custom_tokenizer(text):
    tokens = text.split()
    filtered_tokens = [stemmer.stem(token) for token in tokens if token.lower() not in stop_words and token.isalpha() and token not in punctuation]
    filtered_tokens = [token for token in filtered_tokens if token == 'buy' or (len(token) > 3 and token not in stock_tickers)]

    return filtered_tokens

In [None]:
count_vect = CountVectorizer(lowercase=True, min_df=10, tokenizer=custom_tokenizer)
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

In [None]:
feature_names = count_vect.get_feature_names_out()
print(feature_names)

### Wordcloud of texts with positive sentiment

In [None]:
positive_sentiment_data = X_train_counts[np.where(y_train == 1)[0]]

# Convert the sparse matrix to a dense array
positive_sentiment_word_counts = np.asarray(positive_sentiment_data.sum(axis=0)).flatten()

# Create a dictionary of words and their counts
positive_sentiment_word_count_dict = dict(zip(feature_names, positive_sentiment_word_counts))

# Print the sorted dictionary
print(', '.join([f'{key}: {value}' for key, value in dict(sorted(positive_sentiment_word_count_dict.items(), key=lambda item: item[1], reverse=True)).items()]))

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(positive_sentiment_word_count_dict)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

### Wordcloud of texts with negative sentiment

In [None]:
negative_sentiment_data = X_train_counts[np.where(y_train == -1)[0]]

# Convert the sparse matrix to a dense array
negative_sentiment_word_counts = np.asarray(negative_sentiment_data.sum(axis=0)).flatten()

# Create a dictionary of words and their counts
negative_sentiment_word_count_dict = dict(zip(feature_names, negative_sentiment_word_counts))

# Print the sorted dictionary
print(', '.join([f'{key}: {value}' for key, value in dict(sorted(negative_sentiment_word_count_dict.items(), key=lambda item: item[1], reverse=True)).items()]))

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(negative_sentiment_word_count_dict)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

### Results
- After performing feature extraction and selection the data set contains 423 features representing content words from posts, excluding stock tickers and domain names.
- Word frequencies and wordclouds are computed separately for training set posts of positive and negative sentiment to understand whether the features are meaningful.
- As expected, words buy and long are more frequently mentioned in posts of positive sentiment, while words short and sell in posts of negative sentiment.
- Interestingly, word coronavirus more frequently appears in posts of negative sentiment. 

## TF-IDF
- CountVectorizer counts how many times each word appeared in a post. The drawback of counts is that they are not normalized, i.e., it is more likely that a word will appear more frequently in a longer post. To normalize feature values, I will apply TF-IDF.
- **TF-IDF** stands for term frequency-inverse document frequency.
- In this representation features are words that appear in posts. Words are referred to as terms.
- Posts, i.e., short texts, are referred to as documents. A set of documents is referred to as corpus.
- **Inverse document frequency (IDF)** is computed for each term as: log(the number of documents in a corpus / the number of documents that contain the term)
- **Term frequency (TF)** is computed for a specific term and a document as: the number of times the term appeared in the document / total number of terms in that document
- **TF-IDF (term, document)** = TF (term, document) * IDF (term)
- IDF is computed once on the training set for all terms / features. When we want to compute TF-IDF for documents from validation and test sets, for each document we compute term frequencies and multiply them with IDFs computed on the training set.

### Compute word occurances for validation and test set

In [None]:
X_train_counts.shape

In [None]:
X_val_counts = count_vect.transform(X_val)
X_val_counts.shape

In [None]:
X_test_counts = count_vect.transform(X_test)
X_test_counts.shape

### Convert occurances to frequencies

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

In [None]:
X_val_tfidf = tfidf_transformer.transform(X_val_counts)
X_val_tfidf.shape

In [None]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
X_test_tfidf.shape

In [None]:
X_train_tfidf_dense = X_train_tfidf.toarray()
print(X_train_tfidf_dense[1])

In [None]:
# Compute summary statistics
mean_tfidf = np.mean(X_train_tfidf_dense)
median_tfidf = np.median(X_train_tfidf_dense)
std_tfidf = np.std(X_train_tfidf_dense)
min_tfidf = np.min(X_train_tfidf_dense)
max_tfidf = np.max(X_train_tfidf_dense)

# Print the summary statistics
print("Summary Statistics for TF-IDF Matrix:")
print(f"Mean: {mean_tfidf}")
print(f"Median: {median_tfidf}")
print(f"Standard Deviation: {std_tfidf}")
print(f"Minimum: {min_tfidf}")
print(f"Maximum: {max_tfidf}")

## Measuring predictive performance of a classification model
- The performance_metrics function measures accuracy, precision, recall and F1-score of a classification model on training and validation sets.
- The plot_confusion_matrix function plots confusion matrix for predicted and true labels.

### Performance metrics

In [None]:
def performance_metrics(y_train_true, y_train_pred, y_val_true, y_val_pred):
    # Compute metrics for the training set
    train_accuracy = accuracy_score(y_train_true, y_train_pred)
    train_precision_positive = precision_score(y_train_true, y_train_pred, pos_label=1)
    train_recall_positive = recall_score(y_train_true, y_train_pred, pos_label=1)
    train_f1_positive = f1_score(y_train_true, y_train_pred, pos_label=1)
    
    train_precision_negative = precision_score(y_train_true, y_train_pred, pos_label=-1)
    train_recall_negative = recall_score(y_train_true, y_train_pred, pos_label=-1)
    train_f1_negative = f1_score(y_train_true, y_train_pred, pos_label=-1)

    # Compute metrics for the validation set
    val_accuracy = accuracy_score(y_val_true, y_val_pred)
    val_precision_positive = precision_score(y_val_true, y_val_pred, pos_label=1)
    val_recall_positive = recall_score(y_val_true, y_val_pred, pos_label=1)
    val_f1_positive = f1_score(y_val_true, y_val_pred, pos_label=1)
    
    val_precision_negative = precision_score(y_val_true, y_val_pred, pos_label=-1)
    val_recall_negative = recall_score(y_val_true, y_val_pred, pos_label=-1)
    val_f1_negative = f1_score(y_val_true, y_val_pred, pos_label=-1)

    # Print the metrics for the positive class
    print("Positive Class Metrics:")
    print("Training Set:")
    print(f"Accuracy: {train_accuracy:.2f}")
    print(f"Precision: {train_precision_positive:.2f}")
    print(f"Recall: {train_recall_positive:.2f}")
    print(f"F1-Score: {train_f1_positive:.2f}")
    
    print("\nValidation Set:")
    print(f"Accuracy: {val_accuracy:.2f}")
    print(f"Precision: {val_precision_positive:.2f}")
    print(f"Recall: {val_recall_positive:.2f}")
    print(f"F1-Score: {val_f1_positive:.2f}")

    # Print the metrics for the negative class
    print("\nNegative Class Metrics:")
    print("Training Set:")
    print(f"Accuracy: {train_accuracy:.2f}")
    print(f"Precision: {train_precision_negative:.2f}")
    print(f"Recall: {train_recall_negative:.2f}")
    print(f"F1-Score: {train_f1_negative:.2f}")
    
    print("\nValidation Set:")
    print(f"Accuracy: {val_accuracy:.2f}")
    print(f"Precision: {val_precision_negative:.2f}")
    print(f"Recall: {val_recall_negative:.2f}")
    print(f"F1-Score: {val_f1_negative:.2f}")

### Confusion matrix

In [None]:
def plot_confusion_matrix(true_labels, predicted_labels, title):
    # Create a confusion matrix
    cm = confusion_matrix(true_labels, predicted_labels)

    # Plot the confusion matrix using seaborn and matplotlib
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
                xticklabels=["Predicted Negative", "Predicted Positive"],
                yticklabels=["Actual Negative", "Actual Positive"])
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.title('Confusion Matrix' + (' - ' + title if len(title) > 0 else ''))
    plt.show()

## Training and validation of classification models
- Function train_and_validate takes an instance of a classifier, trains and validates a model, prints performance metrics and confusion matrices.
- Most frequent class model is used as a baseline. It predicts the most frequent class label in the training set (positive) for all examples in the validation set. It is a straightforward baseline that serves as a point of comparison for more sophisticated models. The idea is that any model that is trained performs better than this baseline. If the trained model performs worse than this baseline, typically there is a bug in the model that should be addressed.
- The following models will be trained and validated:
    - Naive Bayes
    - SVM
    - Gradient boosting
    - Random forest

In [None]:
def train_and_validate(clf, X_train, y_train, X_val, y_val):
    start_time = time.time()
    clf.fit(X_train, y_train)
    end_time = time.time()
    training_time = end_time - start_time
    print(f"Training time: {training_time:.2f} seconds\n")
    predictions_train = clf.predict(X_train)
    predictions_val = clf.predict(X_val)
    performance_metrics(y_train, predictions_train, y_val, predictions_val)
    plot_confusion_matrix(y_train, predictions_train, 'Training set')
    plot_confusion_matrix(y_val, predictions_val, 'Validation set')

### Most frequent class baseline

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
train_and_validate(dummy_clf, X_train_tfidf, y_train, X_val_tfidf, y_val)

### Naive Bayes

In [None]:
nb_clf = MultinomialNB()
train_and_validate(nb_clf, X_train_tfidf, y_train, X_val_tfidf, y_val)

#### Results
- Naive Bayes model outperformed the most frequent class baseline in terms of:
    - Accuracy measured on the validation set --> 72% Naive Bayes vs. 64% baseline
    - F1 score measured on the validation set --> positive label: 80% Naive Bayes vs. 78% baseline - negative label: 51% Naive Bayes vs. 0% baseline
- The confusion matrix reveals that there is a high number of false positives (126), i.e., post of negative sentiment, which are predicted as positive. There is more misclassified (126) than correctly classified (84) posts of negative sentiment. Naive Bayes model performs much better when classfying posts of positive sentiment --> 333 true positives and 36 false negatives. 
- This may happen because negative class label is underrepresented. Naive Bayes had more positive training examples and better learned to recognize positive sentiment. In the next step I will address class imbalance by giving the higher weight to negative examples.

### Random forest
- Random forest algoritm has an option to assign different weights to examples of different class labels.
- I will compare two versions of the random forest model, one that equaly weights examples of positive and negative class, and another with weights that simulate balanced data set.

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
train_and_validate(rf_clf, X_train_tfidf, y_train, X_val_tfidf, y_val)

In [None]:
class_labels = [-1, 1]
class_weights = compute_class_weight('balanced', classes=class_labels, y=y_train)
class_weight_dict = {class_labels[i]: weight for i, weight in enumerate(class_weights)}
print(f'Negative class weight: {class_weights[0]}\nPositive class weight: {class_weights[1]}')

In [None]:
rf_weight_clf = RandomForestClassifier(n_estimators=100, class_weight=class_weight_dict, random_state=42)
train_and_validate(rf_weight_clf, X_train_tfidf, y_train, X_val_tfidf, y_val)

#### Results
- Both versions of Random forest model outperformed the most frequent class baseline in terms of:
    - Accuracy measured on the validation set --> 74% and 73% Random forest vs. 64% baseline
    - F1 score measured on the validation set --> positive label: 81% both Random forest models vs. 78% baseline - negative label: 58% and 57% Random forest models vs. 0% baseline
- The results of comparison of the Random forest model that equaly weights positive and negative examples, and Naive Bayes model, indicate the highest improvement in recall on negative class label --> 50% recall of Random forest vs. 40% recall of Naive Bayes. Other performance metrics are also slightly improved for Random forest. Therefore, Random forest algoritm is a better choice than Naive Bayes for this problem.
- By changing weights of the class labels to simulate balanced training set, no additional improvement in performance of Random forest model is achieved.
- Since class imbalance is eliminated as a cause of poor performance on negative class labels, another option is to expand feture set with features that can better capture negative sentiment.