# Group 4: Script Assignment 5

1. Load the provided dataset containing financial news headlines and sentiment labels. Perform exploratory data analysis to understand the structure of the dataset, distribution of sentiment labels, and any other relevant insights. ( 5 points )
2. Clean the text data by removing punctuation, special characters, and irrelevant symbols. Tokenize the headlines and convert them to lowercase for uniformity. Implement techniques like stemming or lemmatization to normalize the text data. ( 5 points )
3. Convert the text data into numerical features suitable for machine learning models. You can use techniques like bag-of-words, TF-IDF, or word embeddings. Split the dataset into training and testing sets. ( 5 points )
4. Choose appropriate machine learning algorithms (e.g., Naive Bayes, Support Vector Machines, or Neural Networks) for sentiment analysis. Train the model using the training data and evaluate its performance using appropriate evaluation metrics (accuracy, precision, recall, F1-score). ( 5 points )

## Import packages  

In [16]:
# Load the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Load NLTK Vader Lexicon

In [17]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\vbort\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Load Data and Clean Null Values

In [32]:
# Load the CNBC headlines dataset
filename = r'../data/cnbc_headlines.csv'
cnbc_headlines = pd.read_csv(filename)

# Drop NA
cnbc_headlines = cnbc_headlines.dropna()

# Display the original data after dropping NA
cnbc_headlines.head()

Unnamed: 0,Headlines,Time,Description
0,Jim Cramer: A better way to invest in the Covi...,"7:51 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer recommended buying..."
1,Cramer's lightning round: I would own Teradyne,"7:33 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer rings the lightnin..."
3,"Cramer's week ahead: Big week for earnings, ev...","7:25 PM ET Fri, 17 July 2020","""We'll pay more for the earnings of the non-Co..."
4,IQ Capital CEO Keith Bliss says tech and healt...,"4:24 PM ET Fri, 17 July 2020","Keith Bliss, IQ Capital CEO, joins ""Closing Be..."
5,Wall Street delivered the 'kind of pullback I'...,"7:36 PM ET Thu, 16 July 2020","""Look for the stocks of high-quality companies..."


## Calculate Sentiment

The original dataset does not include label. We use the Natural Language Toolkit's Vader Sentiment Intensity Analyzer class to get and assign sentiments to the headlines.

In [33]:
# The original dataset does not include label. We use the Natural Language Toolkit, vader, 
# SentimentIntensityAnalyzer module to get and assign sentiments to the headlines. 

analyzer = SentimentIntensityAnalyzer()


# Extract sentiment score
def get_sentiment_score(text):
    sentiment_score = analyzer.polarity_scores(text)
    return sentiment_score['compound']  # Using compound score for overall sentiment

# Classify sentiment based on sentiment score
def classify_sentiment(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Assign sentiment score to the dataset
cnbc_headlines['sentiment_score'] = cnbc_headlines['Headlines'].apply(get_sentiment_score)

# Add sentiment label to the dataset
cnbc_headlines['Sentiment'] = cnbc_headlines['sentiment_score'].apply(classify_sentiment)

# Display the updated dataset
cnbc_headlines.head()


Unnamed: 0,Headlines,Time,Description,sentiment_score,Sentiment
0,Jim Cramer: A better way to invest in the Covi...,"7:51 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer recommended buying...",0.4404,Positive
1,Cramer's lightning round: I would own Teradyne,"7:33 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer rings the lightnin...",0.0,Neutral
3,"Cramer's week ahead: Big week for earnings, ev...","7:25 PM ET Fri, 17 July 2020","""We'll pay more for the earnings of the non-Co...",0.0,Neutral
4,IQ Capital CEO Keith Bliss says tech and healt...,"4:24 PM ET Fri, 17 July 2020","Keith Bliss, IQ Capital CEO, joins ""Closing Be...",0.5719,Positive
5,Wall Street delivered the 'kind of pullback I'...,"7:36 PM ET Thu, 16 July 2020","""Look for the stocks of high-quality companies...",0.0,Neutral


In [22]:
# Check the structure of the dataset
cnbc_headlines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2800 entries, 0 to 3079
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Headlines        2800 non-null   object 
 1   Time             2800 non-null   object 
 2   Description      2800 non-null   object 
 3   sentiment_score  2800 non-null   float64
 4   Sentiment        2800 non-null   object 
dtypes: float64(1), object(4)
memory usage: 131.2+ KB


In [23]:
# Check the distribution of sentiment labels
cnbc_headlines['Sentiment'].value_counts()

Sentiment
Neutral     1046
Positive    1008
Negative     746
Name: count, dtype: int64

2. Clean the text data by removing punctuation, special characters, and irrelevant symbols. Tokenize the headlines and convert them to lowercase for uniformity. Implement techniques like stemming or lemmatization to normalize the text data. 

In [24]:
nltk.download('wordnet')

# Remove punctuation, special characters, and irrelevant symbols
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = text.lower()
    return text

# Tokenize the headlines and convert them to lowercase
cnbc_headlines['cleaned_headline'] = cnbc_headlines['Headlines'].apply(clean_text)
cnbc_headlines['tokenized_headline'] = cnbc_headlines['cleaned_headline'].apply(word_tokenize)

# Initialize WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
cnbc_headlines['lemmatized_headline'] = cnbc_headlines['tokenized_headline'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vbort\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [25]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
cnbc_headlines['cleaned_headline'] = cnbc_headlines['lemmatized_headline'].apply(lambda x: [word for word in x if word not in stop_words])

# Join the tokens back into sentences
cnbc_headlines['cleaned_headline'] = cnbc_headlines['cleaned_headline'].apply(lambda x: ' '.join(x))

In [26]:
cnbc_headlines.head()

Unnamed: 0,Headlines,Time,Description,sentiment_score,Sentiment,cleaned_headline,tokenized_headline,lemmatized_headline
0,Jim Cramer: A better way to invest in the Covi...,"7:51 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer recommended buying...",0.4404,Positive,jim cramer better way invest covid vaccine gol...,"[jim, cramer, a, better, way, to, invest, in, ...","[jim, cramer, a, better, way, to, invest, in, ..."
1,Cramer's lightning round: I would own Teradyne,"7:33 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer rings the lightnin...",0.0,Neutral,cramers lightning round would teradyne,"[cramers, lightning, round, i, would, own, ter...","[cramers, lightning, round, i, would, own, ter..."
3,"Cramer's week ahead: Big week for earnings, ev...","7:25 PM ET Fri, 17 July 2020","""We'll pay more for the earnings of the non-Co...",0.0,Neutral,cramers week ahead big week earnings even bigg...,"[cramers, week, ahead, big, week, for, earning...","[cramers, week, ahead, big, week, for, earning..."
4,IQ Capital CEO Keith Bliss says tech and healt...,"4:24 PM ET Fri, 17 July 2020","Keith Bliss, IQ Capital CEO, joins ""Closing Be...",0.5719,Positive,iq capital ceo keith bliss say tech healthcare...,"[iq, capital, ceo, keith, bliss, says, tech, a...","[iq, capital, ceo, keith, bliss, say, tech, an..."
5,Wall Street delivered the 'kind of pullback I'...,"7:36 PM ET Thu, 16 July 2020","""Look for the stocks of high-quality companies...",0.0,Neutral,wall street delivered kind pullback ive waitin...,"[wall, street, delivered, the, kind, of, pullb...","[wall, street, delivered, the, kind, of, pullb..."


3.	Convert the text data into numerical features suitable for machine learning models. You can use techniques like bag-of-words, TF-IDF, or word embeddings. Split the dataset into training and testing sets. 

In [27]:
# Convert the text data into numerical features using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  
X = tfidf_vectorizer.fit_transform(cnbc_headlines['Headlines'])
y = cnbc_headlines['Sentiment']

In [28]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4.	Choose appropriate machine learning algorithms (e.g., Naive Bayes, Support Vector Machines, or Neural Networks) for sentiment analysis. Train the model using the training data and evaluate its performance using appropriate evaluation metrics (accuracy, precision, recall, F1-score). 

In [29]:
# Sentiment analysis with Support Vector Machines
# Initialize SVM classifier
svm_classifier = SVC()

# Train the model
svm_classifier.fit(X_train, y_train)

# Predict sentiment labels for the test set
y_pred = svm_classifier.predict(X_test)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

In [30]:
# Display metrics 

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.6089285714285714
Precision: 0.6645849297573436
Recall: 0.6089285714285714
F1 Score: 0.6027940528832338


In [31]:
# Sentiment analysis with Naive Bayes

from sklearn.naive_bayes import MultinomialNB

# Initialize Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the model
nb_classifier.fit(X_train, y_train)

# Predict sentiment labels for the test set
y_pred = nb_classifier.predict(X_test)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.5839285714285715
Precision: 0.6252564935064936
Recall: 0.5839285714285715
F1 Score: 0.5774796409319105
