# Stock Prediction With News Sentiment Analysis

### Data Import

The data used for this project is 2 CSV files

World-Stock-Prices-Dataset.csv:  contains historical stock price data.
Data.csv:  contains sentiment or financial news data.

1. World-Stock-Prices-Dataset.csv:
This file contains stock price data with the following columns:

Date: The date of the stock prices.
Open, High, Low, Close: Standard stock market metrics.
Volume: The number of shares traded.
Dividends: Dividend payouts (if any).
Stock Splits: Information on stock splits.
Brand_Name: The company or brand name.
Ticker: Stock ticker symbol.
Industry_Tag: The industry of the stock.
Country: The country of the stock's origin.
Capital Gains: Gains associated with the stock.



2. Data.csv:
This file contains sentiment data, likely extracted from news articles or other sources. Key columns:

Date: The date of the data.
Label: A binary label indicating positive (1) or negative (0) sentiment.
Top1, Top2, …, Top25: These are  snippets from news articles


In [136]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import re

from sklearn.model_selection import train_test_split,  cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

In [108]:
sentiment_data_df = pd.read_csv(r"C:\Users\skanagal\Downloads\Data.csv\Data.csv",  encoding='ISO-8859-1' )   
stock_prices_df = pd.read_csv(r"C:\Users\skanagal\Downloads\World-Stock-Prices-Dataset.csv\World-Stock-Prices-Dataset.csv",  encoding='ISO-8859-1' )

In [109]:
stock_prices_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Capital Gains
0,2024-09-16 00:00:00-04:00,4.73,4.84,4.62,4.72,8400000.0,0.0,0.0,peloton,PTON,fitness,usa,
1,2024-09-16 00:00:00-04:00,10.59,10.699,10.435,10.6,7558300.0,0.0,0.0,zoominfo,ZI,technology,usa,
2,2024-09-16 00:00:00-04:00,122.18,122.949997,121.559998,122.459999,19800.0,0.0,0.0,adidas,ADDYY,apparel,germany,
3,2024-09-16 00:00:00-04:00,261.309998,262.850006,259.149994,261.089996,2169700.0,0.0,0.0,american express,AXP,finance,usa,
4,2024-09-16 00:00:00-04:00,42.104,42.104,42.104,42.104,100.0,0.0,0.0,puma,PMMAF,apparel,germany,


In [110]:
sentiment_data_df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links
3,2000-01-06,1,Pilgrim knows how to progress,Thatcher facing ban,McIlroy calls for Irish fighting spirit,Leicester bin stadium blueprint,United braced for Mexican wave,"Auntie back in fashion, even if the dress look...",Shoaib appeal goes to the top,Hussain hurt by 'shambles' but lays blame on e...,...,Putin admits Yeltsin quit to give him a head s...,BBC worst hit as digital TV begins to bite,How much can you pay for...,Christmas glitches,"Upending a table, Chopping a line and Scoring ...","Scientific evidence 'unreliable', defence claims",Fusco wins judicial review in extradition case,Rebels thwart Russian advance,Blair orders shake-up of failing NHS,Lessons of law's hard heart
4,2000-01-07,1,Hitches and Horlocks,Beckham off but United survive,Breast cancer screening,Alan Parker,Guardian readers: are you all whingers?,Hollywood Beyond,Ashes and diamonds,Whingers - a formidable minority,...,Most everywhere: UDIs,Most wanted: Chloe lunettes,Return of the cane 'completely off the agenda',From Sleepy Hollow to Greeneland,Blunkett outlines vision for over 11s,"Embattled Dobson attacks 'play now, pay later'...",Doom and the Dome,What is the north-south divide?,Aitken released from jail,Gone aloft


### Data Exploration and CLeaning

In [111]:
stock_prices_df.isnull().sum()

Date                  0
Open                  0
High                  0
Low                   0
Close                 0
Volume                0
Dividends             0
Stock Splits          0
Brand_Name            0
Ticker                0
Industry_Tag          0
Country               0
Capital Gains    295775
dtype: int64

In [112]:
sentiment_data_df.isnull().sum()

Date     0
Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64

There is very little to no missing data in both data sets.

In [113]:
stock_prices_df['Date'] = pd.to_datetime(stock_prices_df['Date'], utc=True)
sentiment_data_df['Date'] = pd.to_datetime(sentiment_data_df['Date'], utc=True)

In [114]:
stock_prices_df.dtypes

Date             datetime64[ns, UTC]
Open                         float64
High                         float64
Low                          float64
Close                        float64
Volume                       float64
Dividends                    float64
Stock Splits                 float64
Brand_Name                    object
Ticker                        object
Industry_Tag                  object
Country                       object
Capital Gains                float64
dtype: object

### Sentiment Analysis

The news headline data will be analyzed to calculate a daily sentiment score for each of the dates

In [115]:
analyzer = SentimentIntensityAnalyzer()

# Function to clean the text (remove punctuation, lower case, etc.)
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = text.lower()  # Convert to lowercase
    return text

In [116]:
sentiment_data_df['Combined_Text'] = sentiment_data_df[[f'Top{i}' for i in range(1, 26)]].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
sentiment_data_df['Combined_Text'] = sentiment_data_df['Combined_Text'].apply(clean_text)

In [117]:
# Step 2.2: Perform sentiment analysis with VADER
def analyze_sentiment_vader(text):
    scores = analyzer.polarity_scores(text)
    return scores['compound']  # Compound score indicates overall sentiment (-1 to 1)

sentiment_data_df['Sentiment_Score_VADER'] = sentiment_data_df['Combined_Text'].apply(analyze_sentiment_vader)

# Optional: Perform sentiment analysis with TextBlob (use polarity: ranges from -1 to 1)
def analyze_sentiment_textblob(text):
    return TextBlob(text).sentiment.polarity

sentiment_data_df['Sentiment_Score_TextBlob'] = sentiment_data_df['Combined_Text'].apply(analyze_sentiment_textblob)

# Step 2.3: Aggregate sentiment scores by date
daily_sentiment = sentiment_data_df.groupby('Date')['Sentiment_Score_VADER'].mean().reset_index()





In [118]:
# Check the aggregated sentiment data
daily_sentiment['Date'] = pd.to_datetime(daily_sentiment['Date'])
daily_sentiment.head()

Unnamed: 0,Date,Sentiment_Score_VADER
0,2000-01-03 00:00:00+00:00,-0.8271
1,2000-01-04 00:00:00+00:00,-0.9818
2,2000-01-05 00:00:00+00:00,0.7003
3,2000-01-06 00:00:00+00:00,-0.9091
4,2000-01-07 00:00:00+00:00,-0.9772


In [119]:
# Check the aggregated sentiment data
stock_prices_df.dtypes

Date             datetime64[ns, UTC]
Open                         float64
High                         float64
Low                          float64
Close                        float64
Volume                       float64
Dividends                    float64
Stock Splits                 float64
Brand_Name                    object
Ticker                        object
Industry_Tag                  object
Country                       object
Capital Gains                float64
dtype: object

In [120]:
daily_sentiment['Date'] = daily_sentiment['Date'].dt.date


In [121]:
stock_prices_df['Date'] = stock_prices_df['Date'].dt.date


In [122]:
# Step 3: Merge the stock price and daily sentiment scores by date
merged_df = pd.merge(stock_prices_df, daily_sentiment, on='Date', how='inner')

# Check the merged dataframe
merged_df.head()


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Capital Gains,Sentiment_Score_VADER
0,2016-07-01,49.410243,49.687439,49.245656,49.366932,8330300.0,0.0,0.0,starbucks,SBUX,food & beverage,usa,,-0.9983
1,2016-07-01,95.695235,97.820845,92.656209,95.567184,6594200.0,0.0,0.0,hershey company,HSY,food & beverage,usa,,-0.9983
2,2016-07-01,137.803308,137.803308,136.234067,136.856506,2317200.0,0.0,0.0,costco,COST,retail,usa,,-0.9983
3,2016-07-01,99.978444,100.176256,99.558087,99.9702,7051400.0,0.0,0.0,johnson & johnson,JNJ,healthcare,usa,,-0.9983
4,2016-07-01,140.160633,141.494461,138.890328,139.588989,2048800.0,0.0,0.0,fedex,FDX,logistics,usa,,-0.9983


### Model Building and Evaluation

Now that the final dataset is ready, the model can be created

In [123]:
merged_df['Price_Change_Percentage'] = merged_df['Close'].pct_change() * 100

# Lagged features for sentiment
merged_df['Lagged_Sentiment'] = merged_df['Sentiment_Score_VADER'].shift(1)

# Drop rows with missing data after adding lagged features
#merged_df.dropna(inplace=True)


In [124]:
merged_df['Price_Movement'] = (merged_df['Price_Change_Percentage'] > 0).astype(int)


In [125]:

# Step 4.2: Modeling (same as before, but using new sentiment features)
X = merged_df[['Sentiment_Score_VADER', 'Lagged_Sentiment']]
y = merged_df['Price_Movement']  # (same target variable as before)

In [126]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [128]:
# Step 3.4: Train a Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)



In [129]:
# Step 3.5: Predict on test data
y_pred = rf_model.predict(X_test)



In [130]:
# Step 3.6: Evaluate the model
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy Score: 0.4709614725048588

Classification Report:
               precision    recall  f1-score   support

           0       0.47      0.49      0.48     17474
           1       0.47      0.45      0.46     17514

    accuracy                           0.47     34988
   macro avg       0.47      0.47      0.47     34988
weighted avg       0.47      0.47      0.47     34988



In [132]:
import xgboost as xgb
from sklearn.metrics import f1_score

# Train XGBoost Classifier
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred = xgb_model.predict(X_test)

# Evaluate F1 score
print("F1 Score:", f1_score(y_test, y_pred))

F1 Score: 0.47889785337721724


In [137]:



# Initialize 
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

# cross-validation
model_scores = {}
for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    model_scores[name] = -cv_scores.mean()
    print(f"{name} CV Mean Squared Error: {-cv_scores.mean():.4f}")

Linear Regression CV Mean Squared Error: 0.2500
Random Forest CV Mean Squared Error: 0.2561
Gradient Boosting CV Mean Squared Error: 0.2503


In [138]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [139]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Step 3.4: Train a Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Step 3.5: Predict on test data
y_pred = rf_model.predict(X_test)

# Step 3.6: Evaluate the model
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy Score: 0.47093289127700927

Classification Report:
               precision    recall  f1-score   support

           0       0.47      0.49      0.48     17474
           1       0.47      0.45      0.46     17514

    accuracy                           0.47     34988
   macro avg       0.47      0.47      0.47     34988
weighted avg       0.47      0.47      0.47     34988



In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize the model
rf = RandomForestClassifier(random_state=42)

# Set up Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, scoring='f1')

# Fit the model
grid_search.fit(X_train, y_train)

# Check best parameters
print("Best Parameters:", grid_search.best_params_)

# Use the best model
best_rf_model = grid_search.best_estimator_

# Predict and evaluate
y_pred = best_rf_model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred))

### Conclusion

The Random Forest model achieved an accuracy score of 47%, which is just slightly better than random guessing for a binary classification problem (where baseline accuracy would be 50%). The precision, recall, and F1-scores for both classes (positive and negative price movements) are nearly identical, with F1-scores of 0.48 for class 0 and 0.46 for class 1. This suggests that the model struggles equally with both classes, reflecting limited discriminative power in predicting stock price movements based on the features provided.

The model's low performance is further highlighted by the macro and weighted averages of precision, recall, and F1-score, all hovering around 0.47. These metrics indicate that the model does not favor any particular class but performs poorly across the board.

Given these results, the model may not be capturing sufficient signal from the sentiment analysis and stock price data. Perhaps news headlines were not the best choice of media maybe tweets would have been a better choice. Additional feature engineering, incorporating more sophisticated modeling techniques, or acquiring more diverse data may be necessary to improve performance. Further efforts should focus on enhancing the model's ability to distinguish between positive and negative stock price movements with more meaningful predictive power.