# End-to-End EDA: Stock, Macro, and Sentiment Data

This notebook performs an exploratory data analysis (EDA) for stock price prediction using historical stock prices, macroeconomic indicators, and financial news sentiment data. The workflow includes data cleaning, visualization, and merging of datasets for deeper insights.

## 1. Import Required Libraries

We begin by importing the necessary libraries for data analysis and visualization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For better plot aesthetics
plt.style.use('seaborn-v0_8-darkgrid')
import warnings
warnings.filterwarnings('ignore')

## 2. Load Datasets

Load the following CSV files into pandas DataFrames:
- `all_stocks_5y_data.csv`: Historical stock prices
- `macro_data.csv`: Macroeconomic indicators
- `stock_news_sentiment.csv`: Financial news sentiment

In [3]:
# File paths (update if needed)
stock_fp = '../data/raw/all_stocks_5y_data.csv'
macro_fp = '../data/raw/macro_data.csv'
sentiment_fp = '../data/raw/stock_news_sentiment.csv'

# Load datasets
stocks_df = pd.read_csv(stock_fp)
macro_df = pd.read_csv(macro_fp)
sentiment_df = pd.read_csv(sentiment_fp)

print("Stock data shape:", stocks_df.shape)
print("Macro data shape:", macro_df.shape)
print("Sentiment data shape:", sentiment_df.shape)

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/all_stocks_5y_data.csv'

## 3. Convert Date Columns to Datetime

Convert the `date` column in all datasets to pandas datetime format for proper time-based analysis.

In [None]:
# Convert 'date' columns to datetime
stocks_df['date'] = pd.to_datetime(stocks_df['date'])
macro_df['date'] = pd.to_datetime(macro_df['date'])
sentiment_df['date'] = pd.to_datetime(sentiment_df['date'])

# Check conversion
print(stocks_df['date'].dtype, macro_df['date'].dtype, sentiment_df['date'].dtype)

## 4. Handle Missing Values

- Forward-fill missing sentiment scores per `ticker` in `stock_news_sentiment.csv`.
- If the first score for a ticker is missing, fill it with 0.
- Handle other missing values as needed.

In [None]:
# Forward-fill sentiment scores per ticker, fill initial NaNs with 0
sentiment_df = sentiment_df.sort_values(['ticker', 'date'])
sentiment_df['sentiment_score'] = (
    sentiment_df.groupby('ticker')['sentiment_score']
    .apply(lambda x: x.ffill().fillna(0))
)

# Check for remaining missing values
print("Missing sentiment scores:", sentiment_df['sentiment_score'].isna().sum())

# Check missing values in other datasets
print("Stocks missing values:\n", stocks_df.isna().sum())
print("Macro missing values:\n", macro_df.isna().sum())

## 5. EDA on Stock Data

- Display basic info (shape, columns, missing values).
- Plot distribution of closing prices.
- Time-series plots for `close` prices of selected tickers (AAPL, MSFT, AMZN).
- Top 10 tickers by average trading volume.
- Correlation heatmap for `open`, `high`, `low`, `close`, and `volume`.

In [None]:
# Basic info
print("Stocks DataFrame shape:", stocks_df.shape)
print("Columns:", stocks_df.columns.tolist())
print("Missing values:\n", stocks_df.isna().sum())

# Distribution of closing prices
plt.figure(figsize=(8,5))
sns.histplot(stocks_df['close'], bins=50, kde=True)
plt.title('Distribution of Closing Prices')
plt.xlabel('Close Price')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Time-series plot for selected tickers
selected_tickers = ['AAPL', 'MSFT', 'AMZN']
plt.figure(figsize=(12,6))
for ticker in selected_tickers:
    ticker_df = stocks_df[stocks_df['ticker'] == ticker]
    plt.plot(ticker_df['date'], ticker_df['close'], label=ticker)
plt.title('Close Price Over Time: AAPL, MSFT, AMZN')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.show()

In [None]:
# Top 10 tickers by average trading volume
top_vol = (
    stocks_df.groupby('ticker')['volume']
    .mean()
    .sort_values(ascending=False)
    .head(10)
)
plt.figure(figsize=(10,5))
sns.barplot(x=top_vol.index, y=top_vol.values, palette='viridis')
plt.title('Top 10 Tickers by Average Trading Volume')
plt.xlabel('Ticker')
plt.ylabel('Average Volume')
plt.show()

In [None]:
# Correlation heatmap for price and volume columns
corr_cols = ['open', 'high', 'low', 'close', 'volume']
corr = stocks_df[corr_cols].corr()
plt.figure(figsize=(6,5))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap: Stock Price & Volume')
plt.show()

## 6. EDA on Macroeconomic Data

- Line plots for `GDP`, `CPI`, `unemployment`, and `interest_rate` over time.
- Correlation matrix of all macro indicators.
- Highlight trends or anomalies (e.g., inflation spikes).

In [None]:
# Line plots for macro indicators
macro_cols = ['GDP', 'CPI', 'unemployment', 'interest_rate']
plt.figure(figsize=(14,8))
for i, col in enumerate(macro_cols, 1):
    plt.subplot(2,2,i)
    plt.plot(macro_df['date'], macro_df[col])
    plt.title(col)
    plt.xlabel('Date')
    plt.ylabel(col)
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix of macro indicators
macro_corr = macro_df[macro_cols].corr()
plt.figure(figsize=(5,4))
sns.heatmap(macro_corr, annot=True, cmap='Blues', fmt='.2f')
plt.title('Correlation Matrix: Macro Indicators')
plt.show()

In [None]:
# Highlighting inflation (CPI) spikes
plt.figure(figsize=(10,5))
plt.plot(macro_df['date'], macro_df['CPI'], color='orange')
plt.title('CPI Over Time (Highlighting Inflation Trends)')
plt.xlabel('Date')
plt.ylabel('CPI')
plt.show()

## 7. EDA on Sentiment Data

- Number of sentiment entries per ticker.
- Line plot of average daily sentiment across all tickers.
- Sentiment score distribution (histogram).
- Analyze if certain tickers are consistently positive or negative after forward-filling.

In [None]:
# Number of sentiment entries per ticker
sentiment_counts = sentiment_df['ticker'].value_counts().head(15)
plt.figure(figsize=(10,5))
sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values, palette='mako')
plt.title('Number of Sentiment Entries per Ticker (Top 15)')
plt.xlabel('Ticker')
plt.ylabel('Count')
plt.show()

In [None]:
# Line plot of average daily sentiment across all tickers
daily_sentiment = sentiment_df.groupby('date')['sentiment_score'].mean()
plt.figure(figsize=(12,5))
plt.plot(daily_sentiment.index, daily_sentiment.values)
plt.title('Average Daily Sentiment Score Across All Tickers')
plt.xlabel('Date')
plt.ylabel('Average Sentiment Score')
plt.show()

In [None]:
# Sentiment score distribution
plt.figure(figsize=(8,5))
sns.histplot(sentiment_df['sentiment_score'], bins=40, kde=True, color='purple')
plt.title('Distribution of Sentiment Scores')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Analyze consistently positive/negative tickers after forward-filling
ticker_sentiment_mean = sentiment_df.groupby('ticker')['sentiment_score'].mean().sort_values()
print("Top 5 consistently negative tickers:\n", ticker_sentiment_mean.head(5))
print("\nTop 5 consistently positive tickers:\n", ticker_sentiment_mean.tail(5))

## 8. Merge Datasets (Optional)

Join stock data with sentiment and macro data on `date` (and `ticker` where applicable) to create a combined DataFrame for further analysis.

In [None]:
# Merge stocks with sentiment on date and ticker
merged_df = pd.merge(
    stocks_df, sentiment_df[['date', 'ticker', 'sentiment_score']],
    on=['date', 'ticker'], how='left'
)

# Merge with macro data on date
merged_df = pd.merge(
    merged_df, macro_df, on='date', how='left'
)

print("Merged DataFrame shape:", merged_df.shape)
merged_df.head()

## 9. Combined Visualizations

- Plot closing price, sentiment score, and interest rate over time for a selected stock.
- Create a correlation heatmap across all combined features.

In [None]:
# Example: Plot for AAPL
aapl_df = merged_df[merged_df['ticker'] == 'AAPL'].sort_values('date')

fig, ax1 = plt.subplots(figsize=(14,6))

ax1.plot(aapl_df['date'], aapl_df['close'], color='blue', label='Close Price')
ax1.set_ylabel('Close Price', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()
ax2.plot(aapl_df['date'], aapl_df['sentiment_score'], color='green', alpha=0.6, label='Sentiment Score')
ax2.set_ylabel('Sentiment Score', color='green')
ax2.tick_params(axis='y', labelcolor='green')

ax3 = ax1.twinx()
ax3.spines['right'].set_position(('outward', 60))
ax3.plot(aapl_df['date'], aapl_df['interest_rate'], color='red', alpha=0.5, label='Interest Rate')
ax3.set_ylabel('Interest Rate', color='red')
ax3.tick_params(axis='y', labelcolor='red')

fig.suptitle('AAPL: Close Price, Sentiment Score, and Interest Rate Over Time')
fig.tight_layout()
plt.show()

In [None]:
# Correlation heatmap across combined features
combined_cols = ['open', 'high', 'low', 'close', 'volume', 'sentiment_score', 'GDP', 'CPI', 'unemployment', 'interest_rate']
combined_corr = merged_df[combined_cols].corr()
plt.figure(figsize=(10,8))
sns.heatmap(combined_corr, annot=True, cmap='Spectral', fmt='.2f')
plt.title('Correlation Heatmap: Combined Features')
plt.show()