# Exploratory Data Analysis (EDA)

This notebook explores the raw historical stock data used in the quant-bot pipeline. The goal is to understand the data, spot issues, and generate insights for feature engineering and modeling.

## 1. Imports and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
sns.set(style="whitegrid")

# Load the data (adjust ticker if needed)
df = pd.read_csv("../data/AAPL_data.csv", parse_dates=["Date"])
df.head()

## 2. Data Overview

In [None]:
# Data types and missing values
df.info()

In [None]:
# Basic statistics
df.describe()

In [None]:
# Check for missing values
df.isnull().sum()

## 3. Time Series Plots

In [None]:
# Plot closing price over time
plt.figure(figsize=(12, 6))
plt.plot(df["Date"], df["Close"])
plt.title("AAPL Closing Price Over Time")
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.show()

In [None]:
# Plot volume over time
plt.figure(figsize=(12, 4))
plt.plot(df["Date"], df["Volume"], color="orange")
plt.title("AAPL Volume Over Time")
plt.xlabel("Date")
plt.ylabel("Volume")
plt.show()

## 4. Distribution Analysis

In [None]:
# Calculate daily returns
df["Return"] = df["Close"].pct_change()

# Plot histogram of daily returns
plt.figure(figsize=(8, 4))
sns.histplot(df["Return"].dropna(), bins=50, kde=True)
plt.title("Distribution of Daily Returns")
plt.xlabel("Daily Return")
plt.show()

## 5. Correlation Analysis

In [None]:
# Correlation matrix
corr = df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

## 6. Missing Data Visualization

In [None]:
# Visualize missing data
plt.figure(figsize=(10, 2))
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Data Heatmap")
plt.show()

## 7. Feature Exploration (Optional)

If you have engineered features (e.g., SMA, RSI, sentiment), load and plot them here for further analysis.

In [None]:
# Example: Load features and plot SMA/RSI if available
try:
    features = pd.read_csv("../data/AAPL_features.csv", parse_dates=["Date"])
    plt.figure(figsize=(12, 6))
    plt.plot(features["Date"], features["Close"], label="Close")
    if "SMA_20" in features.columns:
        plt.plot(features["Date"], features["SMA_20"], label="SMA 20")
    if "RSI_14" in features.columns:
        plt.plot(features["Date"], features["RSI_14"], label="RSI 14")
    plt.legend()
    plt.title("Close Price and Technical Indicators")
    plt.show()
except Exception as e:
    print("Feature file not found or error loading features:", e)

## 8. Insights and Next Steps

- Summarize key findings from the EDA.
- Note any data quality issues or outliers.
- Suggest ideas for feature engineering or modeling based on your observations.