# Stock Price Movement Prediction
## Objective
Predict whether the closing price of an S&P 500 stock will increase or decrease on the next trading day based on the following features:
- Opening Price
- Highest Price
- Lowest Price
- Adjusted Close Price
- Trading Volume

## Data Description
- **Date Range**: 2010 to end of 2016
- **Companies**: 501 S&P 500 companies
- **Data Points**: 851,264 entries
- **Adjustments**: 140 stock splits adjusted in `prices-split-adjusted.csv`


## Importing Necessary Libraries
Import all essential libraries required for data manipulation, visualization, and modeling.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, classification_report

# Model Explainability
import shap

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set plot aesthetics
sns.set_theme(style="whitegrid", palette="muted")
%matplotlib inline

## Loading and Inspecting the Data
Load the dataset, inspect the first few rows, check for missing values, and understand the data types.

In [None]:
# Load the dataset
df = pd.read_csv('prices-split-adjusted.csv')

# Display the first five rows
print("First five rows of the dataset:")
display(df.head())

In [None]:
# Check the data types and non-null counts
print("\nData Information:")
df.info()

# Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

# Check unique values
print("\nUnique Values per Column:")
print(df.nunique())

## Data Preprocessing
Handle missing values, create the target variable, and perform necessary feature engineering.

### Defining the Target Variable
Create a binary target variable indicating whether the closing price increases (1) or decreases (0) the next trading day.

In [None]:
# Sort the dataframe by symbol and date to ensure correct ordering
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by=['symbol', 'date'])

# Create the target variable: 1 if next day's close > today's close, else 0
df['target'] = df.groupby('symbol')['close'].shift(-1) > df['close']
df['target'] = df['target'].astype(int)

# Drop the last day for each symbol as it doesn't have a target
df = df.dropna(subset=['target'])

print("\nSample data with target variable:")
display(df.head())

### Handling Missing Values
Impute or remove missing values as appropriate.

In [None]:
# Check for missing values
missing = df.isnull().sum()
print("\nMissing Values:")
print(missing[missing > 0])

# Impute missing values for numerical columns using median
numerical_cols = ['open', 'close', 'low', 'high', 'volume']
imputer = SimpleImputer(strategy='median')
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])

# Verify no missing values remain
print("\nMissing Values After Imputation:")
print(df.isnull().sum())

### Feature Engineering
Create additional features that may help in prediction.

In [None]:
# Create daily return as a feature
df['daily_return'] = (df['close'] - df['open']) / df['open']

# Create volatility feature
df['volatility'] = (df['high'] - df['low']) / df['open']

# Drop any remaining missing values just in case
df = df.dropna()

print("\nData with new features:")
display(df.head())

## Exploratory Data Analysis (EDA)
Understand the distribution of features and the target variable.

### Distribution of the Target Variable

In [None]:
# Plot the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=df, palette='Set2')
plt.title('Distribution of Target Variable')
plt.xlabel('Price Increase (1) or Decrease (0)')
plt.ylabel('Count')
plt.show()

# Print the class balance
print("\nClass Distribution:")
print(df['target'].value_counts(normalize=True))

### Feature Distributions
Visualize the distribution of key numerical features.

In [None]:
# Select numerical features for plotting
features = ['open', 'close', 'low', 'high', 'volume', 'daily_return', 'volatility']

# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(features, 1):
    plt.subplot(3, 3, i)
    sns.histplot(df[feature], kde=True, bins=50)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

### Correlation Analysis
Examine the correlation between features and the target variable.

In [None]:
# Compute the correlation matrix
corr_matrix = df[['open', 'close', 'low', 'high', 'volume', 'daily_return', 'volatility', 'target']].corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

## Data Preparation
Encode categorical variables, split the data into training and testing sets, and scale the features.

### Encoding Categorical Variables
Encode the `symbol` column using One-Hot Encoding.

In [None]:
# Select features and target
features = ['open', 'close', 'low', 'high', 'volume', 'daily_return', 'volatility', 'symbol']
X = df[features]
y = df['target']

# Identify categorical columns
categorical_cols = ['symbol']

if categorical_cols:
    encoder = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')
    encoded_features = encoder.fit_transform(X[categorical_cols])
    encoded_cols = encoder.get_feature_names_out(categorical_cols)
    
    # Create DataFrame for encoded features
    encoded_df = pd.DataFrame(encoded_features, columns=encoded_cols, index=X.index)
    
    # Concatenate with original features
    X = pd.concat([X.drop(columns=categorical_cols), encoded_df], axis=1)

print("\nFeatures after encoding categorical variables:")
display(X.head())

### Splitting Data into Training and Testing Sets
Split the data based on chronological order to prevent data leakage.

In [None]:
# Sort the data by date
X = X.sort_values(by='date')
y = y.sort_index()

# Define the split point (e.g., last 20% as test)
split_index = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

print(f"\nTraining set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

### Scaling Features
Scale numerical features to ensure they contribute equally to the model.

In [None]:
# Identify numerical columns (excluding the encoded symbols)
numerical_cols = ['open', 'close', 'low', 'high', 'volume', 'daily_return', 'volatility']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit the scaler on training data and transform both training and testing data
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("\nFeatures scaled using StandardScaler.")

## Model Training and Hyperparameter Tuning
Train multiple classification models and perform hyperparameter tuning to find the best model.

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
}

# Define hyperparameters for tuning
param_grids = {
    'Logistic Regression': {'C': [0.01, 0.1, 1, 10]},
    'Random Forest': {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}
}

# Dictionary to store best models
best_models = {}

# Train and tune models
for name, model in models.items():
    print(f"\nTraining {name}...")
    if name in param_grids:
        grid = GridSearchCV(model, param_grids[name], cv=5, scoring='f1', n_jobs=-1)
        grid.fit(X_train, y_train)
        best_models[name] = grid.best_estimator_
        print(f"Best parameters for {name}: {grid.best_params_}")
    else:
        model.fit(X_train, y_train)
        best_models[name] = model
        print(f"{name} trained without hyperparameter tuning.")

## Model Evaluation
Evaluate the performance of each model using appropriate classification metrics.

In [None]:
# Initialize a DataFrame to store evaluation metrics
evaluation_metrics = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC'])

# Evaluate each model
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_proba) if y_proba is not None else np.nan
    
    evaluation_metrics = evaluation_metrics.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC-AUC': roc_auc
    }, ignore_index=True)

print("\nModel Evaluation Metrics on Test Set:")
display(evaluation_metrics)

# Plot evaluation metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
evaluation_metrics_melted = evaluation_metrics.melt(id_vars='Model', value_vars=metrics, var_name='Metric', value_name='Score')

plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=evaluation_metrics_melted)
plt.title('Model Comparison based on Evaluation Metrics')
plt.ylim(0, 1)
plt.legend(title='Model')
plt.show()

### Confusion Matrix
Visualize the confusion matrix for each model.

In [None]:
# Plot confusion matrix for each model
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix for {name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

### Classification Report
Detailed classification report for each model.

In [None]:
# Print classification report for each model
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    print(f"\nClassification Report for {name}:")
    print(classification_report(y_test, y_pred))

### Feature Importance
Identify the most important features for each model.

In [None]:
# Feature Importance for Random Forest
if 'Random Forest' in best_models:
    rf = best_models['Random Forest']
    importances = pd.Series(rf.feature_importances_, index=X_train.columns)
    importances = importances.sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(10,6))
    sns.barplot(x=importances.values, y=importances.index, palette='viridis')
    plt.title('Top 10 Feature Importances - Random Forest')
    plt.xlabel('Importance Score')
    plt.ylabel('Features')
    plt.show()

# Feature Importance for Logistic Regression
if 'Logistic Regression' in best_models:
    lr = best_models['Logistic Regression']
    coefficients = pd.Series(lr.coef_[0], index=X_train.columns)
    coefficients = coefficients.abs().sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(10,6))
    sns.barplot(x=coefficients.values, y=coefficients.index, palette='magma')
    plt.title('Top 10 Feature Coefficients - Logistic Regression')
    plt.xlabel('Coefficient Magnitude')
    plt.ylabel('Features')
    plt.show()

## Model Explainability
Use SHAP values to interpret the contribution of each feature to the model's predictions.

In [None]:
# Select the best model based on F1 Score
best_model_name = evaluation_metrics.sort_values(by='F1 Score', ascending=False).iloc[0]['Model']
best_model = best_models[best_model_name]
print(f"\nBest Model: {best_model_name}")

# Initialize SHAP explainer
explainer = shap.Explainer(best_model, X_train)
shap_values = explainer(X_test)

# Summary plot
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
plt.title('SHAP Feature Importance')
plt.tight_layout()
plt.show()

# Dependence plot for the top feature
top_feature = shap_values.feature_names[0]
shap.dependence_plot(top_feature, shap_values.values, X_test, show=False)
plt.title(f'SHAP Dependence Plot for {top_feature}')
plt.tight_layout()
plt.show()

## Conclusion
Summarize the findings, model performance, and potential next steps for improving predictions.