# Predictive Analytics for Stock Market Trends


This notebook walks through the complete pipeline of building machine‑learning models to predict stock market trends using publicly available historical price data.

## Objective
- Predict the **next day's closing price** of a chosen stock index (default: *NIFTY 50*) and
- Classify whether the price will **move Up or Down**.

We will compare several models and discuss their merits, with reproducible code you can adapt to any ticker symbol.

## Data Source
Historical prices are downloaded on‑the‑fly from **Yahoo Finance** via the `yfinance` Python package. Data includes *Open, High, Low, Close, Adj Close,* and *Volume* columns.

> **Note**: Internet connection is required when first running the notebook to fetch data.

## Import Libraries

In [None]:
!pip -q install yfinance ta statsmodels shap scikit-learn matplotlib seaborn

In [None]:
# Install dependencies (uncomment when running locally)
# !pip -q install yfinance ta statsmodels shap scikit-learn matplotlib seaborn

import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ta import add_all_ta_features
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, classification_report, confusion_matrix
import shap
sns.set_style('whitegrid')
%matplotlib inline


## Import Data

In [None]:
# Choose ticker and date range
ticker = '^NSEI'  # NIFTY 50 Index
start_date = '2015-01-01'
end_date   = None   # today

df = yf.download(ticker, start=start_date, end=end_date, progress=False)
df.head()


## Describe Data

In [None]:
df.describe().T

## Exploratory Data Analysis

In [None]:
plt.figure(figsize=(12,4))
plt.plot(df['Close'])
plt.title(f'{ticker} Closing Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()


## Feature Engineering – Technical Indicators

In [None]:
# Use the 'ta' package to compute ~40 technical indicators
df_ta = add_all_ta_features(
    df.copy(),
    open='Open', high='High', low='Low', close='Close', volume='Volume')

# Shift close to obtain next‑day target
df_ta['Target_Close'] = df_ta['Close'].shift(-1)

# Binary target (Up = 1, Down = 0)
df_ta['Target_Direction'] = (df_ta['Target_Close'] > df_ta['Close']).astype(int)

# Drop rows with NaN values introduced by TA calculation
df_ta.dropna(inplace=True)

df_ta.head()


## Data Preprocessing

In [None]:
# Separate features and targets
feature_cols = df_ta.columns.difference(['Target_Close', 'Target_Direction'])
X = df_ta[feature_cols]
y_reg = df_ta['Target_Close']
y_cls = df_ta['Target_Direction']

# Train‑test split preserving temporal order (last 20% as test)
split_idx = int(len(df_ta)*0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_reg_train, y_reg_test = y_reg.iloc[:split_idx], y_reg.iloc[split_idx:]
y_cls_train, y_cls_test = y_cls.iloc[:split_idx], y_cls.iloc[split_idx:]

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)


## Modeling – Regression (Predict Price)

In [None]:
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest'    : RandomForestRegressor(
                            n_estimators=300,
                            max_depth=None,
                            n_jobs=-1,
                            random_state=42)
}

reg_results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_reg_train)
    preds = model.predict(X_test_scaled)
    reg_results[name] = {
        'MAE': mean_absolute_error(y_reg_test, preds),
        'RMSE': np.sqrt(mean_squared_error(y_reg_test, preds)),
        'R²': r2_score(y_reg_test, preds)
    }

pd.DataFrame(reg_results).T


### Modeling – Classification (Up / Down)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

cls_model = GradientBoostingClassifier(random_state=42)
cls_model.fit(X_train_scaled, y_cls_train)
cls_pred = cls_model.predict(X_test_scaled)

print(classification_report(y_cls_test, cls_pred, target_names=['Down','Up']))
conf_mat = confusion_matrix(y_cls_test, cls_pred)

sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()


## Prediction vs Actual – Regression

In [None]:
best_model_name = min(reg_results, key=lambda k: reg_results[k]['RMSE'])
best_model = models[best_model_name]
pred = best_model.predict(X_test_scaled)

plt.figure(figsize=(12,4))
plt.plot(y_reg_test.values, label='Actual')
plt.plot(pred, label='Predicted')
plt.title(f'Actual vs Predicted Closing Price ({best_model_name})')
plt.legend()
plt.show()


## Explainability – SHAP Feature Importance

In [None]:
# Compute SHAP values for a sample subset to save time
explainer = shap.Explainer(best_model, X_train_scaled)
shap_values = explainer(X_test_scaled[:100])

shap.plots.beeswarm(shap_values)


## Conclusion
- Technical indicators can improve predictive performance over raw price data alone.
- *Random Forest* often excels at non‑linear relationships but can overfit.
- Always evaluate models on **out‑of‑sample** data and update with new data.
- Consider more advanced architectures (e.g., LSTM/Transformer) for sequential dependencies.