<a href="https://colab.research.google.com/github/tomararpit147/Project-1/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Member** - Arpit Tomar

# **Project Summary -**

This project aims to predict Yes Bank stock prices using machine learning models.
Using historical monthly stock data from 2005-2020, we implement multiple regression
algorithms including Linear Regression, Random Forest, XGBoost, and LSTM networks
to forecast future stock prices based on historical patterns and engineered features.

# **GitHub Link -**

https://github.com/tomararpit147/Project-1

# **Problem Statement**


Predict Yes Bank's monthly closing stock prices using historical data and
technical indicators to assist investors in making informed trading decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import joblib
import datetime
import sys
warnings.filterwarnings('ignore')

# Preprocessing
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

# Machine Learning Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

# Advanced ML
import xgboost as xgb
import lightgbm as lgb

# Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Dropout, GRU, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam

# For time series analysis
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Feature selection
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.decomposition import PCA

# Model explainability
import shap
from sklearn.inspection import permutation_importance

# Statistical tests
from scipy import stats
from scipy.stats import boxcox, normaltest, jarque_bera

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
sns.set_context("notebook", font_scale=1.2)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
print("âœ… All libraries imported successfully!")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"TensorFlow version: {tf.__version__}")

### Dataset Loading

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Load Dataset
df = pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
print("First 5 rows of dataset:")
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows and columns in the dataset:")
df.shape

### Dataset Information

In [None]:
# Dataset Info
print("Information about the dataset:")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Number of duplicate values in the dataset:")
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Number of missing values in each column:")
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset contains monthly stock prices of Yes Bank from July 2005 to November 2020. It has 185 rows and 5 columns (Date, Open, High, Low, Close). All columns are numerical except Date. There are no missing values or duplicates, making it clean for analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Columns in the dataset:")
df.columns

In [None]:
# Dataset Describe
print("Dataset describe:")
df.describe()

### Variables Description

1. **Date**: Month and year of stock price (MMM-YY format)
2. **Open**: Opening price of the stock for the month
3. **High**: Highest price during the month
4. **Low**: Lowest price during the month
5. **Close**: Closing price at the end of the month

All prices are in Indian Rupees (INR).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_count = df[column].nunique()
    print(f"{column}: {unique_count} unique values")

In [None]:
# Convert Date to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Month_Name'] = df['Date'].dt.strftime('%B')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create additional features for better analysis
df['Price_Range'] = df['High'] - df['Low']  # Daily volatility
df['Avg_Price'] = (df['Open'] + df['High'] + df['Low'] + df['Close']) / 4  # Average price
df['Open_Close_Change'] = ((df['Close'] - df['Open']) / df['Open']) * 100  # Daily return %
df['High_Low_Ratio'] = df['High'] / df['Low']  # Volatility ratio
df['Cumulative_Return'] = (df['Close'] / df['Close'].iloc[0] - 1) * 100  # Cumulative return from start

# Create rolling statistics
df['MA_12'] = df['Close'].rolling(window=12).mean()  # 12-month moving average
df['Volatility'] = df['Close'].pct_change().rolling(window=12).std() * 100  # Annualized volatility

print("Dataset after feature engineering:")
df.head()

In [None]:
print("Starting Data Wrangling Process...")
print("=" * 60)

# 1. Convert Date to datetime
print("\n1. Converting Date to datetime format...")
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
print(f"   âœ… Date range: {df['Date'].min().strftime('%b-%Y')} to {df['Date'].max().strftime('%b-%Y')}")

# 2. Sort by date
df = df.sort_values('Date').reset_index(drop=True)
print("   âœ… Data sorted chronologically")

# 3. Extract time-based features
print("\n2. Creating time-based features...")
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Month_Name'] = df['Date'].dt.strftime('%B')
df['Year_Month'] = df['Date'].dt.strftime('%Y-%m')
print("   âœ… Year, Month, Quarter, Month_Name features created")

# 4. Create lag features (autoregressive components)
print("\n3. Creating lag features...")
lags = [1, 2, 3, 6, 12]
for lag in lags:
    df[f'Close_Lag_{lag}'] = df['Close'].shift(lag)
    df[f'Open_Lag_{lag}'] = df['Open'].shift(lag)
    df[f'High_Lag_{lag}'] = df['High'].shift(lag)
    df[f'Low_Lag_{lag}'] = df['Low'].shift(lag)
print(f"   âœ… Created lag features for periods: {lags}")

# 5. Create rolling statistics
print("\n4. Creating rolling statistics...")
windows = [3, 6, 12]
for window in windows:
    # Moving averages
    df[f'Close_MA_{window}'] = df['Close'].rolling(window=window).mean()
    df[f'Close_MA_{window}_shift'] = df[f'Close_MA_{window}'].shift(1)

    # Rolling standard deviation (volatility)
    df[f'Close_Std_{window}'] = df['Close'].rolling(window=window).std()

    # Rolling min and max
    df[f'Close_Min_{window}'] = df['Close'].rolling(window=window).min()
    df[f'Close_Max_{window}'] = df['Close'].rolling(window=window).max()

    # Price range rolling statistics
    df[f'Range_MA_{window}'] = (df['High'] - df['Low']).rolling(window=window).mean()
print(f"   âœ… Created rolling statistics for windows: {windows}")

# 6. Create price-based features
print("\n5. Creating price-based features...")
df['Price_Range'] = df['High'] - df['Low']
df['Price_Range_Pct'] = (df['Price_Range'] / df['Low']) * 100
df['Open_Close_Change'] = df['Close'] - df['Open']
df['Open_Close_Return'] = ((df['Close'] - df['Open']) / df['Open']) * 100
df['High_Low_Ratio'] = df['High'] / df['Low']
df['OHLC_Avg'] = (df['Open'] + df['High'] + df['Low'] + df['Close']) / 4
df['Close_to_High'] = (df['High'] - df['Close']) / df['Close'] * 100
df['Close_to_Low'] = (df['Close'] - df['Low']) / df['Low'] * 100
print("   âœ… Created 8 price-based features")

# 7. Create technical indicators
print("\n6. Creating technical indicators...")

# RSI (Relative Strength Index)
def calculate_rsi(data, periods=14):
    delta = data.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=periods).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=periods).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

df['RSI'] = calculate_rsi(df['Close'], 14)
print("   âœ… RSI calculated")

# MACD (Moving Average Convergence Divergence)
exp1 = df['Close'].ewm(span=12, adjust=False).mean()
exp2 = df['Close'].ewm(span=26, adjust=False).mean()
df['MACD'] = exp1 - exp2
df['MACD_Signal'] = df['MACD'].ewm(span=9, adjust=False).mean()
df['MACD_Histogram'] = df['MACD'] - df['MACD_Signal']
print("   âœ… MACD calculated")

# Bollinger Bands
df['BB_Middle'] = df['Close'].rolling(window=20).mean()
df['BB_Std'] = df['Close'].rolling(window=20).std()
df['BB_Upper'] = df['BB_Middle'] + (df['BB_Std'] * 2)
df['BB_Lower'] = df['BB_Middle'] - (df['BB_Std'] * 2)
df['BB_Width'] = df['BB_Upper'] - df['BB_Lower']
df['BB_Position'] = (df['Close'] - df['BB_Lower']) / (df['BB_Upper'] - df['BB_Lower'])
print("   âœ… Bollinger Bands calculated")

# Volume proxy features (using price range as volume proxy)
df['Volume_Proxy'] = df['Price_Range'] * df['Close']
df['Volume_Proxy_MA_12'] = df['Volume_Proxy'].rolling(window=12).mean()
print("   âœ… Volume proxy features created")

# 8. Create interaction features
print("\n7. Creating interaction features...")
df['Open_High_Interaction'] = df['Open'] * df['High']
df['Open_Low_Interaction'] = df['Open'] * df['Low']
df['High_Low_Interaction'] = df['High'] * df['Low']
print("   âœ… Created interaction features")

# 9. Drop NaN values
print("\n8. Handling missing values...")
initial_rows = len(df)
df_clean = df.dropna().reset_index(drop=True)
final_rows = len(df_clean)
rows_dropped = initial_rows - final_rows
print(f"   âœ… Dropped {rows_dropped} rows with NaN values")
print(f"   âœ… Final dataset shape: {df_clean.shape}")

# 10. Verify data types
print("\n9. Verifying data types...")
print(df_clean.dtypes.value_counts())

print("\n" + "=" * 60)
print("âœ… Data Wrangling Complete!")

print(f"ðŸ“Š Final dataset has {df_clean.shape[0]} rows and {df_clean.shape[1]} columns")

In [None]:
# Display first few rows of cleaned dataset
print("\nðŸ“‹ First 5 rows of processed dataset:")
df_clean.head()

### What all manipulations have you done and insights you found?

1. **Date Processing** (10 features)
   - Converted string dates to datetime
   - Extracted Year, Month, Quarter, Month_Name

2. **Lag Features** (20 features)
   - Created 1,2,3,6,12 month lags for all price columns
   - Enables autoregressive modeling

3. **Rolling Statistics** (24 features)
   - Moving averages (3,6,12 months)
   - Rolling volatility (standard deviation)
   - Rolling min/max prices

4. **Price-based Features** (8 features)
   - Price range and percentage range
   - Returns and changes
   - OHLC averages and ratios

5. **Technical Indicators** (12 features)
   - RSI (momentum oscillator)
   - MACD (trend following)
   - Bollinger Bands (volatility)

6. **Interaction Features** (3 features)
   - Price multiplications for non-linear relationships

**Key Insights from Wrangling:**
- Time-based features capture seasonality in stock prices
- Lag features show strong autocorrelation (prices depend on past values)
- Technical indicators provide additional predictive power
- Rolling statistics help identify trend changes and volatility regimes

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code - Time Series Decomposition
fig, axes = plt.subplots(4, 1, figsize=(16, 12))

# Perform seasonal decomposition
decomposition = seasonal_decompose(df_clean['Close'].values, model='multiplicative', period=12)

# Original series
axes[0].plot(df_clean['Date'], df_clean['Close'], color='blue', linewidth=1.5)
axes[0].set_title('Original Time Series - Yes Bank Closing Prices', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Price (INR)')
axes[0].grid(True, alpha=0.3)

# Trend component
axes[1].plot(df_clean['Date'], decomposition.trend, color='red', linewidth=1.5)
axes[1].set_title('Trend Component', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Price (INR)')
axes[1].grid(True, alpha=0.3)

# Seasonal component
axes[2].plot(df_clean['Date'], decomposition.seasonal, color='green', linewidth=1.5)
axes[2].set_title('Seasonal Component', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Seasonal Effect')
axes[2].grid(True, alpha=0.3)

# Residual component
axes[3].plot(df_clean['Date'], decomposition.resid, color='orange', linewidth=1.5)
axes[3].set_title('Residual (Noise) Component', fontsize=14, fontweight='bold')
axes[3].set_xlabel('Date')
axes[3].set_ylabel('Residuals')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Time series decomposition helps understand the underlying components of stock prices: trend, seasonality, and noise. This is crucial for feature engineering and model selection.

##### 2. What is/are the insight(s) found from the chart?

- Strong upward trend until 2018, then sharp decline
- Clear seasonal patterns (annual cycles)
- Increasing variance in residuals during high volatility periods
- Multiplicative seasonality (seasonal amplitude increases with price)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding these components helps in:
- Identifying long-term investment opportunities (trend)
- Timing entries/exits based on seasonal patterns
- Risk assessment through residual volatility analysis

#### Chart - 2

In [None]:
# Chart - 2 visualization code - Autocorrection Analysis (ACF and PACF)
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# ACF of original series
plot_acf(df_clean['Close'], lags=30, ax=axes[0,0])
axes[0,0].set_title('Autocorrelation Function (ACF) - Close Price', fontsize=12, fontweight='bold')
axes[0,0].set_xlabel('Lag')
axes[0,0].set_ylabel('Autocorrelation')
axes[0,0].grid(True, alpha=0.3)

# PACF of original series
plot_pacf(df_clean['Close'], lags=30, ax=axes[0,1], method='ywm')
axes[0,1].set_title('Partial Autocorrelation Function (PACF) - Close Price', fontsize=12, fontweight='bold')
axes[0,1].set_xlabel('Lag')
axes[0,1].set_ylabel('Partial Autocorrelation')
axes[0,1].grid(True, alpha=0.3)

# ACF of returns
plot_acf(df_clean['Open_Close_Return'].dropna(), lags=30, ax=axes[1,0])
axes[1,0].set_title('ACF - Monthly Returns', fontsize=12, fontweight='bold')
axes[1,0].set_xlabel('Lag')
axes[1,0].set_ylabel('Autocorrelation')
axes[1,0].grid(True, alpha=0.3)

# PACF of returns
plot_pacf(df_clean['Open_Close_Return'].dropna(), lags=30, ax=axes[1,1], method='ywm')
axes[1,1].set_title('PACF - Monthly Returns', fontsize=12, fontweight='bold')
axes[1,1].set_xlabel('Lag')
axes[1,1].set_ylabel('Partial Autocorrelation')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("ðŸ“Š Autocorrelation Analysis:")
print(f"Strong autocorrelation in prices up to lag 12 (1 year)")
print(f"Weak autocorrelation in returns (suggests market efficiency)")

##### 1. Why did you pick the specific chart?

ACF and PACF are essential for time series modeling to understand the correlation structure and determine appropriate lag orders for ARIMA/SARIMA models.

##### 2. What is/are the insight(s) found from the chart?

- Price shows strong autocorrelation up to 12 lags (prices highly dependent on past values)
- Returns show minimal autocorrelation (random walk behavior)
- Significant spikes at lag 1 and lag 12 suggest AR(1) and seasonal AR(1) components
- PACF cuts off after lag 1 for returns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Critical for:
- Selecting appropriate lag features in ML models
- Understanding market efficiency (weak form)
- Developing mean-reversion or momentum strategies

#### Chart - 3

In [None]:
# Chart - 3 visualization code - Stationery Tests
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Original series
axes[0,0].plot(df_clean['Date'], df_clean['Close'], color='blue')
axes[0,0].set_title('Original Close Price', fontsize=12, fontweight='bold')
axes[0,0].set_ylabel('Price')
axes[0,0].grid(True, alpha=0.3)

# ADF test result for original
result_orig = adfuller(df_clean['Close'])
axes[0,1].text(0.1, 0.5, f'ADF Test - Original Series\n\nADF Statistic: {result_orig[0]:.4f}\np-value: {result_orig[1]:.4f}\n\nCritical Values:\n1%: {result_orig[4]["1%"]:.4f}\n5%: {result_orig[4]["5%"]:.4f}\n10%: {result_orig[4]["10%"]:.4f}',
               transform=axes[0,1].transAxes, fontsize=12, verticalalignment='center')
axes[0,1].axis('off')
axes[0,1].set_title('Augmented Dickey-Fuller Test', fontsize=12, fontweight='bold')

# First difference
df_clean['Close_Diff1'] = df_clean['Close'].diff()
axes[1,0].plot(df_clean['Date'], df_clean['Close_Diff1'], color='green')
axes[1,0].set_title('First Difference', fontsize=12, fontweight='bold')
axes[1,0].set_ylabel('Price Change')
axes[1,0].grid(True, alpha=0.3)

# ADF test for first difference
result_diff = adfuller(df_clean['Close_Diff1'].dropna())
axes[1,1].text(0.1, 0.5, f'ADF Test - First Difference\n\nADF Statistic: {result_diff[0]:.4f}\np-value: {result_diff[1]:.4f}\n\nCritical Values:\n1%: {result_diff[4]["1%"]:.4f}\n5%: {result_diff[4]["5%"]:.4f}\n10%: {result_diff[4]["10%"]:.4f}',
               transform=axes[1,1].transAxes, fontsize=12, verticalalignment='center')
axes[1,1].axis('off')
axes[1,1].set_title('ADF Test - First Difference', fontsize=12, fontweight='bold')

# Rolling statistics
rolling_mean = df_clean['Close'].rolling(window=12).mean()
rolling_std = df_clean['Close'].rolling(window=12).std()

axes[0,2].plot(df_clean['Date'], df_clean['Close'], label='Original', alpha=0.7)
axes[0,2].plot(df_clean['Date'], rolling_mean, label='12-month Rolling Mean', color='red')
axes[0,2].plot(df_clean['Date'], rolling_std, label='12-month Rolling Std', color='green')
axes[0,2].set_title('Rolling Statistics', fontsize=12, fontweight='bold')
axes[0,2].set_xlabel('Date')
axes[0,2].set_ylabel('Price')
axes[0,2].legend()
axes[0,2].grid(True, alpha=0.3)

# Log transformation
df_clean['Close_Log'] = np.log(df_clean['Close'])
axes[1,2].plot(df_clean['Date'], df_clean['Close_Log'], color='purple')
axes[1,2].set_title('Log Transformed Series', fontsize=12, fontweight='bold')
axes[1,2].set_xlabel('Date')
axes[1,2].set_ylabel('Log(Price)')
axes[1,2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interpretation
print("\nðŸ“Š Stationarity Test Results:")
print(f"Original Series p-value: {result_orig[1]:.6f} - {'Non-stationary' if result_orig[1] > 0.05 else 'Stationary'}")
print(f"First Difference p-value: {result_diff[1]:.6f} - {'Non-stationary' if result_diff[1] > 0.05 else 'Stationary'}")

##### 1. Why did you pick the specific chart?

Stationarity tests determine if transformations are needed for time series modeling. Most ML models perform better with stationary data.

##### 2. What is/are the insight(s) found from the chart?

- Original series is non-stationary (p-value > 0.05)
- First difference achieves stationarity (p-value < 0.05)
- Rolling statistics show changing mean and variance over time
- Log transformation helps stabilize variance

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Essential for:
- Choosing between price vs return prediction
- Understanding risk dynamics over time
- Selecting appropriate transformations for model inputs

#### Chart - 4

In [None]:
# Chart - 4 visualization code - Feature Correlation Heatmap
# Select numerical features for correlation analysis
feature_cols = ['Close', 'Open', 'High', 'Low', 'Price_Range', 'Open_Close_Return',
                'RSI', 'MACD', 'BB_Width', 'Volume_Proxy', 'Close_Lag_1', 'Close_Lag_12',
                'Close_MA_12', 'Close_Std_12', 'Year', 'Month']

# Ensure all selected columns exist
available_cols = [col for col in feature_cols if col in df_clean.columns]
corr_df = df_clean[available_cols].copy()

# Calculate correlation matrix
corr_matrix = corr_df.corr()

# Create heatmap
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Full correlation heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, linewidths=1, fmt='.2f', cbar_kws={"shrink": 0.8}, ax=axes[0])
axes[0].set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold')

# Correlation with target (Close price)
target_corr = corr_matrix['Close'].sort_values(ascending=False)
target_corr_df = pd.DataFrame({
    'Feature': target_corr.index,
    'Correlation': target_corr.values
})

colors = ['green' if x > 0 else 'red' for x in target_corr.values]
axes[1].barh(range(len(target_corr_df)), target_corr_df['Correlation'], color=colors)
axes[1].set_yticks(range(len(target_corr_df)))
axes[1].set_yticklabels(target_corr_df['Feature'])
axes[1].set_xlabel('Correlation with Close Price')
axes[1].set_title('Feature Importance by Correlation', fontsize=14, fontweight='bold')
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nðŸ“Š Top 5 Features by Correlation with Close Price:")
print(target_corr_df.head(5).to_string(index=False))

##### 1. Why did you pick the specific chart?

Correlation heatmap helps identify relationships between features and multicollinearity, which is crucial for feature selection and model interpretation.

##### 2. What is/are the insight(s) found from the chart?

- Strong multicollinearity between Open, High, Low, Close (expected)
- Lag features highly correlated with target
- Technical indicators show moderate correlation
- Month shows weak correlation (consistent with seasonal decomposition)


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Critical for:
- Feature selection to avoid multicollinearity
- Understanding which factors drive stock prices
- Building interpretable models

#### Chart - 5

In [None]:
# Chart - 5 visualization code - Distribution Analysis
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Original Close price distribution
axes[0,0].hist(df_clean['Close'], bins=40, color='skyblue', edgecolor='black', alpha=0.7)
axes[0,0].axvline(df_clean['Close'].mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {df_clean['Close'].mean():.2f}")
axes[0,0].axvline(df_clean['Close'].median(), color='green', linestyle='--', linewidth=2, label=f"Median: {df_clean['Close'].median():.2f}")
axes[0,0].set_title('Distribution of Close Prices', fontsize=12, fontweight='bold')
axes[0,0].set_xlabel('Close Price (INR)')
axes[0,0].set_ylabel('Frequency')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Log-transformed distribution
axes[0,1].hist(np.log(df_clean['Close']), bins=40, color='lightgreen', edgecolor='black', alpha=0.7)
axes[0,1].axvline(np.log(df_clean['Close']).mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {np.log(df_clean['Close']).mean():.2f}")
axes[0,1].set_title('Log-Transformed Distribution', fontsize=12, fontweight='bold')
axes[0,1].set_xlabel('Log(Close Price)')
axes[0,1].set_ylabel('Frequency')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Returns distribution
axes[0,2].hist(df_clean['Open_Close_Return'].dropna(), bins=40, color='coral', edgecolor='black', alpha=0.7)
axes[0,2].axvline(df_clean['Open_Close_Return'].mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {df_clean['Open_Close_Return'].mean():.2f}%")
axes[0,2].axvline(df_clean['Open_Close_Return'].median(), color='green', linestyle='--', linewidth=2, label=f"Median: {df_clean['Open_Close_Return'].median():.2f}%")
axes[0,2].set_title('Monthly Returns Distribution', fontsize=12, fontweight='bold')
axes[0,2].set_xlabel('Return (%)')
axes[0,2].set_ylabel('Frequency')
axes[0,2].legend()
axes[0,2].grid(True, alpha=0.3)

# Q-Q plot for normality
stats.probplot(df_clean['Close'], dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot - Close Price', fontsize=12, fontweight='bold')
axes[1,0].grid(True, alpha=0.3)

# Q-Q plot for log-transformed
stats.probplot(np.log(df_clean['Close']), dist="norm", plot=axes[1,1])
axes[1,1].set_title('Q-Q Plot - Log Close Price', fontsize=12, fontweight='bold')
axes[1,1].grid(True, alpha=0.3)

# Box plot
df_clean[['Close', 'Open', 'High', 'Low']].boxplot(ax=axes[1,2])
axes[1,2].set_title('Box Plot of Price Variables', fontsize=12, fontweight='bold')
axes[1,2].set_ylabel('Price (INR)')
axes[1,2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests
print("\nðŸ“Š Normality Tests:")
print("-" * 40)
print(f"Skewness (Close): {df_clean['Close'].skew():.4f}")
print(f"Kurtosis (Close): {df_clean['Close'].kurtosis():.4f}")
jb_stat, jb_p = jarque_bera(df_clean['Close'])
print(f"Jarque-Bera test p-value: {jb_p:.6f}")
print(f"Interpretation: {'Not normal' if jb_p < 0.05 else 'Normal'} distribution")

##### 1. Why did you pick the specific chart?

Distribution analysis is crucial for understanding data characteristics and selecting appropriate transformations for ML models.

##### 2. What is/are the insight(s) found from the chart?

- Close price is right-skewed (positive skew)
- Log transformation achieves near-normality
- Returns show fat tails (leptokurtic)
- Significant outliers in all price variables
- JB test confirms non-normality (p < 0.05)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Important for:
- Selecting appropriate loss functions
- Understanding risk (fat tails mean extreme events more likely)
- Applying transformations for better model performance

#### Chart - 6

In [None]:
# Chart - 6 visualization code- Train-Test Split
# Determine split point (80% train, 20% test)
split_idx = int(len(df_clean) * 0.8)
split_date = df_clean['Date'].iloc[split_idx]

# Create train and test sets
train_df = df_clean.iloc[:split_idx]
test_df = df_clean.iloc[split_idx:]

fig, axes = plt.subplots(2, 2, figsize=(18, 10))

# Train-test split visualization
axes[0,0].plot(train_df['Date'], train_df['Close'], label='Training Data', color='blue', linewidth=2)
axes[0,0].plot(test_df['Date'], test_df['Close'], label='Test Data', color='orange', linewidth=2)
axes[0,0].axvline(x=split_date, color='red', linestyle='--', linewidth=2, label=f'Split Date: {split_date.strftime("%b-%Y")}')
axes[0,0].set_title('Train-Test Split (80-20) - Time Series', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Date')
axes[0,0].set_ylabel('Close Price (INR)')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Distribution comparison - Train vs Test
axes[0,1].hist(train_df['Close'], bins=30, alpha=0.7, label='Train', color='blue', edgecolor='black')
axes[0,1].hist(test_df['Close'], bins=30, alpha=0.7, label='Test', color='orange', edgecolor='black')
axes[0,1].set_title('Distribution Comparison: Train vs Test', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Close Price (INR)')
axes[0,1].set_ylabel('Frequency')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Statistics comparison
stats_comparison = pd.DataFrame({
    'Metric': ['Count', 'Mean', 'Std', 'Min', '25%', '50%', '75%', 'Max'],
    'Train': [len(train_df), train_df['Close'].mean(), train_df['Close'].std(),
              train_df['Close'].min(), train_df['Close'].quantile(0.25),
              train_df['Close'].median(), train_df['Close'].quantile(0.75),
              train_df['Close'].max()],
    'Test': [len(test_df), test_df['Close'].mean(), test_df['Close'].std(),
             test_df['Close'].min(), test_df['Close'].quantile(0.25),
             test_df['Close'].median(), test_df['Close'].quantile(0.75),
             test_df['Close'].max()]
})

# Hide axes for table
axes[1,0].axis('tight')
axes[1,0].axis('off')
table = axes[1,0].table(cellText=stats_comparison.round(2).values,
                        colLabels=stats_comparison.columns,
                        cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1.2, 1.5)
axes[1,0].set_title('Dataset Statistics Comparison', fontsize=14, fontweight='bold')

# Rolling statistics comparison
train_rolling_mean = train_df['Close'].rolling(window=12).mean()
test_rolling_mean = test_df['Close'].rolling(window=12).mean()

axes[1,1].plot(train_df['Date'][11:], train_rolling_mean[11:], color='blue', label='Train 12-MA')
axes[1,1].plot(test_df['Date'], test_rolling_mean, color='orange', label='Test 12-MA')
axes[1,1].set_title('Rolling Mean Comparison (12-month)', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Date')
axes[1,1].set_ylabel('Price (INR)')
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nðŸ“Š Train-Test Split Summary:")
print(f"Split Date: {split_date.strftime('%B %Y')}")
print(f"Training set: {len(train_df)} samples ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f"Test set: {len(test_df)} samples ({len(test_df)/len(df_clean)*100:.1f}%)")


##### 1. Why did you pick the specific chart?

Visualizing train-test split is crucial for time series to ensure no data leakage and to understand the distribution differences.

##### 2. What is/are the insight(s) found from the chart?

- Test set contains the recent high-volatility period (2018-2020)
- Train and test distributions are different (non-stationarity)
- Test set includes the dramatic price drop
- This split will test model's ability to handle regime changes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Critical for:
- Understanding model generalization to new market conditions
- Evaluating model robustness during crisis periods
- Setting realistic performance expectations

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***