# **Project Name**    - Predicting Yes Bank's Monthly Stock Closing Price



#### **Project Type**    - Regression
#### **Contribution**    - Individual
#### **Name** - Sagar Zujam


# **Project Summary -**

###**This project involves predicting the monthly closing stock price of Yes Bank using historical stock data. Yes Bank, one of India's prominent private banks, experienced major fluctuations in its stock price, particularly post-2018 due to financial instability and fraud cases. The dataset includes monthly stock data such as open, high, low, and close prices, which allows for regression modeling to forecast the stock's closing price. The goal is to understand trends, perform deep EDA, and apply machine learning techniques to accurately predict the closing price, thereby providing valuable insights into stock behavior.**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


###**To build a predictive model that estimates Yes Bank's monthly stock closing price using historical stock price data, and to analyze how trends and volatility can be captured using time series and regression techniques.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()


In [None]:
df = pd.read_csv("data_YesBank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
print("\nFirst 5 Rows:")
print(df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("\nNumber of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
print("\nData Types:")
print(df.dtypes)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("\nChecking for duplicate records:")
print(df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("\nNull Values:")
print(df.isnull().sum())

In [None]:
# Visualizing Missing Values with Heatmap
plt.figure(figsize=(10, 2))
sns.heatmap(df.isnull(), cbar=False, cmap="Reds", yticklabels=False)
plt.title("Missing Value Heatmap")
plt.show()

### What did you know about your dataset?

The dataset consists of monthly stock price data for Yes Bank spanning from July 2005 to November 2020, with 185 observations and 5 key columns:

1. Date – Contains month and year of the record.
2. Open – Stock price at the beginning of the month.
3. High – Highest stock price during the month.
4. Low – Lowest stock price during the month.
5. Close – Stock price at the end of the month (Target variable)

### Convert Date to datetime and sort

In [None]:
print("\nConverting 'Date' to datetime format...")
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df = df.sort_values('Date')
df.reset_index(drop=True, inplace=True)

####1. Add additional time-based features (month and year) for better analysis

In [None]:
print("\nAdding 'Month' and 'Year' columns...")
df['Month'] = df['Date'].dt.month

####2. Since only day and month provided, we simulate year sequence manually

In [None]:
print("\nCreating synthetic 'Year' based on position...")
df['Year'] = [i//12 + 2005 for i in range(len(df))]  # Assume data starts from July 2005

####3. Rearrange columnss

In [None]:
df = df[['Date', 'Year', 'Month', 'Open', 'High', 'Low', 'Close']]

# Show final structure
print("\nUpdated DataFrame Head:")
print(df.head())

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("\nColumn Names:")
print(df.columns)

In [None]:
# Dataset Describe
print("\nDescriptive Statistics of Dataset:")
print(df.describe())

# Data Types confirmation
print("\nData Types after processing:")
print(df.dtypes)

### Variables Description

1. Strong correlation observed between High and Close, and Low and Close.
2. Open, High, and Low may serve as good predictors for the Close value.
3. No duplicate records and all dates are unique after transformation.
4. Date was only used for temporal ordering and visualization; it was dropped before modeling due to its non-numeric format.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\nAre Dates Unique?:", df['Date'].is_unique)

# Correlation Check
print("\nCorrelation Matrix:")
print(df.corr())

# Heatmap for correlation
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Checking for missing values again
print("\nChecking for missing values before imputation:")
print(df.isnull().sum())

# Check for outliers using IQR method
def detect_outliers(column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower) | (df[column] > upper)]
    return outliers

outlier_summary = {}
for col in ['Open', 'High', 'Low', 'Close']:
    outliers = detect_outliers(col)
    outlier_summary[col] = len(outliers)
    print(f"Outliers in {col}: {len(outliers)}")

# Optionally handle outliers: Capping method used here (optional, for demonstration)
for col in ['Open', 'High', 'Low', 'Close']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df[col] = np.where(df[col] < lower, lower, df[col])
    df[col] = np.where(df[col] > upper, upper, df[col])

print("\nShape after outlier capping:", df.shape)

# Check again for missing values
print("\nMissing values after handling:")
print(df.isnull().sum())

# Final check of dataset
df.reset_index(drop=True, inplace=True)
print("\nData Wrangling complete.")

### What all manipulations have you done and insights you found?

1. **Missing Values:** No missing values were found in the dataset.

2. **Outlier Detection:** Used the IQR method to detect outliers in Open, High, Low, and Close.

3. **Outlier Handling:** Applied capping to limit the influence of extreme values.

4. **Outcome:** Data became clean, with reduced skewness, making it suitable for modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Univariate Analysis: Distribution of each stock price component

In [None]:
# Chart - 1 visualization code
features = ['Open', 'High', 'Low', 'Close']
for feature in features:
    plt.figure(figsize=(8,4))
    sns.histplot(df[feature], kde=True, bins=30, color='skyblue')
    plt.title(f'Distribution of {feature} Price')
    plt.xlabel(f'{feature} Price')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

    print(f"Why this chart: To observe the distribution and skewness of {feature} prices.")
    print("Insights: Outliers and skewed patterns observed due to major price fluctuations over years.")
    print("Business Impact: Helps in normalization decisions for modeling, detecting non-stationary behavior.")


##### 1. Why did you pick the specific chart?

To observe the distribution and skewness of {feature} prices.

##### 2. What is/are the insight(s) found from the chart?

Insights: Outliers and skewed patterns observed due to major price fluctuations over years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Helps in normalization decisions for modeling, detecting non-stationary behavior.

Yes Bank’s stock showed abnormal peaks and drops around 2018–2019, reflecting periods of financial instability and fraud.
By identifying these trends early through visualizations, we were able to design a more robust prediction pipeline, aiding better financial decision-making and reducing risk.

#### Chart - 2 Bivariate Analysis: Relation of Open, High, Low with Close

In [None]:
# Chart - 2 visualization code
for feature in ['Open', 'High', 'Low']:
    plt.figure(figsize=(6,4))
    sns.scatterplot(x=df[feature], y=df['Close'])
    plt.title(f'{feature} vs Close')
    plt.xlabel(feature)
    plt.ylabel('Close')
    plt.grid(True)
    plt.show()

    print(f"Why this chart: To analyze the relationship between {feature} and Close price.")
    print("Insights: Strong linear relationship evident, especially for High and Low prices.")
    print("Business Impact: These features are valuable predictors for stock closing price.")

##### 1. Why did you pick the specific chart?

To analyze the relationship between {feature} and Close price.

##### 2. What is/are the insight(s) found from the chart?

Insights: Strong linear relationship evident, especially for High and Low prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3: Multivariate Analysis: Time-based trends

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(15,5))
sns.lineplot(data=df, x='Year', y='Close', label='Close')
sns.lineplot(data=df, x='Year', y='High', label='High')
sns.lineplot(data=df, x='Year', y='Low', label='Low')
plt.title('Trend of Stock Prices Over Time')
plt.grid(True)
plt.legend()
plt.show()

print("Why this chart: To understand stock price evolution over the years.")
print("Insights: Significant spike and drop observed around 2018-2019.")
print("Business Impact: Helps explain volatility periods and link them to external events like financial fraud.")


##### 1. Why did you pick the specific chart?

To understand stock price evolution over the years.

##### 2. What is/are the insight(s) found from the chart?

Insights: Significant spike and drop observed around 2018-2019.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Helps explain volatility periods and link them to external events like financial fraud.

1. The sharp decline after 2018 reflects real-world financial fraud at Yes Bank.

2. This affects investor trust and must be accounted for in modeling to avoid misleading projections.

These trends guide decision-makers in assessing when and why drastic price changes occurred, directly impacting business strategies.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

Hypothesis 1: Is the mean 'Close' price before 2018 significantly different from after 2018?

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Hypothesis 1: Is the mean 'Close' price before 2018 significantly different from after 2018?
before_2018 = df[df['Year'] < 2018]['Close']
after_2018 = df[df['Year'] >= 2018]['Close']

# Perform t-test
t_stat, p_value = stats.ttest_ind(before_2018, after_2018, equal_var=False)
print("\nHypothesis Test 1: Mean comparison of Close price before and after 2018")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Statistically significant difference in Close prices before and after 2018.")
else:
    print("Conclusion: No significant difference in Close prices before and after 2018.")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

Independent Two-Sample t-test (Welch’s t-test):

1. stats.ttest_ind() is used to compare the means of two independent samples (in this case, Close prices before and after 2018).

2. The parameter equal_var=False indicates we do not assume equal population variance, which makes it a Welch's t-test, a more robust version of the t-test.



##### Why did you choose the specific statistical test?

Answer Here.

1. You're comparing average values of a numeric variable across two groups (e.g., pre-2018 vs. post-2018).

2. Sample sizes or variances are unequal.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.
  Is there a significant correlation between 'High' and 'Close' prices?

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
correlation, pval = stats.pearsonr(df['High'], df['Close'])
print("\nHypothesis Test 2: Correlation between High and Close")
print(f"Correlation Coefficient: {correlation:.4f}, P-Value: {pval:.4f}")

if pval < 0.05:
    print("Conclusion: Strong statistically significant correlation between High and Close.")
else:
    print("Conclusion: No statistically significant correlation.")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

Pearson Correlation Coefficient Test.

1. stats.pearsonr() computes the linear correlation coefficient between two continuous variables — in this case, High and Close prices.

2. It also returns a p-value, which tests the null hypothesis that the correlation is zero (i.e., no linear relationship).

##### Why did you choose the specific statistical test?

Answer Here.

This test is appropriate because:

1. Both High and Close are continuous and approximately normally distributed (as confirmed in univariate analysis).

2. Measuring the strength and significance of a linear relationship.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here. Is there a significant difference in average Close prices across months?

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Hypothesis 3: Is there a significant difference in average Close prices across months?
from scipy.stats import f_oneway

# Group Close prices by month
monthly_groups = [group["Close"].values for name, group in df.groupby("Month")]

# Perform One-Way ANOVA
f_stat, p_val = f_oneway(*monthly_groups)

print("\nHypothesis Test 3: Difference in average Close prices across months (One-Way ANOVA)")
print(f"F-Statistic: {f_stat:.4f}, P-Value: {p_val:.4f}")

if p_val < 0.05:
    print("Conclusion: Significant variation in Close prices across different months.")
else:
    print("Conclusion: No significant variation in Close prices across different months.")


##### Which statistical test have you done to obtain P-Value?

Answer Here.

One-Way ANOVA (Analysis of Variance)

1. **Test Used:** f_oneway (One-Way ANOVA)

2. **Null Hypothesis:** All monthly means of Close prices are equal.

3. **Alternative Hypothesis:** At least one month's mean Close price is different.

4. **Reason:** Appropriate for comparing means across more than two independent groups (12 months here).

##### Why did you choose the specific statistical test?

Answer Here.

1. The test compares mean Close prices across multiple independent groups (in this case, the 12 months).

2. We grouped the Close prices by Month and used scipy.stats.f_oneway() to check if at least one month's mean is significantly different from others.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Answer:

Yes,
1. The target variable Close and predictors like Open, High, Low are already in continuous, linear scale.

2. A visual inspection of their distribution showed mild skewness, but not extreme—hence transformation was not strictly necessary.

3. Instead, we handled skewness and outliers using IQR-based capping, which stabilized the data.

4. The StandardScaler was used for scaling to ensure equal contribution from all variables.

### 2. Data Scaling

In [None]:
# Scaling your data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


##### Which method have you used to scale you data and why?

Answer:

used StandardScaler from sklearn.preprocessing.
1. StandardScaler standardizes features by removing the mean and scaling to unit variance.

2. This means each feature will have a mean of 0 and standard deviation of 1, which helps models like Linear Regression and Random Forest perform better.

3. Scaling ensures all features are on the same scale, preventing bias toward higher magnitude variables like High or Low prices.

### 3. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

Since the dataset is small, focused, and well-engineered, dimensionality reduction techniques are not required for this regression task.

### 4. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# Drop 'Date' since it's not a numeric feature
df_model = df.drop(['Date'], axis=1)

# Define X and y
X = df_model.drop('Close', axis=1)
y = df_model['Close']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("\nFeature Engineering and Preprocessing complete.")
print("Training Feature Shape:", X_train.shape)
print("Testing Feature Shape:", X_test.shape)

##### What data splitting ratio have you used and why?

Answer Here.
I used an 80:20 train-test split ratio.

1. 80% of the data was used for training the machine learning models to learn the patterns.

2. 20% of the data was kept for testing to evaluate how well the model generalizes on unseen data.

3. This is a standard industry practice that provides a good balance between training and validation while avoiding overfitting or underfitting.

4. The dataset has a moderate size (185 rows), so 80:20 offers enough training samples while keeping a meaningful test set.

## ***7. ML Model Implementation***

### ML Model - 1: Linear Regression

In [None]:
# ML Model - 1 : Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Fit the Algorithm

# Predict on the model

model1 = LinearRegression()
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)

print("\nModel 1 - Linear Regression")
print("MAE:", mean_absolute_error(y_test, y_pred1))
print("MSE:", mean_squared_error(y_test, y_pred1))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred1)))
print("R2 Score:", r2_score(y_test, y_pred1))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt

# Linear Regression Metrics
lr_metrics = {
    'MAE': 5.0563,
    'MSE': 71.5326,
    'RMSE': 8.4577,
    'R2': 0.9914
}

# Plotting
plt.figure(figsize=(8, 5))
plt.bar(lr_metrics.keys(), lr_metrics.values(), color='skyblue')
plt.title('Linear Regression Evaluation Metrics')
plt.ylabel('Score')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

# Linear Regression Model
lr = LinearRegression()

# Define parameter grid for GridSearchCV
param_grid = {
    'fit_intercept': [True, False]
}

# GridSearchCV
grid_search = GridSearchCV(estimator=lr, param_grid=param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

# Best Estimator
best_lr = grid_search.best_estimator_
print("Best Parameters from GridSearchCV:", grid_search.best_params_)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

I used GridSearchCV as the hyperparameter optimization technique.

1. Exhaustive Search: It evaluates all possible combinations of the given hyperparameter values, ensuring the best set is selected.

2. Simplicity: Easy to implement and interpret, especially when the parameter space is small (as in Linear Regression).

3. Cross-validation: It uses cross-validation internally, which helps in reducing overfitting and gives a more generalized model performance.

4. Best suited: For models with fewer hyperparameters like Linear Regression, GridSearchCV is efficient and effective.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

There was no significant improvement in metrics after tuning because Linear Regression has very limited hyperparameters, and the dataset was already well-scaled and clean. The model was already performing optimally

### ML Model - 2: RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor(random_state=42)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)

print("\nModel 2 - Random Forest Regressor")
print("MAE:", mean_absolute_error(y_test, y_pred2))
print("MSE:", mean_squared_error(y_test, y_pred2))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred2)))
print("R2 Score:", r2_score(y_test, y_pred2))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Initialize the model
rf = RandomForestRegressor(random_state=42)

# Define parameter grid for tuning
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Define the model
rf_model = RandomForestRegressor(random_state=42)

# Define hyperparameter grid
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_model,
    param_distributions=param_dist,
    n_iter=10,  # Number of combinations to try
    cv=5,
    scoring='neg_mean_squared_error',
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit on training data
random_search.fit(X_train, y_train)

# Best model from RandomizedSearch
best_rf_model = random_search.best_estimator_

# Predict on test set
rf_y_pred = best_rf_model.predict(X_test)

# Evaluation
rf_mae = mean_absolute_error(y_test, rf_y_pred)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_y_pred)

# Print evaluation
print("Random Forest with RandomizedSearchCV:")
print(f"MAE: {rf_mae}")
print(f"MSE: {rf_mse}")
print(f"RMSE: {rf_rmse}")
print(f"R² Score: {rf_r2}")


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

I have used RandomizedSearchCV for hyperparameter optimization of the Random Forest Regressor. This technique was chosen because it is computationally more efficient than GridSearchCV, especially when the hyperparameter space is large. It randomly selects a fixed number of parameter combinations from the specified grid, reducing the search time while still achieving good performance. It also helps prevent overfitting by using cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

Yes, after applying RandomizedSearchCV, the model's performance improved compared to the default Random Forest Regressor.

MAE reduced by ~25%, indicating more accurate point predictions.

RMSE and MSE also dropped significantly, showing fewer extreme errors.

R² Score increased from 0.9788 to 0.9873, reflecting better overall model fit.



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

For the Random Forest Regressor, the evaluation metrics used were:

1. MAE (Mean Absolute Error): Indicates the average error in predicted prices. In business terms, it reflects the average monthly deviation between actual and predicted stock prices, helping stakeholders understand typical forecasting errors.

2. RMSE (Root Mean Square Error): Penalizes larger errors more than MAE, making it useful to identify high-risk months where stock behavior was volatile. This helps risk management teams plan better.

3. R² Score: Shows how well the model explains the variance in stock prices. A high R² value indicates strong model performance, making it reliable for forecasting stock movements.

**Business Impact**: These metrics help financial analysts trust the model’s predictions, plan better investments, and understand the consistency and reliability of the forecasts.

### ML Model - 3: XGBRegressor

In [None]:
# ML Model - 3 Implementation
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Instantiate and fit the model
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

# Predict
xgb_preds = xgb_model.predict(X_test)

# Evaluate
xgb_mae = mean_absolute_error(y_test, xgb_preds)
xgb_mse = mean_squared_error(y_test, xgb_preds)
xgb_rmse = np.sqrt(xgb_mse)
xgb_r2 = r2_score(y_test, xgb_preds)

print("XGBoost Regressor Performance:")
print(f"MAE: {xgb_mae}")
print(f"MSE: {xgb_mse}")
print(f"RMSE: {xgb_rmse}")
print(f"R2 Score: {xgb_r2}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# ----- Linear Regression -----
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

lr_mae = mean_absolute_error(y_test, lr_pred)
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_pred))
lr_r2 = r2_score(y_test, lr_pred)

# ----- Random Forest Regressor -----
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

rf_mae = mean_absolute_error(y_test, rf_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_r2 = r2_score(y_test, rf_pred)

# ----- XGBoost Regressor -----
xgb_model = XGBRegressor(random_state=42, verbosity=0)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)

xgb_mae = mean_absolute_error(y_test, xgb_pred)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
xgb_r2 = r2_score(y_test, xgb_pred)

# ----- Visualizing Evaluation Metric Score Chart -----
models = ['Linear Regression', 'Random Forest', 'XGBoost']
mae_scores = [lr_mae, rf_mae, xgb_mae]
rmse_scores = [lr_rmse, rf_rmse, xgb_rmse]
r2_scores = [lr_r2, rf_r2, xgb_r2]

plt.figure(figsize=(15, 4))

# MAE
plt.subplot(1, 3, 1)
plt.bar(models, mae_scores, color='orange')
plt.title('MAE Comparison')
plt.ylabel('MAE')

# RMSE
plt.subplot(1, 3, 2)
plt.bar(models, rmse_scores, color='green')
plt.title('RMSE Comparison')
plt.ylabel('RMSE')

# R² Score
plt.subplot(1, 3, 3)
plt.bar(models, r2_scores, color='skyblue')
plt.title('R² Score Comparison')
plt.ylabel('R² Score')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import RandomizedSearchCV

# Define parameter grid
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 10],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Create the model
xgb_base = XGBRegressor(random_state=42)

# RandomizedSearchCV
xgb_random_search = RandomizedSearchCV(
    estimator=xgb_base,
    param_distributions=xgb_param_grid,
    n_iter=20,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

xgb_random_search.fit(X_train, y_train)

# Best model
best_xgb_model = xgb_random_search.best_estimator_

# Predictions
xgb_best_preds = best_xgb_model.predict(X_test)

# Evaluation
xgb_best_mae = mean_absolute_error(y_test, xgb_best_preds)
xgb_best_mse = mean_squared_error(y_test, xgb_best_preds)
xgb_best_rmse = np.sqrt(xgb_best_mse)
xgb_best_r2 = r2_score(y_test, xgb_best_preds)

print("Tuned XGBoost Regressor Performance:")
print(f"MAE: {xgb_best_mae}")
print(f"MSE: {xgb_best_mse}")
print(f"RMSE: {xgb_best_rmse}")
print(f"R2 Score: {xgb_best_r2}")


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

I used RandomizedSearchCV for hyperparameter optimization. This technique is more efficient than GridSearchCV when dealing with a large hyperparameter space. It samples a fixed number of parameter settings from the specified distributions, saving time and computational resources while still providing good results. It is ideal when we want a quicker yet effective tuning of model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

Yes, after applying RandomizedSearchCV to the XGBoost model, I observed a noticeable improvement in performance metrics:

1. MAE reduced from 4.89 to 4.53

2. RMSE reduced from 7.92 to 7.38

3. R2 Score improved from 0.9921 to 0.9932

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

Evaluation Metrics Used:

1. **MAE** gives a clear interpretation of the average magnitude of prediction errors in real business units (rupees in this case), helping to understand the expected deviation from actual stock prices.

2. **RMSE** penalizes larger errors more than MAE, making it suitable for stock prices where large deviations could significantly impact investment decisions.

3. **R² Score** indicates how well the independent features explain the variation in the closing stock price. A higher R² directly relates to better model accuracy, which improves investor confidence and decision-making.

These metrics collectively ensure the model's precision, reliability, and generalizability, which are essential when making financial predictions where every rupee matters.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

 **XGBoost Regressor** with RandomizedSearchCV (Tuned Version)

1. Achieved the lowest MAE and RMSE among all models.

2. Gave the highest R² Score (0.9932), indicating excellent prediction accuracy.

3. XGBoost also handles non-linearity, outliers, and feature interactions better than traditional models like Linear Regression or Random Forest.

4. Hyperparameter tuning via RandomizedSearchCV further improved its performance without overfitting.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

1. XGBoost is an ensemble-based gradient boosting algorithm that builds multiple decision trees in sequence.

2. It optimizes performance by focusing on errors made by previous models, improving overall accuracy.

3. It has built-in regularization, making it less prone to overfitting.

Understanding feature contributions helps stakeholders and investors trust the model and make informed decisions. It also aids in identifying which stock attributes drive prices the most, leading to better investment strategies.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save the best model
filename = 'best_xgb_model.pkl'
joblib.dump(best_xgb_model, filename)
print(f"\nModel saved as {filename}")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load(filename)
sample_prediction = loaded_model.predict(X_test[:5])
print("\nSample Predictions from loaded model:", sample_prediction)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The Linear Regression outperformed Random Forest Regressor in all key evaluation metrics, indicating its robustness
in capturing the non-linear relationships in Yes Bank's stock data. The model was saved and validated on sample data
successfully, making it deployment-ready. Future work may include time-series-specific models like ARIMA or LSTM for
finer trend prediction.

In [None]:
print("\nModel creation, evaluation, and deployment preparation complete.")

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***