# **Project Name**  -  Stock-Market-ML-Model



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member  ** - Vrushabh Dhakad



# **Project Summary -**

This project focuses on building a machine learning model to predict stock prices of Yes Bank using historical data. The goal was to develop a reliable, accurate, and explainable model that could assist investors and analysts in making informed financial decisions.

The dataset underwent several preprocessing steps, including handling missing values, date formatting, feature extraction (like Month and Year), and feature scaling using StandardScaler. These transformations ensured the data was in the right shape and scale for effective model training.

We began by exploring multiple regression models: Linear Regression, Random Forest Regressor, and AdaBoost Regressor. Each model was trained on 80% of the scaled dataset while 20% was reserved for testing. Their performances were measured using three key evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² Score. These metrics were chosen for their relevance in financial forecasting—MAE gives a direct interpretation in rupees, MSE penalizes larger errors more, and R² indicates how well the model explains the variance in the stock price.

Initially, Random Forest and AdaBoost outperformed Linear Regression. To further boost accuracy, we applied hyperparameter tuning using GridSearchCV for both Random Forest and AdaBoost. This process helped us identify the best combination of parameters (e.g., number of estimators, learning rate) by performing cross-validation. The tuned AdaBoost model showed the best results, achieving the lowest MAE and MSE and the highest R² Score on the test data.

We also used model explainability techniques by accessing the AdaBoost Regressor’s feature importance scores. This revealed which input features—such as Open, High, Low, and Previous Close prices—had the most influence on predictions. This insight adds transparency to the model and allows stakeholders to better understand what drives stock movements.

The best model (Tuned AdaBoost Regressor) was saved using joblib into a pickle file format, making it ready for deployment. To validate the save/load process, we reloaded the model and performed a sanity check by predicting on the test data again, confirming identical performance metrics.

In conclusion, this end-to-end machine learning pipeline successfully demonstrated a data-driven approach to predicting Yes Bank’s stock price. From preprocessing to model evaluation and deployment, every step followed industry-standard practices. The tuned AdaBoost Regressor provides a reliable forecasting tool with strong business implications—helping users make more confident and accurate investment decisions.
"""

# **GitHub Link -**

# **Problem Statement**


The problem is to predict the closing price of Yes Bank stocks using historical data. Accurate predictions can help investors make informed decisions, optimize portfolios, and mitigate risks. The challenge lies in handling the volatility and non-linearity of stock prices, which are influenced by various external factors.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor , AdaBoostRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows
print("Total duplicates:", df.duplicated().sum())

# Check for duplicate dates
print("Duplicate dates:", df['Date'].duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()  # No missing values found

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False,cmap='viridis', yticklabels=False)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe() # Statistical summary

### Variables Description

1. Date        : The trading date of the stock in YYYY-MM-DD format.
2. Open        : The price at which the stock opened on the given trading day.
3. High        : The highest price of the stock during the trading day.
4. Low         : The lowest price of the stock during the trading day.
5. Close       : The actual closing price of the stock on the current trading day

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'],format = '%b-%y')

In [None]:
# Extract year and month for additional features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [None]:
# Calculate monthly price change
df['Price_Change'] = df['Close'] - df['Open']

In [None]:
# Create lag feature: previous month's close
df['Prev_Close'] = df['Close'].shift(1)

In [None]:
# Drop rows with missing values (due to shift)
df.dropna(inplace=True)

In [None]:
df.head(5)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Histogram of Closing Price

In [None]:
df['Close'].plot(kind='hist', bins=20, color='teal', edgecolor='black', title='Distribution of Closing Prices', figsize=(8, 4))
plt.xlabel('Close Price')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

This histogram shows how frequently different closing price ranges occurred.
It helps in understanding the distribution and skewness of the closing price data.

##### 2. What is/are the insight(s) found from the chart?

Most of the closing prices fall within a specific lower range, showing that the stock often trades at a relatively low value. This indicates a skewed distribution and suggests that high closing prices were rare events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the histogram chart help create a positive business impact by showing that the stock mostly trades at a lower price range, which indicates affordability for retail investors.
However, this also reflects that the stock rarely reaches high price points, whch might signal limited growth potential.
This insight can help investors assess both opportunity and risk before making financial decisions.


#### Line plot of Price Change over Time

In [None]:
df.set_index('Date')['Price_Change'].plot(figsize=(10, 4), title='Monthly Price Change Over Time', color='purple')
plt.xlabel('Date')
plt.ylabel('Price Change')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This line plot shows the monthly price change trend.
It helps to visualize how volatile the price movement is month-to-month.

##### 2. What is/are the insight(s) found from the chart?

This insight suggests the stock lacks long-term growth consistency, making it risky for long-term investors but possibly useful for short-term trading strategies

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The line plot of price change over time reveals frequent ups and downs in monthly price movements, indicating high volatility. This insight helps create a positive business impact by allowing traders and analysts to identify periods of increased activity, which may offer short-term profit opportunities. However, the irregular and unpredictable swings in price also reflect instability, which can lead to negative growth if investors enter during a downturn. Understanding this pattern is essential for managing risk and developing timing strategies in volatile market conditions.

####  Pairplot of numerical features

In [None]:
sns.pairplot(df[['Open', 'High', 'Low', 'Close']])
plt.suptitle('Pairwise Feature Relationships', y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This pairplot helps to understand the relationships and scatter distributions among numerical features. It can help in spotting linear or non-linear relationships.

##### 2. What is/are the insight(s) found from the chart?

The pairplot shows that features like Open, High, Low, and Close have strong positive relationships. The scatter plots form diagonal patterns, suggesting that these variables move together consistently. This confirms they are closely related and can be useful predictors in the model.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The pairplot shows strong relationships between features like Open, High, Low, and Close. This helps create a positive business impact by confirming these variables can be used together to build accurate prediction models. However, since these features move closely together, a drop in one often affects the others too. This could lead to negative growth if early signals aren't caught, highlighting the need for timely decision-making.

#### Line plot

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(df['Date'], df['Close'], label='Current Close')
plt.plot(df['Date'], df['Prev_Close'], label='Previous Close', linestyle='--')
plt.title('Close vs Previous Close Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This plot compares the current closing price to the previous month's closing price.
It helps to see how consistent or deviating the price was from the last month.

##### 2. What is/are the insight(s) found from the chart?

The line plot shows that the closing price of Yes Bank is highly volatile with no clear upward trend. This suggests that the stock lacks long-term growth stability and may be more suited for short-term trading rather than long-term investment.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The line plot reveals that Yes Bank's closing prices are highly volatile with no clear upward trend. This can help create a positive business impact by allowing traders to spot short-term opportunities and manage risk better. However, the frequent fluctuations and absence of long-term growth also raise concerns, as they indicate potential instability and negative growth for long-term investors.

#### Bar chart

In [None]:
df.groupby('Year')['Close'].mean().plot(kind='bar', figsize=(8, 4), title='Average Close Price Per Year', color='skyblue')
plt.ylabel('Average Close Price')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This chart shows how the average closing price changes each year. it helps detect any year-over-year trends in stock performance.

##### 2. What is/are the insight(s) found from the chart?

The bar chart shows fluctuations in the average closing price of Yes Bank across different years. It highlights that the stock did not show consistent growth year over year, with some years experiencing a decline. This suggests unstable long-term performance and helps identify which years had relatively better or worse stock value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The bar chart helps identify how the average closing price of Yes Bank changed each year. This insight creates a positive business impact by highlighting which years performed better, guiding future investment timing. However, the inconsistency and decline in some years indicate potential negative growth and raise concerns about the stock's long-term reliability.



####  Box plot

In [None]:
sns.boxplot(x='Month', y='Close', data=df, palette='Set2')
plt.title('Monthly Close Price Distribution')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot reveals the distribution and outliers of close prices for each month.
It helps identify seasonal effects or consistency in monthly prices.

##### 2. What is/are the insight(s) found from the chart?

The box plot shows that the distribution of closing prices varies significantly across different months. Some months have wider price ranges and outliers, indicating high volatility, while others are more stable. This suggests that certain months may be riskier or more active for trading than others.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The box plot highlights monthly variations in closing prices, revealing which months are more volatile and which are stable. This helps create a positive business impact by guiding investors to choose more predictable months for trading. However, the presence of outliers and wide ranges in some months also points to unpredictability and risk, which can lead to negative outcomes if not carefully monitored.

#### Scatter plot

In [None]:
sns.scatterplot(x='Open', y='Close', data=df)
plt.title('Open vs Close Price')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This plot shows the relationship between opening and closing prices.
A strong upward trend here supports high correlation between these features.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows a strong positive relationship between the opening and closing prices. This means that when the stock opens high, it generally closes high, indicating consistent intraday movement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The scatter plot shows a strong positive relationship between opening and closing prices, which supports more accurate model predictions and helps traders anticipate daily stock behavior, creating a positive business impact. However, if the stock opens low, it’s likely to close low as well, which may signal potential negative growth, especially during bearish market conditions.

#### Chart - 8

In [None]:
df.set_index('Date')['Close'].plot(figsize=(10, 4), title='Yes Bank Closing Price Over Time', grid=True)
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The line plot was chosen because it clearly shows the trend and volatility in Yes Bank’s closing price over time. It helps in understanding whether the stock follows a stable growth pattern or experiences frequent fluctuations, which is crucial for investment decision-making.



##### 2. What is/are the insight(s) found from the chart?

The line plot shows that Yes Bank’s closing price is highly volatile over time without a consistent upward trend. This suggests the stock lacks long-term stability but may offer opportunities for short-term trading during price swings.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The line plot shows that Yes Bank’s closing prices are highly volatile with no clear long-term growth. This helps create a positive business impact by informing investors about the risky nature of the stock, encouraging more careful short-term strategies. However, the lack of consistent upward movement and frequent dips also signal instability, which could lead to negative growth if not managed properly, especially for long-term investments.



#### Countplot

In [None]:
sns.countplot(x='Month', data=df, palette='Set3')
plt.title('Monthly Count of Closing Records')
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The countplot was chosen because it helps visualize how often the stock closed in each month. It’s useful for identifying seasonal patterns and understanding if certain months are more active or consistent in trading volume.

##### 2. What is/are the insight(s) found from the chart?

The countplot shows that certain months have more frequent closing records than others, suggesting possible seasonal activity or more consistent trading behavior in those months. This can help detect if some months are more stable or active than the rest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the countplot helps identify which months have more frequent trading activity, which can guide investors to focus on more stable periods. This creates a positive business impact by revealing seasonal trends that may align with market performance. However, if certain months show consistently low activity, it may indicate investor hesitation or negative sentiment, which could signal potential risk or negative growth during those times.



#### Line Plot

In [None]:
avg_monthly_close = df.groupby('Month')['Close'].mean()
plt.figure(figsize=(8, 4))
plt.plot(avg_monthly_close.index, avg_monthly_close.values, marker='o', color='green')
plt.title('Average Close Price by Month')
plt.xlabel('Month')
plt.ylabel('Average Close Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart helps to see if there is any trend in average closing price across months.

##### 2. What is/are the insight(s) found from the chart?

The line plot shows that the average closing price fluctuates across different months. Some months have slightly higher averages, while others dip, indicating that the stock doesn't follow a consistent seasonal trend. This helps understand monthly behavior for better timing of investments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart helps create a positive business impact by revealing how the average closing price changes from month to month, which supports better timing for investment decisions. However, the fluctuations across months show a lack of seasonal stability, which may lead to negative growth if investors rely on monthly patterns that are inconsistent.

#### Bar Plot

In [None]:
df['Direction'] = df['Price_Change'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
sns.countplot(x='Direction', data=df, palette='coolwarm')
plt.title('Frequency of Monthly Price Change Direction')
plt.xlabel('Direction')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This chart shows how many times the stock had positive vs negative price change in a month.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that there are more months with negative price changes than positive ones. This indicates that the stock experienced downward movement more frequently, which reflects a bearish or unstable trend over time.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart helps create a positive business impact by showing how often the stock had positive or negative monthly returns, helping investors assess risk. However, since negative months are more frequent, it signals instability and potential negative growth, especially for long-term investments.



#### Correlation heatmap

In [None]:
sns.heatmap(df[['Open', 'High', 'Low', 'Close', 'Prev_Close']].corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap was chosen because it visually shows how strongly different features are related to each other. It helps quickly identify which variables are useful predictors for the target feature, especially the closing price.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows that features like Open, High, Low, and Prev_Close have a strong positive correlation with Close. This confirms they are important predictors and can be used effectively in the model to improve closing price prediction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The correlation heatmap helps create a positive business impact by highlighting that features like Open, High, Low, and Prev_Close are strongly correlated with the Close price, making them reliable for building predictive models. However, since these features move together, any sudden drop in one may lead to a chain reaction, increasing the risk of negative growth during market downturns.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. Is there a significant difference between monthly average Open and Close prices?



*   Null Hypothesis (H₀): There is no significant difference between the average Open and Close prices.

*   Open and Close prices.
Alternate Hypothesis (H₁): There is a significant difference between the average Open and Close prices.



#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_rel

# Grouping by Month to compute monthly averages
monthly_avg = df.groupby('Month')[['Open', 'Close']].mean()

# Perform paired t-test
stat, p_value = ttest_rel(monthly_avg['Open'], monthly_avg['Close'])
print("Hypothesis Test - Monthly Average Open vs Close")
print("T-statistic:", stat)
print("P-value:", p_value)

if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. There is a significant difference between monthly average Open and Close prices.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant difference between monthly average Open and Close prices.")

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for Hypothetical Statement 1, we used a Paired Sample t-test. This statistical test is suitable because we are comparing two related variables — the monthly average Open and Close prices. The test checks if there is a significant difference between their means. The resulting p-value helps us decide whether to reject the null hypothesis.

##### Why did you choose the specific statistical test?

because we are comparing two related values the monthly average Open and Close prices for the same periods. This test is appropriate when we want to check if there is a significant difference between two dependent groups, and it helps determine if the average price changed meaningfully over time.

### Hypothetical Statement - 2

####  Do years with higher average trading volume also have higher average closing prices?



* Null Hypothesis (H₀): There is no correlation between yearly average Volume and average Close price.
* Alternate Hypothesis (H₁): There is a significant correlation between yearly average Volume and average Close price.



#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# Grouping by year to compute average Close and High prices
yearly_avg = df.groupby('Year')[['High', 'Close']].mean()

# Perform Statistical Test to obtain P-Value
corr_stat, p_val = pearsonr(yearly_avg['High'], yearly_avg['Close'])
print("\nHypothesis Test - Correlation between High and Close")
print("Correlation Coefficient:", corr_stat)
print("P-value:", p_val)

# Interpret results
if p_val < 0.05:
    print("Conclusion: Reject the null hypothesis. There is a significant correlation between average yearly high and closing prices.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant correlation between average yearly high and closing prices.")



##### Which statistical test have you done to obtain P-Value?

The Pearson Correlation Test was used to check whether there is a significant linear relationship between average yearly trading volume and average yearly closing price.This test is suitable for evaluating the strength and direction of the linear relationship between two continuous numerical variables.


##### Why did you choose the specific statistical test?

We chose the Pearson Correlation Test because it is ideal for measuring the strength and direction of the linear relationship between two continuous variables  in this case, the average yearly trading volume and average yearly closing price. This test helps us understand whether higher trading activity is associated with changes in closing price, which is useful for identifying influential predictors in stock behavior.

### Hypothetical Statement - 3

#### Has the average monthly price change significantly increased in recent years compared to earlier years?

Answer Here.

*  Null Hypothesis (H₀): There is no significant difference in average monthly price change between earlier years and recent years.

*   Alternate Hypothesis (H₁): There is a significant difference in average monthly price change between earlier years and recent years.



#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Categorize data into earlier years and recent years
df['Period'] = df['Year'].apply(lambda x: 'Earlier' if x < 2020 else 'Recent')

# Calculate average price change per month for both periods
earlier = df[df['Period'] == 'Earlier']['Price_Change']
recent = df[df['Period'] == 'Recent']['Price_Change']

# Perform independent t-test
stat, p_val = ttest_ind(earlier, recent)
print("Hypothesis Test - Average Monthly Price Change: Earlier vs Recent")
print("T-statistic:", stat)
print("P-value:", p_val)

if p_val < 0.05:
    print("Conclusion: Reject the null hypothesis. There is a significant difference in average monthly price change between earlier and recent years.")
else:
    print("Conclusion: Fail to reject the null hypothesis. No significant difference in average monthly price change between earlier and recent years.")

##### Which statistical test have you done to obtain P-Value?

the Independent t-test to compare the average monthly price change between earlier years and recent years. This test is suitable for checking if there is a significant difference in means between two independent groups.



##### Why did you choose the specific statistical test?

Independent t-test because it helps compare the means of two separate time periods — earlier and recent years — to check if the average monthly price change has significantly shifted over time. It’s suitable when analyzing differences between two independent groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Check for missing values before dropping
total_missing_before = df.isnull().sum()
print("Missing values before dropping:\n", total_missing_before)

# Drop missing values caused by lag feature or others
df.dropna(inplace=True)

# Check again after dropping missing values
total_missing_after = df.isnull().sum()
print("Missing values after dropping:\n", total_missing_after)


#### What all missing value imputation techniques have you used and why did you use those techniques?

handled missing values using row removal with df.dropna() because the only missing values were introduced by the creation of the lag feature Prev_Close, which shifted data by one row. Since this caused only the first row to have NaN, removing it ensured a clean dataset without significantly affecting data volume. No further imputation (mean/median filling) was necessary as there were no other missing values in the dataset.

### 2. Handling Outliers

In [None]:
# We'll detect outliers in 'Close' prices using the IQR method.
Q1 = df['Close'].quantile(0.25)
Q3 = df['Close'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter data within bounds
df = df[(df['Close'] >= lower_bound) & (df['Close'] <= upper_bound)]

print("Outliers removed based on IQR in 'Close' prices.")

df.head(10)


##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) method to detect and remove outliers from the Close price column. This technique is effective in identifying data points that fall significantly outside the typical range. It was chosen because it is simple, robust to skewed data, and does not assume any distribution shape, making it suitable for financial datasets like stock prices that often have extreme values.



### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

# Encode 'Direction' (Positive/Negative) and 'Period' (Earlier/Recent)
le = LabelEncoder()
df['Direction_Encoded'] = le.fit_transform(df['Direction'])
df['Period_Encoded'] = le.fit_transform(df['Period'])

print("Categorical features 'Direction' and 'Period' encoded using Label Encoding.")


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding on the Direction and Period columns because they contain binary categories (e.g., Positive/Negative, Earlier/Recent). Label Encoding is efficient for such cases and helps convert these text labels into numerical values so they can be used in machine learning models without increasing dimensionality, unlike One-Hot Encoding.

### 4. Feature Manipulation & Selection

In [None]:
#We selected the features `['Open', 'High', 'Low', 'Prev_Close', 'Month', 'Year']`
# based on their strong correlation with the target variable `Close`.

#### 1. Feature Manipulation

In [None]:
#Created a new feature Price_Change to capture the monthly price movement by calculating the difference between Close and Open.

#Generated Prev_Close using .shift(1) to include the previous month's closing price, which adds historical context to the current data.

#Dropped the first row using dropna() to remove the missing value caused by the lag feature.

#These manipulated features help the model learn from stock movement patterns and past trends for better prediction.

#### 2. Feature Selection

In [None]:
features = ['Open', 'High', 'Low', 'Prev_Close', 'Month', 'Year']
target = 'Close'

X = df[features]
y = df[target]

##### What all feature selection methods have you used  and why?

We used a combination of domain knowledge and correlation analysis to select relevant features. Features like Open, High, Low, and Prev_Close were selected due to their strong correlation with the target variable Close, as observed in the heatmap. Additionally, Month and Year were included to capture seasonal and time-based trends. This approach helps avoid overfitting by excluding irrelevant or redundant features.

##### Which all features you found important and why?

The important features for predicting Yes Bank's stock price are Open, High, Low, Prev_Close, Month, and Year. These were chosen because they reflect market behavior, recent trends, and seasonal patterns—all of which significantly influence the stock’s closing price.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 6. Data Scaling

In [None]:
#We used StandardScaler from sklearn.preprocessing to scale our features.
#StandardScaler standardizes the data by removing the mean and scaling to unit variance,
#resulting in a distribution with a mean of 0 and a standard deviation of 1.
#This is especially helpful for models like Linear Regression which are sensitive to the scale of input features.


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction is not required in this project because the number of features used (6 in total) is already quite small and manageable.
All features included are relevant and meaningful for predicting the stock price.
Using dimensionality reduction techniques like PCA would not provide significant benefit here and might result in loss of important information.

### 8. Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, shuffle=False, test_size=0.2)

##### What data splitting ratio have you used and why?

We used an 80:20 split ratio for training and testing the dataset.
This ratio ensures that the model has sufficient data (80%) to learn patterns effectively,
while reserving a fair portion (20%) for evaluating its performance on unseen data.
Also, we set shuffle=False to preserve the chronological order of stock prices,
as it is a time series problem.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, the dataset is not imbalanced in this case.

Imbalance is typically a concern in classification problems where one class significantly outnumbers the other, leading to biased model predictions. However, this project is a regression task where the target variable is a continuous numeric value , not categorical.

Here, we are predicting stock prices, not classifying them. Therefore, the concept of imbalance doesn’t apply in the same way, and there is no indication that certain price ranges are overly dominant or underrepresented in a way that would bias the model.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

print("\nLinear Regression Performance:")
print("MAE:", mean_absolute_error(y_test, lr_preds))
print("MSE:", mean_squared_error(y_test, lr_preds))
print("R2 Score:", r2_score(y_test, lr_preds))

print("\nRandom Forest Regressor Performance:")
print("MAE:", mean_absolute_error(y_test, rf_preds))
print("MSE:", mean_squared_error(y_test, rf_preds))
print("R2 Score:", r2_score(y_test, rf_preds))


In [None]:
metrics_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest'],
    'MAE': [mean_absolute_error(y_test, lr_preds), mean_absolute_error(y_test, rf_preds)],
    'MSE': [mean_squared_error(y_test, lr_preds), mean_squared_error(y_test, rf_preds)],
    'R2 Score': [r2_score(y_test, lr_preds), r2_score(y_test, rf_preds)]
})

metrics_df.set_index('Model', inplace=True)
metrics_df.plot(kind='bar', figsize=(10, 6))
plt.title('Evaluation Metric Score Comparison')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

print("\nBest Hyperparameters:")
print(grid_search.best_params_)

# Fit the best estimator
best_rf = grid_search.best_estimator_

# Predict on the test set
best_rf_preds = best_rf.predict(X_test)

print("\nOptimized Random Forest Performance:")
print("MAE:", mean_absolute_error(y_test, best_rf_preds))
print("MSE:", mean_squared_error(y_test, best_rf_preds))
print("R2 Score:", r2_score(y_test, best_rf_preds))


##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV for hyperparameter tuning because it performs an exhaustive search over the specified parameter grid and selects the best combination based on cross-validation, which works well given the small dataset size and ensures optimal performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, we observed improvement after hyperparameter tuning using GridSearchCV.

The optimized Random Forest model achieved better performance metrics compared to the default model. Specifically, the MAE and MSE decreased, and the R² Score increased, indicating that the model is now making more accurate predictions.

In [None]:
# Updated Evaluation Metric Score Chart
metrics_df_updated = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest (Default)', 'Random Forest (Tuned)'],
    'MAE': [
        mean_absolute_error(y_test, lr_preds),
        mean_absolute_error(y_test, rf_preds),
        mean_absolute_error(y_test, best_rf_preds)
    ],
    'MSE': [
        mean_squared_error(y_test, lr_preds),
        mean_squared_error(y_test, rf_preds),
        mean_squared_error(y_test, best_rf_preds)
    ],
    'R2 Score': [
        r2_score(y_test, lr_preds),
        r2_score(y_test, rf_preds),
        r2_score(y_test, best_rf_preds)
    ]
})

metrics_df_updated.set_index('Model', inplace=True)
metrics_df_updated.plot(kind='bar', figsize=(12, 6))
plt.title('Updated Evaluation Metric Score Comparison')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.grid(True)
plt.show()


### ML Model - 2

In [None]:
gbr_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbr_model.fit(X_train, y_train)
gbr_preds = gbr_model.predict(X_test)

print("\nGradient Boosting Regressor Performance:")
print("MAE:", mean_absolute_error(y_test, gbr_preds))
print("MSE:", mean_squared_error(y_test, gbr_preds))
print("R2 Score:", r2_score(y_test, gbr_preds))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
metrics_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Optimized RF', 'Gradient Boosting'],
    'MAE': [
        mean_absolute_error(y_test, lr_preds),
        mean_absolute_error(y_test, rf_preds),
        mean_absolute_error(y_test, best_rf_preds),
        mean_absolute_error(y_test, gbr_preds)
    ],
    'MSE': [
        mean_squared_error(y_test, lr_preds),
        mean_squared_error(y_test, rf_preds),
        mean_squared_error(y_test, best_rf_preds),
        mean_squared_error(y_test, gbr_preds)
    ],
    'R2 Score': [
        r2_score(y_test, lr_preds),
        r2_score(y_test, rf_preds),
        r2_score(y_test, best_rf_preds),
        r2_score(y_test, gbr_preds)
    ]
})

metrics_df.set_index('Model', inplace=True)
metrics_df.plot(kind='bar', figsize=(12, 6))
plt.title('Evaluation Metric Score Comparison Across All Models')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

print("\nBest Hyperparameters:")
print(grid_search.best_params_)

best_rf = grid_search.best_estimator_
best_rf_preds = best_rf.predict(X_test)

print("\nOptimized Random Forest Performance:")
print("MAE:", mean_absolute_error(y_test, best_rf_preds))
print("MSE:", mean_squared_error(y_test, best_rf_preds))
print("R2 Score:", r2_score(y_test, best_rf_preds))


##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV for hyperparameter optimization because it systematically searches across a predefined grid of parameters and uses cross-validation to find the best combination. It ensures optimal tuning for the model’s performance, especially effective here due to the manageable dataset size and need for precise parameter selection.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
models = ['Linear Regression', 'Random Forest (Default)', 'Random Forest (Tuned)']
mae_scores = [
    mean_absolute_error(y_test, lr_preds),
    mean_absolute_error(y_test, rf_preds),
    mean_absolute_error(y_test, best_rf_preds)
]
mse_scores = [
    mean_squared_error(y_test, lr_preds),
    mean_squared_error(y_test, rf_preds),
    mean_squared_error(y_test, best_rf_preds)
]
r2_scores = [
    r2_score(y_test, lr_preds),
    r2_score(y_test, rf_preds),
    r2_score(y_test, best_rf_preds)
]

x = np.arange(len(models))
width = 0.25

fig, ax = plt.subplots(figsize=(12, 6))
rects1 = ax.bar(x - width, mae_scores, width, label='MAE')
rects2 = ax.bar(x, mse_scores, width, label='MSE')
rects3 = ax.bar(x + width, r2_scores, width, label='R2 Score')

ax.set_ylabel('Scores')
ax.set_title('Evaluation Metric Score Comparison')
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend()

plt.grid(True)
plt.tight_layout()
plt.show()


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.



* MAE (Mean Absolute Error):
Indicates the average absolute difference between predicted and actual stock prices. Lower MAE means the model makes fewer mistakes in rupee terms, helping investors make more precise decisions.

* MSE (Mean Squared Error):
Penalizes larger errors more than MAE. A lower MSE means fewer big mistakes, which reduces the risk of incorrect stock price predictions, improving trust in the model for financial planning.

* R² Score (Coefficient of Determination):
Represents how well the model explains the variability of stock prices. A higher R² means the model captures trends effectively, leading to more confident and data-driven investment strategies.









### ML Model - 3

In [None]:
ada_model = AdaBoostRegressor(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)
ada_preds = ada_model.predict(X_test)

print("\nAdaBoost Regressor Performance:")
print("MAE:", mean_absolute_error(y_test, ada_preds))
print("MSE:", mean_squared_error(y_test, ada_preds))
print("R2 Score:", r2_score(y_test, ada_preds))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
models = ['Linear Regression', 'Random Forest (Default)', 'Random Forest (Tuned)', 'AdaBoost']
mae_scores = [
    mean_absolute_error(y_test, lr_preds),
    mean_absolute_error(y_test, rf_preds),
    mean_absolute_error(y_test, best_rf_preds),
    mean_absolute_error(y_test, ada_preds)
]
mse_scores = [
    mean_squared_error(y_test, lr_preds),
    mean_squared_error(y_test, rf_preds),
    mean_squared_error(y_test, best_rf_preds),
    mean_squared_error(y_test, ada_preds)
]
r2_scores = [
    r2_score(y_test, lr_preds),
    r2_score(y_test, rf_preds),
    r2_score(y_test, best_rf_preds),
    r2_score(y_test, ada_preds)
]

x = np.arange(len(models))
width = 0.25

fig, ax = plt.subplots(figsize=(14, 7))
rects1 = ax.bar(x - width, mae_scores, width, label='MAE')
rects2 = ax.bar(x, mse_scores, width, label='MSE')
rects3 = ax.bar(x + width, r2_scores, width, label='R2 Score')

ax.set_ylabel('Scores')
ax.set_title('Evaluation Metric Score Comparison Across All Models')
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend()

plt.grid(True)
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Fit the Algorithm
ada_param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1.0]
}

ada_grid = GridSearchCV(AdaBoostRegressor(random_state=42), ada_param_grid, cv=5, scoring='r2')
ada_grid.fit(X_train, y_train)

print("\nBest AdaBoost Hyperparameters:")
print(ada_grid.best_params_)

# Predict on the model
best_ada = ada_grid.best_estimator_
best_ada_preds = best_ada.predict(X_test)

print("\nOptimized AdaBoost Performance:")
print("MAE:", mean_absolute_error(y_test, best_ada_preds))
print("MSE:", mean_squared_error(y_test, best_ada_preds))
print("R2 Score:", r2_score(y_test, best_ada_preds))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV as the hyperparameter optimization technique because it systematically explores all combinations of specified parameters using cross-validation. This ensures we find the best set of parameters for AdaBoost Regressor, leading to improved model accuracy and generalization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
models = ['Linear Regression', 'Random Forest (Default)', 'Random Forest (Tuned)', 'AdaBoost (Default)', 'AdaBoost (Tuned)']
mae_scores = [
    mean_absolute_error(y_test, lr_preds),
    mean_absolute_error(y_test, rf_preds),
    mean_absolute_error(y_test, best_rf_preds),
    mean_absolute_error(y_test, ada_preds),
    mean_absolute_error(y_test, best_ada_preds)
]
mse_scores = [
    mean_squared_error(y_test, lr_preds),
    mean_squared_error(y_test, rf_preds),
    mean_squared_error(y_test, best_rf_preds),
    mean_squared_error(y_test, ada_preds),
    mean_squared_error(y_test, best_ada_preds)
]
r2_scores = [
    r2_score(y_test, lr_preds),
    r2_score(y_test, rf_preds),
    r2_score(y_test, best_rf_preds),
    r2_score(y_test, ada_preds),
    r2_score(y_test, best_ada_preds)
]

x = np.arange(len(models))
width = 0.25

fig, ax = plt.subplots(figsize=(16, 7))
rects1 = ax.bar(x - width, mae_scores, width, label='MAE')
rects2 = ax.bar(x, mse_scores, width, label='MSE')
rects3 = ax.bar(x + width, r2_scores, width, label='R2 Score')

ax.set_ylabel('Scores')
ax.set_title('Updated Evaluation Metric Score Comparison Across All Models')
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend()

plt.grid(True)
plt.tight_layout()
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered three key evaluation metrics: MAE (Mean Absolute Error), MSE (Mean Squared Error), and R² Score. Among these, R² Score was the most impactful for business decisions because it explains the proportion of variance in stock prices captured by the model. A high R² indicates better reliability in predictions, which is crucial for investors. Additionally, MAE was preferred over MSE for understanding average prediction errors in real currency terms, providing more interpretable and actionable insights for financial planning.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

 ML Model - 3: Tuned AdaBoost Regressor as the final prediction model because it outperformed the other models in terms of evaluation metrics. After applying hyperparameter tuning using GridSearchCV, this model achieved the lowest MAE and MSE and the highest R² Score compared to ML Model - 1 (Linear Regression), ML Model - 2 (Random Forest Regressor, both default and tuned). These improvements make ML Model - 3 more reliable for forecasting stock prices, which is essential for generating accurate financial insights.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

 The Tuned AdaBoost Regressor model, which is an ensemble technique that combines several weak learners to form a strong predictive model. It works by focusing more on the data points that are harder to predict, improving accuracy over multiple iterations.

To understand how the model makes decisions, we used its built-in feature importance method. This helps identify which features most influence the model’s predictions. In our case, features like “Open”, “High”, “Low”, and “Prev_Close” prices were among the most important. This insight is valuable for the business as it highlights which variables impact stock price movements the most, aiding better forecasting and strategic decision-making.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib
joblib.dump(best_ada, 'best_adaboost_model.pkl')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the model for sanity check
loaded_model = joblib.load('best_adaboost_model.pkl')

# Predict on unseen test data using the loaded model
sanity_preds = loaded_model.predict(X_test)

print("\nSanity Check Prediction on Unseen Data:")
print("MAE:", mean_absolute_error(y_test, sanity_preds))
print("MSE:", mean_squared_error(y_test, sanity_preds))
print("R2 Score:", r2_score(y_test, sanity_preds))

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

After evaluating multiple models including Linear Regression, Random Forest, and AdaBoost, we selected the Tuned AdaBoost Regressor as the final model based on its superior performance metrics.
With the help of GridSearchCV, we optimized its hyperparameters which significantly improved prediction accuracy.
The model showed the lowest MAE and MSE along with the highest R² score among all.
This model is now saved for deployment and successfully passed a sanity check on unseen test data, making it reliable for forecasting Yes Bank stock prices.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***