# **Project Name - Yes Bank**

##### **Project Type - Regression**
##### **Contribution - Individual**
##### **Name - Shivan Mishra**

# **Project Summary :--**
*In this project, we aim to predict the monthly closing stock price of Yes Bank using regression models based on historical stock data. The dataset contains monthly values including the opening, highest, lowest, and closing prices of the stock.*

*To perform this task, we implemented and compared three different regression models:*

    *1. Linear Regression*

    *2. Lasso Regression*
    
    *3. K-Nearest Neighbors (KNN) Regressor*
    
*We applied both normalization and standardization techniques during the preprocessing phase to ensure optimal model performance. Data was split into training and testing sets using train_test_split.*

### *Library's are:--*

***1. Numpy***

***2. Pandas***

***3. Matplotlib***

***4. Seaborn***

***5. Sklearn***

#### *Evaluation Metrics Used:- Mean Absolute Error, Mean Square Error and R2*


# **GitHub Link -** <a href="https://github.com/shivan632/Labmentix-1.git"> (https://github.com/shivan632/Labmentix-1.git)</a>

# **Problem Statement :--**

**The main objective is to predict the Stock's Closing price of the Month.**

****

# ***Let's Begin !***

## ***1. Know Your Data***

### **Import Libraries**

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns       
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2


### **Loading Dataset**

In [None]:
# Step 1: Load the dataset
df = pd.read_csv("data_YesBank_StockPrices.csv")
df

### **Dataset First View**

In [None]:
df.head(6)  # Display the first few rows of the dataset

### **Dataset Rows & Columns count**

In [None]:
#Dataset Rows & Columns count
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])
print("Column Names:", df.columns.tolist())

### **Dataset Information**

In [None]:
# Dataset Info
df.info()

### **Checking Duplicate Values / Missing/Null Values**

In [None]:
#Step 2: Checking duplicates and null values
duplicates = df.duplicated().sum()      # Check for duplicate rows  
print(f"Number of duplicate rows: {duplicates}")
null_counts = df.isnull().sum()     # Count null values in each column
print(f"Null values in each column:\n{null_counts}")

### **What did you know about your dataset?**

📊 Lets Explore What Did We Learn About the Dataset?

The dataset contains historical stock prices of Yes Bank, with 185 rows and 5 columns.

Each row represents stock data for a specific date, and the columns include:

- **Date:** The date of the record (object type; can be converted to datetime)

- **Open:** Stock price at market open

- **High:** Highest price of the day
 
- **Low:** Lowest price of the day
 
- **Close:** Stock price at market close

✅ All columns have complete data — no missing or null values were found.

✅ There are no duplicate rows, indicating the dataset is clean.

🧮 The numeric columns (Open, High, Low, Close) are of type float64, suitable for further statistical analysis and modeling.

****

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe()

### **Variables Description**

Below is a brief description of each variable in the dataset:

- **Date**: The trading date for each record. While it is not used directly in prediction models, it is important for time-based analysis, trend visualization, and for creating additional time-related features (e.g., day of the week, month).
  
- **Open**: The price at which Yes Bank's stock opened for trading on a given day.

- **High**: The highest price reached by the stock during the trading day.

- **Low**: The lowest price reached by the stock during the trading day.

- **Close**: The final price of the stock at market close. This is often used as the **target variable** for stock price prediction tasks.

> All price-related columns (`Open`, `High`, `Low`, `Close`) are continuous numerical variables and are essential for trend and pattern analysis.


### **Check Unique Values for each variable.**

In [None]:
# Check Unique Values for each variable.
df.nunique()

****

## ***3. Data Wrangling***

### **Data Wrangling Code**

In [None]:

df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

df = df.drop_duplicates()

print("Missing values:\n", df.isna().sum())

df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

print("\nData types after conversion:\n", df.dtypes)

df = df.reset_index(drop=True)
df.head()


### **Insights Gained and Manipulations Performed**

#### ✅ Data Manipulations Performed:
- Converted the **`Date`** column from a string like `'Jul-05'` to proper `datetime` format using `%b-%y`
- Created new **`Month`** and **`Year`** columns from the `Date` column for better time-based analysis
- Checked for and confirmed that there are **no missing values** or **null entries**
- Removed any **duplicate rows** (none were found in this case)
- Verified data types to ensure all price-related columns (`Open`, `High`, `Low`, `Close`) are in **float64** format
- Reset the dataframe index after cleaning

#### 📌 Key Insights Observed:
- The dataset contains **monthly stock prices** for Yes Bank
- Data ranges from **July 2005 onward**, with each row representing a **monthly summary**
- All price features (`Open`, `High`, `Low`, `Close`) appear to be **continuous numerical variables**
- No immediate data quality issues like missing or duplicated values were found

This clean dataset is now ready for visualization (EDA) and modeling steps.


****

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## ***U — Univariate Analysis:--***

### *1. Histogram of Closing Prices*

In [None]:
# Chart visualization code
plt.figure(figsize=(8, 5))
sns.histplot(df['Close'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


Why this chart?
To understand how the closing prices are distributed — skewed, normal, etc.

Insights:
If prices are clustered around a range or show outliers, we can identify volatility or stability.

Business Impact:
Understanding distribution helps in risk assessment. Skewed data may imply a need for transformation or highlight price manipulation.






































### *2. Line Plot of Closing Price Over Time*

In [None]:
# Chart visualization code
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')

plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], label='Close Price', color='blue')
plt.title('Closing Price Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.legend()
plt.show()


Why this chart?
To observe trends, spikes, crashes, or steady movement in closing prices.

Insights:
Sharp drops or increases can indicate major events (like the Rana Kapoor fraud case).

Business Impact:
Helps correlate financial events with stock performance for strategic planning.



## ***B — Bivariate Analysis:--***

### *3. Scatter Plot: Opening Price vs Closing Price*

In [None]:
# Chart visualization code
plt.figure(figsize=(7, 5))
sns.scatterplot(x='Open', y='Close', data=df, color='purple')
plt.title('Opening Price vs Closing Price')
plt.xlabel('Opening Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()


Why this chart?
To identify whether there's a linear or non-linear relationship between opening and closing prices.

Insights:
A strong linear pattern indicates the closing price can be predicted using the opening price.

Business Impact:
Useful in short-term trading strategies — if opening price reliably predicts closing price, it enables profitable same-day trading decisions.

Why this chart?
To study quarterly trends and distribution shapes (including multi-modal ones).

Insights:
Wider violins show more variability in specific quarters. Can indicate seasonal investment behavior.

Business Impact:
Guides quarterly investment planning. For instance, Q1/Q3 might be more volatile or rewarding.

### *4. Bar Plot: Average High Prices by Year*

In [None]:
# Chart visualization code
avg_high_year = df.groupby('Year')['High'].mean().reset_index()

plt.figure(figsize=(9, 5))
sns.barplot(x='Year', y='High', data=avg_high_year, palette='YlGnBu')
plt.title('Average High Prices by Year')
plt.xlabel('Year')
plt.ylabel('Average High Price')
plt.grid(True)
plt.show()


Why this chart?
Helps understand how stock peaks have evolved annually.

Insights:
Rising or falling high prices signal market perception of potential growth or decline.

Business Impact:
Identifies years of investor confidence or fear — useful for back-testing strategies or investor sentiment analysis.

## ***M — Multivariate Analysis:--***

### *5. Multi-line Plot: Open, High, Low, Close Over Time*

In [None]:
# Chart visualization code
plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['Open'], label='Open', alpha=0.7)
plt.plot(df['Date'], df['High'], label='High', alpha=0.7)
plt.plot(df['Date'], df['Low'], label='Low', alpha=0.7)
plt.plot(df['Date'], df['Close'], label='Close', alpha=0.7)
plt.title('Time Series: Open, High, Low, Close Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()


Why this chart?
Shows all major stock price points on a single timeline for full insight.

Insights:
You can identify volatility periods (wide gaps between high/low) and price crashes or booms.

Business Impact:
Helps in building time-aware models and understanding how each metric evolves over time.

****

## ***5. Hypothesis Testing***

****

## ***6. Feature Engineering & Data Pre-processing***

### **1. Handling Missing Values**

In [None]:
print(df.isnull().sum())            # Count total missing values in each column

print("\nAfter filling:\n", df.fillna(0))            # Replaces all NaN with 0

print("\nDropping rows with NaN:\n", df.dropna())  # Drops all rows with NaN values


**Detects missing data in each column.**

**Helps you understand how much data is missing before deciding what to do.**

**Prevents errors during model training, as most ML models can't handle NaN.**

### **2. Handling Outliers**

In [None]:

num_cols = ['Open', 'High', 'Low', 'Close']
original_data=df.copy()
for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"🔎 {col}: {outliers.shape[0]} outliers detected")
    df= df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
df.reset_index(drop=True, inplace=True)
print("All Outliers removed")

### **3. Feature Seletion**

In [None]:
X = df[['Open', 'High', 'Low']]   # Independent variables
y = df['Close']                   # Dependent variable

### **4. Data Splitting**

#### What data splitting ratio have you used and why? (Answer)

In [None]:
# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=5)

****

## ***7. ML Model Implementation***

### ***ML Model 1- Linear Regression***

In [None]:
# ML Model - 1 Implementation
model = LinearRegression()          # Create a Linear Regression model object

# Fit the Algorithm
model.fit(X_train, y_train)          # Fit the model to the training data

# Predict on the model
y_pred = model.predict(X_test)           # Predict the target variable for the test set

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(8, 6))
sns.regplot(x=y_test, y=y_pred, ci=None, line_kws={'color': 'red'})

plt.xlabel("Actual Close Price")
plt.ylabel("Predicted Close Price")
plt.title("Linear Regression: Actual vs Predicted Close Price")
plt.grid(True)
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross- Validation Score
print("Mean Absolute Error (Linear Regression):", mae(y_test, y_pred))
print("Mean Squared Error (Linear Regression):", mse(y_test, y_pred))
print("R-squared Score (Linear Regression):", r2(y_test, y_pred))

****

### **ML Model 2- Lasso Regression**

In [None]:
# ML Model - 2 Implementation
lasso = Lasso(alpha=0.1)          # Create a Lasso Regression model object.

# Fit the Algorithm
lasso.fit(X_train, y_train)          # Fit the model to the training data

# Predict on the model
y_pred = lasso.predict(X_test)           #Predict the target variable for the test set

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(8, 5))
sns.regplot(x=y_test, y=y_pred, ci=None, line_kws={"color": "red"})
plt.xlabel("Actual Close Price")
plt.ylabel("Predicted Close Price")
plt.title("Lasso Regression: Actual vs Predicted")
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross- Validation Score
print("Mean Absolute Error (Linear Regression):", mae(y_test, y_pred))
print("Mean Squared Error (Linear Regression):", mse(y_test, y_pred))
print("R-squared Score (Linear Regression):", r2(y_test, y_pred))

****

### **ML Model 3- KNN Regression**

In [None]:
# ML Model - 3 Implementation
knn_model = KNeighborsRegressor(n_neighbors=4)          # Create a KNN Regressor model object.

# Fit the Algorithm
knn_model.fit(X_train, y_train)          # Fit the model to the training data

# Predict on the model
y_pred = knn_model.predict(X_test)           # Predict the target variable for the test set

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(8, 6))
sns.regplot(x=y_test, y=y_pred, ci=None, scatter_kws={'alpha':0.6}, line_kws={'color': 'red'})
plt.xlabel("Actual Close Price")
plt.ylabel("Predicted Close Price")
plt.title("KNN Regression: Actual vs Predicted Close Price")
plt.grid(True)
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross- Validation Score
print("Mean Absolute Error (Linear Regression):", mae(y_test, y_pred))
print("Mean Squared Error (Linear Regression):", mse(y_test, y_pred))
print("R-squared Score (Linear Regression):", r2(y_test, y_pred))

****

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

        *To ensure accurate and business-relevant predictions of Yes Bank’s monthly stock closing prices, I used the following evaluation metrics:
                1. Mean Absolute Error
                2. Mean Squared Error
                3. R² Score
        *These combined metrics provide a balanced evaluation — understanding average errors, large risks, and overall model performance, which is critical in stock 
         price prediction.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

        *I chose Linear Regression because of regions:
                1. Best Performance on Evaluation Metrics
                2. Accurate Predictions
                3. Model Simplicity & Interpretability
                4. No Overfitting Detected
        *Due to its strong performance on error metrics, reliable predictions, and simplicity, Linear Regression was chosen as the final model for predicting Yes 
         Bank’s monthly stock closing prices. It ensures both accuracy and transparency, which is critical for financial forecasting and decision-making.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

            ->Model Used: Linear Regression
            *Linear Regression is a supervised learning algorithm used to predict a continuous target variable based on one or more input features.
            *In this project, it was used to predict the monthly closing price of Yes Bank’s stock using features like:
                    1. Open price
                    2. High price
                    3. Low Price
            ->How It Works:
            *Linear Regression fits a straight line (y = mx + c) that best represents the relationship between the features (X) and the target (y).

****

### ***My model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# ***<u>Conclusion</u>***

*-> In this project, we successfully built a regression model to predict the monthly closing stock prices of Yes Bank using historical stock price data. The dataset included essential features such as opening, highest, lowest, and closing prices, which were used to train the model.*

## **Steps:-**

### 📊 **1. Dataset Loading**

The stock price dataset was loaded, containing columns like **opening, closing, high, and low prices** for each month.

### 🔍 **2. Data Cleaning**

We performed:

* **Null value checking and handling** (e.g., `fillna()` or `dropna()`)
* **Duplicate value detection and removal**

### 🔧 **3. Data Wrangling**

Converted data types (like converting date strings to `datetime`), ensured consistency in column formats, and prepared the dataset for visualization and modeling.

### 📈 **4. Data Visualization**

Created charts to understand patterns:

* Line plots of stock trends
* Correlation heatmaps
* Boxplots to detect outliers

### 🧠 **5. Feature Selection**

Selected important columns/features for the model — such as `Open`, `High`, `Low`, etc. — after visual and statistical analysis.

### 🔀 **6. Data Splitting**

Split the dataset into **training and testing sets** using an 70-30 ratio for unbiased model evaluation.

### 🤖 **7. ML Modeling & Prediction**

* **ML Model Implementation**
* **Fit the Algorithm**
* **Predict on the model**

The models were evaluated using:

* **Mean Squared Error (MSE)**
* **Root Mean Squared Error (RMSE)**
* **R² score**

The best model was then used to predict the stock’s closing price with good accuracy.

### 📊 **8. Final Visualization**

Plotted predicted vs. actual closing prices to visually interpret model performance and confirm the validity of predictions.

****

#### **🔍 Final Results:**
*The best-performing model achieved an R² score of 0.61(Liner), 0.55(Lasso) and 0.16(KNN) on the test set, indicating a good fit and predictive power.*

*Predicted stock closing prices closely followed the actual values in most months, with minimal error in volatile periods.*

****