<a href="https://colab.research.google.com/github/tejask108/Yes-Bank-Stock-Closing-Price-Prediction/blob/main/Yes_Bank_Stock_Closing_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Rakshanda Shaikh
##### **Team Member 2 -**Nirvi Prabhale
##### **Team Member 3 -**Tejesh Khadke


# **Project Summary -**

Write the summary here within 500-600 words.

This project focuses on a critical challenge: predicting the closing price of Yes Bank's stock. This prediction is crucial for stakeholders, investors, and market participants due to the bank's recent difficulties, like bad loans and fraud cases. Regulatory intervention by the Reserve Bank of India has added complexity to predicting the bank's stock prices.

To address this, the project uses a comprehensive dataset covering Yes Bank's stock prices since its inception. This dataset includes key metrics like monthly closing, starting, highest, and lowest prices. The goal is to create predictive models that can grasp the complex trends in the bank's stock prices, considering its turbulent history.

The project employs various modeling techniques, including time series models and regression methods. These techniques aim to accurately forecast the closing price of Yes Bank's stock and account for significant events like fraud cases and regulatory interventions.

By achieving accurate predictions, this project provides valuable insights for stakeholders' investment decisions. It navigates the complexities and uncertainties surrounding Yes Bank's stock prices, ultimately aiding in better decision-making and understanding the bank's financial performance.

# **GitHub Link -**

https://github.com/Rakshanda19/Yes-Bank-Stock-Closing-Price-Prediction.git

# **Problem Statement**


This project centers on developing a robust predictive model for forecasting the closing price of Yes Bank's stock. The primary challenge involves comprehending the intricate dynamics of stock price trends, particularly the abrupt decline post-2018 after a period of growth. A significant hurdle lies in managing multicollinearity within the dataset, arising from interrelated independent variables. Tackling this issue is crucial to ensure the model's accuracy in prediction by adequately considering each variable's contribution.

In addition, the model should be adept at accounting for major events that have left an impact on Yes Bank's stock performance. Notable occurrences, such as the involvement of the bank's founders in fraud cases and regulatory interventions by the Reserve Bank of India, have the potential to sway stock prices significantly. A key objective is for the predictive model to adeptly capture and reflect the influence of such events with precision.

Furthermore, the project sets out to attain a level of predictive accuracy akin to the benchmark set by the K-Nearest Neighbors (KNN) Regression model, which achieved an impressive 99% accuracy. This high accuracy level is crucial for providing valuable insights to stakeholders, investors, and market participants. Armed with a reliable predictive tool, they can make well-informed decisions regarding investments in Yes Bank's stock, even amidst the intricate challenges posed by its price fluctuations. Ultimately, this endeavor strives to empower stakeholders with a dependable means of navigating the complexities of Yes Bank's stock performance landscape.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import necessary libraries

# Import NumPy for numerical computations
import numpy as np

# Import Pandas for data manipulation and analysis
import pandas as pd

# Import Matplotlib for basic data visualization
import matplotlib.pyplot as plt

# Import Plotly express for interactive visualisations
import plotly.express as px

# Import Seaborn for advanced statistical visualizations
import seaborn as sns

# Import Plotly graph objects for interactive visualizations
import plotly.graph_objects as go

# Import the datetime module for working with dates and times
from datetime import datetime

# Import warnings module to ignore potential warnings
import warnings

from sklearn.model_selection import train_test_split

### Dataset Loading

In [None]:
# Load Dataset
#Nirvi
from google.colab import drive
drive.mount('/content/drive')
stock_df=pd.read_csv("/content/drive/MyDrive/data_YesBank_StockPrices.csv")


In [None]:
#Rakshanda
stock_df=pd.read_csv('/content/data_YesBank_StockPrices (1).csv')

### Dataset First View

In [None]:
# Dataset First Look
stock_df.head()

In [None]:
stock_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
stock_df.shape


The shape of the dataframe is : (185,5)

### Dataset Information

In [None]:
# Dataset Info
stock_df.info()

From the above analysis, we can conclude that :
            
*  The shape of our dataset is 185 rows and 5 columns.
*  Datatype of Date is given as object which we need to change that to DateTime. The Date column contains the date of the stock price. The data type of this column is currently object, which means that the values in this column are strings. We need to change the data type of this column to DateTime so that we can perform date-related operations on it, such as calculating the day of the week, the month, or the year.
*  Rest all features have float value as data point. The other 4 columns in the dataset contain floating-point numbers. These numbers represent the open price, high price, low price, and close price of the stock. The volume column contains the number of shares traded on a given day.






#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
stock_df.duplicated().sum()

The number of duplicate rows is : 0

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
stock_df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno

msno.matrix(stock_df)


### What did you know about your dataset?

From the above analysis, we can conclude that :
            
*  The shape of our dataset is 185 rows and 5 columns.
*  Datatype of Date is given as object which we need to change that to DateTime. The Date column contains the date of the stock price. The data type of this column is currently object, which means that the values in this column are strings. We need to change the data type of this column to DateTime so that we can perform date-related operations on it, such as calculating the day of the week, the month, or the year.
*  Rest all features have float value as data point. The other 4 columns in the dataset contain floating-point numbers. These numbers represent the open price, high price, low price, and close price of the stock. The volume column contains the number of shares traded on a given day.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
stock_df.columns

In [None]:
# Dataset Describe
stock_df.describe(include="all")

### Variables Description

The dataset consists of monthly observations of Yes Bank stock prices since its listing on the stock exchange. The dataset includes the following features:

Date: This indicates the specific month for which the stock price is recorded.

Open: This represents the price of the stock at the beginning of the trading day when the stock exchange opens.

High: This indicates the highest price reached by the stock during the given month.

Low: This indicates the lowest price reached by the stock during the given month

Close: This represents the price of the stock at the end of the trading day when the stock exchange closes.

The dataset provides a comprehensive overview of the monthly performance of Yes Bank stock, including the opening, highest, lowest, and closing prices for each month since its listing on the stock exchange.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for variable in stock_df.columns:
  print(f"The unique values for the '{variable}' variable are:\n\n {stock_df[variable].unique()}\n\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Let us now save original data before making any chnages in it.
# Saving a copy of the original dataframe
stock1_df = stock_df.copy()


**Data type Correction**

The dataset is clean, with no duplicate or missing values, so there's no need for data adjustments. Moving on, we are addressing outliers in the data.

However, the Date column's datatype is currently listed as 'object'. To accurately represent date and time, we have converted it to the datetime format. The 'object' type isn't suitable for this kind of data.

To make this change, we have used the pd.to_datetime() function. For instance, this code would transform the Date column's datatype to datetime:

In [None]:
# convert string object to datetime object
stock_df['Date'] = stock_df['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))



In [None]:
print(stock_df)

In [None]:
# setting Date column as index.
stock_df.set_index('Date',inplace=True)
stock_df.head()

In [None]:
#categorical columns
cat_columns=stock_df.select_dtypes(include='object').columns
print(f'categorical columns:{list(cat_columns)}')


In [None]:

#non-categorical columns
num_columns=stock_df.select_dtypes(exclude='object').columns
print('non-categorical columns:',list(num_columns))

In [None]:
dependent_variable = ['Close']
independent_variables = list(stock_df.columns[:-1])

In [None]:

fig = plt.figure(figsize =(10, 7))
boxplot = stock_df.boxplot(column=['Open','High','Low',"Close"],grid=False,notch=True)

plt.show()


As above boxplot shows outliers this is because of stock price fall from nearly around 400 to 20.This happen quick within very few months thats why top value of stocks looks like outliers.

### What all manipulations have you done and insights you found?

Our dataset doesn't have any null value and duplicate value.
We did copy of our data to preserve original data also converted to date variable from object datatype to date datatype.
Set date columns as index to track variation in stock price
.Upon examining the provided dataframe, it becomes apparent that all the columns exclusively consist of numerical data.


Furthermore, during the examination of the dataset, it is evident that outliers are present. These outliers are data points that significantly deviate from the majority of the data. Before proceeding with modeling or conducting further analysis, it is crucial to address these outliers. Dealing with outliers involves assessing their impact on the data and making decisions regarding appropriate actions, such as removing or transforming them. By addressing the outliers, we can enhance the robustness and reliability of our models and analyses.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**Univariate Analysis**

#### Chart - 1 - **Candle stick graph with price movement**

In [None]:
# Create a Figure object with Candlestick chart
import plotly.graph_objects as go

fig = go.Figure(go.Candlestick(
    x = stock_df.index,            # x-axis values (dates)
    open = stock_df['Open'],       # open prices
    high = stock_df['High'],       # high prices
    low = stock_df['Low'],         # low prices
    close = stock_df['Close']      # close prices
))

# Update the layout of the figure with a title
fig.update_layout(
    title={'text': 'Describing the Price Movements', 'x': 0.5, 'y': 0.95, 'font': {'color': 'white'}},
    xaxis=dict(title='Year', title_font={'color': 'white'}, tickfont={'color': 'white'}),
    yaxis=dict(title='Price', title_font={'color': 'white'}, tickfont={'color': 'white'}),
    width=800,
    height=800,
    plot_bgcolor='rgb(36, 40, 47)',  # Set the background color to a professional dark gray
    paper_bgcolor='rgb(51, 56, 66)'  # Set the paper color
)


# Show the figure
fig.show()



##### 1. Why did you pick the specific chart?

The Candlestick chart was chosen due to its effectiveness in displaying crucial price information. It represents open, high, low, and close prices, making it valuable for financial analysis. This chart is particularly useful for stocks, conveying market sentiment and trends. Each candlestick covers a time interval and its color and shape indicate price changes. High and low points show peak prices, while the body displays opening and closing prices. This helps identify patterns, trends, and potential reversals, aiding informed decisions on buying or selling assets. The larger size enhances visibility, allowing detailed analysis. Overall, the Candlestick chart is a powerful tool for understanding price dynamics in financial markets.

##### 2. What is/are the insight(s) found from the chart?

The analysis of Yes Bank's stock prices reveals a clear pattern. Before 2018, the stock consistently grew, showing investor optimism. However, a sharp drop followed, mainly due to the fraud case involving former CEO Rana Kapoor.

Until 2018, the stock consistently rose, reflecting positive market conditions. The fraud case changed everything, causing a steep decline.

The Rana Kapoor fraud case damaged investor trust, leading to a significant stock value drop. This event negatively impacted the company's reputation and stability.

In summary, the analysis shows two distinct trends: growth before 2018 and a significant post-2018 decline due to the Rana Kapoor fraud case.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights could potentially have a positive business impact. Understanding the influence of the Yes Bank fraud case on the stock prices helps the bank and investors comprehend the consequences of such events. This awareness can guide future decision-making, risk management, and communication strategies to rebuild trust and investor confidence.

The insights indeed lead to negative growth. The Yes Bank fraud case caused a sharp decline in stock prices. Increased scrutiny and regulatory interventions created uncertainty about the bank's future, leading investors to sell shares. This negative sentiment directly impacted stock prices and reflects the tangible impact of external events on financial performance.

#### Chart - 2- **Distribution of dependent variable Close Price of stock**

---



In [None]:
# Chart - 2 visualization code

# Set the figure size and title
plt.figure(figsize=(15, 9))
plt.suptitle('Overall Distribution of Each Variable', color='white')

# Define the color list for each variable (using Yes Bank color scheme)
color_list = ['#003366', '#FF6600', '#99CC00', '#FFCC00']


for i, column in enumerate(stock_df.columns):
    # Create subplots
    ax1 = plt.subplot(2, 2, i + 1)
    ax2 = ax1.twinx()

    # Plot histogram
    sns.histplot(stock_df[column], color=color_list[i], ax=ax1)

    # Plot KDE curve
    sns.kdeplot(stock_df[column], color=color_list[i], ax=ax2)

    # Set gridlines
    ax1.grid(which='major', alpha=0.5)
    ax1.grid(which='minor', alpha=0.5)

    # Add vertical lines for mean and median
    plt.axvline(stock_df[column].mean(), color='white', linestyle='dashed', linewidth=1.5)
    plt.axvline(stock_df[column].median(), color='yellow', linestyle='dashed', linewidth=1.5)
plt.show()


##### 1. Why did you pick the specific chart?

The above chart, a mix of histograms and KDE plots, effectively displays how data is distributed in the dataset. It shows center, spread, and shape of distributions, allowing easy comparison. Colors match Yes Bank branding. The chart helps explore skewness, multimodality, and outliers. It's a compact representation, combining histograms' frequency view and KDE's smooth curve. This cohesive chart aids pattern spotting and variable relationships, offering insights into the dataset's nature.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart show that open, high, low, and close distributions are positively skewed. This means most data is on the left side, with a tail on the right for larger values. Histograms and KDE plots highlight this. Positive skewness suggests these variables tend to have higher values but fewer occurrences. This might indicate restrictions, leading to more lower-end values and some larger ones. Handling this skewness accurately is vital for analysis, possibly needing transformations or different techniques.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights about positively skewed distributions can positively impact decision-making and highlight buying opportunities. However, positive skewness alone doesn't guarantee negative growth. Negative growth depends on multiple factors beyond skewness, like trends and market conditions. Concluding negative growth solely based on skewness isn't justified. Further analysis is needed to understand potential negative impacts on business growth.






#### Chart - 3 - **Plotting target variable**

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(12,7))
stock_df['Close'].plot(color='blue')
plt.grid(which='major',linestyle='-',linewidth='0.5',color='black')
plt.grid(which='minor',linestyle='-',linewidth='0.5',color='black')


##### 1. Why did you pick the specific chart?

The specific chart, a line plot of the 'Close' data, was chosen to visualize the trend and fluctuations in the closing prices of the Yes Bank stock over time.



##### 2. What is/are the insight(s) found from the chart?

The insight from the chart is that the stock price increased until 2018, followed by a sharp decline after the Rana Kapoor fraud case.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight about the stock price increasing until 2018 and then sharply declining due to the Rana Kapoor fraud case can be valuable for creating a positive business impact. It allows stakeholders and decision-makers to understand the significant events that affected the stock price and to adjust strategies accordingly.

The insight doesn't directly indicate negative growth, but it highlights a specific event (fraud case) that led to a decline. However, negative growth would require a comprehensive analysis considering various factors like market conditions, financial performance, and external influences beyond this specific event.


##**Bivariate Analysis**

#### Chart - 4 - **Distribution of numerical features High, Low and Open price of stock.**

In [None]:
#Chart visualization
# List of independent features
numerical_features = list(set(stock_df.describe().columns)-{'Close'})
numerical_features

In [None]:

#Plotting distribution for each of numerical features.
for col in numerical_features:
    plt.figure(figsize=(6,5))
    sns.distplot(stock_df[col], color='green')
    plt.title("Distribution", fontsize=16)
    plt.xlabel(col, fontsize=12)
    plt.ylabel('Density', fontsize=12)

plt.show()

-It looks all numerical features are rightly skewed.

-We need to Apply log transformation to make normal.

In [None]:
# Applying log transformation
for col in numerical_features:
    plt.figure(figsize=(6,5))
    sns.distplot(np.log10(stock_df[col]), color='green')
    plt.title("Distribution", fontsize=16)
    plt.xlabel(col, fontsize=12)
    plt.ylabel('Density', fontsize=12)
plt.show

##### 1. Why did you pick the specific chart?

We used line plot to show how a specific variable changes over time or across a range. For stock prediction, it can illustrate the actual and predicted stock prices over different time periods. This visual helps understand model performance, identify trends, and evaluate prediction accuracy at a glance.

##### 2. What is/are the insight(s) found from the chart?

Insights can include assessing data distributions, identifying outliers, gauging data spread, checking for normality, and comparing feature distributions. These insights help in understanding and preprocessing the data for analysis or modeling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially lead to a positive business impact by helping with data preprocessing and analysis, which can improve decision-making and model performance. For example, identifying outliers and understanding data distributions can lead to more accurate predictions and better risk management.

#### Chart - 5 **Scatter Plot to see the Best Fit line**

Best Fit Line:- A line of best fit is a straight line that is the best approximation of the given set of data. It is used to study the nature of the relation between two variables.

In [None]:
for col in numerical_features:
   fig = plt.figure(figsize=(9, 8))
   ax = fig.gca()
   feature = stock_df[col]
   label = stock_df['Close']
   correlation = feature.corr(label)
   plt.scatter(x=feature, y=label,marker="^",c="b",s = label*2)
   plt.xlabel(col)
   plt.ylabel('Close')
   ax.set_title(col + ' Vs. Close' + '         Correlation: ' + str(round(correlation,2)), fontsize=16)
   z = np.polyfit(stock_df[col], stock_df['Close'], 1)
   y_hat = np.poly1d(z)(stock_df[col])

   plt.plot(stock_df[col], y_hat, "r", lw=1)

plt.show()

##### 1. Why did you pick the specific chart?

Using scatter plots with a best fit line allows for visualizing the relationship between numerical features and the 'Close' price. The correlation coefficient quantifies the strength of the relationship. The best fit line provides an estimate of the trend and predictive power. The plot aids interpretation and communication of the relationship to stakeholders. Annotations, such as the correlation coefficient, provide valuable insights. Customization enhances clarity and aesthetics. The plots help identify potential predictors and support analysis and decision-making in stock market analysis.

##### 2. What is/are the insight(s) found from the chart?

After reviewing scatter plots with the best fit line, it's clear that all independent variables have a linear connection with the dependent variable, 'Close'. This means their changes go hand-in-hand predictably.

A linear relationship holds significance for analysis and modeling. It implies changes in independent variables relate proportionally to changes in 'Close'. This insight helps build regression models, predict outcomes, and understand the independent variables' influence on 'Close' price.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying linear relationships between independent variables and the dependent variable has significant business impact, especially in stock market analysis:

Prediction and Forecasting: Clear linear relationships allow regression models to predict future 'Close' prices, aiding investment forecasts.

Risk Assessment: Analyzing relationships helps assess risk from independent variable changes, supporting risk management.

Feature Selection: Recognizing influential variables guides selection for future analyses and model development.

Strategy Development: Linear relationships offer insights into stock price drivers, helping develop trading strategies.

Understanding these relationships enhances forecasting, risk management, and decision-making for more accurate and informed choices in financial markets.


### Chart - 6 :- **Skewness in the Dataset**

# Chart - 6 visualization

**Data Distribution and mean and median of each single Indpendent variable**

In [None]:
numeric_features = stock_df.describe().columns
for col in numeric_features[0:4]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = stock_df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

##### 1. Why did you pick the specific chart?

 Histograms are versatile tools that help researchers, analysts, and decision-makers gain insights into data distributions, patterns, and characteristics. They are a fundamental part of exploratory data analysis and are often the first step in understanding a dataset before applying more advanced statistical techniques.






##### 2. What is/are the insight(s) found from the chart?

The charts display the data distribution of each variable in your dataset. The magenta dashed line represents the mean, while the cyan dashed line represents the median. When the mean is close to the peak of the histogram, it suggests a relatively symmetric distribution, but if it's distant from the peak, skewness may be present. The relationship between the mean and median indicates the direction of skewness: a mean to the right of the median suggests positive skew (right-skewed), while a mean to the left suggests negative skew (left-skewed). These visual cues offer valuable insights into the shape and skewness of each variable's data distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In summary, the insights gained from data analysis can be a valuable tool for making informed decisions, optimizing operations, and identifying growth opportunities. However, it's crucial for businesses to interpret these insights accurately and take appropriate actions to leverage them for positive impact and avoid negative consequences. The insights themselves do not inherently cause negative growth but can indirectly contribute to it if not acted upon wisely.

#### Chart - 7 **Correlation between each independent variable using scatter plot**

In [None]:
# Chart - 7 visualization code
fig=px.scatter(stock_df, x= 'High', y='Close', title= 'Relations between High and Close')
fig.update_layout (autosize=False, width=1000,height=500)
fig.show()

In [None]:
fig=px.scatter(stock_df, x= 'Open', y='Close', title= 'Relations between Open and Close')
fig.update_layout (autosize=False, width=1000,height=500)
fig.show()


In [None]:
fig=px.scatter(stock_df, x= 'Low', y='Close', title= 'Relations between Low and Close')
fig.update_layout (autosize=False, width=1000,height=500)
fig.show()


##### 1. Why did you pick the specific chart?

The specific chart, a scatter plot, was chosen to visualize the relationship between the 'Low' and 'Close' prices of the Yes Bank stock.



##### 2. What is/are the insight(s) found from the chart?

The insights reveal strong correlations between all independent variables and the dependent variable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight about strong correlations between independent variables and the dependent variable can potentially create a positive business impact. It suggests that changes in independent variables are associated with changes in the dependent variable. This understanding can aid in making informed decisions, optimizing strategies, and enhancing predictive models.

However, these insights don't directly lead to negative growth. High correlations indicate relationships, but not necessarily causation or direction. Negative growth would require comprehensive analysis considering external factors and trends beyond correlations. So, while correlations offer valuable insights, they don't inherently predict negative growth.



#### Chart - 8 **The relationship between dependent & independent variables**

In [None]:
# Chart - 13 visualization code
fig=px.scatter(stock_df, x= 'High', y='Close', title= 'Relations between High and Close')
fig.update_layout (autosize=False, width=1000,height=500)
fig.show()

##### 1. Why did you pick the specific chart?

we have used scatter plots to showcase the relationship between dependent and independent variables. They help visualize how changes in the independent variables impact the dependent variable. Each point on the scatter plot represents a data instance with its independent and dependent variable values. By observing the pattern of points, you can identify trends, correlations, or potential nonlinear relationships between the variables. This aids in feature selection, understanding data distribution, and making informed decisions when building predictive models.






##### 2. What is/are the insight(s) found from the chart?

The presence of high correlations between independent variables in our dataset indicates the potential for multicollinearity. Multicollinearity can adversely affect model fitting and prediction accuracy, as even slight changes in one independent variable can lead to unpredictable results. To assess the extent of multicollinearity in our dataset, we can calculate the Variation Inflation Factor (VIF). By analyzing the VIF values, we can determine which variables should be retained in our analysis and prediction model and identify variables that may need to be removed from the dataset to mitigate multicollinearity issues. This evaluation helps ensure the robustness and reliability of our models and supports accurate predictions and interpretations of the relationships between variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The impact of multicollinearity in a business context can be significant. Here are a few business implications:

Model Reliability: Multicollinearity affects predictive model reliability by making it hard to isolate individual variable impacts, leading to less trustworthy predictions.

Interpretation of Results: It complicates regression coefficient interpretation, hindering identification of key drivers for informed decisions.

Overfitting and Generalization: Multicollinearity raises overfitting risk, causing models to struggle with new data, potentially leading to flawed strategies.

Resource Allocation: Highly correlated variables might inefficiently use resources. Identifying these helps optimize allocation for relevant predictors.

Risk Assessment: Models with multicollinearity can misguide decisions. Awareness is vital for proper risk assessment and mitigation strategies.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(13,9))
cor= sns.heatmap(stock_df.corr(), annot=True)

##### 1. Why did you pick the specific chart?

The correlation heatmap is an effective way to visualize the pairwise correlations between numerical variables in a dataset. It uses color coding to represent the strength and direction of the correlations, making it easier to identify patterns and relationships. By using a heatmap, it allows for a quick and intuitive understanding of the correlation structure of the variables.

##### 2. What is/are the insight(s) found from the chart?

Every feature is extremely corelated with each other, so taking just one feature or average of these features would suffice for our regression model as linear regression assumes there is no multi colinearity in the features.

#### Chart - 10 - Pair Plot

In [None]:
 # Pair Plot visualization code
sns.pairplot(stock_df)

##### 1. Why did you pick the specific chart?

The pair plot is suitable when you want to visualize the relationships between multiple variables in a dataset. It creates a grid of scatter plots, making it easier to identify patterns, trends, and potential outliers. The pair plot allows for a comprehensive examination of the pairwise relationships, helping to understand how variables interact with each other. On the other hand, the pair plot provides a more comprehensive view of the relationships by displaying scatter plots for all possible variable combinations.

##### 2. What is/are the insight(s) found from the chart?

The insights reveal strong correlations among Open, High, Low, and Close variables, indicating their close ties in Yes Bank's stock. Similarly, Open, High, and Low variables are tightly correlated, suggesting synchronized trends. These correlations are valuable for prediction and decision-making, showcasing stock market interdependencies. Yet, remember, correlation doesn't mean causation. In-depth analysis considers more factors for accurate predictions and decisions.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#Check for missing values
print(stock_df.isnull().sum())

#Drop rows with missing values
stock_df.dropna(inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Dropping Rows with Missing Values:**This technique removes rows that contain missing values. It is suitable when the missing values are random and do not significantly impact the overall dataset. Dropping rows can be appropriate when the missing data is relatively small compared to the available dataset.

### 2. Handling Outliers

In [None]:
# Create a figure with a size of 10x6 inches
fig = plt.figure(figsize=(10, 6))



# Add a super title to the plot
plt.suptitle('Studying the Outliers after Log Transformation', color='black', fontsize=16)

# Define a list of colors for the boxplots
color_list = ['blue', 'green', 'red', 'grey']

# Iterate over each column in the dataframe
for i, column in enumerate(stock_df.columns):
    # Create subplots for each column
    plt.subplot(2, 2, i + 1)

    # Apply a log transformation to the column and create a boxplot
    sns.boxplot(x=np.log10(stock_df[column]), color=color_list[i])



    # Add a title to each subplot
    plt.title(column, color='Black')

# Adjust the layout of the subplots
plt.tight_layout()

plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

The log transformation was applied as a treatment for outliers. This approach not only addresses outliers but also helps to alleviate skewness in the features' distribution. By using log transformation, two problems - outlier treatment and skewness correction - are tackled simultaneously, providing a consolidated solution. This technique aids in normalizing the data and improving the suitability of the features for analysis and modeling purposes.

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

In our data set all the columns are numerical, including the 'Date' column being represented as an object type. Since all the columns are already numerical, there is no need for categorical encoding in this particular dataset.

Categorical encoding is typically required when you have categorical variables that need to be converted into numerical representations for analysis or machine learning tasks. In our case, all the columns ('Open', 'High', 'Low', 'Close') are numerical, representing different aspects of the stock closing price

### **4. Feature Manipulation & Selection**

#### **1. Feature Manipulation**

In [None]:
# Manipulate Features to minimize feature correlation and create new features
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create an empty dataframe to store the VIF for each feature
vif_df = pd.DataFrame()

# Assign the feature names to the 'Features' column
vif_df['Features'] = stock_df.iloc[:, :-1].columns.tolist()

# Calculate the VIF for each feature and store it in the 'VIF' column
vif_df['VIF'] = [variance_inflation_factor(stock_df.iloc[:, :-1].values, i) for i in range(len(stock_df.iloc[:, :-1].columns))]

# Display the dataframe containing the features and their corresponding VIF values
vif_df

The VIF values for all the features indicate high multicollinearity. However, considering the small size of the dataset and having only three numerical independent variables, there is limited potential for feature manipulation that could be beneficial. With the absence of categorical variables, the scope for feature engineering or transformation is constrained. Therefore, the focus should be on alternative modeling approaches or additional data collection to address the issue of multicollinearity.

#### **2. Feature Selection**

**Due to the dataset's small size, any form of feature selection becomes impractical. Given the limited number of observations, attempting to reduce the feature space may lead to unreliable or biased results. Therefore, it is advisable to retain all available features for analysis or modeling purposes.**

### **5. Data Transformation**

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

To address the skewed distribution of the features, a data transformation is necessary to approximate a normal distribution. In this case, a log transformation will be applied. This transformation aims to reduce skewness and make the data more symmetrical. Furthermore, as observed earlier, the log transformation also aids in handling outliers. By employing this transformation, we can simultaneously improve the normality of the data distribution and mitigate the impact of outliers.

In [None]:
# Iterate over each column in the dataframe
for column in stock_df.columns:
    # Apply a log transformation to the column using np.log10()
    stock_df[column] = np.log10(stock_df[column])

In [None]:
# Create a figure with a size of 10x8 inches
plt.figure(figsize=(10,8))




# Add a super title to the plot
plt.suptitle('Overall Distribution of Each Variable after Log Transformation', color='black', fontsize=16)

color_list = ['blue', 'green', 'red', 'purple']

for i, column in enumerate(stock_df.columns):
    plt.subplot(2, 2, i + 1)
    ax1 = plt.gca()
    sns.histplot(stock_df[column], color=color_list[i], ax=ax1)
    ax2 = ax1.twinx()
    sns.kdeplot(stock_df[column], color=color_list[i], ax=ax2)  # Overlapping the KDE plot on the histogram.



    # Add gridlines
    plt.grid(which='major', alpha=0.5)
    plt.grid(which='minor', alpha=0.5)

    # Add dashed lines for mean and median
    plt.axvline(stock_df[column].mean(), color='purple', linestyle='dashed', linewidth=2.5)
    plt.axvline(stock_df[column].median(), color='orange', linestyle='dashed', linewidth=1.5)

plt.tight_layout()

plt.show()

The mean (indicated by the purple vertical line) and the median (represented by the yellow vertical line) are nearly equal for each feature. This alignment suggests that the log transformation successfully reduced the skewness and brought the data closer to symmetry. The convergence of the mean and median highlights the relative balance in the distribution, indicating a more representative central tendency. Overall, these observations indicate an improved approximation to a normal distribution after the log transformation.





##**6. Dimesionality Reduction**

##### Do you think that dimensionality reduction is needed? Explain Why?

Since the dataset is already small in size, there is no need for dimensionality reduction techniques. With a limited number of observations, attempting to reduce the number of features may not provide significant benefits and could potentially lead to loss of valuable information. Therefore, it is advisable to retain all the available features for analysis or modeling purposes without applying dimensionality reduction methods.

### **7. Data Splitting**

In [None]:
from sklearn.model_selection import train_test_split

# Assign the independent and dependent variables to X and y, respectively
X = stock_df[independent_variables]
y = stock_df[dependent_variable]

# Split the data into training and testing datasets using a test size of 0.2 (20%)
# Set random_state to 0 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


##### What data splitting ratio have you used and why?

To train the model effectively, an 80:20 split ratio is being employed, allocating 80% of the data for training and 20% for testing. However, considering the small dataset size, it may be beneficial to acquire more data for training purposes. Increasing the training data size helps improve the model's ability to learn and generalize from the patterns present in the data. Gathering additional data can enhance the model's performance, reduce the risk of overfitting, and provide a more comprehensive representation of the underlying relationships within the dataset.

### **8. Data Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
scaler = StandardScaler()

# Scale the training data (X_train) using fit_transform
X_train = scaler.fit_transform(X_train)

# Scale the testing data (X_test) using transform
X_test = scaler.transform(X_test)


In [None]:
# Checking the training dataset
X_train[0: 10]

In [None]:
# Checking the test dataset
X_test[0: 10]

Which method have you used to scale you data and why?

We used the StandardScaler method to preprocess our data for linear regression analysis. StandardScaler is chosen because linear regression assumes normally distributed features. By applying StandardScaler, we transform the features to have a mean of 0 and a standard deviation of 1. This standardization aligns with the assumptions of linear regression, ensuring that features are on a similar scale. This, in turn, facilitates accurate model fitting and interpretation.

## ***7. ML Model Implementation***

##**ML Model 1 - Linear Regression**

Linear Regression is a powerful machine learning algorithm that falls under the category of supervised learning. It is specifically designed for regression tasks, where the goal is to predict a continuous target variable based on independent variables. In regression analysis, the algorithm establishes a relationship between the predictor variables and the target variable to make accurate predictions.

The primary objective of Linear Regression is to identify and quantify the relationship between variables. By examining the patterns and trends in the data, the algorithm enables us to understand how changes in one variable affect the target variable. This understanding is crucial for making informed decisions and forecasting future outcomes.

Linear Regression is widely employed in various domains, including finance, economics, social sciences, and engineering. It finds applications in areas such as sales forecasting, housing price prediction, demand estimation, and trend analysis. By leveraging the insights gained from analyzing the relationship between variables, Linear Regression empowers us to make reliable forecasts and make informed business decisions.

In summary, Linear Regression is a versatile algorithm that allows us to explore the relationships between variables and make predictions based on those relationships. Its ability to model the dependencies between variables makes it a valuable tool for understanding data and making accurate forecasts in numerous fields.

In [None]:
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression model
linear_reg = LinearRegression()

# Fit the Linear Regression model to the training data
linear_reg.fit(X_train, y_train)

In [None]:
# Predict on the model
y_pred_lin = linear_reg.predict(X_test)

In [None]:
# Checking the model parameters
print("Coefficients:", linear_reg.coef_)
print("Intercept:", linear_reg.intercept_)


In [None]:
plt.figure(figsize=(8, 4))


# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Linear Regression model
plt.plot(10**y_pred_lin, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Linear Regression", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()


**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

Linear Regression aims to establish a linear connection between the independent and dependent variables by minimizing the sum of squared differences between the observed and predicted dependent values. It assumes a linear relationship and calculates the best-fitting line by adjusting the model's coefficients. The objective is to minimize the overall distance between the observed data points and the line of best fit. This approach enables the model to capture the underlying linear pattern and make predictions based on the learned relationship between the variables.

In [None]:
# importing libraries
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Calculate the Mean Squared Error (MSE)
mse_lin = round(mean_squared_error(10**y_test, 10**y_pred_lin), 4)

# Calculate the Root Mean Squared Error (RMSE)
rmse_lin = round(np.sqrt(mse_lin), 4)

# Calculate the Mean Absolute Error (MAE)
mae_lin = round(mean_absolute_error(10**y_test, 10**y_pred_lin), 4)

# Calculate the R-squared Score (R2)
r2_lin = round(r2_score(10**y_test, 10**y_pred_lin), 4)

# Calculate the Adjusted R-squared Score (Adjusted R2)
adj_r2_lin = round(1 - (1 - r2_lin) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1)), 4)


In [None]:
# Create a dataframe to store the evaluation metrics
evametdf_lin = pd.DataFrame()

# Set the 'Metrics' column in the dataframe
evametdf_lin['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the 'Linear Regression' column in the dataframe with the corresponding metric values
evametdf_lin['Linear Regression'] = [mse_lin, rmse_lin, mae_lin, r2_lin, adj_r2_lin]

# Display the dataframe
evametdf_lin

The evaluation metrics for the Linear Regression model are as follows:

Mean Squared Error (MSE): The MSE value is 70.4204, indicating the average squared difference between the actual and predicted Close prices. Lower values indicate better model performance, as they represent a smaller overall prediction error.

Root Mean Squared Error (RMSE): The RMSE value is 8.3917, which is the square root of the MSE. It provides a measure of the average difference between the actual and predicted Close prices in the original scale. Again, a lower value signifies better predictive accuracy.

Mean Absolute Error (MAE): The MAE value is 4.8168, representing the average absolute difference between the actual and predicted Close prices. Similar to MSE and RMSE, a smaller MAE indicates better model performance.

R-2 Score: The R-2 score is 0.9937, reflecting the proportion of variance in the dependent variable (Close prices) explained by the independent variables. A score closer to 1 indicates a better fit of the model to the data.

Adjusted R-2 Score: The adjusted R-2 score is 0.9931, which considers the number of independent variables and sample size when assessing the model's goodness of fit. This adjustment helps mitigate potential overfitting issues and provides a more reliable measure of model performance.

These evaluation metrics collectively demonstrate that the Linear Regression model performs well in predicting the Close prices, with low errors, a high R-2 score, and a relatively stable adjusted R-2 score.

###**ML Model 2- Lasso Regression**

Lasso regression, also known as Penalized regression, is a machine learning method commonly used for variable selection. It offers improved prediction accuracy compared to other regression models. By applying Lasso regularization, the model can enhance interpretability while effectively reducing the impact of less relevant variables. This regularization technique plays a crucial role in feature selection and contributes to a more accurate and interpretable model.

In [None]:

# ML Model - 2 Implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.01)

# Fit the Algorithm
lasso.fit(X_train, y_train)

In [None]:

# Predict on the model
y_pred_lasso = lasso.predict(X_test)


# Print the coefficients of the Lasso model
print("Coefficients:", lasso.coef_)

# Print the intercept of the Lasso model
print("Intercept:", lasso.intercept_)


In [None]:
plt.figure(figsize=(8, 4))


# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Lasso Regression model
plt.plot(10**y_pred_lasso, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Lasso Regression", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()

#### **1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

Lasso Regression is a regularization technique employed in Linear Regression models. It incorporates a penalty term into the loss function that is based on the sum of the absolute values of the coefficients. This penalty term encourages sparsity in the model by driving some coefficients to exactly zero. As a result, Lasso Regression not only reduces the magnitudes of the coefficients but can also eliminate some features from the model by setting their corresponding coefficients to zero.

By reducing the coefficients to zero, Lasso Regression performs feature selection, effectively identifying and prioritizing the most important features for predicting the target variable. This characteristic makes Lasso Regression particularly useful when dealing with high-dimensional datasets where feature reduction is desired.

The regularization effect of Lasso Regression helps mitigate overfitting by preventing the model from relying too heavily on any individual feature. It encourages a more parsimonious model representation, improving its generalizability to unseen data. The capability of Lasso Regression to shrink coefficients towards zero and perform feature selection makes it a valuable tool for both improving model interpretability and enhancing prediction accuracy.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Mean Squared Error
mse_lasso = round( mean_squared_error((10**y_test), 10**(y_pred_lasso)), 4)

# Root Mean Squared Error
rmse_lasso = round(np.sqrt(mse_lasso), 4)

# Mean Absolute Error
mae_lasso = round(mean_absolute_error((10**y_test), 10**(y_pred_lasso)), 4)

# R-2 Score
r2_lasso = round(r2_score((10**y_test), (10**y_pred_lasso)), 4)

# Adjusted R-2 Score
adj_r2_lasso = round(1 - (1 - r2_lasso)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)

In [None]:
# Create a dataframe to store the evaluation metrics for Lasso Regression
evametdf_lasso = pd.DataFrame()

# Set the 'Metrics' column in the dataframe
evametdf_lasso['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the 'Lasso Regression' column in the dataframe with the corresponding metric values
evametdf_lasso['Lasso Regression'] = [mse_lasso, rmse_lasso, mae_lasso, r2_lasso, adj_r2_lasso]

# Display the dataframe
evametdf_lasso

**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for Lasso Regression
lasso_param_grid = {'alpha': [0.00001, 0.0001, 0.001, 0.01, 1, 10, 100, 1000]}

# Perform GridSearchCV with Lasso Regression
lasso_gscv = GridSearchCV(lasso, param_grid=lasso_param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the Lasso Regression model with GridSearchCV
lasso_gscv.fit(X_train, y_train)

In [None]:
# Finding the best parameter value
print("The best value of 'alpha' would be:", lasso_gscv.best_params_)

In [None]:
# Print the coefficients of the best estimator from GridSearchCV
print("Coefficients:", lasso_gscv.best_estimator_.coef_)

# Print the intercept of the best estimator from GridSearchCV
print("Intercept:", lasso_gscv.best_estimator_.intercept_)

In [None]:
# Predict on the model
y_pred_lasso_gscv = lasso_gscv.predict(X_test)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity



# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Lasso Regression model with GridSearchCV
plt.plot(10**y_pred_lasso_gscv, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Lasso Regression with GridSearchCV", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()

In [None]:
# Mean Squared Error
mse_lasso_gscv = round( mean_squared_error((10**y_test), 10**(y_pred_lasso_gscv)), 4)

# Root Mean Squared Error
rmse_lasso_gscv = round(np.sqrt(mse_lasso_gscv), 4)

# Mean Absolute Error
mae_lasso_gscv = round(mean_absolute_error((10**y_test), 10**(y_pred_lasso_gscv)), 4)

# R-2 Score
r2_lasso_gscv = round(r2_score((10**y_test), (10**y_pred_lasso_gscv)), 4)

# Adjusted R-2 Score
adj_r2_lasso_gscv = round(1 - (1 - r2_lasso_gscv)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)


In [None]:
# Create a dataframe to store the evaluation metrics for Lasso Regression with GridSearchCV
evametdf_lasso_gscv = pd.DataFrame()

# Add the column "Metrics" to the dataframe
evametdf_lasso_gscv['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Add the column "Lasso Regression with GridSearchCV" to the dataframe with the corresponding evaluation metric values
evametdf_lasso_gscv['Lasso Regression with GridSearchCV'] = [mse_lasso_gscv, rmse_lasso_gscv, mae_lasso_gscv, r2_lasso_gscv, adj_r2_lasso_gscv]

# Display the dataframe
evametdf_lasso_gscv


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used with a smaller set of hyperparameters to find the best combination of hyperparameter values for Lasso Regression. The hyperparameter grid specified a range of alpha values. By narrowing down the set of hyperparameters, the search space was reduced, making the grid search more efficient. GridSearchCV then performed cross-validation to evaluate the performance of each combination of hyperparameters based on the negative mean squared error. The best set of hyperparameters was determined based on the highest cross-validated score, resulting in the optimal regularization strength for Lasso Regression. This approach allowed for an effective and efficient search for the optimal hyperparameters and minimized the mean squared error.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Create a dataframe to store the comparison of evaluation metrics for Lasso Regression and Lasso Regression with GridSearchCV
lasso_comp_df = pd.concat([evametdf_lasso, evametdf_lasso_gscv.iloc[:, 1]], axis=1)

# Display the dataframe
lasso_comp_df

Lasso Regression with GridSearchCV is considered the winner due to its lower error metrics and slightly higher R-2 scores. The lower mean squared error, root mean squared error, and mean absolute error indicate improved accuracy and better predictive performance compared to Lasso Regression without GridSearchCV. Additionally, the slightly higher R-2 score suggests that Lasso Regression with GridSearchCV captures a greater amount of variance in the target variable and provides a better fit to the data. Overall, these evaluation metrics demonstrate that Lasso Regression with GridSearchCV outperforms Lasso Regression without GridSearchCV in terms of predictive accuracy and model fit.

### ML Model - 3  **Ridge Regression**

Ridge regression is a regularization technique used in multiple regression analysis. While it may seem daunting at first, gaining a solid understanding of multiple regression can provide a foundation for comprehending the science behind Ridge regression in R.

In multiple regression, the goal is to build a model that predicts the relationship between a dependent variable and multiple independent variables. This is done by estimating the coefficients of the independent variables that minimize the difference between the predicted and actual values of the dependent variable. The traditional least squares method is commonly used to estimate these coefficients.

Ridge regression, on the other hand, introduces a regularization term to the least squares method. This regularization term, known as the Ridge penalty or L2 regularization, adds a constraint to the coefficient estimation process. The purpose of this constraint is to prevent overfitting and improve the model's generalization ability.

The Ridge penalty works by adding a weighted sum of squared coefficients to the ordinary least squares cost function. This sum penalizes larger coefficient values, encouraging them to be smaller. Consequently, Ridge regression tends to shrink the coefficient estimates towards zero, while still allowing them to have non-zero values. This shrinkage effect helps mitigate the impact of multicollinearity, a situation where the independent variables are highly correlated with each other.

In R, implementing Ridge regression involves specifying a tuning parameter, often denoted as lambda or alpha. This parameter controls the amount of regularization applied to the model. A larger lambda value results in stronger regularization, leading to smaller coefficient estimates. Conversely, a smaller lambda value reduces the regularization effect, allowing the coefficients to approach the values obtained from ordinary least squares regression.

By understanding the fundamentals of multiple regression, researchers can grasp the underlying principles of Ridge regression in R. This regularization technique offers a valuable tool for handling multicollinearity and improving the generalization performance of multiple regression models.

In [None]:
# ML Model - 3 Implementation
# Import the Ridge regression model from scikit-learn
from sklearn.linear_model import Ridge

# Create an instance of the Ridge regression model
ridge = Ridge()

# Fit the Ridge regression model to the training data
ridge.fit(X_train, y_train)

# Fit the Algorithm

# Predict on the model
y_pred_ridge = ridge.predict(X_test)

# Print the coefficients of the Ridge regression model
print("Coefficients:", ridge.coef_)

# Print the intercept of the Ridge regression model
print("Intercept:", ridge.intercept_)


In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity



# Plot the actual Close prices from the test data in blue
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Ridge regression model in red
plt.plot(10**y_pred_ridge, color='red')

# Set the label for the y-axis as "Close Price"
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot as "Ridge Regression" with white color
plt.title("Ridge Regression", color='Black')

# Add grid lines to the plot
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

# Display the plot
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Ridge Regression is a regularization technique used in Linear Regression models. It introduces a penalty term to the loss function, which is the sum of squared values of the coefficients. This penalty term helps control the magnitude of the coefficients, limiting their impact on the model and reducing the chances of overfitting. By adding this penalty term, Ridge Regression encourages a balance between fitting the training data well and maintaining generalization to unseen data. It is an effective approach to handle multicollinearity and stabilize the model's performance.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate the Mean Squared Error (MSE)
mse_ridge = round(mean_squared_error(10**y_test, 10**y_pred_ridge), 4)

# Calculate the Root Mean Squared Error (RMSE)
rmse_ridge = round(np.sqrt(mse_ridge), 4)

# Calculate the Mean Absolute Error (MAE)
mae_ridge = round(mean_absolute_error(10**y_test, 10**y_pred_ridge), 4)

# Calculate the R-squared Score (R2 Score)
r2_ridge = round(r2_score(10**y_test, 10**y_pred_ridge), 4)

# Calculate the Adjusted R-squared Score (Adjusted R2 Score)
adj_r2_ridge = round(1 - (1 - r2_ridge) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1)), 4)



In [None]:
# Create a dataframe to store the evaluation metrics
evametdf_ridge = pd.DataFrame()

# Set the metrics as a column in the dataframe
evametdf_ridge['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the corresponding values for Ridge Regression in the dataframe
evametdf_ridge['Ridge Regression'] = [mse_ridge, rmse_ridge, mae_ridge, r2_ridge, adj_r2_ridge]

evametdf_ridge

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the hyperparameter grid for Ridge regression
ridge_param_grid = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}

# Create an instance of the Ridge regression model
ridge = Ridge()

# Create an instance of GridSearchCV with the Ridge regression model,
# the hyperparameter grid, scoring metric, and cross-validation settings
ridge_gscv = GridSearchCV(ridge, param_grid=ridge_param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the GridSearchCV instance to the training data
ridge_gscv.fit(X_train, y_train)


In [None]:
# Predict on the model
y_pred_ridge_gscv = ridge_gscv.predict(X_test)

In [None]:
# Finding the best parameter value
print("The best value of 'alpha' would be:", ridge_gscv.best_params_)


In [None]:

# Checking the model parameters after GridSearchCV
print("Coefficients:", ridge_gscv.best_estimator_.coef_)
print("Intercept:", ridge_gscv.best_estimator_.intercept_)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity



# Plot the actual Close prices from the test set in blue
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Ridge regression model with GridSearchCV in red
plt.plot(10**ridge_gscv.predict(X_test), color='red')

# Set the y-axis label
plt.ylabel("Close Price")

# Add a legend for the plotted lines
plt.legend(["Actual", "Predicted"])

# Add grid lines to the plot
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

# Set the title of the plot with white color
plt.title("Ridge Regression with GridSearchCV", color='white')

# Display the plot
plt.show()

In [None]:
# Mean Squared Error
mse_ridge_gscv = round( mean_squared_error((10**y_test), 10**(y_pred_ridge_gscv)), 4)

# Root Mean Squared Error
rmse_ridge_gscv = round(np.sqrt(mse_ridge_gscv), 4)

# Mean Absolute Error
mae_ridge_gscv = round(mean_absolute_error((10**y_test), 10**(y_pred_ridge_gscv)), 4)

# R-2 Score
r2_ridge_gscv = round(r2_score((10**y_test), (10**y_pred_ridge_gscv)), 4)

# Adjusted R-2 Score
adj_r2_ridge_gscv = round(1 - (1 - r2_ridge_gscv)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)



In [None]:
# Create an empty dataframe
evametdf_ridge_gscv = pd.DataFrame()

# Create a column for the evaluation metrics
evametdf_ridge_gscv['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Create a column for the Ridge Regression with GridSearchCV results
evametdf_ridge_gscv['Ridge Regression with GridSearchCV'] = [mse_ridge_gscv, rmse_ridge_gscv, mae_ridge_gscv, r2_ridge_gscv, adj_r2_ridge_gscv]

# Display the dataframe
evametdf_ridge_gscv

##### Which hyperparameter optimization technique have you used and why?

The reason GridSearchCV was used in this code is that we are working with a smaller set of hyperparameters for the Ridge regression model. GridSearchCV allows us to exhaustively search through the specified hyperparameter grid and find the best combination of hyperparameters that yields the optimal model performance.

In this case, the hyperparameter being tuned is the alpha parameter, which represents the regularization strength in Ridge regression. The ridge_param_grid contains a predefined list of potential alpha values to explore. By using GridSearchCV, the code iterates through each alpha value in the grid, fits the Ridge regression model with that particular alpha, and evaluates the model's performance using cross-validation.

GridSearchCV is an effective approach when dealing with a smaller hyperparameter space because it systematically evaluates every possible combination within that space. However, as the hyperparameter space grows larger, GridSearchCV may become computationally expensive and time-consuming.

It's important to note that the choice of hyperparameter search method depends on the specific problem, the size of the hyperparameter space, and the available computational resources. GridSearchCV is suitable for smaller hyperparameter spaces, while other techniques like RandomizedSearchCV or Bayesian optimization may be more efficient for larger hyperparameter spaces.

Overall, GridSearchCV provides a systematic way to search through a smaller set of hyperparameters and identify the optimal combination for the Ridge regression model, leading to improved model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Concatenating two DataFrames side by side using pd.concat()
# Here, we are combining 'evametdf_ridge' and the second column ('iloc[:, 1]') of 'evametdf_ridge_gscv' DataFrame.
ridge_comp_df = pd.concat([evametdf_ridge, evametdf_ridge_gscv.iloc[:, 1]], axis=1)

# Displaying the resulting DataFrame after concatenation
ridge_comp_df


In terms of error metrics, the Ridge Regression model with GridSearchCV outperformed other models. It achieved lower error values, indicating better accuracy and predictive performance. The optimized hyperparameters obtained through GridSearchCV helped improve the model's ability to fit the data and make more accurate predictions, resulting in reduced errors compared to other models. This suggests that the Ridge Regression model with GridSearchCV is a more reliable choice for the given dataset.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In this case, the evaluation and comparison of the models' performance primarily focus on two key metrics: Root Mean Square Error (RMSE) and R-2 Score. The RMSE is a measure of the average magnitude of the prediction errors, providing insights into the models' ability to accurately estimate the Close prices. A lower RMSE indicates better predictive accuracy, as it signifies that the models' predictions are closer to the actual Close prices.

The R-2 Score, also known as the coefficient of determination, quantifies the proportion of the variance in the target variable (Close prices) that is explained by the predictor variables. A higher R-2 Score indicates a better fit of the model to the data, as it suggests that a larger portion of the variation in the Close prices can be accounted for by the predictors.

In this analysis, the dataset has been preprocessed to effectively handle outliers, ensuring that they do not significantly impact the models' performance. Therefore, there is no need to be concerned about the models' sensitivity to outliers.

Additionally, given the small size of the dataset and the models being trained using the same predictor variables, there is no requirement to consider adjusted scores. Adjusted scores are typically used when comparing models with different sets of predictors or when dealing with larger datasets. In this case, since the models are trained on the same predictors and the dataset size is relatively small, the adjusted scores are not necessary for a meaningful comparison.

By placing emphasis on RMSE and R-2 Score, we can effectively evaluate the models' predictive power and their ability to explain the variation in the Close prices. This approach allows us to determine the model that performs the best in terms of accuracy and fit, ultimately contributing positively to the business objectives."

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:

# Concatenate the evaluation metric dataframes of the best performing models from each section
overall_evametdf = pd.concat([evametdf_lin,
                              ridge_comp_df.loc[:, 'Ridge Regression with GridSearchCV'],
                              lasso_comp_df.loc[:, 'Lasso Regression with GridSearchCV'],
                              ], axis=1)

# Display the concatenated dataframe
overall_evametdf


In [None]:
data = {
    'Linear Regression': [70.4204, 8.3917, 4.8168, 0.9937, 0.9931],
    'Ridge Regression with GridSearchCV': [70.2044, 8.3788, 4.9692, 0.9938, 0.9932],
    'Lasso Regression with GridSearchCV': [70.3311, 8.3864, 4.8262, 0.9938, 0.9932],

}
overall_evametdf = pd.DataFrame(data, index=['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score'])

# Transpose the DataFrame to have models as columns and metrics as rows
overall_evametdf_T = overall_evametdf.T

# Create a Plotly figure
fig = go.Figure()

# Add trace for each metric and model
colors = ['rgb(91, 155, 213)', 'rgb(237, 125, 49)', 'rgb(165, 165, 165)', 'rgb(112, 173, 71)', 'rgb(255, 192, 0)']
for i, metric in enumerate(overall_evametdf_T.index):
    fig.add_trace(go.Bar(x=overall_evametdf_T.columns, y=overall_evametdf_T.loc[metric],
                         name=metric, marker_color=colors[i]))

# Update layout
fig.update_layout(
    title_text='Comparison of Metrics for Different Models',
    title_x=0.5,  # Set the title at the middle of the plot
    xaxis_title='Models',
    yaxis_title='Metric Values',
    xaxis=dict(title_font=dict(size=14)),  # Increase the font size of the x-axis title to 18
    yaxis=dict(title_font=dict(size=14)),  # Increase the font size of the y-axis title to 18
    barmode='group',
    legend=dict(y=1.0, bgcolor='rgba(255, 255, 255, 0.5)'),
    width=1200,  # Increase the figure width for better visual appeal
    height=600,  # Increase the figure height for better visual appeal
    plot_bgcolor='rgb(36, 40, 47)',  # Set the dark blue background color of the plot
    paper_bgcolor='rgb(51, 56, 66)',  # Set the dark blue background color of the paper area
    font_color='white',
    hovermode='x',  # Show hover information for each bar
)

# Show the plot
fig.show()



Upon analyzing the evaluation metrics of the different models, it is evident that the 'Ridge Regression with GridSearchCV' model stands out as the preferred choice. This conclusion is based on several factors:

Comparatively lower (Root) Mean Squared Error (MSE/RMSE):

- The 'Ridge Regression with GridSearchCV' model demonstrates the lowest MSE/RMSE among the evaluated models. This metric represents the average squared difference between the predicted and actual values, and a lower value indicates better predictive accuracy. Thus, the 'Ridge Regression with GridSearchCV' model outperforms the other models in terms of minimizing prediction errors.
Dataset size and feature retention:

- The small size of the dataset warrants caution when considering the Lasso Regression model. Lasso Regression is known for its ability to reduce coefficients to exactly zero, effectively performing feature selection. However, in this scenario, it is essential to retain all the features due to the limited dataset. By retaining all the features, the 'Ridge Regression with GridSearchCV' model ensures that no important information is overlooked or discarded during the modeling process.


Based on these observations, the 'Ridge Regression with GridSearchCV' model is the recommended choice for its superior performance in terms of lower MSE/RMSE and the preservation of all features. This model strikes a balance between model complexity (by leveraging ridge regularization) and predictive accuracy, making it suitable for the given dataset and business objectives.

By selecting the 'Ridge Regression with GridSearchCV' model, we can have confidence in its ability to provide accurate predictions while considering all available features in the dataset, thereby making it the optimal choice for achieving the desired business impact.

In [None]:
# Choosing Ridge Regression with GridSearchCV as the final prediction model

final_pred_model = ridge_gscv

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

As discussed earlier, Ridge Regression is a regularization technique used in Linear Regression models to address overfitting and multicollinearity. It achieves this by adding a penalty term, which is the sum of squared values of the coefficients, to the loss function. This penalty term limits the magnitude of the coefficients, promoting a balance between the model's complexity and its ability to generalize well.

On the other hand, GridSearchCV is a hyperparameter tuning technique that performs an exhaustive search over a predefined set of potential hyperparameters. It aims to find the best combination of hyperparameters for a model by evaluating them using cross-validation.

When Ridge Regression is combined with GridSearchCV, Ridge Regression serves as the base model, and GridSearchCV is utilized to determine the optimal value for the regularization strength, also known as the hyperparameter controlling the penalty term.

By leveraging GridSearchCV, we can systematically explore different values of the regularization strength and identify the one that yields the best performance according to the chosen evaluation metric. This approach allows us to fine-tune the Ridge Regression model and optimize its performance based on the given hyperparameters.

To assess the feature importance of the Ridge Regression model, we can examine the coefficients associated with each feature. The magnitude of these coefficients indicates the relative influence of the corresponding features on the predicted outcome. By analyzing the feature importance, we gain insights into which features have the greatest impact on the predictions and can focus on interpreting their influence on the target variable.

In [None]:

# Installing Eli5. It is a Python library used for interpreting machine learning models.

! pip install eli5

In [None]:

# Importing the libraries

import eli5

# Show feature weights using eli5
eli5.show_weights(final_pred_model.best_estimator_, feature_names=independent_variables)


- The 'High price' feature has a weight of +0.333, indicating a positive influence on the predictions. This means that an increase in the 'High price' feature is associated with a corresponding increase in the predicted 'Close price'. The weight suggests that the 'High price' feature plays a significant role in determining the predicted 'Close price' and has a positive impact on the overall prediction outcome.

- The 'Low price' feature has a weight of +0.315, indicating a positive influence on the predictions. This implies that an increase in the 'Low price' feature is associated with a corresponding increase in the predicted 'Close price'. The weight suggests that the 'Low price' feature plays a significant role in determining the predicted 'Close price' and has a positive impact on the overall prediction outcome.

- On the other hand, the 'Open price' feature has a weight of -0.226, suggesting a negative influence on the predictions. This means that an increase in the 'Open price' feature is associated with a corresponding decrease in the predicted 'Close price'. The negative weight indicates that the 'Open price' feature has a significant impact in the opposite direction on the predicted 'Close price', leading to a decrease in the overall prediction outcome.

#**8.Hypothesis Testing**

To validate the assumptions of the model and gain insights from the dataset, we will define three hypothetical statements based on the available data. In the subsequent three questions, we will perform hypothesis testing to draw final conclusions regarding these statements. Hypothesis testing is a statistical analysis that allows us to assess the validity of a statement or claim by evaluating the evidence provided by the data.

We will select an appropriate statistical test based on the nature of the hypothesis, type of data, and assumptions involved. For example, we might use t-tests, ANOVA, chi-square tests, or regression analysis, depending on the scenario.

By performing hypothesis testing on the dataset, we can draw meaningful and statistically sound conclusions regarding the hypothetical statements, providing valuable insights and supporting evidence based on the available data.

**Hypothetical Testing 1**

In this analysis, the Goldfeld-Quandt Test is employed to assess the homoscedasticity of the residuals. This statistical test is used to determine whether the variability of the residuals is consistent across different ranges of the independent variables. By conducting the Goldfeld-Quandt Test, we can evaluate whether there are any indications of heteroscedasticity in the residuals, which would suggest that the variability of the errors is not constant throughout the data. Homoscedasticity, on the other hand, implies that the residuals have a consistent level of variability across the independent variables

**1. State Your research hypothesis as a null hypothesis and alternate hypothesis.**

In the context of assessing the homoscedasticity of residuals, we define the null hypothesis (H0) as stating that the residuals exhibit homoscedasticity, meaning that their variability is constant.

The alternative hypothesis (Ha) posits that the residuals are heteroscedastic, implying that their variability differs across the range of the independent variables.

Through hypothesis testing, we aim to evaluate the evidence provided by the data to determine whether we have sufficient statistical support to reject the null hypothesis in favor of the alternative hypothesis. By examining the results of the statistical test, we can make conclusions about the presence of homoscedasticity or heteroscedasticity in the residuals.

In [None]:
# Importing the libraries

from statsmodels.stats.diagnostic import het_goldfeldquandt

# Calculate the residuals
residuals = 10**y_test.values - 10**y_pred_ridge_gscv.reshape(-1, 1)

# Perform the Goldfeld-Quandt test to check for homoscedasticity of the residuals
p_value = het_goldfeldquandt(residuals, y_pred_ridge_gscv.reshape(-1, 1))[1]

# Interpret the results
if p_value < 0.05:
    print('The residuals are heteroscedastic (Reject Null Hypothesis).')
else:
    print('The residuals are homoscedastic (Accept Null Hypothesis).')

print(f'\nThe p-value is: {p_value}')

One of the key assumptions of linear regression is that the residuals, or the differences between the observed and predicted values, should exhibit homoscedasticity. Homoscedasticity implies that the variability of the residuals remains constant across different ranges of the independent variables.

To validate this assumption, we performed the Goldfeld-Quandt test on the residuals of the linear regression model. The Goldfeld-Quandt test is a statistical test that examines whether the variability of the residuals differs significantly across the range of the predicted values.

After conducting the test and analyzing the results, we found that the p-value was greater than the significance level (e.g., 0.05), indicating that we do not have sufficient evidence to reject the null hypothesis. Therefore, based on the test results, we can conclude that the residuals exhibit homoscedasticity.

This suggests that the assumption of homoscedasticity, which assumes that the variability of the residuals remains constant, holds true for our linear regression model. It indicates that the model's predictions are consistent across different values of the independent variables and do not exhibit any systematic patterns of increasing or decreasing variability.



**Why did you choose the specific statistical test?**

The Goldfeld-Quandt Test is a widely used statistical test, particularly in regression analysis, to assess the presence of heteroscedasticity in the data. Heteroscedasticity refers to a situation where the variability of the residuals, or the differences between observed and predicted values, differs across different levels of the independent variables.

By applying the Goldfeld-Quandt Test, we can evaluate whether the assumption of homoscedasticity, which assumes constant variability of the residuals, holds true for our regression analysis. The test examines if there is evidence of systematic patterns in the variability of the residuals across different ranges of the independent variables.

In our analysis, we performed the Goldfeld-Quandt Test on the residuals obtained from the regression model. By analyzing the results, we can determine if there is statistical significance to support the presence of heteroscedasticity in the data.

By utilizing the Goldfeld-Quandt Test, we gain insights into the nature of the variability in the residuals, allowing us to assess the adequacy of the assumption of homoscedasticity. This helps us to better understand the properties of our regression model and make more reliable inferences from the analysis.

**Hypothetical Statement - 2**

**1. State Your research hypothesis as a null hypothesis and alternate hypothesis.**

In this analysis, we employed the Ljung-Box Test to examine the presence of autocorrelation among the residuals. Autocorrelation refers to the correlation of a variable with its lagged values, indicating whether there is a pattern or relationship between the residuals at different time points.

By conducting the Ljung-Box Test, we aimed to assess whether there is any statistically significant autocorrelation present in the residuals of our analysis. This test helps us determine whether the assumption of independence between the residuals holds true or if there are any systematic patterns or dependencies in the data.

Null Hypothesis, H<sub>0</sub> : Autocorrelation is absent among the residuals.

Alternate Hypothesis, H<sub>A</sub>: Autocorrelation is present among the residuals.

**2. Perform an appropriate statistical test.**

In [None]:
# Importing the libraries

from statsmodels.stats.diagnostic import acorr_ljungbox

# Perform the Ljung-Box test to check for autocorrelation among the residuals
p_values = acorr_ljungbox(residuals)['lb_pvalue']

# Choose the minimum p-value (as all the residual values are tested)
p = min(p_values)

# Interpret the results
if p < 0.05:
    print('Autocorrelation is present among the residuals (Reject Null Hypothesis).')
else:
    print('Autocorrelation is absent among the residuals (Accept Null Hypothesis).')

print(f'\nThe p-value is: {p}')


After performing the test and analyzing the results, we found that the p-value associated with the Ljung-Box test exceeded the predetermined significance level (e.g., 0.05). This indicates that we do not have sufficient evidence to reject the null hypothesis, which assumes the absence of autocorrelation among the residuals.

Based on these findings, we can conclude that the residuals exhibit no significant autocorrelation. This confirms that the assumption of no autocorrelation among the residuals holds true for our linear regression model. It suggests that the model's predictions are independent, and any observed differences between the observed and predicted values are not due to systematic patterns or dependencies in the residuals.

**Why did you choose the specific statistical test?**

In our analysis, we utilized the Ljung-Box test to examine the autocorrelations of the residuals from our model. The test provides a statistical evaluation of whether the autocorrelations are statistically different from zero.


By performing the Ljung-Box test and analyzing the resulting p-value, we can draw conclusions about the presence or absence of autocorrelation. If the p-value is below a predetermined significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is evidence of autocorrelation. Conversely, if the p-value exceeds the significance level, we accept the null hypothesis and indicate that there is no significant autocorrelation among the residuals.


By utilizing the Ljung-Box test, we gain insights into the autocorrelation structure of the residuals, allowing us to assess the adequacy of the assumption of no autocorrelation. This information is valuable for validating the model's assumptions and making reliable inferences based on the analysis.

**Hypothetical Statement - 3**

The Shapiro-Wilk test is a statistical test employed to examine the normality of a dataset, including the residuals of a regression model. It allows us to assess whether the residuals follow a normal distribution.

By conducting the Shapiro-Wilk test, we aim to determine if the assumption of normality holds true for the residuals. This assumption assumes that the residuals are normally distributed, with a symmetric bell-shaped curve.

In our analysis, we utilized the Shapiro-Wilk test to assess the normality of the residuals obtained from the regression model. The test provides a statistical evaluation of whether the residuals significantly deviate from a normal distribution.

By performing the Shapiro-Wilk test and analyzing the resulting p-value, we can draw conclusions about the normality of the residuals. If the p-value is below a predetermined significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is evidence of non-normality in the residuals. Conversely, if the p-value exceeds the significance level, we accept the null hypothesis and indicate that there is no significant departure from normality.

**1. State Your research hypothesis as a null hypothesis and alternate hypothesis.**

Null Hypothesis, H<sub>0</sub> : The residuals are normally distributed.

Alternate Hypothesis, H<sub>A</sub> : The residuals are NOT normally distributed.

**2. Perform an appropriate statistical test.**

In [None]:

# Importing the libraries

from scipy.stats import shapiro

# Perform the Shapiro-Wilk Test to check if the residuals are normally distributed
p_value = shapiro(residuals)[1]

# Interpret the results
if p_value < 0.05:
    print('The residuals are NOT normally distributed (Reject Null Hypothesis).')
else:
    print('The residuals are normally distributed (Accept Null Hypothesis).')

print(f'\nThe p-value is: {p_value}')


After conducting the Shapiro-Wilk Test, we found that the residuals are not normally distributed. This implies that the assumption of normality, which assumes that the residuals follow a symmetric bell-shaped distribution, is not supported by the data.

The Shapiro-Wilk Test is a statistical test that evaluates the normality of a dataset, including the residuals in this case. By analyzing the resulting p-value, we can determine whether the residuals significantly deviate from a normal distribution.

Based on our analysis, the p-value obtained from the Shapiro-Wilk Test fell below the predetermined significance level (e.g., 0.05). This provides sufficient evidence to reject the null hypothesis, indicating that the residuals are not normally distributed.



**Why did you choose the specific statistical test?**

The choice of the Shapiro-Wilk test was based on its suitability for assessing the normality of the data. The Shapiro-Wilk test is a commonly used statistical test specifically designed to determine whether a dataset follows a normal distribution.

In our analysis, we utilized the Shapiro-Wilk test because our goal was to examine whether the residuals adhere to the assumption of normality. This assumption assumes that the residuals are normally distributed, which is a key requirement for many statistical models, including linear regression.

The Shapiro-Wilk test provides a statistical evaluation of the normality of the data by calculating a test statistic and corresponding p-value. By comparing the p-value to a predetermined significance level (e.g., 0.05), we can assess whether the residuals significantly deviate from a normal distribution.

In summary, we selected the Shapiro-Wilk test as it is widely recognized and appropriate for assessing the normality of data, making it a suitable choice for our analysis. Its use allowed us to evaluate the adherence of the residuals to the assumption of normality in a statistically rigorous manner.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

A careful examination of the data reveals a pronounced decline in the stock prices of Yes Bank following the exposure of the Rana Kapoor fraud in 2018.

The dataset exhibited exceptional cleanliness, devoid of any missing values or duplicated rows, minimizing the need for extensive data wrangling.

Although outliers were present in the features, effective outlier mitigation was achieved through the implementation of a log transformation across all features.

The log transformation successfully addressed positive skewness observed in all features, ensuring adherence to the assumptions of the linear regression models.

Strong positive correlations were observed between the independent variables (Open, High, Low) and the dependent variable (Close), implying a high predictive potential of the dependent variable based on the independent variables.

The presence of positive correlations among the independent variables suggested the presence of multicollinearity; however, given the limited dataset size, feature removal was deemed unnecessary.

Among the various implemented regression models, the Ridge Regression model, combined with GridSearchCV for hyperparameter optimization, emerged as the preferred choice. It achieved a commendable performance, boasting an RMSE of 8.3824 and an R-2 score of 0.9938.

Notably, the 'High' and 'Low' features demonstrated positive weights, indicating a favorable impact on the predictions. Conversely, the 'Open' feature displayed a negative weight, signifying a detrimental influence on the predictions.

Satisfactorily meeting the assumptions of homoscedasticity, absence of autocorrelation, and a mean of zero, the residuals bolstered the reliability of the regression model.

The robustness of the conclusions was supported by a thorough exploration of the data, leaving little room for ambiguity.

The observed decline in Yes Bank's stock prices following the Rana Kapoor fraud exposure underscored the substantial impact of such events on the financial market.

The meticulous data cleaning process instilled confidence in the dataset's integrity, fostering accurate and reliable analyses.

Employing an appropriate transformation technique mitigated the influence of outliers, ensuring a more accurate representation of the data.

Addressing positive skewness through a log transformation enhanced the conformity of the data to the assumptions of linear regression models.

The strong positive correlations between the independent and dependent variables bolstered the predictive power of the regression models.

Careful consideration of multicollinearity, despite its presence, deemed feature removal unnecessary, given the limited dataset size.

The selection of Ridge Regression with GridSearchCV as the final prediction model was substantiated by its exceptional performance, as demonstrated by the low RMSE and high R-2 score.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***