<a href="https://colab.research.google.com/github/shinobi-itachi/YES-BANK-STOCK-CLOSING-PRICE-PREDICTION/blob/main/Copy_of_Yes_Bank_Stock_Closing_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### **Project Type**    - Supervised Regression
##### **Contribution**    - Team
##### **Team Member 1 -** - Rohit Kamble
##### **Team Member 2 -** - Pratik Jogdand

# **Project Name**    - YES BANK STOCK CLOSING PRICE PREDICTION



# **Project Summary -**

The primary objective of this analysis is to employ predictive models, particularly time series models, to forecast the stock's closing price for each month. By doing so, researchers seek to understand if these models can effectively capture and respond to significant events like the fraud case involving Rana Kapoor and its potential impact on the bank's stock prices. Through this investigation, the study aims to shed light on the predictive capabilities of various models when confronted with significant real-world financial events.

# **GitHub Link -**

# **Problem Statement**


**Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.**

### Dataset Information

We have 185 rows and 4 columns in our dataset with no null values. Here our dependent variable will be Close, and independent variables are - Open, High and Low.

Date: It denotes the month and year of the for a particular price.

Open: Open means the price at which a stock started trading that month.

High: refers to the maximum price that month.

Low: refers to the minimum price that month.

Close: refers to the final trading price for that month, which we have to
predict using regression

# ***Let's Begin !***

In [None]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.metrics import *

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from xgboost import XGBRegressor
from sklearn import neighbors
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

In [None]:
#mount drive
from google.colab import drive
drive.mount('/content/drive' ,)

In [None]:
#read csv file
df=pd.read_csv('/content/drive/MyDrive/Copy of data_YesBank_StockPrices.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## ***DATA OVERVIEW***

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
#copying data to preserve orignal file
df1= df.copy()

**EDA AND Data Pre-Processing**

In [None]:
# DEPENDENT AND INDEPENDENT VARIABLES
indep_var=df1[['High','Low','Open']]
dep_var=df1['Close']


In [None]:
#Instead of dropping the date, we will convert it into the proper format and use it as an index.

# converting 'Date' into datetime - YYYY-MM-DD
from datetime import datetime
df1['Date'] = pd.to_datetime(df1['Date'].apply(lambda x: datetime.strptime(x, '%b-%y')))

In [None]:
#set date as index
df1.set_index('Date',inplace=True)

In [None]:
#check if changes are being reflected
df1.head()

**CHECK AND HANDLE DUPLICATES**

In [None]:
#check duplicate entries
len(df1[df1.duplicated()])

In [None]:
# No Duplicates found

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df1.isnull().sum())


In [None]:
# Visualizing the missing values
df1.isnull()

### What did you know about your dataset?

The dataset provided for the problem consists of 185 rows and 5 columns. The columns in the dataset are labeled as follows:


* Date: Represents the date of the month for which the stock prices are recorded.
* Open: Indicates the opening stock price of the bank's stock for the corresponding month.
* High: Represents the highest stock price reached during the month.
* Low: Represents the lowest stock price reached during the month.
* Close: Indicates the closing stock price of the bank's stock for the corresponding month.


The dataset covers monthly stock prices of Yes Bank since its inception, providing a comprehensive record of the stock's performance over time. With this dataset, the main objective of the analysis is to predict the closing stock price of the bank for each month, exploring the potential use of time series models or other predictive models to address the impact of significant events like the fraud case involving Rana Kapoor on the stock's prices.

### Variables Description

* Independent Variables: High, Low, and Open represent the highest, lowest, and opening stock prices of Yes Bank for each month, respectively. These variables are used as predictors in the analysis.

* Dependent Variable: Close indicates the closing stock price of Yes Bank for each month. It is the target variable to be predicted using the independent variables.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check Unique Values for each variable.
for i in df1.columns.tolist():
  print("No. of unique values in ",i,"is",df1[i].nunique(),".")

**UNIVARIATE ANALYSIS**
---
We will take a look at distribution plots of our features.

INDEPENDENT VARIABLES

In [None]:
#plots for independent variables
for var in indep_var:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = sns.distplot(df1[var].dropna())
    fig.set_ylabel(' ')
    fig.set_xlabel(var)

    plt.subplot(1, 2, 2)
    fig = sns.boxplot(y=df1[var])
    fig.set_title('')
    fig.set_ylabel(var)

DEPENDENT VARIABLE

In [None]:
#plots for dependent variable
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
fig = sns.distplot(df1['Close'].dropna())
fig.set_ylabel(' ')
fig.set_xlabel(var)

plt.subplot(1, 2, 2)
fig = sns.boxplot(y=df1['Close'])
fig.set_title('')
fig.set_ylabel(var)

**Since our data is skewed, we will perform some transformations during regression analysis. Let's visualize how our data will look like post transformation.**

In [None]:
#independent variables
for var in indep_var:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = sns.distplot(np.log10(df1[var]))
    fig.set_ylabel(' ')
    fig.set_xlabel(var)

    plt.subplot(1, 2, 2)
    fig = sns.boxplot(y=np.log10(df1[var]))
    fig.set_title('')
    fig.set_ylabel(var)

Now our data mimics normal distribution to an extent.

In [None]:
#scatter plot between dependent variable with all independent variables.
for col in indep_var:
   fig = plt.figure(figsize=(9, 6))
   ax = fig.gca()
   feature = df1[col]
   label = df1['Close']
   correlation = feature.corr(label)
   plt.scatter(x=feature, y=label)
   plt.xlabel(col)
   plt.ylabel('closing Price')
   ax.set_title('closing Price - ' + col + ' correlation: ' + str(correlation))
   z = np.polyfit(df1[col], df1['Close'], 1)
   y_hat = np.poly1d(z)(df1[col])

   plt.plot(df1[col], y_hat, "r--", lw=1)

plt.show()

The relationship between the independent variables (High, Low, Open) and the dependent variable (Close) appears to be linear. This implies that as the values of the independent variables change, there is a noticeable linear impact on the dependent variable, which is the closing stock price. This observation suggests that the stock's closing price is influenced by the highest, lowest, and opening prices, and their changes may have a direct and consistent effect on the overall closing price of Yes Bank's stoc

**CORRELATION**
---

In [None]:
#correlation plot
plt.figure(figsize=(10,5))
sns.heatmap(df1.corr(), annot = True, cmap='coolwarm')
plt.show()

**MULTICOLLINEARITY**
---

In [None]:
#Multicollinearity
#VIF score  {Variance Inflation Factor}

def calc_vif(X):

   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
calc_vif(indep_var)

Despite high VIF scores, feature engineering won't be performed as each feature is crucial for this use case. Trying to convert high and low into a day range increased errors significantly, affirming the necessity of all data points in the dataset for accurate predictions.

**DATAFRAME TO STORE EVALUATION METRICS**

I'll store regression model evaluation metrics in a data frame to compare their performance and make an informed decision.

In [None]:
#empty data frame creation
i=0
error_df=pd.DataFrame()

**LINEAR REGRESSION**
---

In [None]:
#train test data split
x_train, x_test,y_train, y_test = train_test_split(indep_var,dep_var,test_size=.20,random_state=1)

In [None]:
#data transformation
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
# Fitting Linear Regression to the Training set
reg = LR()
reg = LR().fit(x_train, y_train)

In [None]:
#predictions
y_pred = reg.predict(x_test)

**EVALUATION**

In [None]:
#defining mape
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100

In [None]:
#evaluation metrics
MAE = mean_absolute_error((y_test), (y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAPE=mape(y_test, y_pred)
print("MAPE :" ,MAPE)

In [None]:
#Inserting errors in dataframe
error_df.loc[i,"Model_Name"]='Linear regression'
error_df.loc[i,"MAE"]=round(MAE,2)
error_df.loc[i,"MSE"]=round(MSE,2)
error_df.loc[i,"RMSE"]=round(RMSE,2)
error_df.loc[i,"MAPE"]=round(MAPE,2)

i+=1

**VISUALIZATION**

In [None]:
#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

**LASSO REGRESSION**
---

In [None]:
#fitting data
lasso  = Lasso(alpha=0.0001 , max_iter= 3000)
lasso.fit(x_train, y_train)

In [None]:
lasso.score(x_train, y_train)

In [None]:
# Cross validation
from sklearn.model_selection import GridSearchCV
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(x_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
#prediction
y_pred = lasso_regressor.predict(x_test)

In [None]:
#evaluation metrics
MAE=mean_absolute_error((y_test), (y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAPE=mape(y_test, y_pred)
print("MAPE :" ,MAPE)

In [None]:
#Inserting errors in dataframe
error_df.loc[i,"Model_Name"]='Lasso regression'
error_df.loc[i,"MAE"]=round(MAE,2)
error_df.loc[i,"MSE"]=round(MSE,2)
error_df.loc[i,"RMSE"]=round(RMSE,2)
error_df.loc[i,"MAPE"]=round(MAPE,2)


i=i+1

**VISUALIZATION**

In [None]:
#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

**RIDGE REGRESSION**
---

In [None]:
#fitting data
ridge  = Ridge(alpha=0.1)
ridge.fit(x_train,y_train)

In [None]:
ridge.score(x_train, y_train)

In [None]:
# Cross validation
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(x_train,y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
#Prediction
y_pred = ridge_regressor.predict(x_test)

In [None]:
#evaluation metrics
MAE=mean_absolute_error((y_test), (y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAPE=mape(y_test, y_pred)
print("MAPE :" ,MAPE)

In [None]:
#Inserting errors in dataframe
error_df.loc[i,"Model_Name"]='Ridge regression'
error_df.loc[i,"MAE"]=round(MAE,2)
error_df.loc[i,"MSE"]=round(MSE,2)
error_df.loc[i,"RMSE"]=round(RMSE,2)
error_df.loc[i,"MAPE"]=round(MAPE,2)

i=i+1

**VISUALIZATION**

In [None]:
#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

**KNN**
---

In [None]:
#hyperparameter tuning
params = {'n_neighbors':[2,3,4,5,6,7,8,9]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)

In [None]:
#fitting data
model.fit(x_train,y_train)

In [None]:
#prediction
y_pred=model.predict(x_test)

In [None]:
#evaluation metrics
MAE=mean_absolute_error((y_test), (y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAPE=mape(y_test, y_pred)
print("MAPE :" ,MAPE)

In [None]:
#Inserting errors in dataframe
error_df.loc[i,"Model_Name"]='KNN regressor'
error_df.loc[i,"MAE"]=round(MAE,2)
error_df.loc[i,"MSE"]=round(MSE,2)
error_df.loc[i,"RMSE"]=round(RMSE,2)
error_df.loc[i,"MAPE"]=round(MAPE,2)

i=i+1

**VISUALIZATION**

In [None]:
#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

**XGBOOST REGRESSOR**
---

Despite XGBoost being a complex and opaque technique with numerous built-in functions, we decided to feed it raw, untransformed data. Surprisingly, when compared to my manual attempts, XGBoost achieved the best results without any human intervention. It seems that the machine's inherent capabilities have outperformed my own efforts, leading me to believe that it possesses a level of intelligence that surpasses my own in this context.|^_^|
                                                   

In [None]:
#data split
x_train, x_test,y_train, y_test = train_test_split((indep_var),(dep_var),test_size=.20,random_state=1)

In [None]:
#fitting data
xgb = XGBRegressor()
xgb.fit(x_train,y_train)

In [None]:
#prediction
y_pred = xgb.predict(x_test)

In [None]:
#evaluation metrics
MAE=mean_absolute_error((y_test), (y_pred))
print("MAE :" ,MAE)

MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MAPE=mape(y_test, y_pred)
print("MAPE :" ,MAPE)

In [None]:
#Inserting errors in dataframe
error_df.loc[i,"Model_Name"]='XGBoost regressor'
error_df.loc[i,"MAE"]=round(MAE,2)
error_df.loc[i,"MSE"]=round(MSE,2)
error_df.loc[i,"RMSE"]=round(RMSE,2)
error_df.loc[i,"MAPE"]=round(MAPE,2)

i=i+1

**VISUALIZATION**

In [None]:
plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

# **METRICS COMPARISION**
---

In [None]:
#sorting metrics
error_df.sort_values(by=['MAE'],ascending=True,inplace=True)

In [None]:
error_df

# **Conclusion**
---

1. The target variable demonstrates high dependence on the input variables.

  * This implies that the input variables have a significant influence on the target variable, indicating a strong relationship between them.
  
2. Linear Regression achieved the best results among the tested models, with the lowest MAE, MSE, RMSE, and MAPE scores.

  * Linear Regression's superior performance in terms of evaluation metrics suggests that it fits the data well and provides accurate predictions with low errors. However, it's essential to consider potential overfitting or underfitting issues and assess the significance of the relationship between the target and input variables.

3. Ridge regression effectively reduced complexity and addressed multicollinearity by shrinking parameters. However, it also had some impact on the evaluation metrics.

  * Ridge regression's ability to handle multicollinearity is a valuable aspect. However, the impact on evaluation metrics implies that the model's regularization might have led to some loss in predictive performance. Careful tuning of hyperparameters may be necessary to strike the right balance between complexity reduction and predictive accuracy.

4. Lasso regression performed feature selection, but its results were inferior to Ridge regression, indicating that all features are important for accurate predictions.

  * Lasso's feature selection capability is useful for identifying important features. However, its inferior performance compared to Ridge regression suggests that, in this specific case, all features might contribute significantly to the target variable. Feature importance analysis and domain knowledge are required to validate this finding.

5. All models achieved an accuracy of more than 90%.

  * This indicates that the models performed well in making correct predictions. However, accuracy alone might not be sufficient to assess the model's performance, especially if the data is imbalanced or if different error metrics reveal different aspects of the model's behavior.

6. Both KNN and XGBoost showed similar results in their predictive capabilities.

  * The similarity in results between KNN and XGBoost suggests that both models might be suitable for the task at hand. However, it's important to consider other factors such as model interpretability, computational efficiency, and ease of implementation while selecting the final model.

#**This project is useful for stakeholders in the financial domain, particularly for Yes Bank and its investors, in the following ways:**

This project is useful for stakeholders in the financial domain, particularly for Yes Bank and its investors, in the following ways:

**1. Stock Price Prediction:** The main objective of the project is to predict the stock's closing price of the month. Stakeholders, including investors and traders, can use these predictive models to make informed decisions about when to buy or sell Yes Bank's stock. Accurate predictions can help them maximize their profits and minimize potential losses.

**2. Risk Management:** Understanding the relationship between input variables and the target variable (stock prices) helps stakeholders assess the risk associated with investing in Yes Bank. By using predictive models like Linear Regression, Ridge Regression, Lasso Regression, KNN, and XGBoost, they can estimate potential price movements and identify risk factors that might affect the stock's performance.

**3. Impact Analysis:** The project's exploration of how the fraud case involving Rana Kapoor impacted the stock prices provides valuable insights for stakeholders to understand how external events can influence the bank's stock performance. Such analysis can aid in developing risk mitigation strategies and managing the bank's reputation during challenging times.

**4. Model Selection:** The comparison of different predictive models (Linear Regression, Ridge Regression, Lasso Regression, KNN, and XGBoost) allows stakeholders to choose the most suitable model for their specific needs. Depending on factors such as interpretability, efficiency, and implementation ease, stakeholders can select the model that best fits their requirements.

**5. Performance Evaluation:** The evaluation metrics (MAE, MSE, RMSE, MAPE, and accuracy) provided in the conclusion enable stakeholders to assess the effectiveness of the predictive models. By considering various metrics, they can gain a comprehensive understanding of the models' performance and select the one that aligns with their priorities.

**6. Investment Strategy:** Armed with accurate stock price predictions, stakeholders can devise better investment strategies. For example, they can use the models' output to optimize their trading strategies, portfolio allocations, and risk management plans.

**7. Financial Planning:** Yes Bank, as an organization, can also benefit from this project. The predictive models can aid in financial planning and forecasting. They can use the models to estimate future stock prices, which can inform their decision-making processes and strategic planning.


Overall, this project's findings and predictive models provide valuable tools and insights for stakeholders to make data-driven decisions related to investing in Yes Bank, managing risk, and optimizing their financial strategies. It can be a valuable resource for anyone involved in the bank's financial ecosystem, from individual investors to the bank's management team.