# **Project Name**    -
#**YES BANK STOCK CLOSING PRICE PREDICTION**


#**Project Type**    - REGRESSION
#**Contribution**    -Sanjana Nasa

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


##Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.



# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path='/content/drive/MyDrive/data_YesBank_StockPrices.csv'
data=pd.read_csv(path)


### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows=data.shape[0]
columns=data.shape[1]
print('rows:',rows)
print('columns:',columns)

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().value_counts()

there are no duplicate values in the data

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

there are no null values as well

### What did you know about your dataset?

#so in the given dataset, we have following features:

* **Date:** It denotes the month and year of the for a particular price.
* **Open:** Open means the price at which a stock started trading that month.
* **High:** refers to the maximum price that month.
***Low:** refers to the minimum price that month.
* **Close:** refers to the final trading price for that month



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

## 3. ***Data Wrangling***

In [None]:
#converting "Date" to datetime YYYY-MM-DD
from datetime import datetime
data['Date'] = pd.to_datetime(data['Date'].apply(lambda x: datetime.strptime(x, '%b-%y')))

In [None]:
#checking the change
data.head(2)

In [None]:
#setting 'Date' as index
data=data.set_index('Date')

In [None]:
data.head(2)

#**UNIVARIATE ANALYSIS**

#Dependent variable (target variable i.e. "Close")

In [None]:
#visualising distribution of 'Close'
plt.figure(figsize=(9,6))
sns.distplot(data['Close'])

# Plotting the mean and the median.
plt.axvline(data['Close'].mean(),color='green',linewidth=2)                            # axvline plots a vertical line at a value (mean in this case).
plt.axvline(data['Close'].median(),color='red',linestyle='dashed',linewidth=1.5)
plt.show()
plt.show()

#**CLOSING PRICE WITH DATE**

In [None]:
plt.figure(figsize=(12,7))
data['Close'].plot(color = 'r')
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='green')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.title('Closing Price with Date')
plt.show()

#We can see that the stock price is rising up until 2018 when the fraud case involving Rana Kapoor happened after which the stock price has had a sharp decline.

#Independent variables

In [None]:
#list of independent variables
features=data[['Open','High','Low']]
print(features)

In [None]:
#plots for  each of the independent variables
for var in features:
  plt.figure(figsize=(10,4))
  plt.subplot(1,2,1)
  sns.distplot(data[var].dropna())

  # Plotting the mean and the median.
  plt.axvline(data[var].mean(),color='green',linewidth=2)                            # axvline plots a vertical line at a value (mean in this case).
  plt.axvline(data[var].median(),color='red',linestyle='dashed',linewidth=1.5)

  plt.subplot(1,2,2)
  sns.boxplot(y=data[var])


  plt.show()

#From the above plots of both dependent and independent variables , we can observe that our data is skewed, owing to which we need to perform some transformations

#So, here i am going to do **log** transformation

In [None]:
#target variable
plt.figure(figsize=(9,6))
sns.distplot(np.log10(data['Close']))
# Plotting the mean and the median.
plt.axvline(np.log10(data['Close']).mean(),color='green',linewidth=2)                            # axvline plots a vertical line at a value (mean in this case).
plt.axvline(np.log10(data['Close']).median(),color='red',linestyle='dashed',linewidth=1.5)
plt.show()

In [None]:
#independent features
for var in features:
  plt.figure(figsize=(10,4))
  plt.subplot(1,2,1)
  sns.distplot(np.log10(data[var].dropna()))

  # Plotting the mean and the median.
  plt.axvline(np.log10(data[var]).mean(),color='green',linewidth=2)                            # axvline plots a vertical line at a value (mean in this case).
  plt.axvline(np.log10(data[var]).median(),color='red',linestyle='dashed',linewidth=1.5)

  plt.subplot(1,2,2)
  sns.boxplot(y=np.log10(data[var]))

#After the log transformation, the data looks very similar to normal distribution.
#The green line represents mean whereas the red line represents the median


#**BIVARIATE ANALYSIS**

In [None]:
#plotting dependent variable against each independent variable (scatter plot) and also checking the correlation between them

for var in features:
  fig=plt.figure(figsize=(10,6))
  ax=fig.gca()
  correlation=data[var].corr(data["Close"])
  plt.scatter(x=data[var],y=data['Close'])

  ax.set_title('closing Price - ' + var + ' correlation: ' + str(correlation))
  plt.xlabel(var)
  plt.ylabel("Closing price")

  z=np.polyfit(data[var],data['Close'],1)
  y_hat=np.poly1d(z)(data[var])

  plt.plot(data[var],y_hat,'red')



#We can see that all of our independent variables are highly correlated to the dependent variable.

#And the relationship between dependent and independent variables is linear in nature.

In [None]:
#correlation plot
plt.figure(figsize=(10,5))
sns.heatmap(data.corr(), annot = True, cmap='coolwarm')
plt.show()

#From the heatmap above, we can clearly see that there is a very high correlation between each pair of features in our dataset. While it is desirable for the dependent variable to be highly correlated with independent variables, the independent varibles should ideally not have high correlation with one another.

#** This gives rise to the problem of **multicollinearity** **

In [None]:
# Dealing with multicollinearity using VIF analysis.
# Calculating VIF(Variation Inflation Factor) to see the correlation between independent variables

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(data[[i for i in data.describe().columns if i not in ['Date','Close']]])

#As we can see the values of VIF factor are very high. However since the dataset is so small and has just 3 independent features, multicollinearity is unavoidable here as any feature engineering will lead to loss of information.

#**DATA PREPROCESSING**

In [None]:
#creating input variables
Y=np.log10(data['Close']).values
X=np.log10(data[['Open','High','Low']]).values

#TRAIN TEST SPLIT

In [None]:
#splitting the data into training and test data
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

In [None]:
X_train.shape

In [None]:
X_test.shape

#Scaling the data is very important for us so as to avoid giving more importance to features with large values. This is achieved by normalization or standardization of the data.

In [None]:
# Scaling the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#**LINEAR REGRESSION**

In [None]:
# importing LinearRegression model and the metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


In [None]:
#initialising the model
model=LinearRegression()

In [None]:
#fitting the model on training data
model.fit(X_train,Y_train)

In [None]:
#predicting  values on test data
y_pred=model.predict(X_test)

#Model parameters

In [None]:
#model intercept
model.intercept_

In [None]:
#model coefficients
model.coef_

#EVALUATION

In [None]:
#defining mape
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100

In [None]:
#calculating performance metrics
MAE = mean_absolute_error((Y_test), (y_pred))
print("mean absolute error :" ,MAE)

MSE  = mean_squared_error((Y_test), (y_pred))
print("mean squared error :" , MSE)

RMSE = np.sqrt(MSE)
print("root mean squared error :" ,RMSE)

MAPE=mape(Y_test, y_pred)
print("mean absolute percentage error:" ,MAPE)




In [None]:
#creating dataframe to store all the evaluation metrics of each model
i=0
metric_df=pd.DataFrame()

In [None]:
#Inserting metrics of linear regression in dataframe created above
metric_df.loc[i,"Model_Name"]='Linear regression'
metric_df.loc[i,"MAE"]=round(MAE,4)
metric_df.loc[i,"MSE"]=round(MSE,4)
metric_df.loc[i,"RMSE"]=round(RMSE,4)
metric_df.loc[i,"MAPE"]=round(MAPE,4)

i+=1

In [None]:
#visualising actual and predicted values

plt.figure(figsize=(10,5))
plt.plot(y_pred)
plt.plot(np.array(Y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

#**LASSO REGRESSION**

In [None]:
from sklearn.linear_model import Lasso

In [None]:
#fitting data
lasso  = Lasso(alpha=0.0001 , max_iter= 3000)
lasso.fit(X_train, Y_train)

In [None]:
# Printing the intercept
lasso.intercept_

In [None]:
#printing the coefficients
lasso.coef_

#Cross Validation and hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV



In [None]:
lasso_param_grid={'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,0.005,0.006,0.007,0.01,0.015,0.02,1e-1,1,5,10,20,30,40,44,50]}

lasso_regressor = GridSearchCV(lasso, lasso_param_grid, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X_train, Y_train)


In [None]:
#getting the best parameter
lasso_regressor.best_params_

In [None]:
#getting the best score
lasso_regressor.best_score_

In [None]:
# Predicting on the test dataset.
y_pred_lasso = lasso_regressor.predict(X_test)
print(y_pred_lasso)

In [None]:
#evaluation metrics for checking the performance
MAE_lasso=mean_absolute_error((Y_test), (y_pred_lasso))
print("mean absolute error :" ,MAE_lasso)

MSE_lasso = mean_squared_error((Y_test), (y_pred_lasso))
print("mean squared error :" , MSE_lasso)

RMSE_lasso = np.sqrt(MSE_lasso)
print("root mean squared error :" ,RMSE_lasso)

MAPE_lasso=mape(Y_test, y_pred_lasso)
print("mean absolute percentage error:" ,MAPE_lasso)



In [None]:
#inserting lasso regression evaluation metrics in dataframe
metric_df.loc[i,"Model_Name"]='Lasso regression'
metric_df.loc[i,"MAE"]=round(MAE_lasso,4)
metric_df.loc[i,"MSE"]=round(MSE_lasso,4)
metric_df.loc[i,"RMSE"]=round(RMSE_lasso,4)
metric_df.loc[i,"MAPE"]=round(MAPE_lasso,4)


i=i+1

In [None]:
# plotting the predicted values vs actual.
plt.figure(figsize=(9,5))
plt.plot(y_pred_lasso)
plt.plot(np.array(Y_test))
plt.legend(["Predicted","Actual"])
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price Lasso regression")

#**RIDGE REGRESSION**

In [None]:
#initialising the model
from sklearn.linear_model import Ridge

ridge=Ridge()
ridge_param_grid={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}

#cross validation
ridge_regressor = GridSearchCV(ridge, ridge_param_grid, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train,Y_train)


In [None]:
# finding the best parameter value (for alpha)
ridge_regressor.best_params_

In [None]:
# predicting on the test dataset now.
y_pred_ridge = ridge_regressor.predict(X_test)


In [None]:
# evaluating performance.
MAE_ridge = mean_absolute_error(Y_test,y_pred_ridge)
print(f"Mean Absolute Error : {MAE_ridge}")

MSE_ridge  = mean_squared_error(Y_test,y_pred_ridge)
print("Mean squared Error :" , MSE_ridge)

RMSE_ridge = np.sqrt(MSE_ridge)
print("Root Mean squared Error :" ,RMSE_ridge)

MAPE_ridge=mape(Y_test,y_pred_ridge)
print("mean absolute percentage error",MAPE_ridge)



In [None]:
#inserting ridge regression evaluation metrics in dataframe
metric_df.loc[i,"Model_Name"]='Ridge regression'
metric_df.loc[i,"MAE"]=round(MAE_ridge,4)
metric_df.loc[i,"MSE"]=round(MSE_ridge,4)
metric_df.loc[i,"RMSE"]=round(RMSE_ridge,4)
metric_df.loc[i,"MAPE"]=round(MAPE_ridge,4)

i=i+1

In [None]:
# Plotting predicted and actual target variable values.
plt.figure(figsize=(9,5))
plt.plot(y_pred_ridge)
plt.plot(Y_test)
plt.legend(["Predicted","Actual"])
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price Ridge regression")

#**ELASTIC NET REGRESSION**

In [None]:
# importing and initializing Elastic-Net Regression.
from sklearn.linear_model import ElasticNet
elasticnet_model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# initializing parameter grid.
elastic_net_param_grid = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,0.001,0.01,0.02,0.03,0.04,1,5,10,20,40,50,60,100],
                          'l1_ratio':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}

# cross-validation.
elasticnet_regressor = GridSearchCV(elasticnet_model, elastic_net_param_grid, scoring='neg_mean_squared_error',cv=5)
elasticnet_regressor.fit(X_train, Y_train)

In [None]:
# finding the best parameter
elasticnet_regressor.best_params_


In [None]:
# finding the best score for the optimal parameter.
elasticnet_regressor.best_score_

In [None]:
# making the predictions on test data
y_pred_elastic_net = elasticnet_regressor.predict(X_test)


In [None]:
#evaluation metrics (evaluation performance of the model)

MAE_elastic_net = mean_absolute_error(Y_test,y_pred_elastic_net)
print(f"Mean Absolute Error : {MAE_elastic_net}")

MSE_elastic_net  = mean_squared_error(Y_test,y_pred_elastic_net)
print("Mean squared Error :" , MSE_elastic_net)

RMSE_elastic_net = np.sqrt(MSE_elastic_net)
print("Root Mean squared Error :" ,RMSE_elastic_net)

MAPE_elastic_net=mape(Y_test,y_pred_elastic_net)
print("Mean absolute percentage error :",MAPE_elastic_net)

In [None]:
#inserting elastic net regression evaluation metrics in dataframe
metric_df.loc[i,"Model_Name"]='Net elastic regression'
metric_df.loc[i,"MAE"]=round(MAE_elastic_net,4)
metric_df.loc[i,"MSE"]=round(MSE_elastic_net,4)
metric_df.loc[i,"RMSE"]=round(RMSE_elastic_net,4)
metric_df.loc[i,"MAPE"]=round(MAPE_elastic_net,4)

In [None]:
# Now let us plot the actual and predicted target variables values.
plt.figure(figsize=(9,5))
plt.plot(y_pred_elastic_net)
plt.plot(Y_test)
plt.legend(["Predicted","Actual"])
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price Elastic Net regression")

#Comparing the performance of all models

In [None]:
metric_df

#Linear regression model has performed better with the lowest MAPE

# **Conclusion**

# * Using data visualization on our target variable, we can clearly see the impact of 2018 fraud case involving Rana Kapoor as the stock prices decline dramatically during that period.

# * We implemented several models on our dataset in order to be able to predict the closing price and found that all our models are performing remarkably well and Linear Regressor model  is the best performing model.

# * There is a high correlation between the dependent and independent variables. This is a signal that our dependent variable is highly dependent on our features and can be predicted accurately from them.