# **Project Name**    - **Yes Bank Stock Closing Price Prediction**



# **Problem Statement**


Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, hightest, and lowest stock prices of every month.

** **The main objective is to predict the stock's closing price of the month.** **


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/capstone_project_02/data_YesBank_StockPrices.csv'
dataset = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
dataset

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape 

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

### The given dataset has 185 observations with 5 features including no missing values and no duplicates values. Date feature is of object type. closing, opening, highest and lowest stock prices are features in float datatypes. 



## ***2. Understanding Your Variables***

In [None]:
# Convert date column to a proper datetime datatype yyyy-mm-dd.
from datetime import datetime
dataset['Date'] = pd.to_datetime(dataset['Date'].apply(lambda x: datetime.strptime(x, "%b-%y")))

In [None]:
dataset.head()

In [None]:
# Dataset Describe
dataset.describe()

### Variables Description 

We can see from the dataset above, all the variables are quantitive it means it shows numerical values. There is no categorical data present.



*   Date :- The date (Month and Year provided)
*   Open :- The price of the stock at the beginning of a particular time period.
*   Close :- The trading price at the end (in this case end of the month)
*   High :-The Maximum price at which a stock traded during the period.
*   Low :-The Lowest price at which a stock traded during the period.

The main objective is to predict the stock's closing price of the month so the closing price of stock have to be considered as dependent features whereas rest of the features are independent features.

In [None]:
# Setting date coloum as index as we need to track variation in stock price on different dates.
dataset.set_index('Date', inplace = True)

In [None]:
dependent_variable = 'Close'
independent_variables =list(set(dataset.columns.tolist())-{dependent_variable})
independent_variables

## ** Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**Assumptions:**

The multiple regression model is based on a following assumptions:


* **Linearity:** There should be linear relationship between the dependent and independent variables.

*   **Normality:** Residual should be normally distributed with mear zero and constant variance σ2


*   **Homoscedasticity:** It means that the variance around the regression line is the same for all values of the predictor variable i.e Close 

*   **Multicollinearity:** There should not be multicollinearity in regression model.

First, we need to check the suggested assumption.




#### Charts:-

In [None]:
# plot dependent variable:
plt.rcParams['figure.figsize']=(10,5)
plt.plot(dataset['Close'], color= 'r')
plt.title('Closing price with date')
plt.xlabel('Date')
plt.ylabel('Close')
plt.grid(linestyle=':', linewidth = '0.5', color = 'b') # to display and customize gridlines on a plot.
plt.show()

we can see Yes bank stock price is rising up till 2018. 
When it has been in the news because of the fraud case involving Rana Kapur, the stock prices is showing rapidly decline. We can conclude that this news impacted the stock price of the company. 

#### Charts:-

To check the linearity between dependent variable and independent variables. we are plotting the scatter plots.

In [None]:
# Scatter plots
for i in independent_variables:
  plt.scatter(dataset[i], dataset['Close']) 
  plt.title(f"{i} v/s Close")
  plt.xlabel(f'{i}')
  plt.ylabel("Close")
  plt.show()


We can see from scatter plot above, Independent variables such as Low, Open, High are showing the linearity with dependent variable Close.

#### Charts:-

To know about the distribution of features, we are plotting histogram with default kernel density estimate which will give the idea how the data is distributed.

In [None]:
# Histogram
for i in dataset.columns:
  plt.figure(figsize=(10,6))
  sns.distplot(dataset[i])
  plt.xlabel(f"{i}")

  # To plot the mean and the median line, axvline fuction is used which add the vertical line across the axis. 
  plt.axvline(dataset[i].mean(),color='green',linestyle='dashed', linewidth=1)  # vertical line at value mean.
  plt.axvline(dataset[i].median(),color='red',linestyle='dashed',linewidth=1)   # vertical line at value median.
  plt.show()
  

From the above graph, it is observed that:



*   All the freatures are positively skewed distributed. 

*   Mean is greater than Median i.e Mean > Median.
  





#### Charts:-

We are using Box Whisker plot to visualize five number summary and to detect the outliers.

In [None]:
# Box-Whisker plot.
for i in dataset.columns:
  plt.rcParams['figure.figsize']=(8,5)
  sns.boxplot(dataset[i])
  plt.xlabel(f'{i}', fontsize=13)
  plt.show()



It is observed that there are some outliers present in given dataset.

####  Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(16,8))
sns.heatmap(dataset[list(dataset.columns)].corr(),cmap='coolwarm', annot = True)

#### Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.figure(figsize=(20,8))
plt.show()

To know about the pairwise relationship amongs the variables close, High, Open and Low, pairwise plot is used and it is concluded that scatter plots indicates the joint relationship whereas histograms indicate the univariate distribution.

#### Charts:-