<a href="https://colab.research.google.com/github/rohitp5551/Yes-Bank-Stock-Closing-Price-Prediction/blob/main/Regression_Model_on_Yes_Bank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Name**            - Rohit Patil
##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

The project focused on predicting stock prices using historical data. The dataset underwent thorough cleaning to handle missing values, correct data types, and manage outliers, ensuring data quality. Categorical variables were encoded, and numerical features were scaled for optimal modeling.

A Ridge regression model was selected for its ability to manage multicollinearity and prevent overfitting. Hyperparameter tuning using GridSearchCV identified the best model with an optimal alpha value of 0.1.

The model demonstrated strong performance during evaluation on both training and test datasets. It achieved a high R-squared score of 0.98 before and 0.97 after hyperparameter tuning.Mean squared error (MSE) metrics were also low, with values of 39.03 before and 55.98 after tuning, highlighting the model's predictive accuracy.

Exploratory Data Analysis (EDA) provided insights into stock price trends and relationships between variables. Visualizations such as line plots and scatter plots facilitated understanding of data patterns and influential factors affecting stock prices.

In summary, the project successfully applied machine learning techniques, specifically Ridge regression, to predict stock prices. Rigorous data preprocessing, effective model selection through hyperparameter tuning, and comprehensive evaluation metrics underscored the project's methodology and findings. Future work could explore additional models or advanced feature engineering techniques to further enhance predictive capabilities.

# **GitHub Link -**

https://github.com/rohitp5551/Yes-Bank-Stock-Closing-Price-PredictionProvide

# **Problem Statement**


The project aims to predict stock prices using historical data, focusing on developing a machine learning model that accurately forecasts stock prices based on features like Open, High, Low, and Close. Key tasks include through data cleaning, preprocessing (including categorical encoding and numerical scaling), testing multiple models (such as linear,lasso,Ridge regression), and optimizing model performance through hyperparameter tuning. Evaluation will be based on metrics like R-squared(R2),Root mean sqaure error(RMSE),mean absolute error(MAE),mean squared error (MSE), with insights from Exploratory Data Analysis (EDA) informing model selection and feature engineering decisions.








# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score,root_mean_squared_error,accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold


### Dataset Loading

In [None]:

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df=pd.read_csv('/content/drive/MyDrive/module 6/data_YesBank_StockPrices.csv')


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull())
plt.show()

### What did you know about your dataset?

The Dataset is about Yes bank monthly stock price which contains 185 rows with 5 columns.There is no duplicate and missing values in the dataset expect one incorrect datatype i.e in Date columns which is an object type .

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description



*   Date: Date of the record denoting the time period of stock in month and year format

*   Open: Opening price of the stock
*   High: Highest price of the stock in a day


*   Low: Lowest price of the stock in a day


*   Close: Closing price of the stock



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

In [None]:
#Checking Outliers
sns.boxplot(data=df[["Open","High","Low","Close"]])
plt.title("Box Plot of Open,High,Low,Close")
plt.ylabel("Price")
plt.show()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['Date']=pd.to_datetime(df['Date'],format='%b-%y')
df["Month_number"]=df["Date"].dt.month
df["Year"]=df["Date"].dt.year
df.drop("Date",axis=1,inplace=True)
df.head()

### What all manipulations have you done and insights you found?



*   Converted Date column into Datetime datatype
*   Extracted Month and year numbers from Date column

*   After Extracting month and year drop the Date column






## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##**Chart-1 Distribution of monthly price over Year**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10,7))
sns.lineplot(x="Month_number",y="Close",hue="Year",data=df,marker="o",palette='hls')
plt.title("Distribution of price monthly")
plt.xlabel("Month")
plt.ylabel("Price")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

This line plot was chosen to visualize the monthly closing prices of stocks over different years. It helps in understanding how the closing prices fluctuate month by month across multiple years, highlighting trends and patterns over time.

##### 2. What is/are the insight(s) found from the chart?

From the chart, insights can include identifying seasonal trends in stock prices. For example, consistent peaks or dips in certain months across multiple years could indicate recurring market behaviors. It also shows if there are any outlier months where prices significantly deviate from the general trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Yes, these insights can be valuable for strategic decision-making in trading or investment. Understanding seasonal patterns can help in timing buy or sell decisions, optimizing portfolio management, and managing risk more effectively.
*   Insights indicating negative growth might include prolonged periods of declining closing prices across all or specific months over the years. This trend could imply economic downturns, sector-specific issues, or company-specific challenges affecting stock performance negatively. Identifying such trends early can prompt proactive measures to mitigate risks or adjust investment strategies accordingly.



##**Chart-2 Comparing Average high and low price over year**

In [None]:
# Chart - 2 visualization code
avg_prices_by_year = df.groupby('Year').agg({'High': 'mean', 'Low': 'mean'})

# Create a bar chart to compare the average high and low prices over the years
plt.figure(figsize=(10, 6))
plt.bar(avg_prices_by_year.index, avg_prices_by_year['High'], label='Average High Price', alpha=0.7)
plt.bar(avg_prices_by_year.index, avg_prices_by_year['Low'], label='Average Low Price', alpha=0.7)
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.title('Comparison of Average High and Low Prices Over Years')
plt.legend()
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart to compare the average high and low prices over the years because it visually represents the price range for each year effectively, enabling us to identify potential trends or patterns in the stock's price movement during those years.




##### 2. What is/are the insight(s) found from the chart?

From the chart, we can infer whether the overall price volatility has increased or decreased over time. It also allows us to observe if the difference between the average high and average low prices has widened or narrowed, which can be an indication of increased or decreased market uncertainty or risk.


#### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 * Yes, insights from this chart can help businesses and investors identify periods of heightened volatility, which can be crucial for making informed trading or investment decisions.
 * Insights about increased volatility can signify higher risk and might lead to negative growth. For example, significant widening of the average high and low prices over time might suggest increasing market instability or an overall decline in the company's performance. This could cause investor confidence to decrease, leading to potential negative growth and a decline in the company's value.


##**Chart-3 Relation Between Open and Close over Year**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
for year in df['Year'].unique():
    year_data = df[df['Year'] == year]
    plt.scatter(year_data['Open'], year_data['Close'], label=f'Year {year}')
plt.xlabel('Open Price')
plt.ylabel('Close Price')
plt.title('Relationship Between Open and Close Prices by Year')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot showing the relationship between open and close prices by year was chosen because it visually represents how these two key price points correlate and vary across different years. It helps in understanding if there's a consistent pattern or trend in how stocks open and close over time.

##### 2. What is/are the insight(s) found from the chart?

Insights from the chart include identifying if there's a strong linear relationship between open and close prices each year. It helps in spotting any outliers where stocks opened significantly higher or lower compared to their closing prices, indicating intraday volatility or market sentiment shifts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Yes, these insights are crucial for traders and investors in making informed decisions. Understanding the relationship between open and close prices can assist in predicting price movements throughout the trading day, optimizing entry or exit points, and managing risk more effectively.
*   Insights indicating negative growth might include years where there's a noticeable divergence between open and close prices, especially if closing prices consistently fall below opening prices across multiple years. This divergence could signify bearish market conditions, economic downturns, or company-specific challenges affecting stock performance negatively. Recognizing such trends allows stakeholders to adjust strategies, hedge risks, or explore alternative investment opportunities during periods of market decline.



## **Chart - 4 Closing and opening Price over Year**

In [None]:
# Chart - 4 visualization code
avg_close_per_year=df.groupby("Year").agg({"Close":"mean"})
avg_open_per_year=df.groupby("Year").agg({"Open":"mean"})
plt.figure(figsize=(10,6))
plt.plot(avg_close_per_year,marker="o")
plt.plot(avg_open_per_year,marker="o")
plt.title("Closing and Opening Price Over Year")
plt.xlabel("Year")
plt.ylabel("Closing Price")
plt.legend(["Closing Price","Opening Price"])
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?


 Line plots are preferred in the above chart because they effectively depict the trend and fluctuations of average closing and opening prices over the years.
 The line plot provides a clear visualization of how the prices changed from year to year, highlighting the overall trend and any significant changes in average values.


##### 2. What is/are the insight(s) found from the chart?

The chart visualize us the trend of closing and opening price of stock over different year.Significant changes in the trend of the average closing and opening prices can signal potential turning points in the stock's performance.
These might indicate shifts in market conditions or changes in investor sentiment.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart allows us to visualize the general direction of the stock's performance over time.
It helps in identifying if the stock is experiencing long-term growth, decline, or stagnation.
These insights are important for investment decisions, resource allocation, and overall business strategy.
The chart allows us to visualize the general direction of the stock's performance over time.
It helps in identifying if the stock is experiencing long-term growth, decline, or stagnation.
These insights are important for investment decisions, resource allocation, and overall business strategy.

##**Chart-5 Correlation**

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(data=df.corr(),annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a correlation heatmap to visualize the relationships between numerical variables in the dataset. It helps in identifying patterns of linear correlation between different features (Open, High, Low, Close, Month_number, Year). This information is crucial for feature selection in predictive models and understanding how variables are interconnected.


##### 2. What is/are the insight(s) found from the chart?

The heatmap indicates the strength and direction of the linear correlation between each pair of features. For example, we can observe if "Open" and "Close" have a strong positive correlation, meaning that when one increases, the other tends to increase as well. It also shows if there's a negative correlation, meaning that as one variable increases, the other tends to decrease.


##**Chart-6 Pair Plot**

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(4,4))
sns.pairplot(data=df)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot because it provides a comprehensive overview of the relationships between multiple numerical variables in the dataset in a single visualization. It creates a matrix of scatter plots for each pair of variables, displaying their distribution and potential correlation. This helps in quickly spotting potential trends, patterns, and outliers across different feature combinations.


##### 2. What is/are the insight(s) found from the chart?

The pair plot helps in understanding the relationship between every combination of numerical features in the dataset. For example, it can visualize if there is a linear relationship between "Open" and "Close" or if "High" and "Low" tend to move together. It also helps in identifying if there are any outliers in specific features or combinations of features.


## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
columns=["Open","High","Low","Close"]
for i in columns:
  q3=df[i].quantile(0.75)
  q1=df[i].quantile(0.25)
  iqr=q3-q1
  upper_limit=q3+1.5*iqr
  lower_limit=q1-1.5*iqr
  df=df[(df[i]<=upper_limit) & (df[i]>lower_limit)]


In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(data=df[["Open","High","Low","Close"]])
plt.title("Box Plot of Open,High,Low,Close")
plt.show()

In [None]:
df.shape

##### What all outlier treatment techniques have you used and why did you use those techniques?

**IQR Method**

The technique used here is called Interquartile Range (IQR) outlier removal. It involves calculating the IQR, which is the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset. Outliers are identified as values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. In this function, for each specified column (Open, High, Low, Close), values outside this range are replaced with NaN. Finally, rows containing NaN values are dropped from the dataset. This method effectively filters out extreme values that skew statistical analysis or model performance based on their deviation from the dataset's central tendency.

### 3. Data Scaling

In [None]:
x=df.drop(columns="Close")
y=df["Close"]


In [None]:
x

In [None]:
y

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
x_trans=ss.fit_transform(x)


In [None]:
x_trans

##### Which method have you used to scale you data and why?

The method used here is StandardScaler from sklearn.preprocessing. StandardScaler scales the input data such that each feature has a mean of 0 and a standard deviation of 1. This transformation is crucial in machine learning to ensure that all features contribute equally to model training and prediction. Standardizing the data removes the mean and scales each feature to unit variance, which is particularly beneficial for algorithms that assume normally distributed data or require standardized inputs, such as linear regression, logistic regression, and support vector machines. It helps in improving the convergence rate and performance of these algorithms by reducing the impact of differing scales among features.

### 4. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test=train_test_split(x_trans,y,test_size=0.2,random_state=42)



##### What data splitting ratio have you used and why?

The data splitting ratio used here is 80% training data and 20% testing data,
specified by test_size=0.2 in train_test_split function. This ratio is commonly chosen to allocate a significant portion of the data for training the machine learning model (80%), while reserving a smaller portion for evaluating its performance (20%).

The rationale behind this ratio is to ensure that the model learns patterns and relationships from a sufficiently large dataset during training, which helps in achieving better generalization and performance on unseen data. The test set serves as an independent dataset to assess how well the model can generalize to new, unseen data points, thus providing a measure of its predictive capability and robustness. This approach helps in detecting overfitting and ensures that the model's performance estimates are reliable.

### 5. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
x_train.shape,y_train.shape

No,the dataset is not imbalanced as input data as well as output data contains the same number of rows.

## ***6. ML Model Implementation***

### ML Model - 1 (Linear Regression)

In [None]:
# ML Model - 1 Implementation
lr=LinearRegression()

# Fit the Algorithm
lr.fit(x_train,y_train)
# Predict on the model
y_test_pred=lr.predict(x_test)


In [None]:
#Checking Score of the model
lr.score(x_test,y_test)*100

In [None]:
lr.coef_

In [None]:
plt.figure(figsize=(9,5))
plt.bar(x.columns,lr.coef_)
plt.title("Linear Regression")
plt.xlabel("Columns")
plt.ylabel("Coefficient")
plt.show()

In [None]:
sample1=np.array([[13.48,14.87,12.27,9,2005]])
sample2=np.array([[13.20,14.47,12.40,10,2005]])
sample_t1=ss.transform(sample1)
sample_t2=ss.transform(sample2)
print(lr.predict(sample_t1))
print(lr.predict(sample_t2))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10,7))
plt.plot(y_test.values,label="Actual_price")
plt.plot(y_test_pred,label="Predicted_price")
plt.title("Actual vs Predicted Price")
plt.legend()
plt.show()

####Evaluation

In [None]:
# Evaluation of model on training and testing data

mse_test=mean_squared_error(y_test,y_test_pred)
rmse_test=root_mean_squared_error(y_test,y_test_pred)
mae_test=mean_absolute_error(y_test,y_test_pred)
r_square_test=r2_score(y_test,y_test_pred)

evaluation_df=pd.DataFrame({"Metric":["MSE","RMSE","MAE","R2"],"Test Values":[mse_test,rmse_test,mae_test,r_square_test]})
evaluation_df

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross Validation
kfold=KFold(n_splits=5,shuffle=True,random_state=42)


cv=cross_val_score(lr,x,y,cv=kfold)
print(f"Cross Validation accuracies of linear regression = {cv}")

mean_accuracy=round(sum(cv)/len(cv)*100,2)
print(f"Mean accuracy of Linear Regression={mean_accuracy}")

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In Linear Regression model, on single train test split we get the score 97.34% where as in Cross validation we get 98.47%.
Cross validation is more reliable than train test split.
In K fold Cross validation we split our data into 'k' no. of subsets .One chunk is used as test data for evaluation and remaining part is used to train the model but each time a different chunk will be used as test data.It test the model on various parts of data helping to trust that it will work well on unseen data.

### ML Model - 2 Lasso Regression

In [None]:
# ML Model - 2 Implementation
la=Lasso()
la.fit(x_train,y_train)
la.score(x_test,y_test)*100

In [None]:
y_pred1=la.predict(x_test)

In [None]:
y_pred1

In [None]:
la.coef_

In [None]:
plt.figure(figsize=(8,5))
plt.bar(x.columns,la.coef_)
plt.title("Lasso Regression")
plt.xlabel("Columns")
plt.ylabel("Coefficient")
plt.show()

In [None]:
sample3=np.array([[16.20,20.95,16.02,3,2006]])
sample4=np.array([[20.56,20.80,18.02,4,2006]])
sample_t3=ss.transform(sample3)
sample_t4=ss.transform(sample4)
print(la.predict(sample_t3))
print(la.predict(sample_t4))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10,6))
plt.plot(y_test.values,label="Actual Price")
plt.plot(y_pred1,label="Predicted Price")
plt.title("Actual vs Predicted Price")
plt.legend()
plt.show()

####Evaluation

In [None]:
# Evaluation of model on test data
mse1=mean_squared_error(y_test,y_pred1)
rmse1=root_mean_squared_error(y_test,y_pred1)
mae1=mean_absolute_error(y_test,y_pred1)
r_square1=r2_score(y_test,y_pred1)

evaluation_df_lasso=pd.DataFrame({"Metric":["MSE","RMSE","MAE","R2"],"Test Values":[mse1,rmse1,mae1,r_square1]})
evaluation_df_lasso

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross Validation
kfold=KFold(n_splits=5,shuffle=True,random_state=42)
cv2=cross_val_score(la,x,y,cv=kfold,)
print(f"Cross Validation accuracies of Lasso regression = {cv2}")

mean_accuracy2=round(sum(cv2)/len(cv2)*100,2)
print(f"Mean accuracy of Lasso regression = {mean_accuracy2}")

In [None]:
la.get_params()

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameter={"alpha":[0.01,0.1,1,10,100]}

In [None]:
gs=GridSearchCV(la,parameter,cv=5,scoring="r2")

In [None]:
gs.fit(x_train,y_train)

In [None]:
best_params=gs.best_params_
best_params

In [None]:
highest=gs.best_score_
highest_score=(highest)*100
highest_score

In [None]:
model_best_estimator=gs.best_estimator_

In [None]:
y_test_pred_gs=model_best_estimator.predict(x_test)

In [None]:
y_test_pred_gs

##### Which hyperparameter optimization technique have you used and why?


Hyperparameter optimization technique used here is GridSearchCV.It checks all the possible values of predefined hyperparameter and gets the optimum set of values that gives the highest accuracy.It is particularly effective when the number of the hyperparameters to tune is small and grid size is manageable,as it provide a comprehensive search over a limited space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In Lasso regression after train test split we get 98.75% as score of the model where as after tuning the model we get 98.58% score.

In [None]:
mse_test_gs=mean_squared_error(y_test,y_test_pred_gs)
rmse_test_gs=root_mean_squared_error(y_test,y_test_pred_gs)
mae_test_gs=mean_absolute_error(y_test,y_test_pred_gs)
r2_test_gs=r2_score(y_test,y_test_pred_gs)

evaluation_df_lasso_gs=pd.DataFrame({"Metrics":["MSE","RMSE","MAE","R2"],"Test Values":[mse_test_gs,rmse_test_gs,mae_test_gs,r2_test_gs]})
evaluation_df_lasso_gs

### ML Model - 3 Ridge Regression

In [None]:
# ML Model - 3 Implementation
rd=Ridge()

# Fit the Algorithm
rd.fit(x_train,y_train)

# Predict on the model
y_pred2=rd.predict(x_test)

In [None]:
rd.score(x_test,y_test)*100

In [None]:
rd.coef_

In [None]:
plt.figure(figsize=(8,5))
plt.bar(x.columns,rd.coef_)
plt.title("Ridge Regression")
plt.xlabel("Columns")
plt.ylabel("Coefficient")
plt.show()

In [None]:
sample5=np.array([[15.90,18.60,15.70,8,2006]])
sample6=np.array([[18.00,18.88,16.80,9,2006]])
sample_t5=ss.transform(sample5)
sample_t6=ss.transform(sample6)
print(rd.predict(sample_t5))
print(rd.predict(sample_t6))

####Evaluation

In [None]:
# Evaluation of the model on test data

mse_rd=mean_squared_error(y_test,y_pred2)
rmse_rd=root_mean_squared_error(y_test,y_pred2)
mae_rd=mean_absolute_error(y_test,y_pred2)
r2_rd=r2_score(y_test,y_pred2)

evaluation_df_rd=pd.DataFrame({"Metric":["MSE","RMSE","MAE","R2"],"Test Values":[mse_rd,rmse_rd,mae_rd,r2_rd]})
evaluation_df_rd

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross Validation
kfold=KFold(n_splits=5,shuffle=True,random_state=42)
cv3=cross_val_score(rd,x,y,cv=kfold,scoring="r2")
print(f"Cross validation accuracies of Ridge Regression = {cv3}")

mean_accuracy3=round(sum(cv3)/len(cv3)*100,2)
print(f"Mean Accuracy of Ridge model= {mean_accuracy3}")

In [None]:
# Hyperparameter Tuning
rd.get_params()

In [None]:
gs1=GridSearchCV(rd,parameter,cv=5,scoring="r2")

In [None]:
gs1.fit(x_train,y_train)

In [None]:
best_params1=gs1.best_params_
best_params1

In [None]:
gs1.best_score_

In [None]:
gs1.best_params_

In [None]:
model_best_estimator_rd=gs1.best_estimator_


In [None]:
y_test_pred_gs1=model_best_estimator_rd.predict(x_test)

In [None]:
y_test_pred_gs1

##### Which hyperparameter optimization technique have you used and why?


Hyperparameter optimization technique used here is GridSearchCV.It checks all the possible values of predefined hyperparameter and gets the optimum set of values that gives the highest accuracy.It is particularly effective when the number of the hyperparameters to tune is small and grid size is manageable,as it provide a comprehensive search over a limited space.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In Ridge Model after train test split,the score of the model is 98.28% where as after hyperparameter tuning the score of the model is 98.61%.

In [None]:
mse_rd1=mean_squared_error(y_test,y_test_pred_gs1)
rmse_rd1=root_mean_squared_error(y_test,y_test_pred_gs1)
mae_rd1=mean_absolute_error(y_test,y_test_pred_gs1)
r2_rd1=r2_score(y_test,y_test_pred_gs1)

evaluation_df_rd1=pd.DataFrame({"Metric":["MSE","RMSE","MAE","R2"],"Test Values":[mse_rd1,rmse_rd1,mae_rd1,r2_rd1]})
evaluation_df_rd1

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10,6))
plt.plot(y_test.values,label="Actual Price")
plt.plot(y_pred2,label="Predicted Price")
plt.title("Actual vs Predicted Price")
plt.legend()
plt.show()

##Model Selection

### 1. Which Evaluation metrics did you consider for a positive business impact and why?



*   Metrics Considered: Mean Squared Error (MSE) and R-squared (R2).

*   Reasoning: MSE measures the average squared difference between predicted values and actual values, providing insight into prediction accuracy. R-squared (R2) indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, demonstrating how well the model fits the data. These metrics are crucial for assessing model performance and ensuring accurate predictions, which are essential for making informed business decisions.



### 2. Which ML model did you choose from the above created models as your final prediction model and why?



*   Chosen Model: Ridge Regression (Ridge model after hyperparameter tuning).

*   Reasoning: Ridge regression was chosen as the final model due to its ability to handle multicollinearity (correlation among predictors) by introducing a regularization term (alpha). This helps in reducing model complexity and overfitting, thereby improving generalization to new data. The best alpha parameter obtained from GridSearchCV ensures optimal regularization strength, balancing between bias and variance to enhance model performance.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?



*   Model Explanation: Ridge regression is a linear regression model that incorporates L2 regularization. It adds a penalty term to the ordinary least squares objective function, minimizing the sum of squared residuals plus the regularization term (alpha times the sum of squared coefficients). This regularization helps in shrinking the coefficients towards zero, reducing their variance and improving the model's ability to generalize.

*   Feature Importance: In linear models like Ridge regression, feature importance can be inferred from the magnitude of the coefficients after fitting the model. Larger coefficients indicate stronger influence of those features on the predicted outcome. Tools like permutation importance or partial dependence plots can further help visualize and interpret the impact of each feature on the model predictions, providing insights into which variables are most influential in determining the target variable.



# **Conclusion**

The project successfully applied Ridge regression to predict stock prices using historical data, demonstrating the effectiveness of machine learning techniques in financial forecasting. Comprehensive data preprocessing ensured high-quality inputs by addressing missing values, correcting data types, managing outliers, and encoding categorical variables, along with scaling numerical features for optimal model performance.

Ridge regression was selected to handle multicollinearity and prevent overfitting, with hyperparameter tuning via GridSearchCV identifying an optimal alpha value of 0.1.

Exploratory Data Analysis (EDA) provided valuable insights into stock price trends and relationships between variables, with visualizations like line plots and scatter plots aiding in understanding the data patterns and influential factors.

In summary, the project effectively utilized machine learning techniques, specifically Ridge regression, to predict stock prices with high accuracy. The rigorous data preprocessing, careful model selection, and thorough evaluation highlighted the robustness of the methodology. Future work could involve exploring additional models or advanced feature engineering techniques to further enhance the predictive capabilities.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***