# **Regression Model on GameStop Historical Stock Prices Dataset**

Within the framework of Machine Learning algorithms, in this study i took the dataset and worked on it using regression algorithm. I used Linear Regression and observed r-square score. 

**Gamestop Historical Stock Prices Dataset**

The dataset contains a lof of fields but as far as i observed two of them are increasing linearly. I can choose 2 columns to implement on Regression model, The more open-price of the stock the more high-price of that day :

Open_price: The opening price of the stock

High_price: The high price of that day



# Linear Regression

First we need to import necessary libraries to get dataset which will be used. Kaggle does it automatically for us. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Next, we should read the *.csv* file using *pandas* and assign it to a dataframe variable. And then we should import *matplotlib.pyplot* library for visualizing the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/kaggle/input/gamestop-historical-stock-prices/GME_stock.csv')
print(plt.style.available) # look at available plot styles
plt.style.use('ggplot')

Now, we specify the x and y axes. X means my features, and Y is my target values. According to that features, target values vary. In this scenario as the x-value increases, so does the y-value.

Let's visualize this by plotting it.

In [None]:
x = np.array(df.loc[:,'open_price']).reshape(-1,1)
y = np.array(df.loc[:,'high_price']).reshape(-1,1)

plt.figure(figsize=[10,10])
plt.scatter(x, y)
plt.xlabel('open_price')
plt.ylabel('high_price')
plt.show()

We can see there might be a linearity to a large extent as far as it is observed. But it is too early to say for sure. To be sure, it is necessary to create and fit the linear regression model. And then we need to calculate how accurately it is fitted. 
* Linear regression : y = ax + b where y = target, x = feature and a = parameter of model

To do this we should import *sklearn.linear_model* library to create and fit the linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(x,y)


By giving these *open_price* values to the regression model we need to determine the predictions. We call it "*y_head*" . Later on, we need to visualize the fitted line on the real values.


In [None]:
y_head = linear_regression.predict(x)

plt.scatter(x, y)
plt.plot(x, y_head, color='green', linewidth=3)
plt.xlabel('open_price')
plt.ylabel('high_price')
plt.show()



*Residual* is the name given to the difference between the *y* real values and the *y_head* prediction values. But we can have negative residuals and positive *residuals* both and this may cause the values to reset each other if we sum up all. That's why we need to sum square of *residuals* up and divide into sample count (n) to get *Min Squared Error*.

In these algorithms our main purpose is to reach minimum *Min Squared Error* value.
* Min Squared Error = sum((residual)^2) / n -> n = sample count

Now we need to calculate R^2 scores which shows how accurately the line is fitted on real values. The closer this value is to 1, the better this model is fitted.,
* residual = y-y_head
* square_residual = (residual)^2
* sum_square_residual = sum((residual)^2) -> *we call it here SSR*
* sum_square_total = sum((y-Yavg)^2) -> *we call it here SST*
* R-square = 1 - (SSR/SST)

Actually we don't need to do these calculations manually one by one. There are ready-made libraries for this process in Python.

In [None]:
from sklearn.metrics import r2_score
print("r_square score = ", r2_score(y,y_head))

As it seems, this result is very close to "1". Now we can make a prediction how much "*high_price*" value is when chosen specific "*open_price*". 

To give an example: *open_price* = 350 so what can be the *high_price*?

In [None]:
print("Prediction of high_price : ", linear_regression.predict([[350]]))

We can check also this model by giving an input over real values and compare with each other. Let's see how *open_price = 354.8299865722656* value on this dataset give a result when go this regression model.

In [None]:
print("Prediction of high_price : ", linear_regression.predict([[354.8299865722656]]))

Looking the dataset, we can see the real value of 380.0, which is very close to the prediction.