##### Project Scribble
Project: Porfolio Optimization

#### Completed
- Implemented Pearson's Corelation Coefficient function
- Created dataset with all features (x values) raised to their coefficient value

#### What's Next
- Implement **multiple** linear regression
- Split data into training and testing sets (predicting stocks for 2 weeks)
- Train model
- Test multiple linear regression on test set
- Clean up code so that it works for other stock datasets

In [259]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# This line is for displaying our graphs in Jupyter
%matplotlib inline

#### Functions

In [260]:
# Pearson's Corelation Coefficient function

# Input:
## dataframe: The dataframe where the data is stored
## x_column: The name of the x column we want to measure
## y_column: The name of the y column we want to measure

# Output:
## r = the Pearson's Corelation Coefficient

def PCC(dataframe: pd, x_column_name: str, y_column_name: str):
    x_values = dataframe.loc[:, x_column_name]
    y_values = dataframe.loc[:, y_column_name]
    
    x_mean = x_values.mean()
    y_mean = y_values.mean()
    
    numerator = sum((x_values - x_mean) * (y_values- y_mean))
    denominator = np.sqrt(sum((x_values - x_mean)**2) * sum((y_values - y_mean)**2))
    
    r = numerator / denominator
    
    return r

In [261]:
# Mean Squared Error function
# Will help with preventing "inf/nan" when doing linear regression

# Inputs:
## dataframe: The dataframe where the data is stored
## x_column: The name of the x column we want to measure
## y_column: The name of the y column we want to measure
## weight: The slope
## bias: The y-intercept

# Output:
## cost: The mean squared error value

def MSE(dataframe: pd, x_column_name: str, y_column_name: str, weight: int, bias: int):
    data_length = len(dataframe)
    
    x_values = dataframe.loc[:, x_column_name]
    y_values = dataframe.loc[:, y_column_name]
    
    sum_error = sum((bias + (weight * x_values) - y_values) ** 2)
    
    cost = (1 / data_length) * sum_error
    
    return cost

###### Ignore this linear regression function. It only works for one x value (Simple Linear Regression).
###### We need Multiple Linear Regression

In [262]:
# Modified Simple Linear Regression function (easier to read)

# Inputs:
## dataframe: The dataframe where the data is stored
## x_column: The name of the x column we want to measure
## y_column: The name of the y column we want to measure
## weight: The slope you want to start calculating the gradient descent with (default weight = 0)
## bias: The y-intercept you want to start calculating the gradient descent with (default bias = 0)
## rate: The rate you want the gradien descent to adjust by (default rate = 0.0001)
## epochs: How many times do we adjusy the gradient descent (default epochs = 1000)

# Outputs:
# A tuple that includes the new weight and new bias (slope and y-intercept)

def LinearRegression(dataframe: pd, 
                     x_column_name: str, 
                     y_column_name: str,
                     weight = 0,
                     bias = 0,
                     rate = 0.0001,
                     epochs = 1000
                    ):
    
    # Setting up the weights and bias that will most likely change in value over time
    new_weight = weight
    new_bias = bias
    
    mse = MSE(dataframe, x_column_name, y_column_name, weight, bias)
    
    # x_values contains the values of the entire column of "x_column"
    # y_values contains the values of the entire column of "y_column"
    x_values = dataframe.loc[:, x_column_name]
    y_values = dataframe.loc[:, y_column_name]
    
    # Just getting the full length of the dataframe
    data_length = len(dataframe)
    
    # Gradient descent portion
    for i in range(epochs):
        # pred_y_values is a new column, where the m'x+b' formula has been applied
        # Every row in pred_y_values has the formula applied
        pred_y_values = (new_weight * x_values) + new_bias
        
        # Obtaining the partial derivatives of the weight and bias
        der_weight = (-2 / data_length) * sum(x_values * (y_values - pred_y_values))
        der_bias = (-2 / data_length) * sum(y_values - pred_y_values)
        
        # Calculating the new_weight and new bias
        new_weight = new_weight - rate * der_weight
        new_bias = new_bias - rate * der_bias
        
        new_mse = MSE(dataframe, x_column_name, y_column_name, new_weight, new_bias)
        if new_mse > mse:
            return (new_weight, new_bias)
        else:
            mse = new_mse
        
    return (new_weight, new_bias)

#### Begining of main program

In [263]:
# Creating a dataframe with AMC stock data
# A dataframe is basically a class that can neatly store values into a table
amc_df = pd.read_csv("stock data/AMC.csv")

In [264]:
# Taking a peak at the first 5 rows of our table
amc_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-04-16,2.35,2.49,2.15,2.44,2.44,6290800
1,2020-04-17,3.85,3.85,3.05,3.2,3.2,26799900
2,2020-04-20,3.22,3.44,3.06,3.18,3.18,10787200
3,2020-04-21,3.2,3.38,3.15,3.25,3.25,5788400
4,2020-04-22,3.39,3.39,3.16,3.18,3.18,4568600


In [265]:
# Checking for null values.
amc_df.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

In [266]:
# Changing order of columns, moving Y ("Close" column) to the far right
# Also removing "Adj Close" column, since its the same as "Close"
columns=["Date", "Open", "High","Low", "Volume", "Close"]
amc_df = amc_df[columns]
amc_df.head()

Unnamed: 0,Date,Open,High,Low,Volume,Close
0,2020-04-16,2.35,2.49,2.15,6290800,2.44
1,2020-04-17,3.85,3.85,3.05,26799900,3.2
2,2020-04-20,3.22,3.44,3.06,10787200,3.18
3,2020-04-21,3.2,3.38,3.15,5788400,3.25
4,2020-04-22,3.39,3.39,3.16,4568600,3.18


Features (What we will use to predict the target): Open, High, Low, Volume

Target (What we are predicting): Close

In [267]:
# I will now find the pearson coefficient for every feature
# P value = the pearson coefficient correlation value
target = "Close"

p_open = PCC(amc_df, "Open", target)
p_high = PCC(amc_df, "High", target)
p_low = PCC(amc_df, "Low", target)
p_volume = PCC(amc_df, "Volume", target)

In [268]:
print("P-value for 'Open':", p_open)
print("P-value for 'High':", p_high)
print("P-value for 'Low':", p_low)
print("P-value for 'Volume':", p_volume)

P-value for 'Open': 0.9817938374075329
P-value for 'High': 0.9792750125803519
P-value for 'Low': 0.9755985440455316
P-value for 'Volume': 0.5364640523185773


In [269]:
# Now I will create a new dataframe, which will have all of the features, but raised to
# the power of their respective coefficient value

new_amc_df = amc_df.copy()

new_amc_df["Open"] = new_amc_df["Open"] ** p_open
new_amc_df["High"] = new_amc_df["High"] ** p_high
new_amc_df["Low"] = new_amc_df["Low"] ** p_low
new_amc_df["Volume"] = new_amc_df["Volume"] ** p_volume

print("Original AMC Stock Data")
print(amc_df.head())
print()
print("AMC Stock Data after raising the features to the power of their respective p_value")
print(new_amc_df.head())

Original AMC Stock Data
         Date  Open  High   Low    Volume  Close
0  2020-04-16  2.35  2.49  2.15   6290800   2.44
1  2020-04-17  3.85  3.85  3.05  26799900   3.20
2  2020-04-20  3.22  3.44  3.06  10787200   3.18
3  2020-04-21  3.20  3.38  3.15   5788400   3.25
4  2020-04-22  3.39  3.39  3.16   4568600   3.18

AMC Stock Data after raising the features to the power of their respective p_value
         Date      Open      High       Low       Volume  Close
0  2020-04-16  2.313727  2.443364  2.110214  4438.756343   2.44
1  2020-04-17  3.756659  3.743924  2.968125  9658.872281   3.20
2  2020-04-20  3.152171  3.353036  2.977619  5927.925232   3.18
3  2020-04-21  3.132948  3.295755  3.063029  4244.920447   3.25
4  2020-04-22  3.315483  3.305303  3.072515  3738.817342   3.18
