# Stock Price Prediction Using Linear Regression

Stock closing prices are predicted using a linear regression model. The workflow includes data loading, feature engineering, data cleaning, model training, and evaluation.

## Features Used

- **7DMA**: 7-day moving average of the closing price
- **30DMA**: 30-day moving average of the closing price
- **RSI**: Relative Strength Index (14-day)

## Functions

- `dt(loc)`: The CSV dataset is loaded from the given path.
- `fillin(loc)`: Missing `Close` values are filled using forward fill.
- `markers(loc)`: 7DMA, 30DMA, Return, and RSI are added to the dataset.
- `ins(loc)`: Feature columns are prepared for model input.
- `outs(loc)`: The target column is prepared for model output.
- `linreg(loc)`: The model is trained, evaluated, and performance metrics are printed.

Required libraries are imported for data manipulation, model training, and evaluation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

Data loading and preprocessing functions are defined. The dataset is loaded, missing values are handled, and technical indicators are computed.

In [2]:
def dt(loc):
    stock = pd.read_csv(loc, parse_dates=True)
    return stock

def fillin(loc):
    stock = dt(loc)
    g = stock['Close'].isna().cumsum().where(stock['Close'].isna())
    stock['Close'] = stock['Close'].where(g.map(g.value_counts()) > 3, stock['Close'].interpolate('linear'))
    stock['Close'] = stock['Close'].bfill()
    return stock

def markers(loc):
    stock = fillin(loc)
    stock['7DMA'] = stock['Close'].shift(1).rolling(window=7, min_periods=0).mean()
    stock['30DMA'] = stock['Close'].shift(1).rolling(window=30, min_periods=0).mean()
    stock['Return'] = stock['Close'].shift(1).pct_change() * 100
    change = stock['Close'].diff()
    avgain = change.where(change > 0, 0).shift(1).rolling(window=14, min_periods=1).sum() / 14
    avloss = abs(change.where(change < 0, 0).shift(1).rolling(window=14, min_periods=1).sum()) / 14
    stock['RSI'] = (avgain / (avgain + avloss)) * 100
    return stock

Data cleaning is performed to handle any remaining missing values in the engineered features.

In [3]:
def clean(loc):
    stock = markers(loc)
    g = stock['7DMA'].isna().cumsum().where(stock['7DMA'].isna())
    stock['7DMA'] = stock['7DMA'].where(g.map(g.value_counts()) > 3, stock['7DMA'].interpolate('linear'))
    stock['7DMA'] = stock['7DMA'].bfill()
    g = stock['30DMA'].isna().cumsum().where(stock['30DMA'].isna())
    stock['30DMA'] = stock['30DMA'].where(g.map(g.value_counts()) > 3, stock['30DMA'].interpolate('linear'))
    stock['30DMA'] = stock['30DMA'].bfill()
    g = stock['RSI'].isna().cumsum().where(stock['RSI'].isna())
    stock['RSI'] = stock['RSI'].where(g.map(g.value_counts()) > 3, stock['RSI'].interpolate('linear'))
    stock['RSI'] = stock['RSI'].bfill()
    return stock

Feature and target extraction functions are defined. Features and targets are extracted for model training and evaluation.

In [4]:
def ins(loc):
    stock = clean(loc)
    inputs = pd.DataFrame()
    inputs['7DMA'] = stock['7DMA']
    inputs['30DMA'] = stock['30DMA']
    inputs['RSI'] = stock['RSI']
    return inputs

def outs(loc):
    stock = clean(loc)
    outputs = pd.DataFrame()
    outputs['Close'] = stock['Close']
    return outputs

The linear regression workflow is defined. Data is split, the model is trained, predictions are made, and performance metrics are reported.

In [5]:
def linreg(loc):
    stock = clean(loc)
    inputs = ins(loc)
    outputs = outs(loc)
    intrain, intest, outtrain, outtest = train_test_split(inputs, outputs, test_size=0.2, shuffle=False)
    lr = LinearRegression()
    lr.fit(intrain, outtrain)
    pred = lr.predict(intest)
    stock['Predicted'] = np.nan
    stock.loc[intest.index, 'Predicted'] = pred
    print(f"Coefficients: {lr.coef_}")
    print(f"Intercept: {lr.intercept_}")
    print("Mean Absolute Error:", mean_absolute_error(outtest, pred))
    print("R^2 Score:", r2_score(outtest, pred))
    return stock

In [6]:
def signal(loc):
    stock = linreg(loc)
    trainp = stock[stock['Predicted'].isna()]
    testp = stock[stock['Predicted'].notna()].copy()
    lastknown = trainp['Close'].iloc[-1]
    testp['Signal'] = np.where(testp['Predicted'] > lastknown, 'Buy', 'Sell')
    testp['Profit'] = abs(testp['Predicted'] - lastknown)
    print('Maximum Possible Profit in Test Period:', testp['Predicted'].max() - lastknown)
    print('Best Date to Sell:', testp.loc[testp['Predicted'].idxmax()])
    return testp

The linear regression model is executed on the dataset. The path to the CSV file is specified, and model performance is displayed.

In [7]:
# Replace this path with your own
print(signal(Path.home() / 'Downloads' / 'Amazon' / 'amzn.us.csv'))

Coefficients: [[0.88736783 0.1154485  0.11187364]]
Intercept: [-5.7468903]
Mean Absolute Error: 11.923524545199198
R^2 Score: 0.9948009246029088
Maximum Possible Profit in Test Period: 814.4417054055914
Best Date to Sell: Date          2017-11-10
Open              1126.1
High             1131.75
Low              1124.06
Close            1125.35
Volume           2179181
OpenInt                0
7DMA         1116.477143
30DMA        1026.415333
Return         -0.331015
RSI            82.187197
Predicted    1112.671705
Signal               Buy
Profit        814.441705
Name: 5152, dtype: object
            Date     Open     High      Low    Close   Volume  OpenInt  \
4122  2013-10-10   304.65   306.70   302.59   305.17  2556190        0   
4123  2013-10-11   304.77   310.93   303.84   310.89  2162268        0   
4124  2013-10-14   309.22   311.64   307.00   310.70  1938900        0   
4125  2013-10-15   309.92   310.79   305.26   306.40  2261100        0   
4126  2013-10-16   308.38   310.