# Goal of this project :
I will be predicting the future stock prices using the historical time series data of stocks. 

# What do I exactly mean by it?
Imagine I have a toy that moves back and forth in a certain pattern. If I can understand how it moves now, I will be able to  guess where it will go next. That's exactly what I will be doing here with stock prices, predicting future prices based on past movements.

# What will I need to do to achieve this goal?
- I will need to collect data from some source 
- After collecting, I will need to understand it and then perform some preprocessing on that data to remove irrelavant or missing records from it
- Next I will have to Build a Model
- The train the model in such a way that they will capture the exact patterns from the past data and predict future values 
- Test the Model Performace 

# Collecting The Data 

Using Yahoo Finance API for collecting the stock prices 😄
- Before starting with collecting the data, I install the yfinance library.

In [57]:
import numpy as np # For numerical operations
import pandas as pd # For analyzing and handling the dataset
from sklearn.preprocessing import MinMaxScaler # To normalize data between 0 and 1 
import tensorflow as tf # Necessary library for building a neural network
from tensorflow.keras.models import Sequential # Build the model layer by layer
from tensorflow.keras.layers import LSTM,Dense,Dropout # Dense for final output and Dropout to avoid overfitting
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input

In [3]:
import yfinance as yf 

### Downloading the stock data for multiple tickers  (Apple, Google and Microsoft)  for the last 5 years

In [4]:
ticker_list=['AAPL','GOOGL','MSFT'] # Creating a ticker list 
stock_data=yf.download(ticker_list,start='2020-01-01',end='2025-01-01') # Downloads the data of last 5 years 

YF.download() has changed argument auto_adjust default to True


[*********************100%***********************]  3 of 3 completed


In [5]:
# Displaying the first and last 5 rows 
print(stock_data.head())
print(stock_data.tail())

Price           Close                              High             \
Ticker           AAPL      GOOGL        MSFT       AAPL      GOOGL   
Date                                                                 
2020-01-02  72.716042  68.108376  153.323242  72.776568  68.108376   
2020-01-03  72.009125  67.752075  151.414124  72.771752  68.360669   
2020-01-06  72.582901  69.557945  151.805496  72.621639  69.583321   
2020-01-07  72.241554  69.423592  150.421417  72.849231  69.841098   
2020-01-08  73.403656  69.917725  152.817337  73.706287  70.256604   

Price                         Low                              Open  \
Ticker            MSFT       AAPL      GOOGL        MSFT       AAPL   
Date                                                                  
2020-01-02  153.428246  71.466782  67.004158  151.137280  71.720989   
2020-01-03  152.683705  71.783969  67.045454  150.879566  71.941336   
2020-01-06  151.872323  70.876068  67.228582  149.399972  71.127858   
2020-01-07  1

Observation : The data appears to be of a multi-index format where the ticker is a part of the index along with the Date .

In [6]:
stock_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2020-01-02 to 2024-12-31
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   (Close, AAPL)    1258 non-null   float64
 1   (Close, GOOGL)   1258 non-null   float64
 2   (Close, MSFT)    1258 non-null   float64
 3   (High, AAPL)     1258 non-null   float64
 4   (High, GOOGL)    1258 non-null   float64
 5   (High, MSFT)     1258 non-null   float64
 6   (Low, AAPL)      1258 non-null   float64
 7   (Low, GOOGL)     1258 non-null   float64
 8   (Low, MSFT)      1258 non-null   float64
 9   (Open, AAPL)     1258 non-null   float64
 10  (Open, GOOGL)    1258 non-null   float64
 11  (Open, MSFT)     1258 non-null   float64
 12  (Volume, AAPL)   1258 non-null   int64  
 13  (Volume, GOOGL)  1258 non-null   int64  
 14  (Volume, MSFT)   1258 non-null   int64  
dtypes: float64(12), int64(3)
memory usage: 157.2 KB


Observation : The data has no null entries

# Data Preprocessing 

- By doing Feature Engineering, I will be adding extra value columns like a moving average over a certain period of time , 
which might help my model to make better predictions
- Also, to make the model work better, I will be Normalizing the stock prices so that the model makes better predictions

## Feature Engineering

In [7]:
# Calculating the 50 days and 200days moving average for AAPL,MSFT and GOOGL 
# Now since the Ticker column is a multi index column (The ticker and Date columns are connected in a multi-index structure)
# I will have to reference the ticker name and the column name together. Each column is a tuple example('MSFT','Open').
stock_data[('AAPL','AAPL_50MA')]=stock_data[('Close','AAPL')].rolling(window=50).mean()
stock_data[('AAPL','AAPL_200MA')]=stock_data[('Close','AAPL')].rolling(window=200).mean()
stock_data[('MSFT','MSFT_50MA')]=stock_data[('Close','MSFT')].rolling(window=50).mean()
stock_data[('MSFT','MSFT_200MA')]=stock_data[('Close','MSFT')].rolling(window=200).mean()
stock_data[('GOOGL','GOOGL_50MA')]=stock_data[('Close','GOOGL')].rolling(window=50).mean()
stock_data[('GOOGL','GOOGL_200MA')]=stock_data[('Close','GOOGL')].rolling(window=200).mean()

In [8]:
stock_data.head()

Price,Close,Close,Close,High,High,High,Low,Low,Low,Open,Open,Open,Volume,Volume,Volume,AAPL,AAPL,MSFT,MSFT,GOOGL,GOOGL
Ticker,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,...,MSFT,AAPL,GOOGL,MSFT,AAPL_50MA,AAPL_200MA,MSFT_50MA,MSFT_200MA,GOOGL_50MA,GOOGL_200MA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,72.716042,68.108376,153.323242,72.776568,68.108376,153.428246,71.466782,67.004158,151.13728,71.720989,...,151.566834,135480400,27278000,22622100,,,,,,
2020-01-03,72.009125,67.752075,151.414124,72.771752,68.360669,152.683705,71.783969,67.045454,150.879566,71.941336,...,151.127764,146322800,23408000,21116200,,,,,,
2020-01-06,72.582901,69.557945,151.805496,72.621639,69.583321,151.872323,70.876068,67.228582,149.399972,71.127858,...,149.944085,118387200,46768000,20813700,,,,,,
2020-01-07,72.241554,69.423592,150.421417,72.849231,69.841098,152.416469,72.021238,69.246938,150.173234,72.592601,...,152.082377,108872000,34330000,21634100,,,,,,
2020-01-08,73.403656,69.917725,152.817337,73.706287,70.256604,153.495089,71.943766,69.300178,150.774555,71.943766,...,151.710031,132079200,35314000,27746500,,,,,,


If we are calculating the moving average of a window 50 , then the first 49 values will not have a moving average because there aren't previous days to calculate. Similary in case of 200 days we will be getting Nan for the first 199 days because our window size is 200, so till 199 we do not have any previous days to calculate the means.

In [9]:
stock_data.head(300)

Price,Close,Close,Close,High,High,High,Low,Low,Low,Open,Open,Open,Volume,Volume,Volume,AAPL,AAPL,MSFT,MSFT,GOOGL,GOOGL
Ticker,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,...,MSFT,AAPL,GOOGL,MSFT,AAPL_50MA,AAPL_200MA,MSFT_50MA,MSFT_200MA,GOOGL_50MA,GOOGL_200MA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,72.716042,68.108376,153.323242,72.776568,68.108376,153.428246,71.466782,67.004158,151.137280,71.720989,...,151.566834,135480400,27278000,22622100,,,,,,
2020-01-03,72.009125,67.752075,151.414124,72.771752,68.360669,152.683705,71.783969,67.045454,150.879566,71.941336,...,151.127764,146322800,23408000,21116200,,,,,,
2020-01-06,72.582901,69.557945,151.805496,72.621639,69.583321,151.872323,70.876068,67.228582,149.399972,71.127858,...,149.944085,118387200,46768000,20813700,,,,,,
2020-01-07,72.241554,69.423592,150.421417,72.849231,69.841098,152.416469,72.021238,69.246938,150.173234,72.592601,...,152.082377,108872000,34330000,21634100,,,,,,
2020-01-08,73.403656,69.917725,152.817337,73.706287,70.256604,153.495089,71.943766,69.300178,150.774555,71.943766,...,151.710031,132079200,35314000,27746500,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-03-05,118.777977,104.354584,223.935349,119.286666,104.808909,225.550079,115.011752,100.906568,218.965454,118.347556,...,221.924183,153766600,53100000,41872800,128.543370,111.140800,221.943701,205.320566,95.331406,82.158522
2021-03-08,113.828087,99.897392,219.864609,118.367123,105.184113,225.646699,113.681349,99.782440,219.613219,118.298646,...,223.712888,154376600,36868000,35267400,128.243574,111.321540,222.020368,205.528865,95.617319,82.307395
2021-03-09,118.455170,101.532585,226.043167,119.404064,102.704981,227.590222,116.205220,101.295719,224.002996,116.439996,...,225.172958,129525800,33920000,33080500,128.054293,111.528311,222.276942,205.778760,95.927964,82.465044
2021-03-10,117.369308,101.325073,224.728210,119.511650,102.579574,229.156640,116.850835,100.486082,224.360782,119.042099,...,229.156640,111943300,27100000,29746800,127.823563,111.727170,222.473840,206.021695,96.228557,82.620041


## Normalizing The data 

I will be using a MinMaxScaler instead of a Normalizer here. 

Why?
- Because a MinMaxScaler will try to scale the stock price values so that all features are within a speacific range 
- While a Normalizer makes things look the same size but doesn’t really change their actual scale, so its not as useful for comparing prices in stock data because we are more interested in how big the stock price is in real terms.

In [45]:
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()  # Created a scaler object

# Normalizing the stock data using MinMaxScaler
stock_data[('Close','AAPL')] = scaler.fit_transform(stock_data[('Close','AAPL')].values.reshape(-1,1)) 
# Since we are working with multi-index, we will be targeting the 'Close' prices (Because its the Final value of the day)
# for each ticker (e.g., ('Close', 'AAPL')) and normalizing those.
# values.reshape(-1, 1) reshapes the column data into the right format for MinMaxScaler to process.

stock_data[('Close','GOOGL')] = scaler.fit_transform(stock_data[('Close','GOOGL')].values.reshape(-1,1))
stock_data[('Close','MSFT')] = scaler.fit_transform(stock_data[('Close','MSFT')].values.reshape(-1,1))

In [46]:
print(stock_data.head())

Price          Close                           High                         \
Ticker          AAPL     GOOGL      MSFT       AAPL      GOOGL        MSFT   
Date                                                                         
2020-01-02  0.089415  0.108716  0.070703  72.776568  68.108376  153.428246   
2020-01-03  0.085954  0.106241  0.065008  72.771752  68.360669  152.683705   
2020-01-06  0.088763  0.118784  0.066176  72.621639  69.583321  151.872323   
2020-01-07  0.087092  0.117850  0.062047  72.849231  69.841098  152.416469   
2020-01-08  0.092781  0.121282  0.069194  73.706287  70.256604  153.495089   

Price             Low                              Open  ...              \
Ticker           AAPL      GOOGL        MSFT       AAPL  ...        MSFT   
Date                                                     ...               
2020-01-02  71.466782  67.004158  151.137280  71.720989  ...  151.566834   
2020-01-03  71.783969  67.045454  150.879566  71.941336  ...  151.12776

# Building the LSTM Model

##  Why LSTM ?
Because LSTM is best suited for time series data.I need to prepare the data in a way that it can learn from past stock prices to predict future ones.

But before we feed our data to the LSTM, the data needs to be preprocessed. We will have to normalize it so that it will get scaled in a range 0 to 1. Almost all models require the data to be normalized,so that the models learn the data efficiently and make the right predictions

In [47]:
stock_closing_price=stock_data[['AAPL','MSFT','GOOGL']] # selecting the closing prices of stocks to predict the future values

# Converting the data to a numpy array for easy calculations,if not converted the LSTM model wont be able to process it properly
data=stock_closing_price.values

# Scaling the data
scaler=MinMaxScaler(feature_range=(0,1)) # Rescale all features so that they fall in a range of 0 to 1
scaler.fit_transform(data)

# As we know, if we need to predict the next day's stock price , we will need to look at the past 60 days prices. 
# So we will have to create a sequence of past 60 days, so that our LSTM model can make the predition of the 61st day by
# looking back at the previous day sequences.

def create_dataset(data,time_step=60): # Model the past 60 days of data to predict the 61st day closing price
    X,y=[],[] # X will hold the past 60 days of data and y will predict the next day's price
    for i in range(len(data),time_step):
        X.append(data[i-time_step:i,:])# if i=100 then 100-60 ies 40 : 100 (ie. 99) and all columns : Google , MSFT and AAPL
        y.append(data[i,:]) # Predict the next day's price for AAPL, GOOGL and MSFT
    return np.array(X),np.array(y)


## Split the data into train and test set

In [48]:
from sklearn.model_selection import train_test_split

# I am splitting the data into 80% train and 20% test
test_size=int(len(X)*0.8) # if len(X)=1000 then 1000*0.8 = 800 which is 80% of the data
X_train,X_test=X[:test_size],X[test_size:] # train = :800 and test= 800: 
y_train,y_test=y[:test_size],y[test_size:]


## checking the shape of the data to make sure everything is organized correctly

In [49]:
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(958, 60, 6) (958, 6) (240, 60, 6) (240, 6)


So for X train set in total 958 sequences where each sequence contains 60days of stock data and total 6 features.

For X test set we have total 240 sequences with 60 days of stock data for each sequence and 6 features

For y train there are 958 samples of target data and 6 features 

For y test 240 samples of target data and 6 features


In [50]:
print(np.isnan(X_train).sum())  # Check if X_train has NaNs
print(np.isnan(y_train).sum())  # Check if y_train has NaNs
print(np.isinf(X_train).sum())  # Check if X_train has infinite values
print(np.isinf(y_train).sum())  # Check if y_train has infinite values

34185
417
0
0


The output indicates that there are NaN values in my X train Data and this might cause the model to fail during training 

In [51]:
# Replace the NaN values with the mean of the column 
X_train = np.nan_to_num(X_train, nan=np.nanmean(X_train))
y_train = np.nan_to_num(y_train, nan=np.nanmean(y_train))

### Okay, Now that I have finished preprocessing the data and split it into Train and test set, Now its time to Build the Model!!!😊

# Building LSTM

In [58]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input


# Here the model is built layer by layer. Each layer will pass an output to the next layer in a linear/sequential way. 
model=Sequential()

# Adding Input Layer as the First Layer 
model.add(Input(shape=(X_train.shape[1], X_train.shape[2])))

# Adding my first LSTM layer
model.add(LSTM(units=50,return_sequences=True))
# Total neurons/units=50 , return_sequences=True will return me the output of every sequence if its false then only the output
# of the final sequence will be returned. X_train.shape[1] is 60 days and X_train.shape[2] is 6 features . 60 days 6 features
# features 2 * 3 , Closing price , Volume , AAPL , GOOGL and MSFT tickers

# Adding a dropout layer.
# Its a technique which is used to prevent overfitting. In this n% of neurons will be randomly turned off during training. 
# This forces the model to learn more general patterns
model.add(Dropout(0.2))


# Adding Second Layer of LSTM
model.add(LSTM(units=50,return_sequences=False))
# Since I am considering this as the last layer, I only want the Final Output so I have kept return_sequences=False
# This layer will take the outputs from the first LSTM layer and process them further, eventually creating the predictions.

# Adding Second Dropout Layer
model.add(Dropout(0.2))

# Adding The Output Layer (Dense Layer)
model.add(Dense(units=6)) # predicting the next stock price for 3 different stocks AAPL, GOOGL and MSFT and 2 features

# Compile the Model
optimizer = Adam(learning_rate=0.0001, clipvalue=1.0)  # Clip gradients if they get too large
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Train the Model
model.fit(X_train,y_train,epochs=10,batch_size=32)
# In each epoch, the model will go through the entire training data. For each batch, it will randomly pick 32 samples from
# the training data, calculate the gradients, and update the weights based on those 32 samples.

Epoch 1/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - loss: 0.1556
Epoch 2/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - loss: 0.0756
Epoch 3/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - loss: 0.0258
Epoch 4/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - loss: 0.0164
Epoch 5/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - loss: 0.0158
Epoch 6/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step - loss: 0.0140
Epoch 7/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - loss: 0.0127
Epoch 8/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - loss: 0.0119
Epoch 9/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - loss: 0.0120
Epoch 10/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 33ms/step - loss: 0.0110

<keras.src.callbacks.history.History at 0x201522ba3a0>

## Predict the Stock Prices 

In [61]:
def predict_function(inputs):
    return model(inputs)
predictions=predict_function(X_test)

During preprocessing, we scaled the data to be between 0 and 1. Now, we need to convert the predicted values back to the original stock price scale (the actual values) using the inverse transformation of the scaler.

# Rescaling the Predictions Back to their Original Range

In [62]:
# Rescale the predictions back to the original range
rescaled_predictions = scaler.inverse_transform(predictions)

In [63]:
rescaled_predictions

array([[184.63710867, 164.60311488, 343.14019921, 322.69260225,
        131.88508387, 120.5358016 ],
       [184.84696975, 164.76764046, 343.51749986, 323.01525544,
        132.02544566, 120.62545396],
       [185.06073098, 164.92901051, 343.92162457, 323.34220622,
        132.17377724, 120.71396152],
       ...,
       [214.43452564, 188.39446182, 401.87453664, 400.87190746,
        155.45219505, 139.39344261],
       [214.67209726, 188.48567032, 402.42210379, 401.23531516,
        155.63957228, 139.49778948],
       [214.91295484, 188.5761239 , 402.98421675, 401.6036738 ,
        155.83174116, 139.60465491]])