# Goal of this project :
I will be predicting the future stock prices using the historical time series data of stocks. 

# What do I exactly mean by it?
Imagine I have a toy that moves back and forth in a certain pattern. If I can understand how it moves now, I will be able to  guess where it will go next. That's exactly what I will be doing here with stock prices, predicting future prices based on past movements.

# What will I need to do to achieve this goal?
- I will need to collect data from some source 
- After collecting, I will need to understand it and then perform some preprocessing on that data to remove irrelavant or missing records from it
- Next I will have to Build a Model
- The train the model in such a way that they will capture the exact patterns from the past data and predict future values 
- Test the Model Performace 

# Collecting The Data 

Using Yahoo Finance API for collecting the stock prices 😄
- Before starting with collecting the data, I install the yfinance library.

In [64]:
import yfinance as yf 

### Downloading the stock data for multiple tickers  (Apple, Google and Microsoft)  for the last 5 years

In [65]:
ticker_list=['AAPL','GOOGL','MSFT'] # Creating a ticker list 
stock_data=yf.download(ticker_list,start='2020-01-01',end='2025-01-01') # Downloads the data of last 5 years 

[*********************100%***********************]  3 of 3 completed


In [66]:
# Displaying the first and last 5 rows 
print(stock_data.head())
print(stock_data.tail())

Price           Close                              High             \
Ticker           AAPL      GOOGL        MSFT       AAPL      GOOGL   
Date                                                                 
2020-01-02  72.716064  68.108376  153.323242  72.776591  68.108376   
2020-01-03  72.009117  67.752075  151.414124  72.771745  68.360669   
2020-01-06  72.582893  69.557945  151.805511  72.621631  69.583321   
2020-01-07  72.241554  69.423592  150.421341  72.849231  69.841098   
2020-01-08  73.403641  69.917725  152.817322  73.706271  70.256604   

Price                         Low                              Open  \
Ticker            MSFT       AAPL      GOOGL        MSFT       AAPL   
Date                                                                  
2020-01-02  153.428246  71.466805  67.004158  151.137280  71.721011   
2020-01-03  152.683705  71.783962  67.045454  150.879566  71.941328   
2020-01-06  151.872338  70.876060  67.228582  149.399987  71.127851   
2020-01-07  1

Observation : The data appears to be of a multi-index format where the ticker is a part of the index along with the Date .

In [67]:
stock_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2020-01-02 to 2024-12-31
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   (Close, AAPL)    1258 non-null   float64
 1   (Close, GOOGL)   1258 non-null   float64
 2   (Close, MSFT)    1258 non-null   float64
 3   (High, AAPL)     1258 non-null   float64
 4   (High, GOOGL)    1258 non-null   float64
 5   (High, MSFT)     1258 non-null   float64
 6   (Low, AAPL)      1258 non-null   float64
 7   (Low, GOOGL)     1258 non-null   float64
 8   (Low, MSFT)      1258 non-null   float64
 9   (Open, AAPL)     1258 non-null   float64
 10  (Open, GOOGL)    1258 non-null   float64
 11  (Open, MSFT)     1258 non-null   float64
 12  (Volume, AAPL)   1258 non-null   int64  
 13  (Volume, GOOGL)  1258 non-null   int64  
 14  (Volume, MSFT)   1258 non-null   int64  
dtypes: float64(12), int64(3)
memory usage: 157.2 KB


Observation : The data has no null entries

# Data Preprocessing 

- By doing Feature Engineering, I will be adding extra value columns like a moving average over a certain period of time , 
which might help my model to make better predictions
- Also, to make the model work better, I will be Normalizing the stock prices so that the model makes better predictions

## Feature Engineering

In [68]:
# Calculating the 50 days and 200days moving average for AAPL,MSFT and GOOGL 
# Now since the Ticker column is a multi index column (The ticker and Date columns are connected in a multi-index structure)
# I will have to reference the ticker name and the column name together. Each column is a tuple example('MSFT','Open').
stock_data[('AAPL','AAPL_50MA')]=stock_data[('Close','AAPL')].rolling(window=50).mean()
stock_data[('AAPL','AAPL_200MA')]=stock_data[('Close','AAPL')].rolling(window=200).mean()
stock_data[('MSFT','MSFT_50MA')]=stock_data[('Close','MSFT')].rolling(window=50).mean()
stock_data[('MSFT','MSFT_200MA')]=stock_data[('Close','MSFT')].rolling(window=200).mean()
stock_data[('GOOGL','GOOGL_50MA')]=stock_data[('Close','GOOGL')].rolling(window=50).mean()
stock_data[('GOOGL','GOOGL_200MA')]=stock_data[('Close','GOOGL')].rolling(window=200).mean()

In [69]:
stock_data.head()

Price,Close,Close,Close,High,High,High,Low,Low,Low,Open,Open,Open,Volume,Volume,Volume,AAPL,AAPL,MSFT,MSFT,GOOGL,GOOGL
Ticker,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,...,MSFT,AAPL,GOOGL,MSFT,AAPL_50MA,AAPL_200MA,MSFT_50MA,MSFT_200MA,GOOGL_50MA,GOOGL_200MA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,72.716064,68.108376,153.323242,72.776591,68.108376,153.428246,71.466805,67.004158,151.13728,71.721011,...,151.566834,135480400,27278000,22622100,,,,,,
2020-01-03,72.009117,67.752075,151.414124,72.771745,68.360669,152.683705,71.783962,67.045454,150.879566,71.941328,...,151.127764,146322800,23408000,21116200,,,,,,
2020-01-06,72.582893,69.557945,151.805511,72.621631,69.583321,151.872338,70.87606,67.228582,149.399987,71.127851,...,149.9441,118387200,46768000,20813700,,,,,,
2020-01-07,72.241554,69.423592,150.421341,72.849231,69.841098,152.416391,72.021238,69.246938,150.173158,72.592601,...,152.0823,108872000,34330000,21634100,,,,,,
2020-01-08,73.403641,69.917725,152.817322,73.706271,70.256604,153.495074,71.943751,69.300178,150.77454,71.943751,...,151.710016,132079200,35314000,27746500,,,,,,


If we are calculating the moving average of a window 50 , then the first 49 values will not have a moving average because there aren't previous days to calculate. Similary in case of 200 days we will be getting Nan for the first 199 days because our window size is 200, so till 199 we do not have any previous days to calculate the means.

In [70]:
stock_data.head(300)

Price,Close,Close,Close,High,High,High,Low,Low,Low,Open,Open,Open,Volume,Volume,Volume,AAPL,AAPL,MSFT,MSFT,GOOGL,GOOGL
Ticker,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,GOOGL,MSFT,AAPL,...,MSFT,AAPL,GOOGL,MSFT,AAPL_50MA,AAPL_200MA,MSFT_50MA,MSFT_200MA,GOOGL_50MA,GOOGL_200MA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,72.716064,68.108376,153.323242,72.776591,68.108376,153.428246,71.466805,67.004158,151.137280,71.721011,...,151.566834,135480400,27278000,22622100,,,,,,
2020-01-03,72.009117,67.752075,151.414124,72.771745,68.360669,152.683705,71.783962,67.045454,150.879566,71.941328,...,151.127764,146322800,23408000,21116200,,,,,,
2020-01-06,72.582893,69.557945,151.805511,72.621631,69.583321,151.872338,70.876060,67.228582,149.399987,71.127851,...,149.944100,118387200,46768000,20813700,,,,,,
2020-01-07,72.241554,69.423592,150.421341,72.849231,69.841098,152.416391,72.021238,69.246938,150.173158,72.592601,...,152.082300,108872000,34330000,21634100,,,,,,
2020-01-08,73.403641,69.917725,152.817322,73.706271,70.256604,153.495074,71.943751,69.300178,150.774540,71.943751,...,151.710016,132079200,35314000,27746500,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-03-05,118.777985,104.354584,223.935333,119.286674,104.808909,225.550064,115.011759,100.906568,218.965439,118.347564,...,221.924168,153766600,53100000,41872800,128.543371,111.140800,221.943702,205.320567,95.331405,82.158522
2021-03-08,113.828087,99.897400,219.864700,118.367123,105.184121,225.646793,113.681349,99.782448,219.613310,118.298646,...,223.712981,154376600,36868000,35267400,128.243575,111.321539,222.020370,205.528867,95.617318,82.307395
2021-03-09,118.455147,101.532578,226.043213,119.404041,102.704974,227.590268,116.205198,101.295711,224.003041,116.439974,...,225.173004,129525800,33920000,33080500,128.054293,111.528310,222.276946,205.778762,95.927963,82.465044
2021-03-10,117.369316,101.325073,224.728210,119.511658,102.579574,229.156640,116.850842,100.486082,224.360782,119.042107,...,229.156640,111943300,27100000,29746800,127.823564,111.727170,222.473844,206.021698,96.228556,82.620040


## Normalizing The data 

I will be using a MinMaxScaler instead of a Normalizer here. 

Why?
- Because a MinMaxScaler will try to scale the stock price values so that all features are within a speacific range 
- While a Normalizer makes things look the same size but doesn’t really change their actual scale, so its not as useful for comparing prices in stock data because we are more interested in how big the stock price is in real terms.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()  # Created a scaler object

# Normalizing the stock data using MinMaxScaler
stock_data[('Close','AAPL')] = scaler.fit_transform(stock_data[('Close','AAPL')].values.reshape(-1,1)) 
# Since we are working with multi-index, we will be targeting the 'Close' prices (Because its the Final value of the day)
# for each ticker (e.g., ('Close', 'AAPL')) and normalizing those.
# values.reshape(-1, 1) reshapes the column data into the right format for MinMaxScaler to process.

stock_data[('Close','GOOGL')] = scaler.fit_transform(stock_data[('Close','GOOGL')].values.reshape(-1,1))
stock_data[('Close','MSFT')] = scaler.fit_transform(stock_data[('Close','MSFT')].values.reshape(-1,1))

In [72]:
print(stock_data.head())

Price           Close                              High             \
Ticker           AAPL      GOOGL        MSFT       AAPL      GOOGL   
Date                                                                 
2020-01-02  72.716064  68.108376  153.323242  72.776591  68.108376   
2020-01-03  72.009117  67.752075  151.414124  72.771745  68.360669   
2020-01-06  72.582893  69.557945  151.805511  72.621631  69.583321   
2020-01-07  72.241554  69.423592  150.421341  72.849231  69.841098   
2020-01-08  73.403641  69.917725  152.817322  73.706271  70.256604   

Price                         Low                              Open  ...  \
Ticker            MSFT       AAPL      GOOGL        MSFT       AAPL  ...   
Date                                                                 ...   
2020-01-02  153.428246  71.466805  67.004158  151.137280  71.721011  ...   
2020-01-03  152.683705  71.783962  67.045454  150.879566  71.941328  ...   
2020-01-06  151.872338  70.876060  67.228582  149.399987  7