<a href="https://colab.research.google.com/github/sarcasmsc/AAPLPredictionModels/blob/main/AAPL_Stock_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#AAPL Stock Prediction

##Part 1: Planning

### Task 1: Choose a Research Question

For this task, we're going to focus on formulating a research question with a compelling, real-world application that can be answered using a machine learning model. There is no specific area or topic that you're required to investigate.  In fact, it's usually more interesting if you're exploring your own personal interests or answering a question related to the industry where you will direct your job search.

**The most important rule of research questions is to actually have one.** Many, many projects suffer because the learner begins with a vague, general sense of direction but doesn’t have a clear, specific question to answer.  For example, you might wish to complete your Milestone 3 project about shopping habits at a particular online retailer, but the data you use and the ML model you develop will depend on your specific research question.  

For example, you’ll need different methods to answer the question “Are customers who see a new version of the store website more likely to make a purchase?” than to answer the question, “What products do customers often purchase in the same order?”

**The second most important rule of research questions is not to get in over your head.**  If you are really interested in your topic, you will probably have lots and lots of research questions that you are tempted to answer with this project.  Do not do it!  For this project, you will answer one (and only one) specific, well-defined question.  

Research is not a mechanical process and often doesn’t proceed in a straight line, but it's almost always best to spend time at the start of your project developing and refining a core question that will motivate your study: it makes every aspect of the research process easier.


**Although you may need to try a lot of things on your way to answering your research question, please remove any code that is not part of your finished project before submitting.**

**This notebook should be able to run from start to finish without error.** 


**Step 1** 

Brainstorm three industries or topics that you most interested in exploring for this project.

**Step 1 Answer**

Heathcare, Finance, Company Sales Figures

**Step 2** 

Pick the industry or topic from Step 1 that interests you the most.  Brainstorm three potential research questions that you could answer for this project.  **Right now, the question will be fairly broad.  You will refine your research question once you select your data set.**  Each potential research question should have the following qualities:

*   It's a question, not a statement.
*   It has a real-world application with clear stakeholder(s).
*   You can answer the question using a machine learning model.
*   You know (or have a pretty good idea) which machine learning model will be appropriate for your research question.
*   You have a pretty good idea how to find the data that will answer your question (more on that later).
*   You don't know the answer in advance. 

**Step 2 Answer**

Heathcare - How are doctors visits or life expectancy related to income. Finance - How much is a stocks price impacted by volume, time of year, holidays, news, how does a stock perform on average after earnings calls. Company Sales - How much inventory is sold every black friday each year, and is it growing or decreasing in volume.

**Step 3** 

Rank your three potential research questions from Step 2 in order from one to three with one being your top choice and two and three being backup choices.  

When ranking your research questions, think both about what interests you the most and what will be practical.  How well-defined is each research question?  How difficult will it be to find data to answer that question?  How difficult will it be to wrangle the data?  How confident do you feel about selecting an ML model to answer the research question?

This is a great time to involve your instructor, who can provide guidance on revising your ideas in Step 2 and identifying your top and backup research questions. 

**Step 3 Answer**

1. Stocks
2. Company Sales Figures
3. Healthcare

**My Milestone Project 3 Research Question Is:**

How accurately can you predict a stocks performance after an earnings call based on whether or not the missed, matched or overperformed their expected earnings.

### Task 2: Select a Data Set 

Once you have a research question, you need to find a publically available data set that will allow you to answer it.


There are lots of potential sources.  A few that are particularly useful are:

Awesome public data sets: https://github.com/awesomedata/awesome-public-datasets

Kaggle: https://www.kaggle.com/datasets

UC Irvine Machine Learning: https://archive.ics.uci.edu/ml/datasets.php

Definitely Google around for more data sources and ask your peers and instructor!
 


**Step 1** 

Find at least one data set that you can use to answer your research question from Task 1.  

Your data set should:
* Be at least 100 records long.  You'll need a lot more records if you want to use a neural network.  
* Be publicly available on the Internet. Don't collect your own data.  Don’t use data that can’t be attributed to a reputable source.
* Have at least four features (remember that, when modeling text data, each word is a feature).

This is another good opportunity to check in with your instructor and make sure you are on the right track.  




**Step 1 Answer**

I'm working with a data set about Apple's stock price history, I will be adding other features from sites on earnings dates, product releases, along with stock history from S&P500 data set to analyze Apple's stock price.

https://www.kaggle.com/datasets/tarunpaparaju/apple-aapl-historical-stock-data

https://finance.yahoo.com/quote/SPY/history?period1=1267401600&period2=1582848000&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

https://finance.yahoo.com/calendar/earnings?symbol=AAPL

https://en.wikipedia.org/wiki/Timeline_of_Apple_Inc._products

https://www.businessofapps.com/data/apple-statistics/#:~:text=Share-,iPhone%20statistics,for%2050%25%20of%20its%20revenue.

**Step 2** 

Refine your research question so it applies to your specific data set.  

For example, if your broad research question was "How can I use machine learning to group customers by what they buy?" and you select a data set that contains Target holiday shopping orders, your refined research question might be something like, "Can I group Target customers by their holiday shopping orders?"

**At this point, it is possible that you will discover that you can't find a data set and research question that work well together.  Work with your instructor to either modify your current question or select one of your backup research questions above. To complete this task successfully, you will need to have selected a research question AND have a data set that can be used to answer it.**


**Step 2 Answer**

How can I use machine learning to understand and predict how Apple's stock is affected by its previous movements along with other factors such as volume, time of year, product releases and earnings.

### Task 3: Conduct Exploratory Data Analysis 

The purpose of exploratory data analysis at the project planning stage is to make sure that your data will answer your research question. It is possible to do everything correctly during this step and still hit a roadblock when you actually run your ML model, but it’s much less likely.


**Step 1** 

Determine the type(s) of machine learning model(s) you will use to answer your research question.  There might be only one type of model that will work, or you might have a number of models to choose from.

**Step 1 Answer**

Time series makes the most sense here as we are trying to forecast future prices. I could also classify each day based on whether or not the next day was positive or negative net gain to use todays data to predict if today's data makes it likely for a gain or loss the next day.

**Step 2** 

Are there any requirements your data must meet to use the ML model(s) you listed in Step 1?  For example, if you plan to use a logistic regression model, you must have a categorical target.  If you wish to use natural language processing, you'll need a large amount of text data.


**Step 2 Answer**

If I plan to use time series I don't believe there is anything I need to create since there already is a price to calculate. But if I do use a classifier, I will have to create a gain or loss on the next day to use as the target. I ended up using classifiers to find the most likely outcome for the next day based on todays data.

**Step 3** 

Using visualizations or summary data, show that the requirements you listed in Step 2 are met by your data.  Feel free to add code blocks as necessary.


In [None]:
#Step 3 Answer:

#Use this space to complete Step 3.  Feel free to add code blocks as necessary.

**Step 4**

Explore your data to determine what kind of data cleaning or wrangling will be necessary before you run your ML model. 

Here are some questions to consider:
* How many observations does your data set have?
* How many features does your data set have?
* Does your data have any missing values?
* If your data has missing values, will you drop the records or impute the missing data?  What will your imputation strategy be? 
* Does your data have any outliers or unusual values?  If so, how will you handle them? 
* Will you need to do any feature engineering or drop any features from your data?
* Will you need to encode any categorical data or standardize or normalize quantitative features?  
* Will you need to split your data into traning and testing sets?
* Will you need to do any preprocessing of text data?

If you aren’t sure how to answer any of these questions, your instructor can give you guidance and suggestions.



In [None]:
#Step 4 Answer:

#Use this space to explore your data in Step 4.

**Step 4 Answer**

Write your answer to any relevant questions from Step 4 here.

**At this point, it is possible that you will discover that you can't answer your research question with your data set.  Work with your instructor to either modify your research question, select a backup research question, or select a new data set. To complete this task successfully, you will need to have selected a research question, have a data set that can be used to answer it, AND have performed EDA.**

### Task 4: Develop a Project Plan 

Now you are ready to plan out everything that will be in your final project slide deck.  


**Step 1** 

Write about one paragraph of background to give your audience some context for your research question.  What motivated you to ask this specific research question?  What is the real world application?  Think about your stakeholder(s) and what that person would want to know about the topic before you got started.

**Step 1 Answer**

Understanding and evaluating stock prices is something I've always found interesting, so I wanted to create a model that could evaluate a stocks movement based on the current data we have. The model can help us better evaluate, understand and predict the movement of stock prices, specifically apple's stock price in this case.

**Step 2** 

Write one or two paragraphs describing your data set.  What was the source of the data?   Why did you choose to use this particular data set?  Did you experience any challenges with accessing or loading the data?  Describe any data wrangling you need to do to run your ML model.

**Step 2 Answer**

The data is coming from a data set I found on Kaggle that tracks the price of Apple, I also found other data sets to use or created my own that I combined with the Apple data set. A data set that tracked the movement of the S&P500 to help use and relate to Apple's movement, along with information on Apple's earnings dates and product release dates that I put into data sets to merge with the main Apple stock price data set.

**Step 3**

Write one paragraph describing the model or models you plan to use.  Why did you pick this model or models?  How will they answer your research question?  What metrics will you use to evaluate the model performance?  

**Step 3 Answer**

I will be using decision trees, random forests and boosts as classifiers to try and see if the current day movements can help the model predict the next days movements. Along with that I will also use a time series model to predict the next day's movements.

**Step 4**

Think about your intended audience.  How will you communicate your results to your stakeholders?  What data storytelling techniques will you use in your presentation to engage your audience?

**Step 4 Answer**

I plan on displaying my process of cleaning the data, explaining that the variables I chose, such as earnings and release dates help the model in tandem with the prices to help the model better predict the classification models, along with showing the time series model that will predict the movement of the stock. I think I prefer the classification method because time series works better on things that have a repition like weather, but stocks can change depending on how the company's choices or national/world economic events that can drastically break a stocks value out of patterns.

##Task 5: Data Wrangling

Use the following code block (feel free to add more) to do any data wrangling.  It may be helpful to refer to the questions you answered previously in the Exploratory Data Analysis section.  For reference, they are:

Here are some questions to consider:

* How many observations does your data set have?
* How many features does your data set have?
* Does your data have any missing values?
* If your data has missing values, will you drop the records or impute the * missing data? What will your imputation strategy be?
* Does your data have any outliers or unusual values? If so, how will you handle them?
* Will you need to do any feature engineering or drop any features from your data?
* Will you need to encode any categorical data or standardize or normalize quantitative features?
* Will you need to split your data into traning and testing sets?
* Will you need to do any pre-processing of text data?

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn import tree 
from sklearn.tree import DecisionTreeClassifier
import graphviz
from io import StringIO
import pydotplus
from IPython.display import Image
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error as MSE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier

In [None]:
#Upload data (i did these in separate blocks but i thought it would look nicer together)
aaplquotes = files.upload()
spyquotes = files.upload()
appleproducthistory = files.upload()

In [None]:
#Put data into dataframes

df = pd.read_csv('HistoricalQuotes.csv')
df2 = pd.read_csv("spy2.csv")
df4 = pd.read_csv('appletimeline.csv')

In [None]:
#Working with apple's historical quotes
#First opened the data to see what I was working with
df.head()
#Checked column names because initially df['Open'] was not working, turned out it was titled ' Open' in the data set so I had to change that
df.columns
#Removed the space in the column names
df = df.rename(columns={" Close/Last": "Close/Last", " Volume": "Volume", " Open":"Open", " High":"High", " Low":"Low"})
#Created a column that determined if the day's closing price was higher than the previous day's closing price
df['Gain or Loss'] = np.where(df['Close/Last'] > df['Close/Last'].shift(-1), 1,0)
#Created a column titled 'Next Day' that showed whether or not the next day was a gain or loss
df['Next Day'] = df['Gain or Loss'].shift(1)


In [None]:
#Working with SPY's stock data so that it could be compared to Apple's movement in the model
df2 = pd.read_csv("spy2.csv")
#Changed the column names because they were the same as df's column names
df2 = df2.rename(columns={"Open": "SPY Open", "High": "SPY High", "Low":"SPY Low", "Close*":"SPY Close", "Adj Close**": "SPY Adj Close", "Volume":"SPY Volume"})
#Created a gain or loss column for SPY
df2['SPY Gain or Loss'] = np.where(df2['SPY Close'] > df2['SPY Close'].shift(-1), 1,0)
#Created a next day column for SPY
df2['SPY Next Day'] = df2['SPY Gain or Loss'].shift(1)
#Merged the two data frames
df3 = pd.merge(df, df2, on='Date')

In [None]:
#Working with df3 which is a combination of Apple's price history with SPY's price history
#Making sure it merged the way I wanted it to
df3.head()
#Created a column that would divide the dates into quarters
df3['Quarter'] = pd.PeriodIndex(df3.Date, freq='Q')
#Created 4 new columns named Quarter1, Quarter2, Quarter3, and Quarter4 to assign value to each one depending on which quarter of the year we were in.
df3['Quarter1'] = np.where((df3['Quarter'] == '2020Q1') | (df3['Quarter'] == '2019Q1') | (df3['Quarter'] == '2018Q1') | (df3['Quarter'] == '2017Q1') | (df3['Quarter'] == '2016Q1') | (df3['Quarter'] == '2015Q1') | (df3['Quarter'] == '2014Q1') | (df3['Quarter'] == '2013Q1') | (df3['Quarter'] == '2012Q1') | (df3['Quarter'] == '2011Q1') | (df3['Quarter'] == '2010Q1'), 1, 0)
df3['Quarter2'] = np.where((df3['Quarter'] == '2020Q2') | (df3['Quarter'] == '2019Q2') | (df3['Quarter'] == '2018Q2') | (df3['Quarter'] == '2017Q2') | (df3['Quarter'] == '2016Q2') | (df3['Quarter'] == '2015Q2') | (df3['Quarter'] == '2014Q2') | (df3['Quarter'] == '2013Q2') | (df3['Quarter'] == '2012Q2') | (df3['Quarter'] == '2011Q2') | (df3['Quarter'] == '2010Q2'), 1, 0)
df3['Quarter3'] = np.where((df3['Quarter'] == '2020Q3') | (df3['Quarter'] == '2019Q3') | (df3['Quarter'] == '2018Q3') | (df3['Quarter'] == '2017Q3') | (df3['Quarter'] == '2016Q3') | (df3['Quarter'] == '2015Q3') | (df3['Quarter'] == '2014Q3') | (df3['Quarter'] == '2013Q3') | (df3['Quarter'] == '2012Q3') | (df3['Quarter'] == '2011Q3') | (df3['Quarter'] == '2010Q3'), 1, 0)
df3['Quarter4'] = np.where((df3['Quarter'] == '2020Q4') | (df3['Quarter'] == '2019Q4') | (df3['Quarter'] == '2018Q4') | (df3['Quarter'] == '2017Q4') | (df3['Quarter'] == '2016Q4') | (df3['Quarter'] == '2015Q4') | (df3['Quarter'] == '2014Q4') | (df3['Quarter'] == '2013Q4') | (df3['Quarter'] == '2012Q4') | (df3['Quarter'] == '2011Q4') | (df3['Quarter'] == '2010Q4'), 1, 0)
#Created a new column named Earnings, assigning it 0 to show that it was not an earnings date
df3['Earnings'] = 0
#Added earnings dates to the data set, i couldn't find a data set so i added each one in manually.
df3.loc[df3['Date'] == '01/28/2020', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/30/2019', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/30/2019', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/30/2019', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/29/2019', 'Earnings'] = 1
df3.loc[df3['Date'] == '11/02/2018', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/31/2018', 'Earnings'] = 1
df3.loc[df3['Date'] == '05/01/2018', 'Earnings'] = 1
df3.loc[df3['Date'] == '02/01/2018', 'Earnings'] = 1
df3.loc[df3['Date'] == '11/02/2017', 'Earnings'] = 1
df3.loc[df3['Date'] == '08/01/2017', 'Earnings'] = 1
df3.loc[df3['Date'] == '05/02/2017', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/31/2017', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/25/2016', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/26/2016', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/26/2016', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/26/2016', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/27/2015', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/21/2015', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/27/2015', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/27/2015', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/20/2014', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/22/2014', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/23/2014', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/27/2014', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/28/2013', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/23/2013', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/23/2013', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/23/2013', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/25/2012', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/24/2012', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/24/2012', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/24/2012', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/18/2011', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/19/2011', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/20/2011', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/18/2011', 'Earnings'] = 1
df3.loc[df3['Date'] == '10/18/2010', 'Earnings'] = 1
df3.loc[df3['Date'] == '07/20/2010', 'Earnings'] = 1
df3.loc[df3['Date'] == '04/20/2010', 'Earnings'] = 1
df3.loc[df3['Date'] == '01/25/2010', 'Earnings'] = 1
#checked for null values, I was missing some data from the SPY dataset but I was okay with it since it was only 40 values out of the whole data set.
df3.isnull().sum()
#checked to make sure the earnings got added in
df5['Earnings'].value_counts()[1]

In [None]:
#Adding in 4th data set to work on, I created this data set by getting data from https://en.wikipedia.org/wiki/Timeline_of_Apple_Inc._products and copy pasting the tables I needed into an excel sheet
df4 = pd.read_csv('appletimeline.csv')
#Dropped columns I accidently left in the data set
df4 = df4.drop(columns='Unnamed: 2')
df4 = df4.drop(columns='Unnamed: 3')
#Changed the names of values that didn't properly move over from excel
df4.loc[df4['Family'] == 'AirP+C1:D23ort,\xa0drives', 'Family'] = 'Airport, drives'
df4.loc[df4['Family'] == 'AirPort,\xa0drives', 'Family'] = 'Airport, drives'
#Checked what values were in Family after
df4['Family'].unique()
#Renamed the products based on the type of product, I divided them into iPhone, iPad, Mac and Other hardware. I decided to do so based on this site that shows how much revenue Apple gets on products https://www.businessofapps.com/data/apple-statistics/#:~:text=Share-,iPhone%20statistics,for%2050%25%20of%20its%20revenue.
df4.loc[df4['Family'] == 'Airport, drives', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'iPad', 'Family'] = 'iPad'
df4.loc[df4['Family'] == 'MacBook Pro', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'MacBook', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'Mac Mini', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'iPhone', 'Family'] = 'iPhone'
df4.loc[df4['Family'] == 'iMac', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'Trackpad', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Input Device Accessories', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Displays', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Mac Pro', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'iPod Touch', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'iPod Nano', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'iPod Shuffle', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Apple TV', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'MacBook Air', 'Family'] = 'Mac'
df4.loc[df4['Family'] == 'AirPort', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'AirPort Express', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Headphones', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Apple Watch', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Apple Mouse', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Apple Keyboard', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'Speakers', 'Family'] = 'Other Hardware'
df4.loc[df4['Family'] == 'iPhone accessories', 'Family'] = 'Other Hardware'
#Dropped duplicates because some days Apple released multiple of the same types of products on the same day (such as two types of iPhones on the same day)
df4 = df4.drop_duplicates()
#In excel I created new columns named iPhone, iPad, Mac and Other Hardware and assigned them values of 1 or 0 based on whether or not they occured, then reuploaded the excel file to work on again
#I then combined the rows because some dates had multiple releases and were in the table twice, one for each release (if an iPad and iPhone were released on the same day, it would show up on the table as two separate rows)
agg_functions = {'iPhone':'sum', 'iPad':'sum', 'Mac':'sum', 'Other':'sum'}
df4 = df4.groupby(df4['Date']).aggregate(agg_functions)
#Merged df3 and df4
df5 = df3.merge(df4,how='left', left_on='Date', right_on='Date')
#Filled in the nulls
df5['iPhone'].fillna(value=0, inplace=True)
df5['iPad'].fillna(value=0, inplace=True)
df5['Mac'].fillna(value=0, inplace=True)
df5['Other'].fillna(value=0, inplace=True)
#created a second dataframe to work with just incase i change the data and change too much
df5a = df5

##Task 6: Modeling

Use the following code block to build your ML model.  

In [None]:
#Though I tried to document everything I did, incase the data set has anything missing, I also sent a copy of the data set and it can be uploaded here before testing the data.
from google.colab import files
dataset = files.upload()

In [None]:
#Assign the data set to a dataframe
df5a = pd.read_csv('milestone3projectfinal.csv')
#An extra unnamed column was saved that needs to be dropped
df5a = df5a.drop(columns=('Unnamed: 0'))

In [None]:
#For my  first model, I decided to go with something simple to test out the data, I used logistic regression to see how much a gain or loss on one day affected its outcome on the next day.
#I had to drop the NaN values from my data set but I didn't want to drop them from all my data sets so I created df6 for this
df5a = df5a.dropna()

#Assign X and y for the model, y being the target
y = df5a[['Next Day']]
X = df5a[['Gain or Loss']]

#Train model
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#Get the logistic regression accuracy for next day vs gain or loss
log_reg = LogisticRegression(random_state=0)
log_reg_model = log_reg.fit(X_train, y_train)
accuracy = log_reg_model.score(X_train, y_train)
print('Gain or Loss log_reg accuracy')
print(accuracy)
#Ended up being 52% accurate, so not much better than random chance

#Though I don't think LogReg would be a good model overall, I wanted to see how accurate a model with Volume would be

#Assign X and y again
y = df5a[['Next Day']]
X = df5a[['Volume']]

#Train model
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#Logistic regression for next day vs volume
log_reg = LogisticRegression(random_state=0)
log_reg_model = log_reg.fit(X_train, y_train)
accuracy = log_reg_model.score(X_train, y_train)
print('Volume log_reg accuracy')
print(accuracy)
#Ended up being 49% accurate, so turned out worse than just flipping a coin

In [None]:
#Used .corr() to see correlation between Next Day and all columns
df5.corr()
#I noticed here that not all the columns were showing up, and realized I had never removed the '$' signs from the original data set, so I went back and removed them
df5['Open'] = df5['Open'].str.replace('$', '')
df5['High'] = df5['High'].str.replace('$', '')
df5['Low'] = df5['Low'].str.replace('$', '')
#I then put them back into the data set as floats so that they could be used in corr()
df5['Open'] = df5['Open'].astype(float)
df5['High'] = df5['High'].astype(float)
df5['Low'] = df5['Low'].astype(float)

In [None]:
#Because I found a few columns that had a much better correlation than the two I tried, I decided to try a few more logreg to see how well the model would do, though I didn't expect very good results.
#Assign X and y for the model, y being the target
y = df5a[['Next Day']]
X = df5a[['SPY Next Day']]

#Train model
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#Get the logistic regression accuracy for next day vs SPY Next Day
log_reg = LogisticRegression(random_state=0)
log_reg_model = log_reg.fit(X_train, y_train)
accuracy = log_reg_model.score(X_train, y_train)
print('SPY Next Day log_reg accuracy')
print(accuracy)
#Accuracy ended up being 68% which was a lot higher than I expected

#I wanted to try SPY Gain or Loss since it is just the previous day shifted, though the correlation using .corr() showed it to be much lower
#Assign X and y for the model, y being the target
y = df5a[['Next Day']]
X = df5a[['SPY Gain or Loss']]

#Train model
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#Get the logistic regression accuracy for next day vs SPY Gain or Loss
log_reg = LogisticRegression(random_state=0)
log_reg_model = log_reg.fit(X_train, y_train)
accuracy = log_reg_model.score(X_train, y_train)
print('SPY Gain or Loss log_reg accuracy')
print(accuracy)
#Ended up being barely above 50%, so not a very good model once again

Trees

In [None]:
+#I decided to make a tree with everything included to get a starting point on the accuracy of a model with everything I had gotten data for, I figured this wouldn't be accurate but a good starting point.
#Tree1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)

#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#Pipeline for decision tree
tree0 = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('tree', DecisionTreeClassifier(criterion='entropy',random_state=42))

])

tree0.fit(X_train, y_train)

import sklearn
#Decision tree visual

features = ['Close/Last', 'Volume', 'Open', 'High', 'Low', 'Gain or Loss',
        'SPY Open', 'SPY High', 'SPY Low', 'SPY Close',
       'SPY Adj Close', 'SPY Volume', 'SPY Gain or Loss', 'SPY Next Day',
        'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'Earnings',
       'iPhone', 'iPad', 'Mac', 'Other']

dot_data = StringIO()
sklearn.tree.export_graphviz(tree0.named_steps['tree'], out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = features,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('tree.png')
Image(graph.create_png())

#Drop print to get the image
scores = cross_val_score(tree0, X_train, y_train, cv=10)
print('Tree1 score with Close/Last, Volume, Open, High, Low, Gain or Loss, SPY Open, SPY High, SPY Low, SPY Close, SPY Adj Close, SPY Volume, SPY Gain or Loss, SPY Next Day, Quarter1, Quarter2, Quarter3, Quarter4, Earnings, iPhone, iPad, Mac, Other')
print(scores.mean())
#The model ended up scoring 57% with all the columns included, though before I could run it through, I got a few errors and had to fix the data up some more.

In [None]:
#I got an error saying that there was a string of '1.05\xa0Dividend' somewhere in my data, so I had to find it and replace it
df5a[df5a.eq('1.05\xa0Dividend').any(1)]
df5a.loc[df5a['SPY Open'] == '0.825\xa0Dividend'] = 187.71
#It turned out that the dividend days of SPY were put in to some of the columns, at first i removed it because I thought that for some reason those days had only the dividend and not the actual prices since I didn't see them, but then when I checked the day, I actually had 2 rows for the dates with dividends. One for the dividend and one actually showing the values.
df5a[df5a.eq('03/18/2016').any(1)]
#So instead, I dropped all days with NaNs in SPY High since it would remove all the dividend rows
df5a = df5a[df5a['SPY High'].notna()]
#After fixing that problem, I got another error which was that I had a string that was not letting my decision tree work, so I had to find what had the value of '58,126,000'
df5a[df5a.eq('58,126,000').any(1)]
#It turned out that the SPY Volume column had ',' in its values instead of just the numbers, so I had to remove and turn it into a float.
df5a['SPY Volume'] = df5a['SPY Volume'].str.replace(',', '')
df5a['SPY Volume'] = df5a['SPY Volume'].astype(float)
#Next, there were also NaNs still left in the data which was causing issues, but luckily it was only the NaNs for Next Day and Gains and Losses since the first and last day of the data set wouldn't have something to reference for the last/next day's value.
df5a.isnull().sum()
df5a = df5a.dropna()
df5a.isnull().sum()
#And finally, the data was ready for the tree.

AdaBoost

In [None]:
#Adaboost pipeline (all major columns)
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline
AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB1 accuracy(all):', scores)
#Accuracy turned out to be 65%, I'm happy with the results being much higher with so much clutter still in the mix, I wanted to try each type of boost before I move on to trying to narrow down which columns are important.

In [None]:
#AdaBoost pipeline 2 (Dropped High, Low, SPY High, SPY Low)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB2 accuracy(Dropped High, Low, SPY High, SPY Low):', scores)

In [None]:
#AdaBoost pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB3 accuracy(Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#Adaboost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Adaboost Pipeline 2 (Dropped High, Low, SPY High, SPY Low) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#AdaBoost pipeline 4 (Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB4 accuracy(Dropped Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#AdaBoost pipeline 5 (SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'SPY High', 'SPY Low','Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB5 accuracy(Dropped SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#AdaBoost pipeline 6 (SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'SPY High', 'SPY Low', 'Other','Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

AdaBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('AdaBoost', AdaBoostClassifier(random_state=42))

])

AdaBoost.fit(X_train, y_train)
scores = cross_val_score(AdaBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('ADAB6 accuracy(Dropped SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#Adaboost Pipeline 7 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Adaboost Pipeline 8 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Adaboost Pipeline 9 (Dropped SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter','SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

Gradient Boost

In [None]:
#Gradientboost pipeline (all major columns)
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline
Gradientboost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('Gradientboost', GradientBoostingClassifier(random_state=42))

])

Gradientboost.fit(X_train, y_train)
scores = cross_val_score(Gradientboost, X_train, y_train, cv=10)
scores = scores.mean()
print('GB1 Pipeline accuracy(all):', scores)
#For the rest of the accuracies I recorded them in an excel sheet and will post it when I finish my models.

In [None]:
#gradient boost classifier with GB1 pipeline learning rate 0.1 to 0.5
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
mean_accuracy = []

for i in [0.10, 0.20, 0.30, 0.40, 0.50]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.20, 0.30, 0.40, 0.50])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#gradient boost classifier with GB1 pipeline learning rate 0.1 to 0.15
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
mean_accuracy = []

for i in [0.10, 0.11, 0.12, 0.13, 0.14, 0.15]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.11, 0.12, 0.13, 0.14, 0.15])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#GradientBoost Pipeline 2 (Dropped High, Low, SPY High, SPY Low)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2
Gradientboost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('Gradientboost', GradientBoostingClassifier(random_state=42))

])

Gradientboost.fit(X_train, y_train)
scores = cross_val_score(Gradientboost, X_train, y_train, cv=10)
scores = scores.mean()
print('GB2 accuracy(Dropped High, Low, SPY High, SPY Low):', scores)

In [None]:
#gradient boost classifier with GB2 pipeline
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Gradientboost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#GradientBoost pipeline 3 (Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

Gradientboost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('Gradientboost', GradientBoostingClassifier(random_state=42))

])

Gradientboost.fit(X_train, y_train)
scores = cross_val_score(Gradientboost, X_train, y_train, cv=10)
scores = scores.mean()
print('GB3 accuracy(Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#GradientBoost pipeline 4 (Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

Gradientboost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('Gradientboost', GradientBoostingClassifier(random_state=42))

])

Gradientboost.fit(X_train, y_train)
scores = cross_val_score(Gradientboost, X_train, y_train, cv=10)
scores = scores.mean()
print('GB3 accuracy(Dropped Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#Gradientboost Pipeline 5 (Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.1 to 0.2
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Gradientboost Pipeline 5 (Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

XGBoost

In [None]:
#I haven't tried XGBoost before and wanted to try a model with it, I had not imported the library yet so I had to do that first.
import xgboost as xgb
#XGBoost pipeline (all major columns)
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline
XGBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('XGBoost', xgb.XGBClassifier(random_state=42))

])

XGBoost.fit(X_train, y_train)
scores = cross_val_score(XGBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('XGB accuracy(all):', scores)
#Accuracy turned out to be 64% which was lower than Adaboost, but I'm sure if I tweak both I can get different outcomes, will decide which is better depending on that.

In [None]:
#XGBoost pipeline 2 (Dropped High, Low, SPY High, SPY Low)
#I decided to drop High, Low, SPY High and SPY Low, I feel like highs and lows are important but dont give away a lot of information since if the ending price matches one of those they dont input anything and if its somewhere inbetween those numbers it still doesn't input much
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline
XGBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('XGBoost', xgb.XGBClassifier(random_state=42))

])

XGBoost.fit(X_train, y_train)
scores = cross_val_score(XGBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('XGB2 accuracy(Dropped High, Low, SPY High, SPY Low):', scores)

In [None]:
#XGBoost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4)
#I decided to drop more columns incase there was too much clutter for the data to be analyzed properly
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

XGBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('XGBoost', xgb.XGBClassifier(random_state=42))

])

XGBoost.fit(X_train, y_train)
scores = cross_val_score(XGBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('XGB3 accuracy(Dropped High, Low, SPY High, SPY Low):', scores)

In [None]:
#XGBoost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.1 to 0.5
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('XGBoost', xgb.XGBClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#XGBoost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.1 to 0.2
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('XGBoost', xgb.XGBClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#XGBoost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('XGBoost', xgb.XGBClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#XGBoost pipeline 4 (Quarter1, Quarter2, Quarter3, Quarter4)
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

XGBoost = Pipeline([
    ('imp_mean', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('XGBoost', xgb.XGBClassifier(random_state=42))

])

XGBoost.fit(X_train, y_train)
scores = cross_val_score(XGBoost, X_train, y_train, cv=10)
scores = scores.mean()
print('XGBoost4 accuracy(Dropped Quarter1, Quarter2, Quarter3, Quarter4):', scores)

In [None]:
#XGBoost Pipeline 5 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('XGBoost', xgb.XGBClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_train, y_train, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

##Task 7: Evaluation

Use the following code block to evaluate your ML model.  

In [None]:
#Task 7 - Evaluation - Use this code block to evaluate your ML model.
# Make sure to clearly comment your code.

Log Reg Evaluation

In [None]:
#because I found a few columns that had a much better correlation than the two I tried, I decided to try a few more logreg to see how well the model would do, though I didn't expect very good results.
#assign X and y for the model, y being the target
y = df5a[['Next Day']]
X = df5a[['SPY Next Day']]

#train model
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

#get the logistic regression accuracy for next day vs SPY Next Day
log_reg = LogisticRegression(random_state=0)
log_reg_model = log_reg.fit(X_train, y_train)
scoring = 'roc_auc'
results = model_selection.cross_val_score(model, X_val, y_val, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

AdaBoost Evaluation

In [None]:
#Adaboost Pipeline 3 (Dropped High, Low, SPY High, SPY Low, Quarter1, Quarter2, Quarter3, Quarter4) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#AdaBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('AdaBoost', AdaBoostClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_test, y_test, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

Gradient Boost Evaluation

In [None]:
#Gradientboost Pipeline 5 (Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.1 to 0.2
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_test, y_test, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

In [None]:
#Gradientboost Pipeline 5 (Dropped High, Low, SPY High, SPY Low, Other, Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#Gradientboost Pipeline 2 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('Gradient', GradientBoostingClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_test, y_test, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()

XGBoost Evaluation

In [None]:
#XGBoost Pipeline 6 (Dropped High, Low, SPY High, SPY Low, Earnings, Other Quarter1, Quarter2, Quarter3, Quarter4, SPY Next Day) learning rates 0.01 to 0.1
y = df5a['Next Day']
X = df5a.drop(['Next Day', 'Date', 'Quarter', 'High', 'Low', 'SPY High', 'SPY Low', 'Earnings', 'Other', 'Quarter1', 'Quarter2', 'Quarter3', 'Quarter4', 'SPY Next Day'], axis=1)
#Train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 
#XGBoost Pipeline 3 with learning rates
mean_accuracy = []

for i in [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
  pipe     = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('XGBoost', xgb.XGBClassifier(random_state=42, learning_rate=i))])

  pipe.fit(X_train, y_train)
  scores = cross_val_score(pipe, X_test, y_test, cv=10)
  mean_accuracy.append(scores.mean())


learning_rate_df = pd.DataFrame([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1])
learning_rate_df.rename({0:'learning rate'}, axis=1, inplace=True)

mean_accuracy_df = pd.DataFrame(mean_accuracy)*100
mean_accuracy_df.rename({0:'mean accuracy'}, axis=1, inplace=True)

to_plot = pd.concat([learning_rate_df, mean_accuracy_df], axis=1)

print(to_plot)

plt.plot(to_plot['learning rate'], to_plot['mean accuracy'])
plt.xlabel('Learning Rate')
plt.ylabel('Mean accuracy %')
plt.show()
#Surprisingly better results than the modeel, with the training data i was getting in the 47% area with my tests, but with the test data the model was able to score in the 54-55% range

##Task 8: Results

Use the following text block to summarize your results.  How will you communicate the answer to your research question to stakeholders? 

**Task 8 Answer**

Modeling with the current day data that I provided in my data sets did not provide a model good enough to use for day trading or market prediction. The models were scoring in the mid 40s which is worse than random chance, though with the test data it did improve to above 50% and even near the 55 percent accuracty, it is not good enough to use for financial decisions. Though with the SPY Next Day column available, the models were able to make above 66% accuracy, which would make it right about 2 out of 3 days, making it possible to make financial decisions with. The problem is SPY Next Day isn't available either since it is the next day's movement. Though it does show that you can make a correlation between the two's movement, but that most likely is due to the fact that AAPL has been a top 500 company in market cap throughout the time that the data set was observed. Meaning that it would be inside the SPY Index and its own movement would influence SPY.

#Part 2: Presentation 



##Task 1: Slide Deck

Create and present a slide deck to your classmates showing how you answered your research question.  You can find a Slide Template in the course materials or create your own.  The presentation should be about five minutes long.

##Task 2: Reflection

Use the following text block to reflect on the project.  Did you run into anything that was particularly difficult?  What part(s) of the project did you enjoy most?  Did your results leave you with any new questions you'd investigate if you had more time?

**Task 2 Answer** 

The most issue I had was finding a model that I thought was accurate or deciding when to stop modeling and moving on to evaluate the models. Some of the data wrangling and cleaning gave me issues because I didn't know how to manipulate the data exactly how I wanted or didn't find out I hadn't cleaned it properly until I ran models and got errors. I enjoyed modeling and trying to find accurate models, but I do feel like I spent too much time on it. My results made me question how much AAPL affects SPY, even though it is in the index I felt like it wasn't a large enough part of the index to use for models to make it accurate. But it did end up being accurate.