# Stock Market Analysis
##### Shubham Mittal - CMSC320 Final Project Tutorial

### 1. Introduction
Interest in the stock market has been increasing especially after the introduction of more user friendly trading apps such as Robinhood. Financial data has also started becoming more accessible in part fueling this increase. The pandemic shook the global economy and the market experienced a lot of volatility. As a result, at the beginning of the pandemic, people started [moving away](https://www.statista.com/topics/7856/covid-19-and-investment-behavior-worldwide/#dossierKeyfigures) from the stock market to more low risk investments. However, as the pandemic has continued, a [new surge of people](https://www.cnbc.com/2021/04/08/a-large-chunk-of-the-retail-investing-crowd-got-their-start-during-the-pandemic-schwab-survey-shows.html) have started coming back to the stock market. 

Data analysis of the stock market can help traders and investors make decisions about buying and selling securities and gain an edge in the market. In this tutorial, I will analyze stock market data in different intervals to identify intraday, monthly, and yearly trends. Then we will use machine learning to also verify and predict trends. These trends could help identify periods of time that may be more lucrative for trading (a certain time of day or a certain month of a year) allowing investors to make better trading decisions. This tutorial will also help you gain a better understanding of the data science pipeline and hopefully allow you to analyze stock market data for your own purposes. 

The data science pipeline in this tutorial will consist of the following steps:
- Data Collection
- Data Cleaning & Processing
- Exploratory Analysis & Visualization
- Analysis & Hypothesis Testing
- Conclusions & Insights

### 2. Required Tools
In order to understand this tutorial and follow along, you will need to have a basic understanding of Python. [Click here](https://developers.google.com/edu/python/?hl=en) for a quick refresher of python. We will be using Python 3.8 in this tutorial and the following libraries: 
- [Pandas](https://pandas.pydata.org/docs/getting_started/install.html)
- [NumPy](https://numpy.org/install/)
- [yfinance](https://pypi.org/project/yfinance/) 
- 

Pandas and NumPy will be useful to manipulate and store our data. We will be getting financial data from Yahoo Finance using the yfinance library. There are also other services to obtain stock data such as Google Finance, the Bloomberg Finance API, and Quandl and you may choose to use any of them if you prefer. 

All of the above libraries can be installed using pip (which is recommended) but the links provide more detailed information on downloading for specific OS's. Succesful execution of the cell below should import all the necessary libraries required for the rest of this tutorial. 

In [1]:
# Importing dependencies for purposes as defined above
import pandas as pd
import numpy as np
import yfinance as yfin

### 3. Data Collection

The first step of the data science pipeline is to obtain the data, in this case using the yfinance library. For the purposes of this tutorial, we will be analyzing two tickers, Google (GOOGL) and the S&P 500 index (^GSPC). Google is a major technological company and the S&P 500 or the Standard and Poor's 500 is an index tracking the performance of the 500 large companies listed on stock exchanges in the United States (Google being on of them).

We will be collecting data for the above two tickers in three intervals, hourly, daily, and monthly to analyze trends across different time periods. We will be collecting data from To get started, we can simply use the yfinance library's download function to obtain the necessary data. The function can take multiple parameters which can help us specify how we want the data presented. Some of the main ones are: 
- tickers (list or string): list for multiple tickers and string for a singular ticker
- start (string): start date in the format YYYY-MM-DD
- end (string): end date in the format YYYY-MM-DD
- period (string): can be used instead of start and end to get a period of the most recent data (Example: 1y)
- interval (string): specify the interval data is provided in. Valid intervals: 1m (1 minute), 1h (1 hour), 1d (1 day), 1mo (1 month)
- progress (boolean): True means a progress bar of obtaining data will be shown and False means it will not
- auto_adjust (boolean): True means data will adjust all OHLC automatically and account for things like stock split while False will not

To learn more about the options the yfinance library provides us with, [click here](https://pypi.org/project/yfinance/)

In [12]:
# Obtaining hourly, daily, and monthly data for GOOGL for the past year
hourly_GOOGL = yfin.download("GOOGL", period="365d", interval="1h", progress=False, auto_adjust=True)
daily_GOOGL = yfin.download("GOOGL", period="5y", interval="1d", progress=False, auto_adjust=True)
monthly_GOOGL = yfin.download("GOOGL", period="5y", interval="1mo", progress=False, auto_adjust=True)

# Obtaining hourly, daily, and monthly data for S&P 500 for the past year
hourly_GSPC = yfin.download("^GSPC", period="365d", interval="1h", progress=False, auto_adjust=True)
daily_GSPC = yfin.download("^GSPC", period="5y", interval="1d", progress=False, auto_adjust=True)
monthyl_GSPC = yfin.download("^GSPC", period="5y", interval="1mo", progress=False, auto_adjust=True)

We have now obtained hourly data for the past year along with daily and monthly data for the past 5 years. This should be enough information for us to conduct our analysis and identify trends. Make sure to pay special attention to the values you input for parameters to ensure you are receiving data as expected.

Next, let's take a look at what the data we have collected looks like. 

In [13]:
hourly_GOOGL.head(5)

Unnamed: 0,Open,High,Low,Close,Volume
2020-11-30 09:30:00-05:00,1776.51001,1780.23999,1759.030029,1760.839844,270458
2020-11-30 10:30:00-05:00,1759.300049,1760.52002,1748.51001,1753.694946,155822
2020-11-30 11:30:00-05:00,1752.670044,1754.204956,1749.140015,1754.204956,105271
2020-11-30 12:30:00-05:00,1754.405029,1754.890015,1749.85498,1751.089966,103912
2020-11-30 13:30:00-05:00,1750.420044,1755.650024,1748.77002,1754.849976,103848


In [14]:
daily_GOOGL.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-05-12,957.849976,957.97998,952.059998,955.140015,1214900
2017-05-15,955.289978,962.700012,952.820007,959.219971,1337700
2017-05-16,963.549988,965.900024,960.349976,964.609985,1101500
2017-05-17,959.700012,960.98999,940.059998,942.169983,2449100
2017-05-18,943.200012,954.179993,941.27002,950.5,1800500


In [15]:
monthly_GOOGL.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-06-01,990.960022,1008.609985,929.599976,929.679993,44085300
2017-07-01,933.219971,1006.190002,915.309998,945.5,41908600
2017-08-01,947.809998,957.200012,918.599976,955.23999,32846200
2017-09-01,957.469971,975.809998,924.51001,973.719971,29626200
2017-10-01,975.650024,1063.619995,961.950012,1033.040039,36853800


From the above dataframes we can note a couple of things. All of our dataframes are indexed using the date or date and time (for the hourly dataframes). We can also notice in the daily_GOOGL dataframe, the data for 2017-05-13 and 2017-05-14 is missing. Missing values in the data obtained could be due to a number of reasons including holidays, the market closed early, or data is simply missing and was not recorded.