# Prototype Of Stock Sets

Author: Taylor D. Gabatino

Description: Prototype of Stock Sets to Be Used for the Big Data Analysis

This notebook will serve as a small analyzing of a set of stocks to determine whether this data is scalable. At a minimum, 5 - 10 stocks are selected, and will be prototyped in order to apply to a larger dataset. 

The stocks that are chosen for this particular prototype are the top 10 stocks across the years 2010-2020. 

In [17]:
# If you wish to install the modules needed, here are a list of the installations
#!pip install pandas
#!pip install pandas_datareader

# Import Statements
import pandas as pd
from pandas import DataFrame
import pandas_datareader.data as web
import yfinance as yf
yf.web_override()
import numpy as np
import scipy as sp
import csv

# For graphing purposes
import math
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib import style
%matplotlib inline

# For styling
import datetime as dt
import os

# Imports for collections
import collections
from collections import Counter

# Imports for Machine Learning
import sklearn
from sklearn.model_selection import train_test_split

print("Done importing")

ModuleNotFoundError: No module named 'yfinance'

# Stock Analysis

The important segment of understanding stock analysis is to run a regression on the years where no significant events may have caused any problems to arise. An example of this would be years where there may have been political fluctuations where there is a cause for the stock market to be affected. 

For the purposes of this study, the dates of 2000 - 2005 are analyzed, since no significant event on a sector of the stocks may have been affected. 

In [16]:
# Usage of the Stock Market Data
start = dt.datetime(2000, 1, 1) # First of the 5 year increment
end = dt.datetime(2005, 1, 1) # Second of the 5 year increment

# Selection of 5 stocks to run a base analysis on
aapl = web.DataReader('AAPL', 'yahoo', start, end)
nvda = web.DataReader('NVDA', 'yahoo', start, end)
msft = web.DataReader('MSFT', 'yahoo', start, end)
amzn = web.DataReader('AMZN', 'yahoo', start, end)
goog = web.DataReader('GOOGL', 'yahoo', start, end)

# Print out and determine the given values
print(aapl)
print(nvda)
print(msft)
print(amzn)
print(goog)

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/AAPL/history?period1=946735200&period2=1104674399&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>'

## GICS Sector and Market Analysis

The GICS sector in economics comes from an understanding that there is methodology for assigning companies and their value to the economic sector that is correlated to its operation of business. There are a total of 11 GICS sector, with their corresponding stocks being a sub-division of each.

The sector definitions are as follows:

* Energy
* Materials
* Industrials
* Consumer Discretionary
* Consumer Staples
* Health Care
* Financials
* Information Technology
* Real Estate
* Communication Services
* Utilities Sector

## Stock Analysis of Volatility

In [10]:
'''
This is dedicated to the NASDAQ CSV's and their respective downloads.
Change the path name to be the variable set to where these files are downloaded.
'''
# All stocks
nasdaq_stocks = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_all.csv')
# Energy
energy_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_energy.csv')
# Capital Goods
capital_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_capitalgoods.csv')
    
# Industrials
# Consumer Discretionary
cons_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_consumerservices.csv')
# Consumer Staples
# Health Care
health_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_healthcare.csv')
# Financials
finance_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_finance.csv')
# Information Technology
tech_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_tech.csv')
# Real Estate
# Communication Services
# Utilities Sector
utils_nasdaq = pd.read_csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_utils.csv')

# Store the sectors in a dictionary with a specific bin


In [11]:
# Select an n number of stocks randomly from m given stocks
# Can do NASDAQ, NYSE, AMEX 
# Downloaded from NASDAQ
# https://www.nasdaq.com/market-activity/stocks/screener?exchange=NASDAQ&render=download
num_stock_available = 500 # The total number of stocks (S & P 500)
num_stocks = 5 # The number of stocks to select as a 1%
x = sp.random.uniform(low=1, high=num_stock_available, size=num_stocks)
sp.random.seed(50)
y = []
for i in range(num_stocks):
    y.append(int(x[i]))
unique_stocks = np.unique(y)
print(unique_stocks)
print(len(unique_stocks))

# The above is placeholder code for the actual stocks that are going to be used

[114 128 189 198 247]
5


In [21]:
# Import Statements
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
'''
Documentation:
Importing pyspark requires a couple of environment path variables to be changed here:
When running in your local path to your computer, you must have pyspark and hadoop included, along widht JDK 8 or 11.
(Using a virual machine doesn't work, it needs to be a JDK)
When I ran on this on my own machine, I needed to set environment variables set to:
JAVA_HOME=$(/usr/libexec/java_home -v 11.0) jupyter notebook
Taylor D. Gabatino
'''

"\nDocumentation:\nImporting pyspark requires a couple of environment path variables to be changed here:\nWhen running in your local path to your computer, you must have pyspark and hadoop included, along widht JDK 8 or 11.\n(Using a virual machine doesn't work, it needs to be a JDK)\nWhen I ran on this on my own machine, I needed to set environment variables set to:\nJAVA_HOME=$(/usr/libexec/java_home -v 11.0) jupyter notebook\nTaylor D. Gabatino\n"

In [23]:
sc = SparkContext.getOrCreate(); # There is an error if trying to run this stating there cannot be multiple 
ss = SparkSession(sc)

In [24]:
# Version Checking
print(f"Spark version is {sc.version}")

print(f"Phthon version is {sc.pythonVer}")

print(f"The name of the master is {sc.master}")

Spark version is 3.1.2
Phthon version is 3.9
The name of the master is local[*]


In [26]:
nasdaq_stocks = ss.read.csv('/Users/taylor/ICS-438/big-data-stock-analysis/data/nasdaq_all.csv', inferSchema=True, header=True)

In [27]:
nasdaq_stocks.columns

['Symbol',
 'Name',
 'Last Sale',
 'Net Change',
 '% Change',
 'Market Cap',
 'Country',
 'IPO Year',
 'Volume',
 'Sector',
 'Industry']

In [28]:
nasdaq_stocks.printSchema()

root
 |-- Symbol: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Last Sale: string (nullable = true)
 |-- Net Change: double (nullable = true)
 |-- % Change: string (nullable = true)
 |-- Market Cap: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- IPO Year: integer (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Sector: string (nullable = true)
 |-- Industry: string (nullable = true)



In [29]:
for line in nasdaq_stocks.head(10):
    print(line, '\n')

Row(Symbol='AACG', Name='ATA Creativity Global American Depositary Shares', Last Sale='$1.57', Net Change=0.15, % Change='10.563%', Market Cap=49261764.0, Country='China', IPO Year=None, Volume=83892, Sector='Miscellaneous', Industry='Service to the Health Industry') 

Row(Symbol='AACI', Name='Armada Acquisition Corp. I Common Stock', Last Sale='$9.76', Net Change=-0.01, % Change='-0.102%', Market Cap=202124720.0, Country='United States', IPO Year=2021, Volume=10539, Sector=None, Industry=None) 

Row(Symbol='AACIW', Name='Armada Acquisition Corp. I Warrant', Last Sale='$0.501', Net Change=0.0008, % Change='0.16%', Market Cap=0.0, Country='United States', IPO Year=2021, Volume=20701, Sector=None, Industry=None) 

Row(Symbol='AADI', Name='Aadi Bioscience Inc. Common Stock', Last Sale='$22.40', Net Change=0.45, % Change='2.05%', Market Cap=468026250.0, Country='United States', IPO Year=None, Volume=151558, Sector='Health Care', Industry='Biotechnology: Pharmaceutical Preparations') 

Row(

In [30]:
nasdaq_stocks.describe().show()

                                                                                

+-------+------+--------------------+---------+-------------------+--------+--------------------+---------+------------------+------------------+----------------+--------------------+
|summary|Symbol|                Name|Last Sale|         Net Change|% Change|          Market Cap|  Country|          IPO Year|            Volume|          Sector|            Industry|
+-------+------+--------------------+---------+-------------------+--------+--------------------+---------+------------------+------------------+----------------+--------------------+
|  count|  4791|                4791|     4791|               4791|    4778|                4788|     4785|              3065|              4791|            3640|                3650|
|   mean|  null|                null|     null|-0.5892035900647064|    null| 6.243646469818504E9|     null|2015.9181076672105| 840550.9628470048|            null|                null|
| stddev|  null|                null|     null|  2.612771379758438|    null|7.90