# Libraries

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np

import re

# First Steps

1. Get some stock market data
2. Save it in csv
3. Transform into useable format (keep as price? transform into percent change? per day?)
4. Define Neural Network Structure
5. Create Neural Network


Keeping stock prices in their standard format would allow each entry to serve as ground truth, while a percentage change would be a relative system. To find a stock change from day X to day Y would require recursively multiplying all percentage changes from day Y back to day X.

Keeping stock prices in percentage change format would help automatically account for stock splits. Adjusted price already accounts for this, but I don't want to use the adjusted price because I don't understand it enough.

## Get some stock market data

I expect the easiest way to do this is with yfinance, though I might look at some other options because yfinance isn't as well maintained as some more professional open source libraries, and it no longer links to a formal yahoo API.

Polygon.io seems promising, but the free version only has 5 API calls per minute (not sure the extent of data I can grab with a single API call, but it doesn't sound like very much), and I can only access two years of historical data.

I'll probably stick with yfinance for now. It's the easiest.

I've done some work with yfinance and one of the difficulties is that it pulls data into a **multi-index dataframe**. I'll have to **transform it into something tidy**.

I'm going to start simple and get a dataset from the DJIA (Dow Jones Industrial Average). It represents a significant portion of the market and it's a good starting point without getting too much in the weeds of big stock data with larger indexes like the S&P 500 or the Wilshire 5000.

Some obstacles to grapple with are:
- **how to handle companies which were bankrupted**. If I don't include these, then the dataset will suffer heavily from survivorship bias.
- **how to handle companies which were merged**. A merged company doesn't go out of business (and so is probably still a good investment) but **representing the change in stock price from the merge** might be difficult.

I found a list of all companies ever listed on the DJIA, but it's stored in a Wikipedia article where the changes are grouped as companies listed on the DJIA on a certain date. There is some text about which companies are moved / merged, but I'll have to parse the complete list with some programming.

There's also the issue of how far to go back for good training data. The DJIA goes back all the way to 1896, but I don't need data back that far for training. Out of convenience, let's have an 80/20 train-test split, where the training set is 40 years, and the testing set is 10 years.

I won't pull until today - that seems ridiculous. Let's set the test cutoff to be December 31, 2023. That means the first training observation will be from January 1, 1973. I don't need to pull data from before then, and I don't need to pull companies listed on the DJIA before then.

### Get company names

In [2]:
djia_hist_comps = pd.read_html('https://en.wikipedia.org/wiki/Historical_components_of_the_Dow_Jones_Industrial_Average')

In [3]:
# combine all dataframes
djia_hist_comps_30_only = djia_hist_comps[1].iloc[:10, :]
for df in djia_hist_comps[2:25]:
    djia_hist_comps_30_only = pd.concat(
        [djia_hist_comps_30_only, df.iloc[:10, :]],
        axis='index',
        copy=False,
        ignore_index=True
    )
display(djia_hist_comps_30_only.sample(8))

Unnamed: 0,0,1,2
170,Allied-Signal Incorporated,Eastman Kodak Company,Minnesota Mining & Manufacturing Company
104,Bank of America Corporation,Hewlett-Packard Company,Pfizer Inc.
83,AT&T Inc.,Hewlett-Packard Company,The Procter & Gamble Company
191,Aluminum Company of America,General Motors Corporation,Philip Morris Companies Inc. ↑
238,E.I. du Pont de Nemours & Company,"Owens-Illinois, Inc. ↑",Westinghouse Electric Corporation
40,3M Company,General Electric Company,"Nike, Inc."
17,The Coca-Cola Company,"Merck & Co., Inc.","Walgreens Boots Alliance, Inc."
152,American Express Company,General Motors Corporation,Microsoft Corporation ↑


In [4]:
# keeps only company name (removes arrows, subscripts, and extraneous info)
def remove_name_clutter(company_name):
    return re.sub(r"(?:\([^\(\)]*\)|\[[^\[\]]*\])|[^a-zA-Z0-9\s&]", '', company_name)

In [5]:
# convert to list of company names
djia_hist_comps_30_only = djia_hist_comps_30_only.to_numpy().flatten().tolist()

In [6]:
# get only pure company names
djia_hist_comps_30_only = [remove_name_clutter(name).strip() for name in djia_hist_comps_30_only]

In [7]:
# get distinct company names
djia_hist_comps_30_only = set(djia_hist_comps_30_only)

In [8]:
# manual removal of explicit duplicates
duplicates = [
    'UnitedHealth Group Incorporated'
]
for d in duplicates:
    djia_hist_comps_30_only.remove(d)

In [9]:
# convert to list
djia_hist_comps_30_only = sorted(list(djia_hist_comps_30_only))
djia_hist_comps_30_only[:8]

['3M Company',
 'AT&T Corporation',
 'AT&T Inc',
 'Alcoa Inc',
 'Allied Chemical Corporation',
 'AlliedSignal Incorporated',
 'Altria Group Incorporated',
 'Aluminum Company of America']

There are a few companies which are probably indistinguishably duplicates (like AT&T Corporation and AT&T Inc) but because of rebranding or mergers, the companies might be listed differently. I'll need to look into this more.

### Get company tickers