# Libraries

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np

import re

# First Steps

1. Get some stock market data
2. Save it in csv
3. Transform into useable format (keep as price? transform into percent change? per day?)
4. Define Neural Network Structure
5. Create Neural Network


Keeping stock prices in their standard format would allow each entry to serve as ground truth, while a percentage change would be a relative system. To find a stock change from day X to day Y would require recursively multiplying all percentage changes from day Y back to day X.

Keeping stock prices in percentage change format would help automatically account for stock splits. Adjusted price already accounts for this, but I don't want to use the adjusted price because I don't understand it enough.

## Get some stock market data

I expect the easiest way to do this is with yfinance, though I might look at some other options because yfinance isn't as well maintained as some more professional open source libraries, and it no longer links to a formal yahoo API.

Polygon.io seems promising, but the free version only has 5 API calls per minute (not sure the extent of data I can grab with a single API call, but it doesn't sound like very much), and I can only access two years of historical data.

I'll probably stick with yfinance for now. It's the easiest.

I've done some work with yfinance and one of the difficulties is that it pulls data into a **multi-index dataframe**. I'll have to **transform it into something tidy**.

I'm going to start simple and get a dataset from the DJIA (Dow Jones Industrial Average). It represents a significant portion of the market and it's a good starting point without getting too much in the weeds of big stock data with larger indexes like the S&P 500 or the Wilshire 5000.

Some obstacles to grapple with are:
- **how to handle companies which were bankrupted**. If I don't include these, then the dataset will suffer heavily from survivorship bias.
- **how to handle companies which were merged**. A merged company doesn't go out of business (and so is probably still a good investment) but **representing the change in stock price from the merge** might be difficult.

I found a list of all companies ever listed on the DJIA, but it's stored in a Wikipedia article where the changes are grouped as companies listed on the DJIA on a certain date. There is some text about which companies are moved / merged, but I'll have to parse the complete list with some programming.

There's also the issue of how far to go back for good training data. The DJIA goes back all the way to 1896, but I don't need data back that far for training. Out of convenience, let's have an 80/20 train-test split, where the training set is 40 years, and the testing set is 10 years.

I won't pull until today - that seems ridiculous. Let's set the test cutoff to be December 31, 2023. That means the first training observation will be from January 1, 1973. I don't need to pull data from before then, and I don't need to pull companies listed on the DJIA before then.

### Get company names

In [2]:
djia_hist_comps = pd.read_html('https://en.wikipedia.org/wiki/Historical_components_of_the_Dow_Jones_Industrial_Average')

In [3]:
# combine all dataframes
djia_hist_comps_30_only = djia_hist_comps[1].iloc[:10, :]
for df in djia_hist_comps[2:25]:
    djia_hist_comps_30_only = pd.concat(
        [djia_hist_comps_30_only, df.iloc[:10, :]],
        axis='index',
        copy=False,
        ignore_index=True
    )
display(djia_hist_comps_30_only.sample(8))

Unnamed: 0,0,1,2
13,The Boeing Company,International Business Machines Corporation,"The Travelers Companies, Inc."
83,AT&T Inc.,Hewlett-Packard Company,The Procter & Gamble Company
140,3M Company † (formerly Minnesota Mining & Manu...,Eastman Kodak Company,Johnson & Johnson
108,Citigroup Inc.,Johnson & Johnson,"Wal-Mart Stores, Inc."
216,E.I. du Pont de Nemours & Company,International Paper Company,United States Steel Corporation
78,E.I. du Pont de Nemours & Company,"Merck & Co., Inc.","Wal-Mart Stores, Inc."
220,Allied Chemical Corporation,Exxon Corporation † (formerly Standard Oil Co....,"Owens-Illinois, Inc."
39,Exxon Mobil Corporation,"Nike, Inc.",The Walt Disney Company


In [4]:
# keeps only company name (removes arrows, subscripts, and extraneous info)
def remove_name_clutter(company_name):
    return re.sub(r"(?:\([^\(\)]*\)|\[[^\[\]]*\])|[^a-zA-Z0-9\s&]", '', company_name)

In [5]:
# convert to list of company names
djia_hist_comps_30_only = djia_hist_comps_30_only.to_numpy().flatten().tolist()

In [6]:
# get only pure company names
djia_hist_comps_30_only = [remove_name_clutter(name).strip() for name in djia_hist_comps_30_only]

In [7]:
# get distinct company names
djia_hist_comps_30_only = set(djia_hist_comps_30_only)

In [8]:
# manual removal of explicit duplicates
duplicates = [
    'UnitedHealth Group Incorporated'
]
for d in duplicates:
    djia_hist_comps_30_only.remove(d)

In [9]:
# convert to list
djia_hist_comps_30_only = sorted(list(djia_hist_comps_30_only))
djia_hist_comps_30_only[:8]

['3M Company',
 'AT&T Corporation',
 'AT&T Inc',
 'Alcoa Inc',
 'Allied Chemical Corporation',
 'AlliedSignal Incorporated',
 'Altria Group Incorporated',
 'Aluminum Company of America']

There are a few companies which are probably indistinguishably duplicates (like AT&T Corporation and AT&T Inc) but because of rebranding or mergers, the companies might be listed differently. I'll need to look into this more.

### Get company tickers

I can probably automate some of this, but I expect that I'll have to go through quite a few names in the list manually to find the tickers.

Nevermind. I've been doing some research on trying to automate this and I'm not having a lot of luck. I'll probably just manually go through the list. It'll probably take me a bit but it's all I have to go on right now, and I'll probably just force myself to find a better solution down the line for when I need to pull more companies.

In [10]:
len(djia_hist_comps_30_only)

86

In [11]:
np.array(djia_hist_comps_30_only[:45])

array(['3M Company', 'AT&T Corporation', 'AT&T Inc', 'Alcoa Inc',
       'Allied Chemical Corporation', 'AlliedSignal Incorporated',
       'Altria Group Incorporated', 'Aluminum Company of America',
       'American Can Company', 'American Express Company',
       'American International Group Inc',
       'American Telephone and Telegraph Company',
       'American Tobacco Company', 'Amgen Inc',
       'Anaconda Copper Mining Company', 'Apple Inc',
       'Bank of America Corporation', 'Bethlehem Steel Corporation',
       'Caterpillar Inc', 'Chevron Corporation', 'Chrysler Corporation',
       'Cisco Systems Inc', 'Citigroup Inc', 'Dow Inc', 'DowDuPont Inc',
       'EI du Pont de Nemours & Company', 'Eastman Kodak Company',
       'Esmark Corporation', 'Exxon Corporation',
       'Exxon Mobil Corporation', 'F W Woolworth Company',
       'General Electric Company', 'General Foods Corporation',
       'General Motors Corporation', 'Goodyear Tire and Rubber Company',
       'HewlettPa

In [12]:
djia_tickers = [
    ('3M Company', 'MMM'),
    ('AT&T Inc', 'T'),
    ('Alcoa Inc', 'AA'),
    ('Honeywell International', 'HON'),
    ('Altria Group Inc', 'MO'),
    ('American Express Company', 'AXP'),
    ('American International Group Inc', 'AIG'),
    ('Amgen Inc', 'AMGN'),
    ('Apple Inc', 'AAPL'),
    ('Bank of America Corporation', 'BAC'),
    ('Caterpillar Inc', 'CAT'),
    ('Chevron Corporation', 'CVX'),
    ('Cisco Systems Inc', 'CSCO'),
    ('Citigroup Inc', 'C'),
    ('Dow Inc', 'DOW'),
    ('DuPont de Nemours Inc', 'DD'),
    ('Eastman Kodak Company', 'KODK'),
    ('Exxon Mobil Corporation', 'XOM'),
    (),
    (),
    (),
    (),
    (),
    (),
    (),
    (),
    (),

]

Noted Changes:
- No separate tickers between 'AT&T Corporation' and 'AT&T Inc'. Several tickers for the company appear, but they look like separate divisions / asset classes, rather than the company's stock before / after the merge or rename or whatever.
- 'Allied Chemical Corporation' changed its name to 'AlliedSignal Incorporated', which then merged with and became 'Honeywell International'.
- 'Aluminum Company of America' renamed to 'Alcoa Inc'.
- 'American Can Company' merged with financial conglomerate 'Primarica Inc'.
- 'American Telephone and Telegraph Company' changed its name to 'AT&T Corporation' (no surprise there).
- 'American Tobacco Company' was restructured into a holding company called 'American Brands Inc'.
- 'Anaconda Copper Mining Company' was purchased by ARCO and then BP.
- 'Bethlehem Steel Corporation' declared bankruptcy in 2001. This is exactly the kind of company I would want to include in the bot, but alas, I'm not easily finding stock info on them.
- 'Chrysler Corporation' bankrupted and was acquired by Fiat, the US, and Canada (not sure what that means - I'm assuming the stock price reflected the bankruptcy).
- 'DowDuPont Inc' spun off (separated into?) DuPont and Dow Inc.
- 'EI du Pont de Nemours & Company' is assumedly now 'DuPont de Nemours Inc', but how it mingled with Dow, DowDuPont, etc. is beyond me right now.
- 'Esmark Corporation' is interwoven with 'JBS USA', but the two companies are too distinct to keep them properly on this list.
-  There isn't even an article for 'F W Woolworth Company', much less a ticker.


A lot of the companies have merged, changed tickers, changed names, gone bankrupt, changed exchanges, etc. Which affects how easily I can access their historical data.

I've done a little research and it looks like I can probably find some data via Library of Congress (though I need to locate a public or academic library that has access to the financial records database). There's also probably newspaper databases, though I don't want to comb through those now (or at all).

I expect that I will need to account for these variables at some point, but for right now, I'm going to stick with the stocks that are easily available. It's not great for the survivorship bias (which was the whole point) but it'll be better than ignoring it entirely. Besides, this is more of an exercise in exploring how a neural network might be able to make sense of the vast interconnectedness of the stock market, rather than getting a useable stock bot immediately.

In [13]:
np.array(djia_hist_comps_30_only[45:])

array(['JPMorgan Chase & Co', 'JohnsManville Corporation',
       'Johnson & Johnson', 'Kraft Foods Inc', 'McDonalds Corporation',
       'Merck & Co Inc', 'Microsoft Corporation',
       'Minnesota Mining & Manufacturing Company',
       'Navistar International Corporation', 'Nike Inc',
       'OwensIllinois Inc', 'Pfizer Inc', 'Philip Morris Companies Inc',
       'Raytheon Technologies Corporation', 'SBC Communications Inc',
       'Salesforce Inc', 'Sears Roebuck & Company',
       'Standard Oil Co of California', 'Standard Oil Co of New Jersey',
       'Swift & Company', 'Texaco Incorporated', 'The Boeing Company',
       'The CocaCola Company', 'The Goldman Sachs Group Inc',
       'The Home Depot Inc', 'The Procter & Gamble Company',
       'The Travelers Companies Inc', 'The Walt Disney Company',
       'Travelers Inc', 'USX Corporation', 'Union Carbide Corporation',
       'United Aircraft Corporation', 'United States Steel Corporation',
       'United Technologies Corporation