# Analyzing Stock Prices
Skills: Stacks/queues, Arrays and lists algorithm, hash tables

In this project, we'll work with stock market data that was downloaded from Yahoo Finance using the yahoo_finance Python package. This data consists of the daily stock prices from 2007-1-1 to 2017-04-17 for several hundred stock symbols traded on the NASDAQ stock exchange, stored in the prices folder. The download_data.py script in the same folder as the Jupyter notebook was used to download all of the stock price data. Each file in the prices folder is named for a specific stock symbol.

To read in and store all of the data, we'll need several layers of indices:

Layer 1 -- the stock symbol, or an numeric index representing the stock symbol.
Layer 2 -- the rows in a stock symbol csv file.
Layer 3 -- The column names in a stock symbol csv file

My choice for appropriate data structure for layer 1, 2, and 3 is:
Hash table, list, list.

I chose this data structure because, for layer 1, there is fix and limited amount of different variable (stock symbol). So we can easily hash the symbols provided.

First, we must use multiple processes to read the data into the data structure we chose. But before that, we should see what is inside our directory, how many files, and what is inside of the file.

In [1]:
# First, we see what kind of files inside our directory
import os
files = os.listdir('prices')
files

['aal.csv',
 'aame.csv',
 'aaon.csv',
 'aapl.csv',
 'aaww.csv',
 'aaxn.csv',
 'abax.csv',
 'abcb.csv',
 'abco.csv',
 'abeo.csv',
 'abio.csv',
 'abmd.csv',
 'abtl.csv',
 'acad.csv',
 'acet.csv',
 'acfc.csv',
 'acgl.csv',
 'achc.csv',
 'achn.csv',
 'aciw.csv',
 'acls.csv',
 'acnb.csv',
 'acor.csv',
 'acta.csv',
 'actg.csv',
 'acxm.csv',
 'adbe.csv',
 'adi.csv',
 'admp.csv',
 'adp.csv',
 'adra.csv',
 'adrd.csv',
 'adre.csv',
 'adru.csv',
 'adsk.csv',
 'adtn.csv',
 'adxs.csv',
 'aegn.csv',
 'aehr.csv',
 'aeis.csv',
 'aemd.csv',
 'aey.csv',
 'aezs.csv',
 'afam.csv',
 'afsi.csv',
 'agen.csv',
 'agii.csv',
 'agys.csv',
 'ahgp.csv',
 'ahpi.csv',
 'aimc.csv',
 'ainv.csv',
 'aiq.csv',
 'airm.csv',
 'airt.csv',
 'akam.csv',
 'akrx.csv',
 'alco.csv',
 'algn.csv',
 'algt.csv',
 'alks.csv',
 'allt.csv',
 'alny.csv',
 'alog.csv',
 'alot.csv',
 'alqa.csv',
 'alsk.csv',
 'alxn.csv',
 'amag.csv',
 'amat.csv',
 'amd.csv',
 'amed.csv',
 'amgn.csv',
 'amkr.csv',
 'amnb.csv',
 'amot.csv',
 'amrb.csv',
 'amr

In [2]:
# How many files in direcory
len(files)

560

In [3]:
# Sampling one file to see what is inside

with open('prices/{}'.format(files[280])) as file:
        data = file.read().strip()
        key = files[280].replace(".csv","")
        value = data.split("\n")
        value = [v.split(",") for v in value]
        print(value)

[['date', 'close', 'open', 'high', 'low', 'volume'], ['2007-01-03', '13.70', '13.10', '14.00', '13.10', '6900'], ['2007-01-04', '13.61', '13.61', '13.61', '13.61', '100'], ['2007-01-05', '13.65', '13.65', '13.65', '13.65', '100'], ['2007-01-08', '13.60', '13.60', '13.60', '13.60', '100'], ['2007-01-09', '13.60', '13.60', '13.60', '13.60', '000'], ['2007-01-10', '13.60', '13.60', '13.60', '13.60', '000'], ['2007-01-11', '13.65', '13.56', '13.65', '13.56', '2200'], ['2007-01-12', '13.79', '13.65', '13.79', '13.51', '2800'], ['2007-01-16', '13.79', '13.79', '13.79', '13.79', '000'], ['2007-01-17', '13.51', '13.51', '13.51', '13.51', '2400'], ['2007-01-18', '13.51', '13.51', '13.51', '13.51', '000'], ['2007-01-19', '13.51', '13.51', '13.51', '13.51', '000'], ['2007-01-22', '13.80', '13.80', '14.00', '13.74', '700'], ['2007-01-23', '13.80', '13.80', '13.80', '13.80', '000'], ['2007-01-24', '14.06', '14.00', '14.06', '14.00', '2000'], ['2007-01-25', '14.06', '14.06', '14.06', '14.06', '000']

Now we have some idea what kind of content inside the file, now we'll process to read data using multithreading.

In [4]:
import concurrent.futures as cf

def read_data(files):
    with open('prices/{}'.format(files)) as file:
        data = file.read().strip()
        key = files.replace(".csv","")
        value = data.split("\n")
        value = [v.split(",") for v in value]
        return key, value

pool = cf.ThreadPoolExecutor(max_workers=4)
list_content = list(pool.map(read_data, files))
content = dict(list_content)

In [5]:
transformed = {}

for key1 in content.keys():
    transformed[key1] = {}
    counter = 0
    for key2 in content[key1][0]:
        transformed[key1][key2]  = [content[key1][i][counter] for i in range(1,len(content[key1]))]
        counter += 1

After the data is transformed, we can easily do some aggregation. For example, the average volume for each stock.

In [10]:
# Average volume each company
volume_avg = {}
for company in transformed.keys():
    volumes = [int(i) for i in transformed[company]['volume']]
    volume_avg[company] = sum(volumes)/len(volumes)
    
volume_avg

{'aal': 8469080.501930501,
 'aame': 6318.918918918919,
 'aaon': 211263.861003861,
 'aapl': 130112422.35521236,
 'aaww': 287414.78764478763,
 'aaxn': 1259518.9961389962,
 'abax': 194647.83783783784,
 'abcb': 92016.02316602317,
 'abco': 276554.0540540541,
 'abeo': 142349.69111969112,
 'abio': 25513.629343629345,
 'abmd': 401411.9305019305,
 'abtl': 70726.21621621621,
 'acad': 1265892.5482625482,
 'acet': 162641.66023166024,
 'acfc': 11198.764478764479,
 'acgl': 839836.833976834,
 'achc': 297078.5714285714,
 'achn': 1428632.1621621621,
 'aciw': 803301.2355212355,
 'acls': 208106.67953667953,
 'acnb': 3627.4517374517372,
 'acor': 603703.9768339768,
 'acta': 169822.47104247104,
 'actg': 417181.4285714286,
 'acxm': 550136.6795366795,
 'adbe': 5341678.532818533,
 'adi': 3337189.034749035,
 'admp': 67102.54826254827,
 'adp': 2804884.092664093,
 'adra': 10941.505791505791,
 'adrd': 17613.320463320462,
 'adre': 131657.83783783784,
 'adru': 6654.980694980695,
 'adsk': 3142473.7837837837,
 'adtn':

In [12]:
# Company with biggest average volumes
max(volume_avg, key=volume_avg.get)

'aapl'

It can be seen that Apple.inc (aapl) has the biggest average volume.

Now that we've computed some aggregates, we can work on finding the most traded stock each day.

In [27]:
transformed2 = {}
for day in transformed['aapl']['date']:
    transformed2[day] = {}
    for company in content.keys():
        dates = transformed[company]['date']
        if day not in dates:
            transformed2[day][company] = 0
        else:
            date_index = dates.index(day)
            transformed2[day][company] = int(transformed[company]['volume'][date_index])

#sort the dict

for day in transformed2:
    transformed2[day] = sorted(transformed2[day], key=transformed2[day].get, reverse=True)
    transformed2[day] = transformed2[day][0]

'aapl'

In [28]:
transformed2

{'2007-01-03': 'aapl',
 '2007-01-04': 'aapl',
 '2007-01-05': 'aapl',
 '2007-01-08': 'aapl',
 '2007-01-09': 'aapl',
 '2007-01-10': 'aapl',
 '2007-01-11': 'aapl',
 '2007-01-12': 'aapl',
 '2007-01-16': 'aapl',
 '2007-01-17': 'aapl',
 '2007-01-18': 'aapl',
 '2007-01-19': 'aapl',
 '2007-01-22': 'aapl',
 '2007-01-23': 'aapl',
 '2007-01-24': 'aapl',
 '2007-01-25': 'aapl',
 '2007-01-26': 'aapl',
 '2007-01-29': 'aapl',
 '2007-01-30': 'aapl',
 '2007-01-31': 'aapl',
 '2007-02-01': 'aapl',
 '2007-02-02': 'aapl',
 '2007-02-05': 'aapl',
 '2007-02-06': 'aapl',
 '2007-02-07': 'aapl',
 '2007-02-08': 'aapl',
 '2007-02-09': 'aapl',
 '2007-02-12': 'aapl',
 '2007-02-13': 'aapl',
 '2007-02-14': 'aapl',
 '2007-02-15': 'bidu',
 '2007-02-16': 'aapl',
 '2007-02-20': 'aapl',
 '2007-02-21': 'aapl',
 '2007-02-22': 'aapl',
 '2007-02-23': 'aapl',
 '2007-02-26': 'aapl',
 '2007-02-27': 'aapl',
 '2007-02-28': 'aapl',
 '2007-03-01': 'aapl',
 '2007-03-02': 'aapl',
 '2007-03-05': 'aapl',
 '2007-03-06': 'aapl',
 '2007-03-0

Finally, we will find the most profitable stock. This can be done by:
- Subtracting the initial price from the final price, then computing a percentage relative to the initial price. This will tell us how much our initial investment would have grown or shrunk.
- Sorting all of the percentages.
- Finding the stock that grew the most in the time period.

In [30]:
#percentage of change for each company
transformed3 = {}
for company in content.keys():
    start_price = float(content[company][1][2])
    end_price = float(content[company][-1][1])
    percentage = ((end_price - start_price) / start_price) * 100
    transformed3[company] = percentage
    
transformed3

{'aal': -17.92540207692341,
 'aame': 24.193548387096772,
 'aaon': 30.65607672516539,
 'aapl': 64.36435698649157,
 'aaww': 22.355824983299943,
 'aaxn': 177.73584905660377,
 'abax': 146.03419668044603,
 'abcb': 54.8371917751045,
 'abco': -12.861915128769205,
 'abeo': 71.87500000000001,
 'abio': -41.5188961156602,
 'abmd': 749.0154922644163,
 'abtl': 237.71428571428572,
 'acad': 275.36556805399323,
 'acet': 71.97231833910035,
 'acfc': -56.96969696969697,
 'acgl': 40.21610656873732,
 'achc': 1330.0000666666667,
 'achn': -77.90626902008522,
 'aciw': -35.95574480503211,
 'acls': 211.54499151103568,
 'acnb': 48.73740883584017,
 'acor': 5.113636363636355,
 'acta': 36.54033041788146,
 'actg': -62.313432835820905,
 'acxm': 6.001562743569759,
 'adbe': 219.22888459654013,
 'adi': 131.97691770422108,
 'admp': 7483.8389225948395,
 'adp': 109.15846940765832,
 'adra': -12.503754132852427,
 'adrd': -30.070395574924568,
 'adre': -8.695649675162434,
 'adru': -32.95905439369041,
 'adsk': 111.6416339738706

In [32]:
# Now, we can find the most profitable company in the period

profitable = max(transformed3, key=transformed3.get)
print(profitable)
print(transformed3[profitable])

admp
7483.8389225948395


It can be seen that the most profitable company to invest on those period is Adamis Pharmaceuticals Corporation, with around 7500 percentage increase in value.