## Analysing Stock Prices

In this guided project, we are working with stock market data that was downloaded from [Yahoo Finance](https://finance.yahoo.com/) using the `yahoo_finance` Python package. 

This data consists of the daily stock prices from 2007-1-1 to 2017-04-17 for several hundred stock symbols traded on the NASDAQ stock exchange, stored in the prices folder. The `download_data.py` script in the same folder as the Jupyter notebook was used to download all of the stock price data.

In [14]:
import concurrent.futures
import os

def read_file(filename):
    with open(filename, 'r') as f:
        data = f.read().strip()
    key = filename.replace(".csv", "").replace("prices/", "")
    data = data.split("\n")
    data = [d.split(",") for d in data]
    return key, data

results = []
pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
filenames = ["prices/{}".format(f) for f in os.listdir("prices")]
prices = pool.map(read_file, filenames)
prices = dict(list(prices))

In [66]:
import time

# printing sample from data
start = time.time()
for k in list(prices)[:3]:
    print('\n Stock: {} \n'.format(k.upper()))
    for line in prices[k][:5]:
        print(line)

end = time.time()
print('\nTime taken to access and print data: {}'.format(end-start))


 Stock: CSCO 

['date', 'close', 'open', 'high', 'low', 'volume']
['2007-01-03', '27.73', '27.459999', '27.98', '27.33', '64226000']
['2007-01-04', '28.459999', '27.68', '28.49', '27.540001', '73012100']
['2007-01-05', '28.469999', '28.440001', '28.57', '28.049999', '62647800']
['2007-01-08', '28.629999', '28.540001', '28.74', '28.32', '47936500']

 Stock: BIOS 

['date', 'close', 'open', 'high', 'low', 'volume']
['2007-01-03', '3.41', '3.49', '3.49', '3.37', '91300']
['2007-01-04', '3.42', '3.43', '3.48', '3.41', '91900']
['2007-01-05', '3.46', '3.44', '3.53', '3.42', '126700']
['2007-01-08', '3.44', '3.48', '3.55', '3.44', '166300']

 Stock: CSBK 

['date', 'close', 'open', 'high', 'low', 'volume']
['2007-01-03', '12.200004', '12.200004', '12.239996', '12.140001', '23300']
['2007-01-04', '12.200004', '12.169998', '12.239996', '12.059997', '25400']
['2007-01-05', '12.129996', '12.129996', '12.189998', '12.12', '27800']
['2007-01-08', '12.180003', '12.12', '12.21', '12.070002', '46600

Initially storing the data in hash (dict) -> array (list) -> array (list), as this matches the current format of data.

## Computing aggregates

Now that we've read in the data, we can use it to compute aggregates. For example, we can find:

- The average closing price of all stocks over the time period.
- The average volume for each stock.
- The average difference between the opening price and closing price for each stock.
- The average difference between the high and low for each stock.

Start by changing format of data to hash (dict) -> hash (dict) -> array (list), to make the data more usable.

In [16]:
from dateutil.parser import parse

prices_dict = {}
for stock, data in prices.items():
    headers = data[0]
    stock_dict = {}
    for i, h in enumerate(headers):
        if i == 0:
            stock_dict[h] = [parse(line[i]) for line in data[1:]]
        else:
            stock_dict[h] = [float(line[i]) for line in data[1:]]
            
    prices_dict[stock] = stock_dict

### Average closing price

In [75]:
from statistics import mean

average_closing = {}
for stock,data in prices_dict.items():
    average_closing[stock] = mean(data["close"])
    
closing_tuples = [(k,v) for k,v in average_closing.items()]
ordered_closing = sorted(closing_tuples, key=lambda x:x[1])

print('Highest priced stocks on average at closing:\n')
print('STOCK | AVG CLOSING PRICE')
print(ordered_closing[-1][0].upper(), '|', ordered_closing[-1][1])
print(ordered_closing[-2][0].upper(), '|', ordered_closing[-2][1])
print('\nLowest priced stocks on average at closing:\n')
print('STOCK | AVG CLOSING PRICE')
print(ordered_closing[0][0].upper(), '|', ordered_closing[0][1])
print(ordered_closing[1][0].upper(), '|', ordered_closing[1][1])

Highest priced stocks on average at closing:

STOCK | AVG CLOSING PRICE
AMZN | 275.1340775710425
AAPL | 257.1765404023166

Lowest priced stocks on average at closing:

STOCK | AVG CLOSING PRICE
BLFS | 0.8122763011583012
APDN | 0.8241009938223939


It appears the AMZN and AAPL have the highest average closing prices, while BLFS, and APDN have the lowest average closing prices.

### Average daily volume

In [76]:
average_volume = {}
for stock,data in prices_dict.items():
    average_volume[stock] = mean(data["volume"])
    
volume_tuples = [(k,v) for k,v in average_volume.items()]
ordered_volume = sorted(volume_tuples, key=lambda x:x[1])

print('Most traded stocks on average:\n')
print('STOCK | AVG DAILY VOLUME')
print(ordered_volume[-1][0].upper(), '|', ordered_volume[-1][1])
print(ordered_volume[-2][0].upper(), '|', ordered_volume[-2][1])

print('\nLeast traded stocks on average:\n')
print('STOCK | AVG DAILY VOLUME')
print(ordered_volume[0][0].upper(), '|', ordered_volume[0][1])
print(ordered_volume[1][0].upper(), '|', ordered_volume[1][1])

Most traded stocks on average:

STOCK | AVG DAILY VOLUME
AAPL | 130112422.35521236
CSCO | 45224781.428571425

Least traded stocks on average:

STOCK | AVG DAILY VOLUME
DGICB | 509.72972972972974
EMCF | 637.3745173745174


It appears the CSCO and AAPL have the highest average trading volume, while DGICB, and EMCF have the lowest average closing prices.

### Average daily price range

In [73]:
open_close = {}
for stock,data in prices_dict.items():
    day_diff = []
    max_range = []
    for i in range(len(data['close'])):
        day_diff.append(abs(data['close'][i] - data['open'][i]))
        max_range.append(abs(data['high'][i] - data['low'][i]))
    open_close[stock] = [mean(day_diff), mean(max_range)]
    
    
range_tuples = [(k,v[0]) for k,v in open_close.items()]
ordered_range = sorted(range_tuples, key=lambda x:x[1])

print('Stocks with widest open-close price range on average:\n')
print('STOCK | AVG OPEN-CLOSE PRICE RANGE')
print(ordered_range[-1][0].upper(), '|', ordered_range[-1][1])
print(ordered_range[-2][0].upper(), '|', ordered_range[-2][1])
print('\nStocks with smallest open-close price range on average:\n')
print('STOCK | AVG OPEN-CLOSE PRICE RANGE')
print(ordered_range[0][0].upper(), '|', ordered_range[0][1])
print(ordered_range[1][0].upper(), '|', ordered_range[1][1])

Stocks with widest open-close price range on average:

STOCK | AVG OPEN-CLOSE PRICE RANGE
CME | 3.6001273864864864
BIDU | 3.5598418084942085

Stocks with smallest open-close price range on average:

STOCK | AVG OPEN-CLOSE PRICE RANGE
EQFN | 0.02162548301158301
BMRA | 0.025084942084942084


In [74]:
range_tuples = [(k,v[1]) for k,v in open_close.items()]
ordered_range = sorted(range_tuples, key=lambda x:x[1])

print('Most volatile stocks based on highest average min-max daily range:\n')
print('STOCK | AVG HIGH-LOW PRICE RANGE')
print(ordered_range[-1][0].upper(), '|', ordered_range[-1][1])
print(ordered_range[-2][0].upper(), '|', ordered_range[-2][1])
print('\nMost stable stocks based on lowest average min-max daily range:\n')
print('STOCK | AVG HIGH-LOW PRICE RANGE')
print(ordered_range[0][0].upper(), '|', ordered_range[0][1])
print(ordered_range[1][0].upper(), '|', ordered_range[1][1])

Most volatile stocks based on highest average min-max daily range:

STOCK | AVG HIGH-LOW PRICE RANGE
CME | 7.094718135521235
BIDU | 7.071447906563706

Most stable stocks based on lowest average min-max daily range:

STOCK | AVG HIGH-LOW PRICE RANGE
EQFN | 0.029586873359073357
BMRA | 0.042243243243243245


It appears the CME and BIDU have the largest min-max daily price range, as well as the widest open-close price differende. EQFN, and BMRA have the lowest ranges for both the min-max and open-close price differences.

### Most traded stock daily

In [61]:
trades = {}
for stock,data in prices_dict.items():
    dates = data['date']
    for i, d in enumerate(dates):
        if d in trades:
            trades[d].append((stock, data['volume'][i]))
        else:
            trades[d] = [(stock, data['volume'][i])]

most_traded = []
for date,data in trades.items():
    most_traded_stock = sorted(data, key=lambda x:x[1])[-1][0]
    most_traded.append([date, most_traded_stock])
    
# print('Most traded stock between 3/1/2007 and 9/1/2007')
print('DATE | STOCK')
for i in range(5):
    print(most_traded[i][0].strftime('%d, %b %Y'), '|', most_traded[i][1].upper())

DATE | STOCK
03, Jan 2007 | AAPL
04, Jan 2007 | AAPL
05, Jan 2007 | AAPL
08, Jan 2007 | AAPL
09, Jan 2007 | AAPL


## Searching for high volume days

Let's say we want to search for transactions in a list on a specific date. We can use a binary or a linear search for this, but binary search will be faster if we want to do repeated searches.

Let's search for all transactions on days with unusually high volume. In order to do this, we'll need to:

- Compute total volume of trading for each day
- Sort and find the 10 highest volume days overall
- Find all prices for all stocks on each of the high volume days

In [57]:
daily_volumes = {}
for date, trades_vol in trades.items():
    daily_volumes[date] = sum([item[1] for item in trades_vol])
            
volume_tuples = [(k,v) for k,v in daily_volumes.items()]
daily_volumes = sorted(volume_tuples, key=lambda x:x[1])

    
print('Days with highest volume traded:\n')
print('DATE | TOTAL VOLUME')
for i in range(-1, -11, -1):
    print(daily_volumes[i][0].strftime('%d, %b %Y'), '|', daily_volumes[i][1])

Days with highest volume traded:

DATE | TOTAL VOLUME
23, Jan 2008 | 1964583900.0
10, Oct 2008 | 1770266900.0
26, Jul 2007 | 1611272800.0
08, Oct 2008 | 1599183500.0
22, Jan 2008 | 1578877700.0
07, Feb 2008 | 1559032100.0
29, Sep 2008 | 1555072400.0
08, Nov 2007 | 1553880500.0
16, Jan 2008 | 1536176400.0
24, Jan 2008 | 1533363200.0


In [63]:
import math

high_volume_days = [v[0] for v in daily_volumes[-10:]]

def binary_search(array, search):
    m = 0
    i = 0
    z = len(array) - 1
    while i<= z:
        m = math.floor(i + ((z - i) / 2))
        if array[m] == search:
            return m
        elif array[m] < search:
            i = m + 1
        elif array[m] > search:
            z = m - 1

high_volume_transactions = {}
for stock, data in prices_dict.items():
    for day in high_volume_days:
        date_idx = binary_search(data["date"], day)
        if date_idx is None:
            continue
        elif stock not in high_volume_transactions:
            high_volume_transactions[date] = []
        else:
            high_volume_transactions[stock].append(prices[stock][date_idx])

## Finding profitable stocks

Now that we've done some basic analysis, let's see which stocks would have been the most profitable to buy on 2007-01-03. We can do this by:

- Subtracting the initial price from the final price, then computing a percentage relative to the initial price. This will tell us how much our initial investment would have grown or shrunk.
- Sorting all of the percentages.
- Finding the stock that grew the most in the time period.

In [65]:
profits = []
for stock, data in prices_dict.items():
    percentage = (data["close"][-1] - data["close"][0]) / data["close"][0]
    profits.append([stock,percentage * 100])

profits = sorted(profits, key=lambda x: x[1])

profits[-10:]

[['achc', 1330.0000666666667],
 ['bcli', 1339.2137535980346],
 ['cui', 1525.1625162516252],
 ['apdn', 1549.6700659868025],
 ['anip', 1707.3554472785033],
 ['amzn', 2230.7234281466817],
 ['blfs', 2437.4365640858978],
 ['arcw', 3898.60048982856],
 ['adxs', 4005.0000000000005],
 ['admp', 7483.8389225948395]]

The most profitable stock to buy in 2007 would have been ADMP, which appreciated from around 7 cents to its current price of 4.43.

## Next steps

We've done some basic analysis of the data, but there's still quite a bit more depth to go into:

- What stocks would have been best to short at the start of the period?
- Which stocks have the most after-hours trading, and show the biggest changes between the closing price and the next day open?
- Can technical indicators like Bollinger Bands help us forecast the market?
- What time periods have resulted in steady increases in prices, and what periods have resulted in steady declines?
- Based on price, what was the optimal day to buy each stock if we wanted to hold them until now?
- On days with high trading volume, do stocks move in one direction (up or down) more than the other one?