In [5]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import networkx as nx
import csv
import matplotlib.pyplot as plt
import pylab
%matplotlib inline
from IPython.display import display, HTML
pylab.rcParams['figure.figsize'] = (10, 6)

AttributeError: module 'pandas' has no attribute 'core'

In [None]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# Stock Market Clustering

Firstly,we will import our data, which consists of the following two *.csv* files:

* **SP_500_firms.csv:** This file contains all firms currently included in the S&P 500 index.
* **SP_500_close_2015.csv:** This file contains daily stock price data of the firms listed in the previous file for 2015 (without some firms for which data was not available for the entire year).

After importing the data, we inspect the first observations for each dataframe.

In [None]:
# Load companies information
firms = pd.read_csv('SP_500_firms.csv', index_col = 0)
# Load companies stock prices
stockPrices = pd.read_csv('SP_500_close_2015.csv', index_col = 0)

In [None]:
firms.iloc[:5, :]

In [None]:
stockPrices.iloc[:5, :5]

## The main part 

### 1. Stock returns

#### Daily returns

In order to analyse how similar those companies perform, we will look at their daily stock price movements. We will assume that companies that have similar stock price movements from day to day, perform similarly. To calculate the daily returns for all stocks in the data we define function `stockReturns` which takes as input the dataframe of the stock prices and returns a dataframe with the daily returns of those stocks. The daily returns are calculated based on the following formula:

$$x_t = \frac{p_t - p_t-1}{p_t-1}$$

In [None]:
def stockReturns(priceDF):
    
    compTickers = priceDF.columns[0: ]    
    priceMat = priceDF.loc[ : , compTickers].as_matrix()    
    diffMat = (priceMat[1: ] - priceMat[ :-1]) / priceMat[ :-1]
    
    return pd.DataFrame(data = diffMat, index = priceDF.index[1: ], \
                        columns = compTickers)    

In [None]:
dailyReturns = stockReturns(stockPrices)
dailyReturns.iloc[:5, :5]

The daily returns of the stocks can be shown in the following figure:

In [None]:
dailyReturns.plot(legend = False);

Next, we will find which companies experienced the **maximum** and **minimum** daily returns,  and what potential evidence there may be for those extreme returns.

To do so, we will marge the daily returns of all companies in one column and then sort this column in descending order.

In [None]:
dailyReturns['Date'] = dailyReturns.index
# Melt data frame so each row is one price change observation
dailyReturnsMelted = pd.melt(dailyReturns, id_vars = "Date")
dailyReturnsMelted = dailyReturnsMelted.rename(columns = {
    'variable':'Symbol',
    'value':'Price Change'
})
# Sort melted dataframe in descending order of price change
dailyReturnsSorted = dailyReturnsMelted.sort_values(by = 'Price Change', \
                                                    ascending = False)
# Merge on firm data with the symbol as the key
dailyReturnsSorted = dailyReturnsSorted.merge(firms, left_on = 'Symbol', \
                                              right_index = True, how = 'left')

The **maximum** daily returns for 2015 were the following:

In [None]:
dailyReturnsSorted.iloc[:10, :]

Potential evidence for some of the **maximum** daily returns presented above can be found on the following links:

* Trip advisor: http://fortune.com/2015/10/14/tripadvisor-stock-gain-priceline-deal/
* Williams: http://www.forbes.com/sites/antoinegara/2015/06/22/pipeline-giant-williams-rejects-64-a-share-takeover-bid-from-energy-transfer/#318056c339f8
* Harman: http://investor.harman.com/releasedetail.cfm?releaseid=890984 and http://investor.harman.com/releasedetail.cfm?releaseid=893546 
* Qorvo: http://www.qorvo.com/news/2015/qorvo-announces-proposed-1-billion-senior-notes-offering and http://www.bizjournals.com/triad/news/2015/11/06/qorvo-revenue-rises-in-latest-quarter.html

The **minimum** daily returns for 2015 were the following:

In [None]:
dailyReturnsSorted.iloc[-10:, :]

Potential evidence for some of the **minimum** daily returns presented above can be found on the following links:

* Akami: http://www.fool.com/investing/general/2015/10/28/why-akamai-technologies-inc-fell-hard-on-wednesday.aspx
* Millinckrodt: http://www.bloomberg.com/news/articles/2015-11-09/mallinckrodt-slumps-on-scrutiny-from-valeant-foe-citron-research
* NRG Energy: http://247wallst.com/infrastructure/2015/12/04/nrg-continues-to-fall-as-ceo-steps-down/
* Micron: http://marketrealist.com/2015/06/microns-share-price-fall-19-june-26/
* Yum: http://www.reuters.com/article/us-yum-brands-china-idUSKCN0S11SZ20151007
* Michael Kors: http://money.cnn.com/2015/05/27/investing/michael-kors-earnings-stock-drop/

#### Yearly returns

Next, we will look at the overall performance of the S&P 500 companies over the whole year. Same as before, we create the function `yearlyStockReturns` which takes as input the dataframe of the stock prices and returns a dataframe with the yearly returns of those stocks.

In [None]:
def yearlyStockReturns(priceDF):
    priceMatrix = priceDF.as_matrix()
    # Calculate the yearly returns:
    # (final price - start price) / start price
    TotalPriceChangeMatrix = (priceMatrix[-1: ] - priceMatrix[ :1]) \
                              / priceMatrix[ :1]
    # Convert the result to a dataframe 
    # with the correct index and column names
    TotalPriceChangeDF = pd.DataFrame(TotalPriceChangeMatrix, \
                                      columns = priceDF.columns)
    # Transpose dataframe
    TotalPriceChangeDFtransposed = TotalPriceChangeDF.transpose()
    TotalPriceChangeDFtransposed.columns = ['Price Change']
    
    return TotalPriceChangeDFtransposed

To find which companies performed overall best and worst over the year we will sort the yearly returns of all companies in descending order.

In [None]:
yearlyReturns = yearlyStockReturns(stockPrices)

# Sort them
yearlyReturnsSorted = yearlyReturns.sort_values(by='Price Change', ascending=False)
# Merge on firm data with the symbol as the key (index in both dfs)
yearlyReturnsSorted = yearlyReturnsSorted.merge(firms, left_index=True, right_index=True, how='left')

The companies that performed **best** over the year are the following:

In [None]:
yearlyReturnsSorted.head(10)

In [None]:
y_pos_top_10 = np.arange(len(yearlyReturnsSorted['Name'][:10])-1,-1,-1)
plt.barh(y_pos_top_10, yearlyReturnsSorted['Price Change'][:10]*100,
         align='center', color='darkgreen')
plt.yticks(y_pos_top_10, yearlyReturnsSorted['Name'][:10])
plt.xlabel("% change on year")
plt.title("Top 10 performing stocks in 2015");

The companies that performed **worst** over the year are the following:

In [None]:
yearlyReturnsSorted.tail(10)

In [None]:
y_pos_top_10 = np.arange(len(yearlyReturnsSorted['Name'][-10:]))
plt.barh(y_pos_top_10, yearlyReturnsSorted['Price Change'][-10:]*100,
         align='center', color='darkred')
plt.yticks(y_pos_top_10, yearlyReturnsSorted['Name'][-10:])
plt.xlabel("% change on year")
plt.title("Bottom 10 performing stocks in 2015");

#### Volatility

Finally, we will try to figure out which companies exhibited **most** and **least** volatility. The volatility of the companies is measured based on the standard deviation of their daily returns over the year. We create the function `volatility` which takes as input a dataframe with the daily returns of the companies and returns a dataframe with the volatility measure of those companies, as defined above.

In [None]:
def volatility(dailyReturns):
    ##calculate sds and change into panda dataframe
    sdPriceChangeDF = pd.DataFrame(np.std(dailyReturns, axis = 0), \
                                   columns = ['Standard Deviation'])
    
    return sdPriceChangeDF

Next we sort the dataframe of the volatilities in order to find the most and least volatile companies for 2015.

In [None]:
sdPriceChangeDF = volatility(dailyReturns)

# Sort on standard deviation
sdPriceChangeDFsorted = sdPriceChangeDF.sort_values(by = 'Standard Deviation', \
                                                    ascending = False)
# Merge on firm data with the symbol as the key (index in both dfs)
sdPriceChangeDFsortedfull = sdPriceChangeDFsorted.merge(firms, \
                                                        left_index = True, \
                                                        right_index = True, \
                                                        how = 'left')

The **most volatile** companies for 2015 were the following:

In [None]:
sdPriceChangeDFsortedfull.head(10)

In [None]:
## create list of company ticker names that are in the top 10 most variable
columnlist = sdPriceChangeDFsortedfull[0:10].index.values.tolist()
##create data frame of price data for just the top 10 most variable companies
variablePriceData = stockPrices[columnlist]
##scale based on first price
variablePricesScaled = variablePriceData.divide(stockPrices[columnlist].ix[0])
y_pos_dates = np.arange(len(variablePricesScaled.index))
##Set colour scheme
colors = ['#9970ab','#5aae61','#4393c3','#de77ae','#35978f','#f768a1','#fec44f','#d0d1e6','#08306b','#a50f15']

for i in range(0,len(columnlist)):
    plt.plot(y_pos_dates,variablePricesScaled[columnlist[i]], c=colors[i], label=columnlist[i].format(i=i))
plt.legend(loc='best')
plt.xticks([len(variablePricesScaled.index)/4,
            len(variablePricesScaled.index)*2/4,
            len(variablePricesScaled.index)*3/4,
            len(variablePricesScaled.index)],
           ["Mar-2015","Jun-2015","Sep-2015","Dec-2015"])
plt.legend(loc=5,prop={'size':10})
plt.show();

The **least volatile** companies for 2015 were the following:

In [None]:
sdPriceChangeDFsortedfull.tail(10)

In [None]:
## create list of company ticker names that are in the top 10 most variable
columnlist = sdPriceChangeDFsortedfull[-10:].index.values.tolist()
##create data frame of price data for just the top 10 most variable companies
variablePriceData = stockPrices[columnlist]
##scale based on first price
variablePricesScaled = variablePriceData.divide(stockPrices[columnlist].ix[0])
y_pos_dates = np.arange(len(variablePricesScaled.index))
##Set colour scheme
colors = ['#9970ab','#5aae61','#4393c3','#de77ae','#35978f','#f768a1','#fec44f','#d0d1e6','#08306b','#a50f15']

for i in range(0,len(columnlist)):
    plt.plot(y_pos_dates,variablePricesScaled[columnlist[i]], c=colors[i], label=columnlist[i].format(i=i))
plt.legend(loc='best')
plt.xticks([len(variablePricesScaled.index)/4,
            len(variablePricesScaled.index)*2/4,
            len(variablePricesScaled.index)*3/4,
            len(variablePricesScaled.index)],
           ["Mar-2015","Jun-2015","Sep-2015","Dec-2015"])
plt.legend(loc=5,prop={'size':10})
axes = plt.gca()
axes.set_ylim([0,3])
plt.show();

### 2. Correlations

To find the similarities between stock price movements for different companies, we will calculate the correlation between the returns of different stock prices. For two companies with stock price returns $x, y$ and observations for $n$ days, their correlation is given by the following formula:

$$r_{xy} = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{n\sum x_i^2 - (\sum x_i)^2} \sqrt{n\sum y_i^2 - (\sum y_i)^2}}$$

In order to compute the correlation of the companies, we define function `calCorrelations`, which takes as input a dataframe with the daily returns of the S&P 500 companies and returns their correlations in two different formats:

1. A **correlations matrix**, where each element represents the correlation betwwen the two companies indicated by the specific row and column. 
2. A **graph** where each node represents a company and each edge between two nodes represent the correlation between those two companies.

In [None]:
def calCorrelations(dailyReturn):
    col = dailyReturn.columns
    ncol = len(col)
    corrMat = np.identity(ncol)
    
    G = nx.Graph()
    G.add_nodes_from(col.values)
        
    n = len(dailyReturn)
    for i in range(0, ncol):
        for j in range(i + 1, ncol):
            x = dailyReturn[col[i]]
            y = dailyReturn[col[j]]
            xsum = sum(x)
            ysum = sum(y)
            corrMat[i][j] = (n * sum(x * y) - xsum * ysum) / (np.sqrt(n * sum(x**2) - xsum**2) * np.sqrt(n * sum(y**2) - ysum**2))
            corrMat[j][i] = corrMat[i][j]
            G.add_edge(col[i], col[j], weight = corrMat[i][j])
        
    return pd.DataFrame(data = corrMat, index = col, columns = col), G

In [None]:
corMatrix = calCorrelations(stockReturns(stockPrices))[0]
corMatrix.iloc[:5, :5]

In addition, in order to be able to find and print correlations easily between companies, we provide the following helper functions:

1. `cal2CompCor`: It takes as inputs the correlation matrix of the companies, the dataframe with the companies information and the symbols of two companies. It returns their full names and correlation.
2. `calTopLow5`: It takes as inputs the correlation matrix of the companies and the symbol of a company. It returns the symbols and correlations of the five highest and lowest correlated companies.

In [None]:
def cal2CompCor(correlationmatrix, firms, company1, company2):
    print("The correlation between " + firms.loc[(company1, company2), :].iloc[0, 0] + \
          " and " + firms.loc[(company1, company2), :].iloc[1, 0] + " is " + \
          str(corMatrix.loc[company1, company2]) + ".")

In [None]:
def highLowCorrelation(correlationmatrix, company):
    namesDict = dict()
    input_file = csv.DictReader(open('SP_500_firms.csv'))
    for row in input_file:
        #print(row)
        namesDict[row['Symbol']] = [row['Name'],row['Sector']]
    
    i = correlationmatrix.columns.get_loc(company)  
    j = correlationmatrix.ix[ : , i]
    high = j.nlargest(6)
    highindex = high.index
    listofhighcomp = []    
    for k in highindex:
        listofhighcomp.append(namesDict[k])
    low = j.nsmallest(5)
    lowindex = low.index
    listoflowcomp = []    
    for k in lowindex:
        listoflowcomp.append(namesDict[k])
       
    dflistofhighcomp = DataFrame(listofhighcomp)
    dfhighindex = DataFrame(highindex)
    dflistofhighcomp = dflistofhighcomp.merge(dfhighindex,right_index=True, left_index=True, how = 'left')
    dfhigh = DataFrame(high)
    table2 = dfhigh.merge(dflistofhighcomp, left_index = True, right_on='0_y', how = 'left')    
    table2.index = table2['0_y'].as_matrix()
    del table2['0_y']
    table2 = table2.rename(columns={company: 'Correlation', '0_x': 'Name of Company', 1: 'Industry'})

    dflistoflowcomp = DataFrame(listoflowcomp)
    dflowindex = DataFrame(lowindex)
    dflistoflowcomp = dflistoflowcomp.merge(dflowindex,right_index=True, left_index=True, how = 'left')
    dflow = DataFrame(low)
    table = dflow.merge(dflistoflowcomp, left_index = True, right_on='0_y', how = 'left')
    table.index = table['0_y'].as_matrix()
    del table['0_y']
    table = table.rename(columns={company: 'Correlation', '0_x': 'Name of Company', 1: 'Industry'})

    return(table2[1:], table)

We will use the functions defined above to explore some of the companies in the tech sector, like *Amazon*, *Microsoft*, *Facebook*, *Apple*, and *Google*. All stocks are affected by macroeconomic environments and thus are correlated with each other, either positively or negatively. There are various factors which determine how correlated two stocks are. 

Usually, stock prices of two companies from the same industry would move in tandem with each other as the market conditions would affect them both in the same way. However, there could also be cases when the market conditions has different effects on companies in the same industry.

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'AMZN')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'AMZN')[1])

For the top five stocks correlated with Amazon's stocks, $4$ out of the $5$ companies are IT companies and are growth stocks, which is not surprising. However, Starbucks also happens to have a relatively high correlation with Amazon's stocks though it comes from a completely different industry. 

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'MSFT')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'MSFT')[1])

For Microsoft, $2$ out of the top $5$ correlated stocks are from the same industry.

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'FB')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'FB')[1])

For Facebook, $4$ out of $5$ of the top correlated companies are in the IT industry. Starbucks also appears in the top 5 for Facebook, surprisingly.

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'AAPL')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'AAPL')[1])

For Apple, surprisingly its top $3$ correlated stocks are all from the Industrials sector, with the subsequent $2$ being from the IT industry. These results are interesting, as the companies do not appear to have any relation to Apple.

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'GOOG')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'GOOG')[1])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'GOOGL')[0])

In [None]:
pd.DataFrame(highLowCorrelation(corMatrix, 'GOOGL')[1])

Alphabet Inc. is the parent company of Google and has two types of stocks (`GOOG` and `GOOGL`), which are Class A and Class C share respectively. The difference between the two is that the owner of Class A share are allowed to vote, where as the owners of Class C shares are not. 
The top 5 correlated shares for `GOOG` and `GOOGL` are the same, aside from Microsoft for `GOOG` and Mastercard Inc for `GOOGL`. It is not surprising that the highest stock correlation for `GOOG` is `GOOGL` at 0.98 and vice versa, since they are under the same parent company.

Most of the lowest correlated stocks for each of the aforeentioned tech companies are from companies in the Energy/Industrials/Materials sector, whose stock prices move in conjunction with commodity prices such as oil and gas based on global supply and demand. Thus, it is not surprising that their stock prices have such low correlation with the stock prices of tech companies.

For the companies which appear to have no relation to the above tech companies but have correlated stocks, this may be due to chance, as the time period we are looking at is only a year. It would be interesting to see if this correlation is sustained throughout a longer time period, to see if there is really a correlation between the two stocks. This will be investigated further in the in-depth analysis section of the report.

Another interesting thing to note is that there appears to be no stocks in the S&P 500 which are strongly negatively correlated with the stocks from above tech companies. 

### 3. Clustering algorithm

We will now use the similarity information of the companies based on their correlations to divide them into clusters. The clusters will indicate companies with similar performance over the year 2015 and will be performed using a *greedy* algorithm design which is described as follows:

1. Sort the edges in the graph by their weight (ie the correlation)
2. Create a single-node set from each node in the graph
3. Repeat k times:
    1. Pick the highest-weight edge
    2. Merge the sets containing the source and the destination of the edge
    3. Repeat from A. with the next-highest weight edge
    
4. Return the remaining sets

To perform the clustering, we define tweo new functions, `sortCorrelations` and `clusteringAlg`, that perform the following tasks:

* `sortCorrelations`: Takes as input the correlation matrix of the companies and returns an ordered list of tuples on descending order based on the correlations of the companies. Each tuple has the following format:

$$(correlation, company_1, company_2)$$

* `clusteringAlg`: takes as input the list of tuple created by the previous function and a constant $k$, which represents the iterations of the algorithm, and clusters the companies based on the algorithm described previously. The output of the function is a list of sets, where each set represents a cluster of companies and a list of integers where each integer represents the number of clusters at each iteration.

In [None]:
def sortCorrelations(corMatrix):
    n = int(corMatrix.shape[0])
    corList = []
    for i in range(1, n):
        for j in range(0, i):
            corList.append((corMatrix.iloc[i, j], corMatrix.columns.values[i], \
                            corMatrix.columns.values[j]))
    return sorted(corList, reverse = True)

In [None]:
def clusteringAlg(corList, k = 0):
    """
    Input:  
    corList: The ordered list of tuples which include the
              correlations between firms and the firms themselves
    k: The number of iterations for the clustering algorithm
    Output: 
    A list of sets where each set represents an individual 
    cluster
    """
    # Initialize the list of sets. Each set represents a cluster
    # which initialy includes only one firm
    sets = []
    noOfClusters = []
    for i in range(len(corList)):
        if not({corList[i][1]} in sets):
            sets.append({corList[i][1]})
        if not({corList[i][2]} in sets):
            sets.append({corList[i][2]}) 
    inNoOfClusters = len(sets)
    # Repeat the algorithm k times
    # In each iteration we check the k-th tuple of correlations list
    # and whether the 2 firms in that tuple are already in the same
    # set. If they do, we move on to the next tuple, otherwise we merge
    for j in range(min(k, len(corList))):
        nd1 = corList[j][1]
        nd2 = corList[j][2]
        fl1, fl2 = False, False 
        noOfClusters.append(inNoOfClusters - len(sets))
        for i in range(len(sets)):
            if (nd1 in sets[i]) and fl1 == False:
                idx1 = i
                fl1 = True
            if (nd2 in sets[i]) and fl2 == False:
                idx2 = i
                fl2 = True
        if idx1 != idx2:
            sets[idx1] = sets[idx1].union(sets[idx2])
            sets.remove(sets[idx2])
    # Return the final list of sets
    return sets, noOfClusters

In [None]:
corList = sortCorrelations(corMatrix)

In [None]:
clusters, noClusters = clusteringAlg(corList, k = 10000)

In the following diagram we observe how the number of clusters increases as the number of iterations increases from $1$ to $10,000$. We define clusters are groups of companies that include more than one company. We observe that the slope of the line is more steep during the first iterations but as the number of iterations increases, the number of clusters increases at a slower pace. 

In [None]:
t = np.arange(1, 10001)
plt.plot(t, noClusters, '-')
plt.xlabel('Number of clusters')
plt.ylabel('Number of iterations')
plt.title('Clusters versus iterations')
plt.show()

The clustering algorithm implemented above is called **single-linkage clustering** and is one of the many methods of **hierarchical clustering**. Single-linkage clustering is based on agglomerative clustering. It involves starting with each firm in a cluster of its own and then combining two clusters at each iteration of the algorithm. It chooses the two clusters that are 'close to' i.e. highly correlated with each other.

This method is also known as **nearest neighbour clustering** which is used in the traveling salesman problem (TSP). In TSP, a salesman wants to travel to $N$ number of cities that are connected to each other. The connections between cities have weights that represent the cost or distance. The salesman wants to minimize the total cost incurred or total distance travelled, i.e. the salesman is looking for cities that are 'close to' each other in the same way we look for stock prices which are 'close to' each other.

In order to evaluate our clustering algorithm, we will focus on the resulting clusters for different values of $k$ and see if the companies included in those have similarities between them and whether their stock prices and daily returns perform in a similar way.

For that purpose we define the function `companyTracker` which take as input the correlation list previously computed, the symbol of a company, the dataframe which contains the companies information and three integers, which specify the different values of $k$ for which we will execute the clustering algorithm.

In [None]:
def companyTracker(corList, company, firms, ks, kf, kint):
    kValues = np.arange(ks, kf, kint)
    sets = []
    for k in kValues:
        clusters = clusteringAlg(corList, k)[0]
        print('k = ' + str(k))
        for i in range(len(clusters)):
            if company in clusters[i]:
                if len(clusters[i]) == 1:
                    display(firms.loc[ticker, :])
                else:
                    display(firms.loc[clusters[i], :].sort_values(['Sector', 'Name']))

We will see how the aforementioned companies are clustered when the number of iterations $k = \{50, 100, 150\}$.

In [None]:
companyTracker(corList, 'BAC', firms, 30, 60, 30)

In [None]:
companyTracker(corList, 'BAC', firms, 50, 100, 50)

In [None]:
companyTracker(corList, 'BAC', firms, 100, 200, 100)

In [None]:
companyTracker(corList, 'BAC', firms, 500, 1000, 500)

In [None]:
companyTracker(corList, 'BAC', firms, 1000, 2000, 1000)

After running the clustering algorithm, for increasing values of $k$, it is observed that as $k$ increases more elements are added to the existing clusters and more new clusters are formed. Clusters are formed with stocks belonging to firms within the same industry. However, after a certain value of $k$, we see clusters merging and becoming more 'broad' – for example, what was initially two clusters – insurance and banking – becomes a general 'finance' cluster. 

In the above tables, *Bank of America Merrill Lynch* was tracked for different values of $k$ to see how the composition of its cluster changes when $k=\{30, 50, 100, 500, 1000\}$. For values of $k$ equal or less than $500$, we can only see companies from the Financial sector within its cluster. For $k = 1000$, firms from the Consumer Discretionary, Industrial and Information Technology sector join the sector. This may happen because one or two stocks from these indsutries may have a lower correlation with companies from the Financial sector.

Next, we define the function `plotSetPrices` which takes as inputs the stock prices of the S&P 500 companies, the resulting clusters of the clustering algorithm and the symbol of a company. TWhat the function does is to plot the normalised prices of all the the companies that belong to the cluster of the specified company.

In [None]:
def plotSetPrices(stockPrices, clusters, company):
    
    for i in range(len(clusters)):
            if company in clusters[i]:
                normPrices = stockPrices.loc[:, clusters[i]]
                normPrices = normPrices / normPrices.iloc[0, :]
                normPrices.plot(legend = True);

In [None]:
clusters, noClusters = clusteringAlg(corList, k = 30)
plotSetPrices(stockPrices, clusters, 'BAC')

## The extra part 

### In-depth analysis

#### Stock correlations of the 5 tech stocks in 2014

In the main section of the report, we looked at the top correlated stocks with the 5 tech companies in 2015. We will now look to see whether these correlations exist in 2014 as well. Stock prices of the S&P500 companies in 2014 were drawn from the Yahoo Finance website and correlations are calculated based on the data.

In [None]:
#pull stock tickers from S&P500 list and convert them into a list

stocktickers = corMatrix.columns
stocktickers = list(stocktickers)

To pull data from the Yahoo Finance website, we used helper functions from hw2.py to generate the helper function `getStockfromYahoo`, which requires two inputs: the year of the data and the list of stock tickers of interest. It returns a dataframe of stock prices over the year for the specified stocks.

In [None]:
import pandas_datareader.data as web
from datetime import datetime

#additional helper functions from hw2.py file

def getStock(symbol, start, end):
    """
    Downloads stock price data from Yahoo Finance
    Returns a pandas dataframe.
    """
    df =  web.DataReader(symbol, 'yahoo', start, end)
    return df

def getClose(df):
    """
    Returns stock price dataframe's adjusted closing price as a list
    """
    L = df['Adj Close'].values.tolist()
    return L

In [7]:
def getStockfromYahoo(year, stocktickers):
    start = datetime(year,1,1)
    finish = datetime(year,12,31)
    stockdf = DataFrame()
    a = 0
    
    for i in stocktickers:
        try:
            if a == 0:
                stock = getStock(i,start, finish)
                stock = stock.loc[:, 'Close']
                stockdf = DataFrame(stock)
                stockdf = stockdf.rename(columns={'Close': i})
                a = a + 1
            else:            
                stock = getStock(i,start, finish)
                stock = stock.loc[:, 'Close']
                stockdf = stockdf.join(stock)
                stockdf = stockdf.rename(columns={'Close': i})
        except:
            pass
    
    return stockdf

In [None]:
#create dataframe with stock prices

data_2014 = getStockfromYahoo(2014, stocktickers)

In [None]:
#calculate stock returns for 2014 data

sr_2014 = stockReturns(data_2014)

In [None]:
#create correlation matrix for 2014

corr_2014 = calCorrelations(sr_2014)[0]

In [6]:
#top and bottom correlated companies for FB

pd.DataFrame(highLowCorrelation(corr_2014, 'FB')[0])

NameError: name 'highlowCorrelation' is not defined

In [None]:
pd.DataFrame(highLowCorrelation(corr_2014, 'FB')[1])

In [None]:
#correlation between BRK-B and MSFT

cal2CompCor(corr_2014, firms, 'BRK-B', 'MSFT')

When we looked at the top and bottom correlated stocks for the 5 tech companies in 2014, we saw a change in the results. For example, for Facebook in 2015, the top correlated companies were mostly from the IT industry, but in 2014, it was most highly correlated with companies from the healthcare industry.

Overall, most of the companies which had the highest correlation with the 5 tech companies in 2015 did not appear in the top 5 correlated companies in 2014, which suggests that many of the surprising results seen before e.g. Starbucks stocks having high correlation with Facebook stocks in 2015, were probably due to chance. (Starbucks stocks and Facebook stocks have a correlation of 0.60 and 0.34 in 2015 and 2014 respectively)

On the other hand, for Microsoft and Berkshire-Hathaway, we still find positive correlation between the two stocks in 2014, though not as strong (0.59 in 2015, 0.43 in 2014). This is interesting because though the two companies are from different industries, their stocks are moderately correlated over a period of two years. This may warrant further investigation, to see whether this relationship exists over an even longer period of time.

### Exploring other clustering methods

# Notes

*Any notes or comments about the report to be listed here:*

* 