# Clustering Exercise

In this section, we will extract structure from variations in historical stock prices. We don't have any target values (i.e., we don't know anything about the structure in the data) which is why this is an *unsupervised* machine learning task.

## Step 1: Import Libraries

In [None]:
import datetime
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import pandas as pd

%matplotlib notebook
from matplotlib import pyplot as plt

print("Libraries imported successfully!")

## Step 2: Load the Data

First, we define a *Python dict* which stores a mapping from ticker symbols to company names. This is useful for presentation purposes later on in this exercise.

In [None]:
# Define a list of stocks we want to investigate
symbol_dict = {
    'TOT': 'Total',
    'XOM': 'Exxon',
    'CVX': 'Chevron',
    'COP': 'ConocoPhillips',
    'VLO': 'Valero Energy',
    'MSFT': 'Microsoft',
    'IBM': 'IBM',
    'TWX': 'Time Warner',
    'CMCSA': 'Comcast',
    'CVC': 'Cablevision',
    'YHOO': 'Yahoo',
    'HPQ': 'HP',
    'AMZN': 'Amazon',
    'TM': 'Toyota',
    'CAJ': 'Canon',
    'MTU': 'Mitsubishi',
    'SNE': 'Sony',
    'F': 'Ford',
    'HMC': 'Honda',
    'NAV': 'Navistar',
    'NOC': 'Northrop Grumman',
    'BA': 'Boeing',
    'KO': 'Coca Cola',
    'MMM': '3M',
    'MCD': 'Mc Donalds',
    'PEP': 'Pepsi',
    'MDLZ': 'Kraft Foods',
    'K': 'Kellogg',
    'UN': 'Unilever',
    'MAR': 'Marriott',
    'PG': 'Procter Gamble',
    'CL': 'Colgate-Palmolive',
    'GE': 'General Electrics',
    'WFC': 'Wells Fargo',
    'JPM': 'JPMorgan Chase',
    'AIG': 'AIG',
    'AXP': 'American express',
    'BAC': 'Bank of America',
    'GS': 'Goldman Sachs',
    'AAPL': 'Apple',
    'SAP': 'SAP',
    'CSCO': 'Cisco',
    'TXN': 'Texas instruments',
    'XRX': 'Xerox',
    'LMT': 'Lookheed Martin',
    'WMT': 'Wal-Mart',
    'WBA': 'Walgreen',
    'HD': 'Home Depot',
    'GSK': 'GlaxoSmithKline',
    'PFE': 'Pfizer',
    'SNY': 'Sanofi-Aventis',
    'NVS': 'Novartis',
    'KMB': 'Kimberly-Clark',
    'R': 'Ryder',
    'GD': 'General Dynamics',
    'RTN': 'Raytheon',
    'CVS': 'CVS',
    'CAT': 'Caterpillar',
    'DD': 'DuPont de Nemours'}

# Split dict into list of symbols and list of names
symbols, names = np.array(list(symbol_dict.items())).T

Opening and closing prices for the above symbols and the period 01-01-2009 to 01-01-2012 have been stored in *data/Open.csv* and *data/Close.csv* respectively. Inspect both files and note their structure.

In the next cell, load the contents of each file using the *pd.read_csv(filepath, header=0, index_col=0)* and store them in *open_vals* and *close_vals* respectively.

**Bonus Question:** Explain the two additional arguments *header=0* and *index_col=0* in the context of these datasets.

In [None]:
# Get opening prices


# Get closing price



## Step 4: Preprocess Data

Before we cluster the dataset, we have to convert the data to the right format. In general, data preprocessing might involve *filling missing values*, *extraction of features*, and *standardization/normalization* of data.

### Extract Features

In this example, we will use daily percent change as the feature for our clustering algorithm. Please note that *there are many other possible feature combinations* that can be used.

In the following cell, compute daily percent change (`(close_vals - open_vals) / open_vals`) and assign the result to a new variable `X`:

In [None]:
# Calculate percent change



**Bonus Exercise**: Experiment with other features (e.g., absolute difference) and see how it changes your results.

## Step 5: Cluster Data

The data is in the right format now. *scikit-learn* provides convenient methods to cluster complex data. We will use the popular *k-means* clustering algorithm. *k-means* is a simple, yet powerful, clustering technique to partition a data set into k distinct, non-overlapping clusters/groups.

Before we run *k-means*, we have to specify *how many clusters* the algorithm should generate.

Create a variable `n_clusters` that holds the number of clusters we want to generate (e.g., start with 12 clusters). Afterwards, create a k-means clustering object with `KMeans(n_clusters)` and assign it to a variable called `kmeans`. To generate the clusters, call the `fit` command on the clustering object. The fit `command` requires one parameter which holds the data that we want to cluster.

In [None]:
# Define the number of clusters in a variable 'n_clusters'


# Cluster the data



[Help: Clustering Data With Scikit-learn](http://scikit-learn.org/stable/modules/clustering.html)

**Bonus Exercise**: Experiment with the number of clusters and observe how it changes the results.

## Step 6: Display Results

Let's combine the clustering results with the company names we specified in step 2 to visualize the clusters. Execute the code in the following cell and observe the result:

In [None]:
# Print results
labels = kmeans.labels_
for i in range(n_clusters):
    print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))

## Step 7: Evaluate Results

### Elbow Method

The *Elbow method* can help select the optimal number of clusters. It looks at the variance explained as a function of the number of clusters. We choose the number of clusters so that when adding another cluster, it does not give significantly better modeling of the data (i.e., the "elbow").

Execute the following code to test the different clustering configurations and to visualize the results:

In [None]:
min_clusters = 1
max_clusters = 20
distortions = []
for i in range(min_clusters, max_clusters+1):
    km = KMeans(n_clusters=i,
                init='k-means++',
                n_init=10,
                max_iter=300,
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
    
# Plot
plt.plot(range(min_clusters, max_clusters+1), distortions, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Distortion")
plt.show()

What is the optimal number of clusters?

[Help: Determining the number of clusters in a data set](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set)