<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/04_Normalizing_and_clustering_stocks_moving_together.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Stocks moving together**


Cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). 

The dataset contains daily price movements of stocks from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. 

And to account for this, a Normalizer is included at the beginning of the pipeline. 

The Normalizer separately transforms each company's stock price to a relative scale before the clustering begins.

A Normalizer() is very different from a StandardScaler(). While a StandardScaler() standardizes features by removing the mean and scaling to unit variance, the Normalizer() rescales each sample - here, each company's stock price - independently of the other.

In [0]:
import numpy as np
import pandas as pd

In [0]:
data = np.genfromtxt('company-stock-movements.csv', delimiter=',', dtype='str')
data

In [0]:
companies = data[1:,0]
companies

array(['Apple', 'AIG', 'Amazon', 'American express', 'Boeing',
       'Bank of America', 'British American Tobacco', 'Canon',
       'Caterpillar', 'Colgate-Palmolive', 'ConocoPhillips', 'Cisco',
       'Chevron', 'DuPont de Nemours', 'Dell', 'Ford',
       'General Electrics', 'Google/Alphabet', 'Goldman Sachs',
       'GlaxoSmithKline', 'Home Depot', 'Honda', 'HP', 'IBM', 'Intel',
       'Johnson & Johnson', 'JPMorgan Chase', 'Kimberly-Clark',
       'Coca Cola', 'Lookheed Martin', 'MasterCard', 'McDonalds', '3M',
       'Microsoft', 'Mitsubishi', 'Navistar', 'Northrop Grumman',
       'Novartis', 'Pepsi', 'Pfizer', 'Procter Gamble', 'Philip Morris',
       'Royal Dutch Shell', 'SAP', 'Schlumberger', 'Sony',
       'Sanofi-Aventis', 'Symantec', 'Toyota', 'Total',
       'Taiwan Semiconductor Manufacturing', 'Texas instruments',
       'Unilever', 'Valero Energy', 'Walgreen', 'Wells Fargo', 'Wal-Mart',
       'Exxon', 'Xerox', 'Yahoo'], dtype='<U34')

In [0]:
movements = data[1:, 1:].astype(float)
movements

array([[ 5.8000000e-01, -2.2000500e-01, -3.4099980e+00, ...,
        -5.3599620e+00,  8.4001900e-01, -1.9589981e+01],
       [-6.4000200e-01, -6.5000000e-01, -2.1000100e-01, ...,
        -4.0001000e-02, -4.0000200e-01,  6.6000000e-01],
       [-2.3500060e+00,  1.2600090e+00, -2.3500060e+00, ...,
         4.7900090e+00, -1.7600090e+00,  3.7400210e+00],
       ...,
       [ 4.3000100e-01,  2.2999600e-01,  5.7000000e-01, ...,
        -2.6000200e-01,  4.0000100e-01,  4.8000300e-01],
       [ 9.0000000e-02,  1.0000000e-02, -8.0000000e-02, ...,
        -3.0000000e-02,  2.0000000e-02, -3.0000000e-02],
       [ 1.5999900e-01,  1.0001000e-02,  0.0000000e+00, ...,
        -6.0001000e-02,  2.5999800e-01,  9.9998000e-02]])

In [0]:
# Import Modules
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [0]:
# Create a normalizer: normalizer
normalizer = Normalizer()

In [0]:
# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters = 10)

In [0]:
# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

In [0]:
# Fit pipeline to the daily price movements
pipeline.fit(movements)

Pipeline(memory=None,
         steps=[('normalizer', Normalizer(copy=True, norm='l2')),
                ('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=10, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0))],
         verbose=False)

In [0]:
# Predict the cluster labels: labels
labels = pipeline.predict(movements)

In [0]:
# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

In [0]:
# Display df sorted by cluster label
print(df.sort_values('labels'))

    labels                           companies
29       0                     Lookheed Martin
36       0                    Northrop Grumman
4        0                              Boeing
54       0                            Walgreen
45       1                                Sony
42       2                   Royal Dutch Shell
32       2                                  3M
44       2                        Schlumberger
39       2                              Pfizer
16       2                   General Electrics
53       2                       Valero Energy
13       2                   DuPont de Nemours
49       2                               Total
10       2                      ConocoPhillips
8        2                         Caterpillar
57       2                               Exxon
12       2                             Chevron
1        3                                 AIG
18       3                       Goldman Sachs
55       3                         Wells Fargo
3        3   