# 1. Data source

Data is provided by cafef.vn

http://s.cafef.vn/du-lieu/download.chn

In [1]:
import numpy as np
import pandas as pd
%matplotlib auto
import matplotlib.pyplot as plt

data = pd.read_csv('CafeF.HNX.Upto23.11.2020.csv').set_index('<Ticker>')

print(data.head())
print(data.tail())

Using matplotlib backend: TkAgg
          <DTYYYYMMDD>  <Open>  <High>  <Low>  <Close>  <Volume>
<Ticker>                                                        
X20           20201123     6.5     6.5    6.5      6.5       100
X20           20201117     6.8     6.8    6.8      6.8       100
X20           20201116     6.8     6.8    6.8      6.8       100
X20           20201113     6.3     6.3    6.3      6.3      1000
X20           20201103     6.3     6.3    6.3      6.3       100
          <DTYYYYMMDD>   <Open>   <High>    <Low>  <Close>  <Volume>
<Ticker>                                                            
AAV           20180629  12.9912  13.1477  12.9129  13.0701    108100
AAV           20180628  12.9129  13.1477  12.9129  12.9918    210300
AAV           20180627  14.8694  14.8694  12.9129  12.9136    532700
AAV           20180626  13.5390  13.5390  12.5216  13.5397    800200
AAV           20180625  12.3651  12.3651  12.3651  12.3657    362000


# 2. Transforming the data

First we define the time range


In [2]:
data = data[data['<DTYYYYMMDD>'] > 20190101]

Assume that we are day traders - we would be interested in the daily stock movement. <br>
So then, we calculate the daily stock movement.

In [3]:
data['Movement'] = data['<Close>'] - data['<Open>']

Then we group the data by ticker name and have each column as movement of 1 day on the stock market.

In [4]:
df = data.pivot( columns = '<DTYYYYMMDD>', values='Movement').fillna(0).rename_axis(None)
#rename date label with number for visibility
df.columns = list(range(0, 474))
#remove any stock with all 0 movement
df = df[(df.T != 0).any()]

print(df.head())
print(df.tail())

        0       1       2       3       4       5       6       7       8    \
AAV -0.0900  0.0000  0.2700 -0.0900  0.0000  0.0000  0.0000  0.0000  0.1800   
ACB -0.2367  1.0654  0.1776 -0.1155 -1.7134  0.2368 -0.0563 -0.0563 -0.0563   
ACM  0.0000  0.0000  0.1000  0.1000  0.1000  0.0000 -0.1000  0.0000  0.0000   
ADC  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000   
ALT  0.0000  0.0000  0.0000  0.0000  0.0000  0.0005  0.0000  0.0000  0.0005   

        9    ...  464  465  466  467  468  469  470  471  472  473  
AAV -0.2700  ...  0.0  0.0  0.0  0.1 -0.1  0.0  0.0  0.0  0.0  0.0  
ACB  0.5327  ... -0.4  0.1  1.0  0.1 -0.3  1.0  0.0 -0.1  0.0 -0.2  
ACM  0.0000  ...  0.0 -0.1  0.0  0.1  0.1  0.0  0.0 -0.1  0.0  0.0  
ADC  0.0000  ... -1.5  0.0  0.0  2.4  0.0  0.0  0.0  0.0  0.0  3.2  
ALT  0.0000  ...  0.0  0.0  0.0  0.0  0.0 -0.1  0.0  0.0 -0.2  0.0  

[5 rows x 474 columns]
        0      1      2       3      4       5       6       7    8       9    \
VTV  0

# 3. Preprocessing

If we plot the daily movement of ACB and ACM we are going to see why we need Preprocessing

In [5]:
df.loc['ACB'].plot(kind='line')
df.loc['ACM'].plot(kind='line')

plt.xlim(0, 474)
plt.legend()
plt.show()

What we can see from these two stocks is that we have different scales between the price movements.

This means we need to do a normalization step before we apply k-means clustering. If we don't do this the algorithm would just cluster based on the price of the stock.

To do this we're going to use Normalizer() from sklearn.preprocessing, and then we'll check out the new minimum movement value, the maximum, and the mean.

In [6]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()

new = normalizer.fit_transform(df)

print('Ticker ACB')
print('Unormalized: ')
print('Max:', df.iloc[1].max())
print('Min:', df.iloc[1].min())
print('Mean:', df.iloc[1].mean())
print('Normalized: ')
print('Max:', new[1].max())
print('Min:', new[1].min())
print('Mean:', new[1].mean())

Ticker ACB
Unormalized: 
Max: 1.9236999999999984
Min: -1.9532999999999987
Mean: 0.01298713080168775
Normalized: 
Max: 0.21008735011530882
Min: -0.2133199672403352
Mean: 0.0014183250484625922


Let's now plot out the movements of AAPL and AMZN again and see how they've changed:

In [None]:
df.iloc[:, :] = new

df.loc['ACB'].plot(kind='line')
df.loc['ACM'].plot(kind='line')

plt.xlim(0, 474)
plt.legend()
plt.show()

# 4. K-means Clustering

The K-means algorithm operates as follows:

1. a number of "centroids" are randomly initialized (the number of hyperparameter of the model), these centroid
   match the dimension of the feature set, and can be imagine as a vector into some n-dimensional space
2. every sample in the data set is then compared to each of the randomly initialized centroids, to see how far 
   it is away from the centroid. Since the samples and centroids are vectors, the distance 
   between a vector v and a centroid u is the vector normal of the difference between the two vectors 
   ((u1-v1)^2 + (u2-v2)^2 + ....)^(1/2). Each sample is then "clustered" with the centroid it is closest to.
3. After each sample has been clustered with a specific centroid, each centroid is repositioned, such that it
   is the average of all of the samples that have been clustered with it.
4. The sample association and centroid repositioning steps are then repeated for some number of iterations



In [8]:
from sklearn.cluster import KMeans

## Finding k - the elbow method

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. <br>
The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [9]:
sse = []
# iterate k from 1 to 10
for k in range(1,10):
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(df)
    sse.append(kmeans.inertia_)

# and then plot it
plt.plot(range(1,10), sse)
plt.title("Elbow Curve")
plt.show()


So, let's choose k = 3

In [15]:
clusters = KMeans(n_clusters = 3)
clusters.fit(df)
labels = clusters.labels_

companies = df.index

# create DataFrame aligning labels & companies
dff = pd.DataFrame({'companies': companies, 'labels': labels}).set_index('companies').sort_values(by = ['labels', 'companies'], ascending=True)

print(dff)

ValueError: could not convert string to float: 'AAV'

# 5. Principal components analysis
Using PCA on stocks movement data can help identified key dates

In [14]:
from sklearn.decomposition import PCA
reduced_data = PCA(n_components = 2 ).fit_transform(df)


print(reduced_data.shape)

kmeans = KMeans(n_clusters=10)
kmeans.fit(reduced_data)
labels = kmeans.predict(reduced_data)

# create DataFrame aligning labels & companies
df = pd.DataFrame({'companies': companies, 'labels': labels})

print(df.sort_values(by = ['labels', 'companies'], ascending=True)

(364, 2)
     labels companies
292       0       TJC
184       0       NTH
40        0       CLM
79        0       ECI
180       0       NHP
..      ...       ...
165       9       MSC
344       9       VLA
272       9       SPP
173       9       NDN
73        9       DST

[364 rows x 2 columns]
