# Question 1

You  should  run  the  k-means  algorithm  on  the  stock  data,  while  using init=’random’ and the default values for the other parameters.  Compute the sum of squared errors (SSE) for the clustering you obtained and include it in your report.

In [3]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

In [4]:
data = pd.read_csv('clustering_data.csv', index_col=0)

Default number of clusters: 8 <br>
Method for initialization 'random’: choose 8 observations (rows) at random from data for the initial centroids

In [5]:
kmeans = KMeans(init='random').fit(data)

inertia_: Sum of squared distances (SSE) of samples to their closest cluster center, weighted by the sample weights if provided

In [6]:
sse = kmeans.inertia_
print("SSE:", sse)

SSE: 1775.8271945588956


# Question 2

You should then try to decrease the SSE as much as possible (while keeping k=8) by changing some of the parameters accordingly.  To this end, select two parameters that you think should impact the results the most.<br>
For each parameter explain:<br>
a) how you expect that changing that parameter would affect the results (increasing its value means better or worse results?<br>
b) whether increasing or decreasing the value of the parameter should always improve the results or not necessarily.

max_iter: int, default=300<br>
Maximum number of iterations of the k-means algorithm for a single run.

In [7]:
kmeans_max_10000 = KMeans(
    init='random', 
    max_iter=1000, 
).fit(data)

In [8]:
sse_max_10000 = kmeans_max_10000.inertia_
print("SSE (max_iter=10000):", sse_max_10000)

SSE (max_iter=10000): 1716.1946463122863


Number of Iterations (max_iter):<br><br>
a) Expected impact: The max_iter parameter controls the maximum number of iterations the K-means algorithm will perform to converge. Increasing the value of max_iter allows the algorithm to run for more iterations, potentially leading to better convergence and lower SSE.<br><br>
b) Impact on results: Increasing the value of max_iter may improve the results, but it does not necessarily guarantee better results. There is a point of diminishing returns where further increasing max_iter does not significantly improve the SSE.

When n_init='auto', the number of runs depends on the value of init:<br>
10 if using init='random' or init is a callable;<br>
1 if using init='k-means++' or init is an array-like

In [9]:
kmeans_init_100 = KMeans(
    init='random', 
    n_init=100,
).fit(data)

In [10]:
sse_init_100 = kmeans_init_100.inertia_
print("SSE (n_init=100):", sse_init_100)

SSE (n_init=100): 1600.1838904567821


Number of Initializations (n_init):<br><br>
a) Expected impact: The n_init parameter controls the number of times the K-means algorithm will be run with different centroid seeds. Each initialization starts from a different set of initial centroids, allowing the algorithm to explore different solutions. Increasing the value of n_init increases the chances of finding better overall clustering results and reducing the SSE.<br><br>
b) Impact on results: Increasing the value of n_init generally improves the results as it increases the likelihood of finding a better clustering solution. However, the improvement diminishes as n_init becomes excessively large.


# Question 3

Then  look  at  the  clustering  you  obtained  and  try  to  label  each  clusterwith a topic.  For example:  cluster of technology stocks,  oil stocks,  etc. Don’t expect your clustering to be perfect.  In particular, you might have different kinds of stocks in a given cluster,  while you might not be able to label all clusters.  We expect that you should be able to label at least three clusters with a topic. It is fine to describe a cluster as a technology cluster if most of the stocks deal with technology, for example. Explain your answers.

In [11]:
kmeans_improved = KMeans(
    init='random', 
    n_init=100,
    max_iter=10000
).fit(data)

In [12]:
clusters = kmeans_improved.labels_
stockNames = data.index

In [13]:
table_data = {'Clusters': clusters, 'Stock Names': stockNames}
table_df = pd.DataFrame(table_data)

In [14]:
# Group the data by Clusters
grouped_data = table_df.groupby('Clusters')['Stock Names'].apply(list)

# Create the final table
display_table = pd.DataFrame({'Stock Names': grouped_data.values})

# Set the maximum column width to 0 will prevent any wrapping of the column content
pd.set_option('display.max_colwidth', 0)

In [15]:
print(display_table)

                                                                                                                               Stock Names
0  [Cisco Systems]                                                                                                                        
1  [Chevron, ExxonMobil, JPMorgan Chase]                                                                                                  
2  [Kraft, Procter & Gamble, AT&T, McDonalds, Coca-Cola]                                                                                  
3  [Hewlett-Packard]                                                                                                                      
4  [Intel, Merck, Johnson & Johnson]                                                                                                      
5  [American Express, Boeing, Microsoft, IBM, The Home Depot, Walt Disney, Wal-Mart, General Electric, United Technologies, Travelers, 3M]
6  [Bank of America]       

Cluster 1: Oil Stocks <br>
Both Chevron and ExxonMobil are in Integrated Oil industry.<br><br>
Cluster 2: Consumers <br>
Kraft, Procter & Gamble, AT&T, McDonalds and Coca-Cola are Consumer Products / Services.<br><br>
Cluster 4: Technology <br>
Microsoft is technology company. March and Johnson & Johnson are related to Health Technology.