# Fuzzy clustering with EM algorithm 

Based on the clickstream event frequency pattern in Q2Q3_input.csv for a given lecture video, apply EM algorithm to cluster the students into two classes with the following initial settings:

Initial centers: c1 =(1,1,1,1,1,1), c2 = (0,0,0,0,0,0)

Cluster features: frequency patterns for 6 given clickstream events: load_video,pause_video,
play_video,

seek_video, speed_change_video and stop_video, you can find them in Q2Q3_input.csv. You need to:

  (a). Report the updated centers and SSE for the first two iterations.
  
  (b). Report the overall iteration step when your algorithm terminates
  
  (c). Report the final converged centers for each cluster.

You need to submit:
1. Your source code A3_itsc_stuid_Q2.xxx in a zip file named as A3_itsc_stuid_code.zip, and 
2. Report your result for (a)(b)(c) in A3_itsc_stuid_answer.pdf.

Notes:

1. Please use the terminate condition below:
Terminate condition: the EM algorithm will terminate when: 

1) The sum of L1-distance for each pair of old-new center
is smaller than 0.001, or

$$\newcommand\norm[1]{\left\lVert#1\right\rVert}$$
$$\sum_{each center} \norm{C_{old}-C_{new}}^1$$

2) The iteration step is greater than the maxinum iteration step 50.

In [71]:
%matplotlib inline
%pylab inline
import pandas as pd
import numpy as np
import seaborn as sns
import math

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


## Read data from .csv

In [72]:
data_input = pd.read_csv('Q2Q3_input.csv')

Initialize initial centers and stopage criteria

In [73]:
c1 = np.array([1,1,1,1,1,1])
c2 = np.array([0,0,0,0,0,0])
stopage_criteria = 0.001

In [74]:
#Function to calculate distance between two centers
def dist(c2, c1):
    return np.linalg.norm(c2-c1)

In [75]:
#Function to calculate weight of object o in cluster 1
def w_1(c1, c2, o):
    return np.divide(np.square(dist(o,c2)),(np.square(dist(o,c2)) + np.square(dist(o,c1))))

In [76]:
#Function to calculate weight of object o in cluster 2
def w_2(c1, c2, o):
    return np.divide(np.square(dist(o,c1)),(np.square(dist(o,c2)) + np.square(dist(o,c1))))

In [77]:
def SSE(c1,c2,w1,w2,data_input):
    dist_c1 = data_input.apply(lambda row: dist(c1,row[features]), axis=1)
    dist_c2 = data_input.apply(lambda row: dist(c2,row[features]), axis=1)
    return np.sum(np.multiply(w1,np.square(dist_c1))) + np.sum(np.multiply(w2,np.square(dist_c2)))

In [78]:
#Function to calculate iteration similarity
def measure_cluster(c_old, c_new):
    similarity = []
    for i in range(len(c_old)):
        similarity.append(np.abs(c_new[i]-c_old[i]))
    return np.sum(similarity)

In [79]:
#Save results to .txt file
def save_to_txt(a,b,c):
    text_file = open("A3_dchepenko_20478954_answer.txt", "w")
    text_file.write("Question a: \n\n")
    text_file.write("First iteration:\n Updated centers:\n c1 = {0},\n c2 = {1}\n SSE: {2}\n".format(a["1_iteration"][0][0],a["1_iteration"][0][1], a["1_iteration"][1]))
    text_file.write("Second iteration:\n Updated centers:\n c1 = {0},\n c2 = {1}\n SSE: {2}\n\n".format(a["2_iteration"][0][0],a["2_iteration"][0][1], a["2_iteration"][1]))
    text_file.write("Question b: %s \n\n" % b)
    text_file.write("Question c: \n")
    text_file.write("Final converged centers :\n c1 = {0},\n c2 = {1}".format(c[0],c[1]))
    text_file.close()

In [80]:
#Function to calculate EM algorithm
def EM_algorithm(data_input, c1,c2):
    features = data_input.columns[1:]
    columns = ["c1", "c2"]
    c_new = [c1, c2]
    similarity = np.inf
    epochs = 0
    first_two_iteration_result = {}
    while similarity > stopage_criteria:
        iteration = pd.DataFrame(columns=columns)
        c1 = c_new[0]
        c2 = c_new[1]
        
        iteration.c1 = data_input.apply(lambda row: w_1(c1, c2, row[features]), axis=1)
        iteration.c2 = data_input.apply(lambda row: w_2(c1, c2, row[features]), axis=1)
    
        new_c1 = np.divide(np.dot(np.square(iteration.c1),data_input[features]),np.sum(np.square(iteration.c1)))
        new_c2 = np.divide(np.dot(np.square(iteration.c2),data_input[features]),np.sum(np.square(iteration.c2)))
        
        c_old = c_new
        c_new = [new_c1, new_c2]
        similarity = measure_cluster(c_old, c_new)
        epochs +=1
        if epochs <= 2:
            sse = SSE(c1= c_new[0], c2= c_new[1], w1=iteration.c1, w2=iteration.c2, data_input=data_input)
            first_two_iteration_result[str(epochs)+"_iteration"] = (c_new, sse)
            
    
    return (first_two_iteration_result,epochs, c_new)

## Clustering with EM algorithm

In [81]:
a,b,c = EM_algorithm(data_input, c1,c2)

In [83]:
save_to_txt(a,b,c)