<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Machine Learning - Clustering</font></h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<h2> Table of Contents</h2>  
<font size = 3>
1. <a href="#item1">Manual Simulation of *k*-means on 2-D Data</a>  
2. <a href="#item2">Cluster Real Data Usin *k*-means</a>  
2.1 <a href="#item3">Download and Explore Customer Data</a>     
2.2 <a href="#item4">Preprocess Data</a>   
2.3 <a href="#item5">Cluster Data Using *k*-means</a>  
2.4 <a href="#item6">Get Most Popular Products in Each Cluster</a>      
</font>
<br>
<p></p>
</div>

<a id='item1'></a>

# 1. Manual Simulation of *k*-means on 2-D Data

#### Import libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

print('Libraries imported!')

#### 30 data points belonging to 2 different clusters (x1 is the first feature and x2 is the second feature)

In [None]:
# data
x1 = [-4.9, -3.5, 0, -4.5, -3, -1, -1.2, -4.5, -1.5, -4.5, -1, -2, -2.5, -2, -1.5, 4, 1.8, 2, 2.5, 3, 4, 2.25, 1, 0, 1, 2.5, 5, 2.8, 2, 2]
x2 = [-3.5, -4, -3.5, -3, -2.9, -3, -2.6, -2.1, 0, -0.5, -0.8, -0.8, -1.5, -1.75, -1.75, 0, 0.8, 0.9, 1, 1, 1, 1.75, 2, 2.5, 2.5, 2.5, 2.5, 3, 6, 6.5]

print('Datapoints defined!')

#### Define a function that carries out the cluster assignment step of each iteration

In [None]:
colors_map = np.array(['b', 'r'])
def assign_members(x1, x2, centers):
    compare_to_first_center = np.sqrt(np.square(np.array(x1) - centers[0][0]) + np.square(np.array(x2) - centers[0][1]))
    compare_to_second_center = np.sqrt(np.square(np.array(x1) - centers[1][0]) + np.square(np.array(x2) - centers[1][1]))
    class_of_points = compare_to_first_center > compare_to_second_center
    colors = colors_map[class_of_points + 1 - 1]
    return colors, class_of_points

print('assign_members function defined!')

#### Define a function that carries out the centroid move step

In [None]:
# update means
def update_centers(x1, x2, class_of_points):
    center1 = [np.mean(np.array(x1)[~class_of_points]), np.mean(np.array(x2)[~class_of_points])]
    center2 = [np.mean(np.array(x1)[class_of_points]), np.mean(np.array(x2)[class_of_points])]
    return [center1, center2]

print('assign_members function defined!')

#### Define a function that plots the data points along with the cluster centroids

In [None]:
def plot_points(centroids=None, colors="g", figure_title=None):
    # plot the figure
    fig = plt.figure(figsize=(15, 10))  # create a figure object
    ax = fig.add_subplot(1, 1, 1)
    
    centroid_colors = ["bx", "rx"]
    if centroids:
        for (i, centroid) in enumerate(centroids):
            ax.plot(centroid[0], centroid[1], centroid_colors[i], markeredgewidth=5, markersize=20)
    plt.scatter(x1, x2, s=500, c=colors)
    
    # define the ticks
    xticks = np.linspace(-6, 8, 15, endpoint=True)
    yticks = np.linspace(-6, 6, 13, endpoint=True)

    # fix the horizontal axis
    ax.set_xticks(xticks)
    ax.set_yticks(yticks)

    # add tick labels
    xlabels = xticks
    ax.set_xticklabels(xlabels)
    ylabels = yticks
    ax.set_yticklabels(ylabels)

    # style the ticks
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    ax.tick_params('both', length=2, width=1, which='major', labelsize=15)
    
    # add labels to axes
    ax.set_xlabel('x1', fontsize=20)
    ax.set_ylabel('x2', fontsize=20)
    
    # add title to figure
    ax.set_title(figure_title, fontsize=24)

    plt.show()

print('plot_points function defined!')

#### Initialize K-means - plot data points

In [None]:
plot_points(figure_title='Scatter Plot of x2 vs x1')

#### Initialize K-means - randomly define clusters and add them to plot

In [None]:
centers = [[-2, 2], [2, -2]]
plot_points(centers, figure_title='k-means Initialization')

#### Run K-means (4-iterations only)

In [None]:
number_of_iterations = 4
for i in range(number_of_iterations):
    input("Iteration {} - Press Enter to update the members of each cluster".format(i+1))
    colors, class_of_points = assign_members(x1, x2, centers)
    title = "Iteration {} - Cluster Assignment".format(i+1)
    plot_points(centers, colors, figure_title=title)
    input("Iteration {} - Press Enter to update the centers".format(i+1))
    centers = update_centers(x1, x2, class_of_points)
    title = "Iteration {} - Centroid Move".format(i+1)
    plot_points(centers, colors, figure_title=title)

#### Power of Scikit-learn

from sklearn.cluster import KMeans    

kmeans = KMeans(n_clusters=2, max_iter=300, random_state=0).fit(X)     
kmeans.labels_

kmeans.cluster_centers_      

<a id='item2'></a>

# 2. Cluster Real Data Using *k*-means

### Import necessary Libraries

In [None]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np # library to handle data in a vectorized manner
import scipy # scientific computing library
    
# import k-means from scikit-learn
from sklearn.cluster import KMeans

<a id='item3'></a>

## 2.1. Download and Explore Dataset

#### Download data from IBM server

In [None]:
# download data from IBM server
!wget -O customers_data.csv --quiet https://ibm.box.com/shared/static/albv617gy292tmqljaacbqm0mz62tiz2.csv
    
print('Data Downloaded!')

#### Read data into a *pandas* dataframe

In [None]:
customers_data = pd.read_csv('customers_data.csv')

# print the size of the dataset
print(customers_data.shape)

# display first five rows of dataframe
customers_data.head()

#### How many customers are in the dataset?

In [None]:
num_customers = len(customers_data['CUST_ID'].unique())
print('There are {} customers'.format(num_customers))

#### Let's fix the gender code in the dataframe

Let's first check what values exist in the `GenderCode` column

In [None]:
customers_data['GenderCode'].unique()

So let's replace `Mr.` and `Master` with `Male`, and `Mrs.` and `Miss` with `Female`

In [None]:
gender_data = customers_data['GenderCode']
gender_data[gender_data.isin(['Mr.', 'Master.'])] = 'Male'
gender_data[gender_data.isin(['Mrs.', 'Miss.'])] = 'Female'

Let's take a look at the updated dataframe to make sure that everything looks fine

In [None]:
customers_data.head()

#### It looks like transactions of customers where labelled as low, medium, or high in value. Let's explore that a little

Let's make sure that no other values exist in the `ORDER_TYPE` column

In [None]:
customers_data['ORDER_TYPE'].unique()

#### Let's explore the transactions labelled as `LowValue`

First, let's get all the data instances pertaining to this value

In [None]:
low_value = customers_data.loc[customers_data['ORDER_TYPE'] == 'LowValue']
low_value.head()

Next, let's see how many instances exist in the dataframe and get the minimum and the maximum values in this bucket

In [None]:
print(low_value.shape)
print(min(low_value['ORDER_VALUE']), max(low_value['ORDER_VALUE']))

#### Let's do the same thing to `MediumValue`

Get all the data instances

In [None]:
medium_value = customers_data.loc[customers_data['ORDER_TYPE'] == 'MediumValue']
medium_value.head()

Print how many instances exist in the dataframe and get the minimum and the maximum values in this bucket

In [None]:
print(medium_value.shape)
print(min(medium_value['ORDER_VALUE']), max(medium_value['ORDER_VALUE']))

#### And Let's do that again to `HighValue`

In [None]:
high_value = customers_data.loc[customers_data['ORDER_TYPE'] == 'HighValue']
high_value.head()

Print how many instances exist in the dataframe and get the minimum and the maximum values in this bucket

In [None]:
print(high_value.shape)
print(min(high_value['ORDER_VALUE']), max(high_value['ORDER_VALUE']))

#### Let's save the products names in a variable called `products`

In [None]:
products = customers_data.columns.values[7:customers_data.shape[1]]
products

#### Let's see if there is a gender pattern in those whose transactions where labelled `LowValue`

In [None]:
customers = low_value['CUST_ID'].unique() # get a list of the customers

# define lists to save the minimum, avergae, and maximum trasaction values for each male customer
male_min = []
male_avg = []
male_max = []

# define lists to save the minimum, avergae, and maximum trasaction values for each female customer
female_min = []
female_avg = []
female_max = []

# loop through customers and compute the statistics
for customer in customers:
    customer_data = low_value[low_value['CUST_ID'] == customer]
    customer_data.reset_index(inplace=True, drop=True)
    if customer_data.loc[0, 'GenderCode'] == 'Male':
        male_min.append(min(customer_data['ORDER_VALUE']))
        male_avg.append(np.mean(customer_data['ORDER_VALUE']))
        male_max.append(max(customer_data['ORDER_VALUE']))
        
    else:
        female_min.append(min(customer_data['ORDER_VALUE']))
        female_avg.append(np.mean(customer_data['ORDER_VALUE']))
        female_max.append(max(customer_data['ORDER_VALUE']))
        
print('Male and female lists populated!')

Print statistics for male cutomers

In [None]:
print('Males:\n', 'Min: {}\n'.format(np.min(male_min)), 'Avg: {}\n'.format(np.mean(male_avg)), 'Max: {}\n'.format(np.max(male_max)))

Print statistics for female cutomers

In [None]:
print('Females:', 'Min: {}\n'.format(np.min(female_min)), 'Avg: {}\n'.format(np.mean(female_avg)), 'Max: {}\n'.format(np.max(female_max)))

#### Let's define a function that returns the summary of customers so we can repeat the same analysis for `MediumValue` and `HighValue`

In [None]:
def return_summary_customers(dataframe):
    customers = dataframe['CUST_ID'].unique()

    male_min = []
    male_avg = []
    male_max = []

    female_min = []
    female_avg = []
    female_max = []

    for customer in customers:
        customer_data = dataframe.loc[dataframe['CUST_ID'] == customer]
        customer_data.reset_index(inplace=True, drop=True)
        if customer_data.loc[0, 'GenderCode'] == 'Male':
            male_min.append(min(customer_data['ORDER_VALUE']))
            male_avg.append(np.mean(customer_data['ORDER_VALUE']))
            male_max.append(max(customer_data['ORDER_VALUE']))
        
        else:
            female_min.append(min(customer_data['ORDER_VALUE']))
            female_avg.append(np.mean(customer_data['ORDER_VALUE']))
            female_max.append(max(customer_data['ORDER_VALUE']))
    
    print('Males:\n', 'Min: {}\n'.format(np.min(male_min)), 'Avg: {}\n'.format(np.mean(male_avg)), 'Max: {}\n'.format(np.max(male_max)))
    print('Females:\n', 'Min: {}\n'.format(np.min(female_min)), 'Avg: {}\n'.format(np.mean(female_avg)), 'Max: {}\n'.format(np.max(female_max)))

print('return_summary_customers function defined!')

In [None]:
return_summary_customers(medium_value)

In [None]:
return_summary_customers(high_value)

**Conclusion**: No difference between male and female customers

<a id='item4'></a>

## 2.2. Preprocess Data

Let's recall how the data looks like

In [None]:
customers_data.head()

#### Let's drop all columns except the customer ID column and the products columns

In [None]:
customers_purchase_data = customers_data.iloc[:, [0] + list(range(7, customers_data.shape[1]))]
customers_purchase_data.head(10)

#### Let's group data pertaining to each customer

In [None]:
grouped_data = customers_purchase_data.groupby('CUST_ID').sum().reset_index()
grouped_data.head(10)

#### For clustering, we only need the data pertaining to what was purchased

In [None]:
purchased_data = grouped_data.iloc[:, 1:]
purchased_data.head()

**Note**. It is important to normalize the data whenever you are building a model

#### Compute the mean of each column

In [None]:
col_means = purchased_data.mean(axis=0)
col_means

#### Compute the standard deviation of each column

In [None]:
col_stds = purchased_data.std(axis=0)
col_stds

#### Normalize the data by subtracting the mean from each column and dividing by the standard deviation

In [None]:
norm_purchased_data = (purchased_data - col_means) / col_stds
norm_purchased_data.head()

#### Do a quick sanity check

In [None]:
norm_purchased_data.shape

<a id='item5'></a>

## 2.3. Cluster Data Using *k*-means

##### Let's use _scikit-learn_ implementation of *k*-means to break the dataset into 100 clusters

In [None]:
kmeans = KMeans(n_clusters=100, random_state=0).fit(norm_purchased_data)
print('Clustering process is complete!')

In [None]:
labels = kmeans.labels_ # the generated labels
cluster_centroids = kmeans.cluster_centers_ # the centroids of the clusters that emerged

print('First five labels {}'.format(labels[0:5]))
print('First five centroid {}'.format(cluster_centroids[0:5]))

Let's make sure that the right number of labels was returned

In [None]:
len(labels)

#### Add labels to grouped data

In [None]:
grouped_data_w_labels = grouped_data
grouped_data_w_labels['labels'] = labels
grouped_data_w_labels.head()

#### Let's examine how many members are in each cluster

In [None]:
for cluster_num in np.arange(100):
    cluster_data = grouped_data_w_labels[grouped_data_w_labels['labels'] == cluster_num]
    print('Cluster {}: {} data points'.format(cluster_num, cluster_data.shape[0]))

#### Let's check out the cluster with the highest number of members

In [None]:
cluster_data = grouped_data_w_labels[grouped_data_w_labels['labels'] == 6]
cluster_data

<a id='item6'></a>

## 2.4. Get Most Popular Products in Each Cluster

#### Let's sort the products in each cluster in descending order of quantity purchased by customers in cluster

First, let's group the data by labels

In [None]:
grouped_purchased_data = grouped_data_w_labels.iloc[:, 1:].groupby('labels').sum().reset_index()
grouped_purchased_data.head()

Next, let's define a function that returns the products sorted in descending order

In [None]:
def return_most_popular_products(row):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values

print('return_most_populat_products function defined!')

Next, Let's define the columns for our new dataframe

In [None]:
# create columns according to number of top venues
columns = ['labels']
for ind in np.arange(len(grouped_purchased_data.columns.values[1:])):
    columns.append('Most Popular Product {}'.format(ind+1))
    
print('columns created!')

Finally, let's define the datafame and fill in each row by applying the `return_most_popular_products` function to the `grouped_purchased_data` dataframe

In [None]:
# create a new dataframe
grouped_purchased_data_sorted = pd.DataFrame(columns=columns)
grouped_purchased_data_sorted['labels'] = grouped_purchased_data['labels']

for ind in np.arange(grouped_purchased_data.shape[0]):
    grouped_purchased_data_sorted.iloc[ind, 1:] = return_most_popular_products(grouped_purchased_data.iloc[ind, :])

grouped_purchased_data_sorted.head() # display the products for first 5 clusters

### Thank you for completing this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/). I hope you found this lab interesting and educational. Feel free to contact me if you have any questions!

## Want to learn more?

You can take free [Python for Data Science](http://cocl.us/DX0111EN_PY0101EN) or [Data Analysis with Python](http://cocl.us/DX0111EN_DA0101EN) or [Data Visualization with Python](http://cocl.us/DX0111EN_DV0101EN) courses.  

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience (DSX) is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](http://cocl.us/DX0111EN_DSX)This is the end of this lesson. Hopefully, now you have a deeper and intuitive understanding regarding the LSTM model. Thank you for reading this notebook, and good luck on your studies.


<hr>
Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).