## Part 1 : Introduction to Clustering

### Video 1 : Unsupervised Learning : basics

<b> What is Unsupervised Learning? </b>

- A group of machine learning algorithms that finds patterns in data
- Data for algorithms has not been labeled, classified, or characterized
- The objective of the algorithm is to interpet any structure in the data
- Common unsupervised learning algorithms : clustering, neural networks, and anomaly detection

<b>What is Clustering?</b>

- The Process of grouping items with similar characteristics
- Items in groups similar to each other than in other groups

<img src = "https://miro.medium.com/proxy/0*G7LC_oXt4mNzavMe.jpg">

#### Practice 1 : Pokémon sightings


In [None]:
# Import plotting class from matplotlib library
from matplotlib import pyplot as plt

# Create a scatter plot
plt.scatter(x, y)

# Display the scatter plot
plt.show()

### Video 2 : Basics of cluster analysis

<b>What is Cluster?</b>

- A Group of items with similar characteristics
- Google News : articles where similar words and word associations appear together
- Customer Segment

<b>Clustering Algorithms: </b>
- Hierarchical Clustering
- K means clustering
- Other Clustering algorithms : DBSCAN, Gaussian Methods

In [None]:
#Hierarchical Clustering in scipy
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.DataFrame({'x_coordinate' : x_coordinates,
                   'y_coordinate' : y_coordinates})

z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion = 'maxclust')

sns.scatterplot(x = 'x_coordinate', y = 'y_coordinate', hue = 'cluster_lables', data = df)
plt.show()

In [None]:
#kmeans in scipy
from scipy.cluster.vq import kmeans, vq
from matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed((1000, 2000))

df = pd.DataFrame({'x_coordinate' : x_coordinates,
                   'y_coordinate' : y_coordinates})

centroids,_ = kmeans(df, 3)
df['cluster_labels'],_ = vq(df, centroids)

sns.scatterplot(x = 'x_coordinate', y = 'y_coordinate', hue = 'cluster_lables', data = df)
plt.show()

#### Practice 1 : Pokémon sightings: hierarchical clustering


In [None]:
# Import linkage and fcluster functions
from scipy.cluster.hierarchy import linkage, fcluster

# Use the linkage() function to compute distance
Z = linkage(df, 'ward')

# Generate cluster labels
df['cluster_labels'] = fcluster(Z, 2, criterion='maxclust')

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

#### Practice 2 : Pokémon sightings: k-means clustering


In [None]:
# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq
# Compute cluster centers
centroids,_ = kmeans(df, 2)

# Assign cluster labels
df['cluster_labels'], _ = vq(df, centroids)

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

### Video 3 : Data preparation for cluster analysis


- Variables have incomparable units
- variables with same units have vastly different scales and variances
- Data in raw form may lead to bias in clustering
- clustering may be heavily dependent on 1 variable
- solution : normalization of individual variables

<b>Normalization</b>
- process of rescaling data to a stf of 1

In [None]:
from scipy.cluster.vq import whiten
scaled_data = whiten(data)
print(scaled_data)

#### Practice 1 : Normalize basic list data


In [None]:
# Import the whiten function
from scipy.cluster.vq import whiten
goals_for = [4,3,2,3,1,1,2,0,1,4]

# Use the whiten() function to standardize the data
scaled_data = whiten(goals_for)
print(scaled_data)

#### Practice 2 : Visualize normalized data


In [None]:
# Plot original data
plt.plot(goals_for, label='original')

# Plot scaled data
plt.plot(scaled_data, label='scaled')

# Show the legend in the plot
plt.legend()

# Display the plot
plt.show()

#### Practice 3 : Normalization of small numbers


In [None]:
# Prepare data
rate_cuts = [0.0025, 0.001, -0.0005, -0.001, -0.0005, 0.0025, -0.001, -0.0015, -0.001, 0.0005]

# Use the whiten() function to standardize the data
scaled_data = whiten(rate_cuts)

# Plot original data
plt.plot(rate_cuts, label='original')

# Plot scaled data
plt.plot(scaled_data, label='scaled')

plt.legend()
plt.show()

#### Practice 4 : FIFA 18: Normalize data


In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x='scaled_wage', y='scaled_value', kind = 'scatter')
plt.show()

# Check mean and standard deviation of scaled values
print(fifa[['scaled_wage', 'scaled_value']].describe())

## Part 2 : Hierarchical Clustering


### Video 1 : Basics of Hierarchical Clustering

In [None]:
#creating a distance matrix using linkage

scipy.cluster.hierarchy.linkage(observations,
                                method = 'single',
                                metric = 'euclidean',
                                optimal_ordering = False)


<b>Which method should use?</b>

- single : based on 2 closest objects
- complete : based on 2 farthest objects
- average : based on arithmetic mean of all objects
- centroid : based on the geometric mean of all objects
- median : based on the median of all objects
- ward : based on the sum of squares

In [None]:
#create cluster lables with fcluster

scipy.cluster.hieararchy.flcuster(distance_matrix,
                                  num_clusters,
                                 criterion)

#### Practice 1 : Hierarchical clustering: ward method


In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import linkage, fcluster
# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'ward', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

#### Practice 2 : Hierarchical clustering: single method


In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import linkage, fcluster
# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'single', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

#### Practice 3 : Hierarchical clustering: complete method


In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import linkage, fcluster
# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

### Video 2 : Visualize clusters


<b>Why visualize clusters?</b>

- try to make sense of the cluster formed
- an additional step in validation of clusters
- spot trends in data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x = 'x', y = 'y', hue = 'labels', data = df)
plt.show()

#### Practice 1 : Visualize clusters with matplotlib

In [None]:
# Import the pyplot class
import matplotlib.pyplot as plt


# Define a colors dictionary for clusters
colors = {1:'red', 2:'blue'}

# Plot a scatter plot
comic_con.plot.scatter(x= 'x_scaled', 
                	   y= 'y_scaled',
                	   c=comic_con['cluster_labels'].apply(lambda x: colors[x]))
plt.show()

#### Practice 2 : Visualize clusters with seaborn

In [None]:
# Import the seaborn module
import seaborn as sns

# Plot a scatter plot using seaborn
sns.scatterplot(x= 'x_scaled', 
                y= 'y_scaled', 
                hue= 'cluster_labels', 
                data = comic_con)
plt.show()

### Video 3 : How many clusters?


A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.

In [None]:
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']],
            method = 'ward',
            metric = 'euclidean')
dn = dendrogram(Z)
plt.show()

#### Practice 1 : Create a dendrogram


In [None]:
# Import the dendrogram function
from scipy.cluster.hierarchy import dendrogram

# Create a dendrogram
dn = dendrogram(linkage(comic_con[['x_scaled', 'y_scaled']],
            method = 'ward',
            metric = 'euclidean'))

# Display the dendogram
plt.show()

### Video 4 : Limitations of hierarchical clustering


#### Practice 1 : FIFA 18: exploring defenders


In [None]:
# Fit the data into a hierarchical clustering algorithm
distance_matrix = linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 'ward')

# Assign cluster labels to each row of data
fifa['cluster_labels'] = fcluster(distance_matrix, 3, criterion='maxclust')

# Display cluster centers of each cluster
print(fifa[['scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean())

# Create a scatter plot through seaborn
sns.scatterplot(x='scaled_sliding_tackle', y='scaled_aggression', hue='cluster_labels', data=fifa)
plt.show()

## Part 3 : K-Means Clustering

### Video 1 : Basic of K-means clustering

<b>Step 1 : Generate cluster centers</b>

In [None]:
kmeans(obs, k_or_guess, iter, thresh, check_finite)

- obs : standardized observations
- k_or_guess : number of clusters
- iter : number of iterations
- thres : threshold
- check_finite : whether to check if observations contain only finite nunbers

<b>Step 2 : generate cluster labels</b>

In [None]:
vq(obs, code_book, check_finite = True)

In [None]:
#running kmeans
from scipy.cluster.vq import kmeans, vq

cluster_centers, _ = kmeans(df[['x_scaled', 'y_scaled']], 3)
df['cluster_labels'],_ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

sns.scatterplot(x= 'x_scaled', 
                y= 'y_scaled', 
                hue= 'cluster_labels', 
                data = comic_con)
plt.show()

#### Practice 1 : K-means clustering: first exercise


In [None]:
# Import the kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers
cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)

# Assign cluster labels
comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

### Video 2 : How many clusters?


In [None]:
#elbow method

# Declaring variables for use
distortions = []
num_clusters = range(2, 7)

# Populating distortions for various clusters
for i in num_clusters:
    centroids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)
    distortions.append(distortion)
    
# Plotting elbow plot data\
elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,'distortions': distortions})
sns.lineplot(x='num_clusters', y='distortions',
             data = elbow_plot_data)
plt.show()


#### Practice 1 : Elbow method on distinct clusters


In [None]:
distortions = []
num_clusters = range(1, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], i)
    distortions.append(distortion)

# Create a DataFrame with two lists - num_clusters, distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

# Creat a line plot of num_clusters and distortions
sns.lineplot(x= 'num_clusters', y= 'distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()

#### Practice 2 : Elbow method on uniform data


In [None]:
distortions = []
num_clusters = range(2, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(uniform_data[['x_scaled', 'y_scaled']], i)
    distortions.append(distortion)

# Create a DataFrame with two lists - number of clusters and distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

# Creat a line plot of num_clusters and distortions
sns.lineplot(x= 'num_clusters', y= 'distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()

### Video 3 : Limitations of k-means clustering


- how to find the right K
- impact of seeds
- biased towards equal sized clusters

#### Practice 1 : Impact of seeds on distinct clusters


In [None]:
# Import random class
from numpy import random

# Initialize seed
random.seed([1,2,1000])

# Run kmeans clustering
cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)

# Plot the scatterplot
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

#### Practice 2 : Uniform clustering patterns


In [None]:
# Import the kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers
cluster_centers, distortion = kmeans(mouse[['x_scaled', 'y_scaled']], 3)

# Assign cluster labels
mouse['cluster_labels'], distortion_list = vq(mouse[['x_scaled', 'y_scaled']], cluster_centers)

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = mouse)
plt.show()

#### Practice 3 : FIFA 18: defenders revisited


In [None]:
# Set up a random seed in numpy
random.seed([1000,2000])

# Fit the data into a k-means algorithm
cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)

# Assign cluster labels
fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)

# Display cluster centers 
print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean())

# Create a scatter plot through seaborn
sns.scatterplot(x='scaled_def', y='scaled_phy', hue= 'cluster_labels', data=fifa)
plt.show()