## Part 1 : Clustering for dataset exploration

### Video 1 : Unsupervised Learning

- unsupervised learning finds patterns in data
- ex : clustering, dimensionality reduction

<b>K-means clustering</b>
- finds cluster for samples
- number of clusters must be specified
- new samples can be assigned to existing clusters
- K-means remembers the mean of each cluster (centroids)

In [None]:
from sklearn.cluster import KMeans
model = Kmeans(n_clusters = 3)
model.fit(samples)

In [None]:
labels = model.predict(samples)
print(labels)

In [None]:
#scatter plot

import matplotlib.pyplot as plt

xs = samples[:, 0]
ys = samples [:, 2]
plt.scatter(xs, ys, c = labels)
plt.show()

#### Practice 1 : Clustering 2D points

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters = 3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

#### Practice 2 : Inspect your clustering

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:, 0]
ys = new_points[:, 1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c = labels, alpha = 0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker = 'D', s = 50)
plt.show()

### Video 2 : Evaluating a clustering

In [None]:
import pandas as pd

df = pd.DataFrame({'labels' : labels, 'species' : species})
print(df)

In [None]:
#cross tab

ct = pd.crosstab(df['labels'], df['species'])
ct

- a good clustering has tight clusters
- samples in each cluster bunched together
- measures how spread out the clusters are (lower is better)
- distance from each sample to centroid of its cluster
- K-means attempts to minimize the inertia when choosing clusters

In [None]:
#use inertia

from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3)
model.fit(samples)
print(model.inertia_)

- <b> a good clustering has tight clusters(so low inertia)</b>
- <b>but not too many clusters</b>, choose the "elbow" in the inertia plot

#### Practice 1 : How many clusters of grain?

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

#### Practice 2 : Evaluating the grain clustering

In [None]:
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters = 3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

### Video 3 : Transforming features for better clusterings

In [None]:
# Standard Scaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy = True, with_mean = True, with_std = True)
samples_scaled = scaler.transform(samples)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters = 3)

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)

#### Practice 1 : Scaling fish data for clustering

In [None]:
# Perform the necessary imports
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters = 4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

#### Practice 2 : Clustering the fish data

In [None]:
# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels' : labels, 'species' : species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

#### Practice 3 : Clustering stocks using KMeans

In [None]:
# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters = 10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)

#### Practice 4 : Which stocks move together?

In [None]:
# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))