@channel **Hello All,**

**2024-01-22 `20.2-Unsupervised-Learning-Machine Learning in Practice`**

Unsupervised learning follows the same pattern (model, fit, predict).

* [scikit learn User Guid](https://scikit-learn.org/stable/user_guide.html)
* [Standardization, or mean removal and variance scaling](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)
* [pandas.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)
* [pandas.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)
* [sklearn.cluster.Birch](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html)
* [sklearn.cluster.AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
* [Clustering performance evaluation](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)


**Objectives**

* Segment data.
* Prepare data for complex algorithms.
* Explain the importance of preprocessing data for unsupervised learning.
* Transform categorical variables into a numerical representation using Pandas.
* Scale data by using the `StandardScaler` module from `scikit-learn`.

**Presentation**
* [19.2-Machine Learning in Practice](https://git.bootcampcontent.com/University-of-California---Berkeley/UCB-VIRT-DATA-PT-08-2023-U-LOLC/-/blob/main/Slides/Data-19.2-Machine_Learning_in_Practice.pdf)

**Best wishes.**

---
**Definition**

* **AgglomerativeClustering** Agglomerative Clustering. Recursively merges pair of clusters of sample data; uses linkage distance.
* **BIRCH** BIRCH clustering algorithm. It is a memory-efficient, online-learning algorithm provided as an alternative to MiniBatchKMeans. It constructs a tree data structure with the cluster centroids being read off the leaf. These can be either the final cluster centroids or can be provided as input to another clustering algorithm such as AgglomerativeClustering.
* **DBSCAN** Perform DBSCAN clustering from vector array or distance matrix..
* **dendrogram** Plot the hierarchical clustering as a dendrogram.
* **fcluster** Form flat clusters from the hierarchical clustering defined by the given linkage matrix.
* **GridSearchCV** Exhaustive search over specified parameter values for an estimator.
* **linkage** Perform hierarchical/agglomerative clustering.
* **load_breast_cancer** Load and return the breast cancer wisconsin dataset (classification).
* **load_wine** Load and return the wine dataset (classification).
* **LogisticRegression** Logistic Regression (aka logit, MaxEnt) classifier.
* **make_moons** Make two interleaving half circles..
* **normalize** Scale input vectors individually to unit norm (vector length).
* **PCA** Principal Component Analysis
* **Pipeline** Pipeline of transforms with a final estimator.
* **StandardScaler** Standardize features by removing the mean and scaling to unit variance
* **SVC** Support Vector Classifier
* **train_test_split** Split arrays or matrices into random train and test subsets
* **TSNE** t-distributed Stochastic Neighbor Embedding.
* **calinski_harabasz_score**: Compute the Calinski and Harabasz score. It is also known as the Variance Ratio Criterion. The score is defined as ratio of the sum of between-cluster dispersion and of within-cluster dispersion.

# ==========================================

### 2.01 Instructor Do: Warmup (0:10) 

In [1]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Read in the CSV file as a Pandas DataFrame
spread_df = pd.read_csv(
    Path("01-Ins_Elbow_Warm_Up/Resources/stock_data.csv"),
    index_col="date", 
    parse_dates=True, 
    infer_datetime_format=True
)

# Review the DataFrame
spread_df.head()

Unnamed: 0_level_0,close,volume,open,high,low,returns,hi_low_spread
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-04-30,3.61,18193730,3.55,3.73,3.53,0.02849,0.2
2009-05-01,3.82,16233940,3.55,3.9,3.55,0.058172,0.35
2009-05-04,4.26,21236940,3.9,4.3,3.83,0.115183,0.47
2009-05-05,4.32,16369170,4.36,4.39,4.11,0.014085,0.28
2009-05-06,4.31,15075630,4.45,4.45,4.12,-0.002315,0.33


In [2]:
# Create a a list to store inertia values
inertia = []

# Create a a list to store the values of k
k = list(range(1, 11))

# Create a for loop where each value of k is evaluated using the K-means algorithm
# Fit the model using the home_sales_df DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the K-means model instance
for i in k:
    k_model = KMeans(n_clusters=i, random_state=1)
    k_model.fit(spread_df)
    inertia.append(k_model.inertia_)

# Create a dictionary that holds the list values for k and inertia
elbow_data = {"k": k, "inertia": inertia}

# Create a DataFrame using the elbow_data dictionary
df_elbow = pd.DataFrame(elbow_data)

# Plot the DataFrame
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

# ==========================================

### 2.02 Students Do: Warm Up (0:15) 

# Clustering National Home Markets
In this activity, you will use the K-means algorithm to identify trends within a dataset of residential house prices in the US.
## Background
You work for a large national bank with a large lending business that provides loans to people who want to buy homes in the United States. The bank wants to have a model that can identify how similar the national real estate market is at any point in time in comparison to real estate periods in the past. Quantifying today's real estate market against the past will help the bank better understand its lending risk as well as the potential for new growth.

You've decided to use the K-means algorithm to segment different periods in the US market for national residential house prices.
## Instructions
1. Review the Pandas DataFrame and plot associated with `national-home-sales.csv`.
2. Run the K-means algorithm identifying three clusters in the data. To do so, complete these steps: 
   - Create and initialize the K-means model for three clusters. Use a `random_state` value of 1 for the model.
   - Fit, or train, the model by using the `home_sales_df` DataFrame.
   - Make predictions about the clustering by using the trained model. Save the predictions to a variable called `home_segment_3`, and print that variable.
   - Create a copy of the DataFrame and name it `home_sales_predictions_df`.
   - Add a column to the `home_sales_predictions_df` DataFrame called "home_segment_3", and add the `home_segment_3` information to the column.
   - Plot the data by using the DataFrame adjusted to include home market segment information for three clusters.
3. Run the K-means algorithm identifying four clusters in the data. To do so, complete these steps:
   - Create and initialize the K-means model for four clusters. Use a `random_state` value of 1 for the model.
   - Fit, or train, the model by using the `home_sales_df` DataFrame.
   - Make predictions about the clustering by using the trained model. Save the predictions to a variable called `home_segment_4`, and print that variable.
   - Add a column to the `home_sales_predictions_df` DataFrame called "customer_segment_4", and add the `home_segment_4` information to the column.
   - Plot the data by using the DataFrame adjusted to include customer segment information for four clusters.
4. Answer the following question: Can any additional information be gleaned from the customer segmentation data when clusters of three and four are applied?

---

In [89]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans

In [90]:
# Read in the CSV file as a Pandas DataFrame
home_sales_df = pd.read_csv(
  Path("02-Stu_Warm_Up/Resources/national-home-sales.csv"),
  index_col="date", 
  parse_dates=True, 
  infer_datetime_format=True 
)

# Review the DataFrame
home_sales_df.head()

Unnamed: 0_level_0,inventory,homes_sold,median_sale_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,1250798,377964,289000
2020-02-01,1265253,405992,294000
2020-03-01,1316823,507324,303000
2020-04-01,1297460,436855,304000
2020-05-01,1289500,421351,299000


In [91]:
# Create a a list to store inertia values
inertia = []

# Create a a list to store the values of k
k = list(range(1, 11))

In [92]:
# Create a for-loop where each value of k is evaluated using the K-means algorithm
# Fit the model using the spread_df DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
for i in k:
    k_model = KMeans(n_clusters=i, random_state=2)
    k_model.fit(home_sales_df)
    inertia.append(k_model.inertia_)

In [93]:
# Create a Dictionary that holds the list values for k and inertia
elbow_data = {"k": k, "inertia": inertia}

# Create a DataFrame using the elbow_data Dictionary
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,8048111000000.0
1,2,3460149000000.0
2,3,1894158000000.0
3,4,1356238000000.0
4,5,1119878000000.0


In [94]:
# Plot the DataFrame
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

## Perform the following tasks for each of the two most likely values of `k`:

* Define a K-means model using `k` to define the clusters, fit the model, make predictions, and add the prediction values to a copy of the scaled DataFrame and call it `spread_predictions_df`.

* Plot the clusters. The x-axis should reflect home "inventory", and the y-axis should reflect either the "median_sale_price" or "homes_sold" variable.

In [9]:
# Define the model with the lower value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=3, random_state=1)

# Fit the model
model.fit(home_sales_df)

# Make predictions
k_lower = model.predict(home_sales_df)

# Create a copy of the DataFrame and name it as spread_df_predictions
home_sales_predictions_df = home_sales_df.copy()

# Add a class column with the labels to the spread_df_predictions DataFrame
home_sales_predictions_df['clusters_lower'] = k_lower

In [10]:
home_sales_predictions_df.head(3)

Unnamed: 0_level_0,inventory,homes_sold,median_sale_price,clusters_lower
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,1250798,377964,289000,2
2020-02-01,1265253,405992,294000,2
2020-03-01,1316823,507324,303000,2


In [99]:
# Plot the clusters
home_sales_predictions_df.hvplot.scatter(
    x="inventory",
    y="homes_sold",
    by="clusters_lower"
).opts(yformatter="%.0f", xformatter="%.0f")

In [96]:
# Define the model with the higher value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=4, random_state=1)

# Fit the model
model.fit(home_sales_df)

# Make predictions
k_higher = model.predict(home_sales_df)

# Add a class column with the labels to the spread_df_predictions DataFrame
home_sales_predictions_df['clusters_higher'] = k_higher

In [98]:
# Plot the clusters
home_sales_predictions_df.hvplot.scatter(
    x="inventory",
    y="homes_sold",
    by="clusters_higher"
).opts(yformatter="%.0f", xformatter="%.0f")

## Answer the following question

* Considering the plot, what’s the best number of clusters to choose, or value of k? 

From the scatter plots, it appears that the optimal value for k, the nubmer of clusters is probably 3. It appears to better group the monthly housing trends among different levels of inventory. However, 4 clusters is probably not wrong either as, for a certain range of inventory, it appears to identify two unique clusters according to the number of homes sold. Overall, the best differention across multiple variables though would likely be 3 clusters.

# ==========================================

### 2.03 Instructor Do: Scaling Data (0:20) 

In [14]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [15]:
# Read in the CSV file and create the Pandas DataFrame
df_shopping = pd.read_csv(
    Path("03-Ins_Scaling_Data/Resources/shopping_data.csv")
)

# Review the DataFrame
df_shopping.head()

Unnamed: 0,CustomerID,Card Type,Age,Annual Income,Spending Score
0,1,Credit,19,15000,39
1,2,Credit,21,15000,81
2,3,Debit,20,16000,6
3,4,Debit,23,16000,77
4,5,Debit,31,17000,40


In [16]:
# Check the DataFrame data types
df_shopping.dtypes

CustomerID         int64
Card Type         object
Age                int64
Annual Income      int64
Spending Score     int64
dtype: object

In [17]:
# Scaling the numeric columns
shopping_data_scaled = StandardScaler().fit_transform(df_shopping[["Age", "Annual Income", "Spending Score"]])

# Creating a DataFrame with with the scaled data
df_shopping_transformed = pd.DataFrame(shopping_data_scaled, columns=["Age", "Annual Income", "Spending Score"])

# Display sample data
df_shopping_transformed.head()

Unnamed: 0,Age,Annual Income,Spending Score
0,-1.424569,-1.738999,-0.434801
1,-1.281035,-1.738999,1.195704
2,-1.352802,-1.70083,-1.715913
3,-1.137502,-1.70083,1.040418
4,-0.563369,-1.66266,-0.39598


In [18]:
# Transform the Card Type column using get_dummies()
card_dummies = pd.get_dummies(df_shopping["Card Type"])

# Display sample data
card_dummies.head()

Unnamed: 0,Credit,Debit
0,1,0
1,1,0
2,0,1
3,0,1
4,0,1


In [19]:
# Concatenate the df_shopping_transformed and the card_dummies DataFrames
df_shopping_transformed = pd.concat([df_shopping_transformed, card_dummies], axis=1)

# Display sample data
df_shopping_transformed.head()

Unnamed: 0,Age,Annual Income,Spending Score,Credit,Debit
0,-1.424569,-1.738999,-0.434801,1,0
1,-1.281035,-1.738999,1.195704,1,0
2,-1.352802,-1.70083,-1.715913,0,1
3,-1.137502,-1.70083,1.040418,0,1
4,-0.563369,-1.66266,-0.39598,0,1


# ==========================================

### 2.04 Everyone Do: Preprocessing Data (0:15) 

In [20]:
# Import the required modules
import pandas as pd
from pathlib import Path
import hvplot.pandas

## Load the Credit Card Data into a Pandas DataFrame

In [21]:
# Read in the CSV file as a Pandas Dataframe
ccinfo_df = pd.read_csv(
    Path("04-Evr_Preprocessing/Resources/cc_info_default.csv")
)

In [22]:
# Review the DataFrame
ccinfo_df.tail()

Unnamed: 0,limit_bal,education,marriage,age,bill_amt,pay_amt,default
4994,20000,secondary,yes,36,110994,7293,0
4995,180000,other,yes,34,35240,22066,0
4996,200000,secondary,yes,45,691806,21443,1
4997,310000,post-grad,yes,44,1548067,72000,0
4998,160000,primary,no,40,4440,3725,0


In [23]:
# Review the info
ccinfo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   limit_bal  4999 non-null   int64 
 1   education  4999 non-null   object
 2   marriage   4999 non-null   object
 3   age        4999 non-null   int64 
 4   bill_amt   4999 non-null   int64 
 5   pay_amt    4999 non-null   int64 
 6   default    4999 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 273.5+ KB


## Transform "education" column with get_dummies

In [24]:
# Verify the categories of the "education" column
ccinfo_df["education"].value_counts()

secondary    2267
primary      1862
post-grad     822
other          48
Name: education, dtype: int64

In [25]:
# Transform the education column using get_dummies
education_dummies = pd.get_dummies(ccinfo_df["education"])

# Display the transformed data
education_dummies.tail()

Unnamed: 0,other,post-grad,primary,secondary
4994,0,0,0,1
4995,1,0,0,0
4996,0,0,0,1
4997,0,1,0,0
4998,0,0,1,0


In [26]:
# Concatenate the df_shopping_transformed and the card_dummies DataFrames
ccinfo_df = pd.concat([ccinfo_df, education_dummies], axis=1)

# Drop the original education column
ccinfo_df = ccinfo_df.drop(columns=["education"])

# Display the DataFrame
ccinfo_df.head()

Unnamed: 0,limit_bal,marriage,age,bill_amt,pay_amt,default,other,post-grad,primary,secondary
0,20000,yes,24,7704,689,1,0,0,0,1
1,120000,no,26,17077,5000,1,0,0,0,1
2,90000,no,34,101653,11018,0,0,0,0,1
3,50000,yes,37,231334,8388,0,0,0,0,1
4,50000,yes,57,109339,59049,0,0,0,0,1


## Transform "marriage" column with encoding function

In [27]:
# Encoding the marriage column using a custom function
def encode_marriage(marriage):
    """
    This function encodes marital status by setting yes as 1 and no as 0.
    """
    if marriage == "yes":
        return 1
    else:
        return 0

# Call the encode_marriage function on the marriage column
ccinfo_df["marriage"] = ccinfo_df["marriage"].apply(encode_marriage)

# Review the DataFrame 
ccinfo_df.head()

Unnamed: 0,limit_bal,marriage,age,bill_amt,pay_amt,default,other,post-grad,primary,secondary
0,20000,1,24,7704,689,1,0,0,0,1
1,120000,0,26,17077,5000,1,0,0,0,1
2,90000,0,34,101653,11018,0,0,0,0,1
3,50000,1,37,231334,8388,0,0,0,0,1
4,50000,1,57,109339,59049,0,0,0,0,1


## Apply the Standard Scaler to "limit_bal", "bill_amt", "pay_amt"

In [28]:
# Import the module
from sklearn.preprocessing import StandardScaler

In [29]:
# Scaling the numeric columns
ccinfo_data_scaled = StandardScaler().fit_transform(ccinfo_df[["limit_bal", "bill_amt", "pay_amt"]])

# Review the scaled data
ccinfo_data_scaled

array([[-1.1173411 , -0.66070266, -0.5427793 ],
       [-0.3499424 , -0.63637003, -0.46399421],
       [-0.58016201, -0.41680786, -0.35401308],
       ...,
       [ 0.26397655,  1.1152494 , -0.16349243],
       [ 1.10811512,  3.33813208,  0.76045505],
       [-0.04298292, -0.66917611, -0.4872953 ]])

In [30]:
# Create a DataFrame of the scaled data
ccinfo_data_scaled = pd.DataFrame(ccinfo_data_scaled, columns=["limit_bal", "bill_amt", "pay_amt"])

# Replace the original data with the columns of information from the scaled Data
ccinfo_df["limit_bal"] = ccinfo_data_scaled["limit_bal"]
ccinfo_df["bill_amt"] = ccinfo_data_scaled["bill_amt"]
ccinfo_df["pay_amt"] = ccinfo_data_scaled["pay_amt"]

# Review the DataFrame
ccinfo_df.head()

Unnamed: 0,limit_bal,marriage,age,bill_amt,pay_amt,default,other,post-grad,primary,secondary
0,-1.117341,1,24,-0.660703,-0.542779,1,0,0,0,1
1,-0.349942,0,26,-0.63637,-0.463994,1,0,0,0,1
2,-0.580162,0,34,-0.416808,-0.354013,0,0,0,0,1
3,-0.887121,1,37,-0.080152,-0.402077,0,0,0,0,1
4,-0.887121,1,57,-0.396855,0.523771,0,0,0,0,1


## Elbow Method to find k

In [31]:
# Import the KMeans module from SKLearn
from sklearn.cluster import KMeans

In [32]:
# Create a a list to store inertia values and the values of k
inertia = []
k = list(range(1, 11))

In [33]:
# Create a for-loop where each value of k is evaluated using the K-means algorithm
# Fit the model using the service_ratings DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
for i in k:
    k_model = KMeans(n_clusters=i, random_state=0)
    k_model.fit(ccinfo_df)
    inertia.append(k_model.inertia_)
    

In [34]:
# Define a DataFrame to hold the values for k and the corresponding inertia
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,449413.376075
1,2,152036.470987
2,3,83362.744848
3,4,58548.383261
4,5,45451.282971


In [35]:
# Plot the DataFrame
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

## Kmeans algo to cluster data

In [36]:
# Define the model with 3 clusters
model = KMeans(n_clusters=3, random_state=3)

# Fit the model
model.fit(ccinfo_df)

# Make predictions
k_3 = model.predict(ccinfo_df)

# Create a copy of the preprocessed data
ccinfo_predictions_df = ccinfo_df.copy()

# Add a class column with the labels
ccinfo_predictions_df['customer_segments'] = k_3

In [37]:
# Plot the clusters
ccinfo_predictions_df.hvplot.scatter(
    x="limit_bal",
    y="age",
    by="customer_segments"
)

# ==========================================

### BREAK (0:10)

# ==========================================

### 2.05 Students Do: Standardizing Stock Data (0:15) 

# Standardizing Stock Data
In this activity, you will use the K-means algorithm to segment customer data for mobile versus in-person banking service ratings. Before you’re able to cluster the data, you’ll need to preprocess it by using the techniques that you've learned in this lesson.
## Instructions
1. Read in the `tsx-energy-2018.csv` file from the `Resources` folder and create the DataFrame. Make sure to set the Ticker column as the DataFrame’s index. Then, review the DataFrame.
   > **Note** The stock data that’s provided for this activity contains the yearly mean prices (open, high, low, and close), volume, annual return, and annual variance from companies in the energy sector that the TSX lists.
2. To prepare the data, use the `StandardScaler` module and the `fit_transform` function to scale all the columns containing numerical values. Review a five-row sample of the scaled data using bracket notation ([0:5]).
3. Create a new DataFrame called `df_stocks_scaled` that contains the scaled data. Make sure to do the following:
   * Use the same labels that were referenced in the `StandardScaler` for the column names.
   * Add a column to the DataFrame that consists of the tickers from the original DataFrame. (Hint: This column was the index).
   * Set the new column of tickers as the index for the new DataFrame.
   * Review the resulting DataFrame.
4. Encode the “EnergyType” column using `pd.get_dummies`, and save the result in a separate DataFrame called `df_oil_dummies`. Note that because the company name isn’t relevant for clustering, you don’t need to encode the “CompanyName” column.
5. Using the `pd.concat` function, concatenate the `df_stocks_scaled` DataFrame with the `df_oil_dummies` DataFrame along an axis value of 1 (`axis=1` tells Pandas to join the data horizontally by columns). Review the concatenated DataFrame.
6. Using the concatenated DataFrame, cluster the data by using the K-means algorithm and a k value of 3. Create a copy of the concatenated DataFrame, and add the resulting list of company segment values as a new column.

---

In [100]:
# Import the required libraries and dependencies
import pandas as pd
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [101]:
# Read the CSV file into a Pandas DataFrame
# Set the index using the Ticker column
df_stocks = pd.read_csv(
    Path("05-Stu-Standardizing_Stock_Data/Resources/tsx-energy-2018.csv"), 
    index_col="Ticker"
)

# Review the DataFrame
df_stocks.head()

Unnamed: 0_level_0,CompanyName,MeanOpen,MeanHigh,MeanLow,MeanClose,MeanVolume,AnnualReturn,AnnualVariance,EnergyType
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ARX,ARC Resources Ltd.,13.14,13.34,12.91,13.1,1479913.38,-0.7275,0.359,Oil
CCO,Cameco Corporation,13.7,13.92,13.5,13.7,1203788.22,0.2014,0.3693,Other Energy
CNQ,Canadian Natural Resources Limited,41.97,42.46,41.46,41.91,3218248.68,-0.3461,0.2947,Oil
CVE,Cenovus Energy Inc.,11.96,12.18,11.75,11.95,4566143.56,-0.3219,0.45,Oil
CPG,Crescent Point Energy Corp.,8.53,8.67,8.36,8.5,3919414.03,-1.0103,0.4597,Other Energy


In [102]:
#  Prepare the data. Use the StandardScaler module and fit_transform function to 
# scale all columns with numerical values
stock_data_scaled = StandardScaler().fit_transform(df_stocks[["MeanOpen", "MeanHigh", "MeanLow", "MeanClose", "MeanVolume", "AnnualReturn", "AnnualVariance"]])

# Diplay the first five rows of the scaled data
stock_data_scaled[0:5]

array([[-0.91683187, -0.91721692, -0.91804499, -0.9181346 , -0.15278563,
        -1.33244548,  0.46085356],
       [-0.88015205, -0.87947182, -0.87906242, -0.87878597, -0.37911694,
         1.69574215,  0.55941139],
       [ 0.97152411,  0.97784771,  0.96831488,  0.97125524,  1.27207441,
        -0.08909231, -0.15441525],
       [-0.9941215 , -0.99270713, -0.99468868, -0.9935528 ,  2.37690243,
        -0.01020099,  1.33160722],
       [-1.21878543, -1.2211301 , -1.21867327, -1.2198074 ,  1.8467981 ,
        -2.25436545,  1.42442382]])

In [103]:
# Create a DataFrame called with the scaled data
# The column names should match those referenced in the StandardScaler step
df_stocks_scaled = pd.DataFrame(
    stock_data_scaled,
    columns=["MeanOpen", "MeanHigh", "MeanLow", "MeanClose", "MeanVolume", "AnnualReturn", "AnnualVariance"]
)

# Create a Ticker column in the df_stocks_scaled DataFrame
# using the index of the original df_stocks DataFrame
df_stocks_scaled["Ticker"] = df_stocks.index

# Set the newly created Ticker column as index of the df_stocks_scaled DataFrame
df_stocks_scaled = df_stocks_scaled.set_index("Ticker")

# Review the DataFrame
df_stocks_scaled.head()

Unnamed: 0_level_0,MeanOpen,MeanHigh,MeanLow,MeanClose,MeanVolume,AnnualReturn,AnnualVariance
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ARX,-0.916832,-0.917217,-0.918045,-0.918135,-0.152786,-1.332445,0.460854
CCO,-0.880152,-0.879472,-0.879062,-0.878786,-0.379117,1.695742,0.559411
CNQ,0.971524,0.977848,0.968315,0.971255,1.272074,-0.089092,-0.154415
CVE,-0.994122,-0.992707,-0.994689,-0.993553,2.376902,-0.010201,1.331607
CPG,-1.218785,-1.22113,-1.218673,-1.219807,1.846798,-2.254365,1.424424


In [104]:
# Encode (convert to dummy variables) the EnergyType column
df_oil_dummies = pd.get_dummies(df_stocks["EnergyType"])

# Review the DataFrame
df_oil_dummies.head()

Unnamed: 0_level_0,Oil,Other Energy
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
ARX,1,0
CCO,0,1
CNQ,1,0
CVE,1,0
CPG,0,1


In [105]:
# Concatenate the `EnergyType` encoded dummies with the scaled data DataFrame
df_stocks_scaled = pd.concat([df_stocks_scaled, df_oil_dummies], axis=1)

# Display the sample data
df_stocks_scaled.head()

Unnamed: 0_level_0,MeanOpen,MeanHigh,MeanLow,MeanClose,MeanVolume,AnnualReturn,AnnualVariance,Oil,Other Energy
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ARX,-0.916832,-0.917217,-0.918045,-0.918135,-0.152786,-1.332445,0.460854,1,0
CCO,-0.880152,-0.879472,-0.879062,-0.878786,-0.379117,1.695742,0.559411,0,1
CNQ,0.971524,0.977848,0.968315,0.971255,1.272074,-0.089092,-0.154415,1,0
CVE,-0.994122,-0.992707,-0.994689,-0.993553,2.376902,-0.010201,1.331607,1,0
CPG,-1.218785,-1.22113,-1.218673,-1.219807,1.846798,-2.254365,1.424424,0,1


In [106]:
# Initialize the K-Means model with n_clusters=3
model = KMeans(n_clusters=3)

In [107]:
# Fit the model for the df_stocks_scaled DataFrame
model.fit(df_stocks_scaled)

KMeans(n_clusters=3)

In [108]:
# Predict the model segments (clusters)
stock_clusters = model.predict(df_stocks_scaled)

# View the stock segments
print(stock_clusters)

[0 1 2 0 0 2 1 1 1 2 1 1 0 1 1 1 2 1 0 2 2 2 2 0]


In [109]:
# Create a copy of the concatenated DataFrame
df_stocks_scaled_predictions = df_stocks_scaled.copy()

In [110]:
# Create a new column in the copy of the concatenated DataFrame with the predicted clusters
df_stocks_scaled_predictions["StockCluster"] = stock_clusters

# Review the DataFrame
df_stocks_scaled_predictions.head()

Unnamed: 0_level_0,MeanOpen,MeanHigh,MeanLow,MeanClose,MeanVolume,AnnualReturn,AnnualVariance,Oil,Other Energy,StockCluster
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ARX,-0.916832,-0.917217,-0.918045,-0.918135,-0.152786,-1.332445,0.460854,1,0,0
CCO,-0.880152,-0.879472,-0.879062,-0.878786,-0.379117,1.695742,0.559411,0,1,1
CNQ,0.971524,0.977848,0.968315,0.971255,1.272074,-0.089092,-0.154415,1,0,2
CVE,-0.994122,-0.992707,-0.994689,-0.993553,2.376902,-0.010201,1.331607,1,0,0
CPG,-1.218785,-1.22113,-1.218673,-1.219807,1.846798,-2.254365,1.424424,0,1,0


# ==========================================

### 2.06 Instructor Do: Clustering Complex Data (0:15) 

**Definition**

* **AgglomerativeClustering** Agglomerative Clustering. Recursively merges pair of clusters of sample data; uses linkage distance.
* **BIRCH** BIRCH clustering algorithm. It is a memory-efficient, online-learning algorithm provided as an alternative to MiniBatchKMeans. It constructs a tree data structure with the cluster centroids being read off the leaf. These can be either the final cluster centroids or can be provided as input to another clustering algorithm such as AgglomerativeClustering.
* **calinski_harabasz_score**: Compute the Calinski and Harabasz score. It is also known as the Variance Ratio Criterion. The score is defined as ratio of the sum of between-cluster dispersion and of within-cluster dispersion.

In [111]:
# Import the dependencies
import numpy as np
np.random.seed(0)
import pandas as pd
import hvplot.pandas
from sklearn import datasets

### Build the Dataset

In [113]:
# Create a simulated dataset for illustration.
X, y = datasets.make_moons(n_samples=(500), noise=0.05, random_state=1)
X[0:10]

array([[ 0.26990344, -0.08961617],
       [ 0.65960878, -0.44401893],
       [ 0.85049952,  0.56270289],
       [ 0.60950684,  0.69134729],
       [ 2.00353027,  0.19446353],
       [ 1.98790193,  0.40053406],
       [ 0.24847592, -0.18050231],
       [ 0.94871933,  0.37843451],
       [-0.96065183, -0.10227327],
       [ 0.44418573,  0.90246804]])

### Fit and predict a K-Means Model

In [None]:
from sklearn.cluster import 

In [114]:
# Import the alternative algorithms for clustering.
from sklearn.cluster import KMeans, AgglomerativeClustering, Birch
# Use the Kmeans algorithm.
k_model = KMeans(n_clusters=3, random_state=0)
k_model.fit(X)
predictions = k_model.predict(X)

### Fit and Predict Birch and Agglomerative models

In [115]:
# Use the Birch algorithm.
birch_model = Birch(n_clusters=2)
birch_model.fit(X)
birch_predictions = birch_model.predict(X)

In [116]:
# Use the AgglomerativeClustering algorithm.
agglo_model = AgglomerativeClustering(n_clusters=3)
agglo_predictions = agglo_model.fit_predict(X)

### Plot Model Predictions for Birch

In [117]:
# Make predictions for the Birch algorithm. 
predictions_df = pd.DataFrame(X)
predictions_df['birch-labels'] = birch_predictions
predictions_df

Unnamed: 0,0,1,birch-labels
0,0.269903,-0.089616,0
1,0.659609,-0.444019,0
2,0.850500,0.562703,0
3,0.609507,0.691347,0
4,2.003530,0.194464,0
...,...,...,...
495,0.024589,0.392571,1
496,0.673013,0.715418,0
497,1.277861,-0.457502,0
498,-0.980544,0.194562,1


In [118]:
#  Rename the non-string columns 0 and 1, "No" and "Yes" to avoid the Holoviews warning for future versions. 
predictions_df.rename({0: 'feature_0', 1: 'feature_1'}, axis=1, inplace=True)
predictions_df

Unnamed: 0,feature_0,feature_1,birch-labels
0,0.269903,-0.089616,0
1,0.659609,-0.444019,0
2,0.850500,0.562703,0
3,0.609507,0.691347,0
4,2.003530,0.194464,0
...,...,...,...
495,0.024589,0.392571,1
496,0.673013,0.715418,0
497,1.277861,-0.457502,0
498,-0.980544,0.194562,1


In [119]:
# Plot predictions for the Birch algorithm. 
predictions_df.hvplot.scatter(
    x="feature_0",
    y="feature_1",
    by="birch-labels"
)

### Estimate Scores for two Versions of the Birch Model

In [124]:
# Estimate scores for two clusters for the Birch model. 
birch_model_two_clusters = Birch(n_clusters=2)
birch_model_two_clusters.fit(X)
birch_predictions_2 = birch_model_two_clusters.predict(X)

In [125]:
# Estimate scores for 3 clusters for the Birch model. 
birch_model_three_clusters = Birch(n_clusters=3)
birch_model_three_clusters.fit(X)
birch_predictions_3 = birch_model_three_clusters.predict(X)

In [126]:
# Use the Calinski-Harabasz index or variance ratio criterion to define two clusters. 
from sklearn import metrics
labels = birch_model_two_clusters.labels_
score = metrics.calinski_harabasz_score(X, labels)  
score

588.1123857523019

In [127]:
# Use the Calinski-Harabasz index or variance ratio criterion to define three clusters. 
labels = birch_model_three_clusters.labels_
score = metrics.calinski_harabasz_score(X, labels)  
score

654.2904571777168

In [130]:
### Added trick

from sklearn import metrics
from sklearn.cluster import SpectralClustering
n_clusters = 3

ml = {
    "KMeans":KMeans(n_clusters=n_clusters, random_state=0),
    "Birch":Birch(n_clusters=n_clusters),
    "AgglomerativeClustering":AgglomerativeClustering(n_clusters=n_clusters),
    "SpectralClustering":SpectralClustering(n_clusters=n_clusters),
    # "SpectralClustering":SpectralClustering(n_clusters, affinity='precomputed', n_init=100, assign_labels='discretize')
    # "cluster_optics_dbscan":cluster_optics_dbscan(n_clusters=n_clusters),
}

res = []
for x in ml:
    model = ml[x]
    model.fit(X)
    y_pred = model.fit_predict(X)

    labels = model.labels_
    score = metrics.calinski_harabasz_score(X, labels) 
    res.append({
        "model":x,
        "score": score
    })
    predictions_df["class"] = y_pred
    display(predictions_df.hvplot.scatter(
        x="feature_0",
        y="feature_1",
        title=x,
        by="class"
    ))
    
df_res = pd.DataFrame(res)
df_res.sort_values("score")

Unnamed: 0,model,score
2,AgglomerativeClustering,565.44924
3,SpectralClustering,573.450055
1,Birch,654.290457
0,KMeans,677.441052


# ==========================================

### 2.07 Students Do: Segmenting Customer Data (0:20) 

# Segmenting Customer Data
In this activity, you will use BIRCH, agglomerative clustering, and the K-means model to segment a dataset on thousands of consumer credit card holders. Then, you’ll compare the results of the three different clustering methods.
## Background
One of the world's biggest banks launched a machine learning competition in [Kaggle](https://www.kaggle.com/), an online community of data scientists and machine learning practitioners. The bank wants to improve their marketing campaigns by identifying the optimal number of customer segments for their credit card clients, and are offering $5,000 in prize money to the winner. The cash prize has piqued your interest, so you have decided to put your hat in the ring and your unsupervised learning skills into practice!

The bank provided a dataset that consists of customer data that includes 10 different features. The data columns were anonymized using generic names to protect customers' privacy, and data values were already normalized.
## Instructions
1. Load the raw data into a Pandas DataFrame.
2. Use the elbow method to determine the optimal number of clusters.
3. Segment the data with K-means using the optimal number of clusters.
4. Cluster the data by using agglomerative clustering and BIRCH.
    * Using the optimal number of clusters found in Step 2, estimate clusters by using both `AgglomerativeClustering` and `Birch`. Save each of these models and their results for comparison.
5. Compare the cluster results from using K-means, agglomerative clustering, and BIRCH.  Make sure to do the following:
    * Create a DataFrame that is a copy of the original `customers_df` data.
    * Add all of the predicted labels (`kmeans_predictions`, `agglo_predictions`, and `birch_predictions`) as columns to this DataFrame.
    * For each algorithm, plot the clusters by using the "feature_1" and "feature_2" columns.
### Bonus
Loop through each clustering algorithm using an alternative metric to determine the optimal number of clusters. To do so, follow these steps: 
1. Create three lists (or a dictionary or DataFrame) to contain the metrics to measure optimal clusters.
2. Using a `for` loop, cycle through a list of cluster counts, fitting each of the three clustering algorithms.
3. When fitting the clustering algorithms in the loop, estimate the [variance ratio criterion (Calinski-Harabasz index)`](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index) and save that metric to your metrics lists in (1).
    > **Hint:** Code samples for these and other metrics can be found in scikit-learn documentation on [clustering performance evaluation](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation).
4. Output each of the three lists. If larger metric values indicate a better number of clusters, what cluster count is best? Does it vary by the algorithm selected?

---

In [131]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path

## Part 1: Create a Pandas DataFrame

In [132]:
# Set the file path
file_path = Path("07-Stu_Segmenting_Customers/Resources/customers.csv")

# Read the csv file into a pandas DataFrame
customers_df = pd.read_csv(file_path)

# Review the DataFrame
customers_df.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,1.148534,4.606077,2.699069,-2.661824,1.526433,1.236671,0.211421,1.482896,-4.445627,-1.936831
1,-1.14941,-1.650549,2.530167,-3.227088,0.572138,4.1626,-0.291679,-1.237575,3.604765,-1.635689
2,0.332427,-0.887985,-0.309216,0.399891,0.828492,3.641945,-0.916946,-1.978024,1.056772,-1.882747
3,2.245599,3.826309,0.264039,0.095471,1.98438,0.373991,-0.280279,1.602786,-5.993331,-2.258925
4,0.705503,-1.312329,0.895406,-0.405408,1.116187,3.699562,-1.427985,-1.494409,1.156908,-1.434964


In [133]:
# Use the "info()" Pandas function to validate data types and null values
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   feature_1   1000 non-null   float64
 1   feature_2   1000 non-null   float64
 2   feature_3   1000 non-null   float64
 3   feature_4   1000 non-null   float64
 4   feature_5   1000 non-null   float64
 5   feature_6   1000 non-null   float64
 6   feature_7   1000 non-null   float64
 7   feature_8   1000 non-null   float64
 8   feature_9   1000 non-null   float64
 9   feature_10  1000 non-null   float64
dtypes: float64(10)
memory usage: 78.2 KB


In [134]:
# Use the Pandas "describe()" function to compute summary statistics
customers_df.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,-0.022428,0.805748,1.942896,-2.36403,0.85498,1.232422,0.146269,0.833486,-0.53432,-1.219393
std,2.382021,2.335796,1.411307,1.716566,1.742986,3.250231,1.635576,2.039563,4.211831,1.979172
min,-6.259471,-4.649286,-2.894995,-8.735778,-4.641509,-9.11147,-4.260013,-4.911903,-9.522425,-6.083462
25%,-2.091657,-1.214774,1.026128,-3.438149,-0.23531,-0.333722,-0.967569,-0.894817,-4.129561,-2.505366
50%,0.16167,1.096439,1.905107,-2.437602,1.084556,1.367371,-0.222299,1.519069,-0.536849,-1.706372
75%,2.030005,2.513648,2.851613,-1.22973,2.287268,3.637304,1.061269,2.298862,2.626514,-0.553571
max,6.275723,7.955158,5.897102,4.296552,4.74135,8.705423,7.123969,5.789222,10.047819,5.413623


## Part 2. Use the Elbow Method to determine the optimal number of clusters for KMeans.

In [135]:
# Import the KMeans, Birch, and AgglomerativeClustering modules from SKLearn
from sklearn.cluster import KMeans, AgglomerativeClustering, Birch

In [136]:
# Create a list to store inertia values and the values of k
inertia = []

# Create a list to set the range of k values to test
k = list(range(1, 11))

In [137]:
# Create a for loop where each value of k is evaluated using the K-means algorithm
# Fit the model using the "customers_df" DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
for i in k:
    k_model = KMeans(n_clusters=i, random_state=0)
    k_model.fit(customers_df)
    inertia.append(k_model.inertia_)

In [138]:
# Define a DataFrame to hold the values for k and the corresponding inertia
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,58103.759171
1,2,32183.537923
2,3,17080.936423
3,4,14890.068176
4,5,12816.235532


In [139]:
# Plot the DataFrame to identify the optimal value for k
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

## Part 3: Segment the data with K-means using the optimal number of clusters

In [140]:
# Define the model with optimal number of clusters
model = KMeans(n_clusters=3, random_state=0)

# Fit the model
model.fit(customers_df)

# Make predictions
kmeans_predictions = model.predict(customers_df)

## Part 4. Cluster the data using AgglomerativeClustering and Birch

Using your optimal number of clusters found above, additionally estimate clusters by using both `AgglomerativeClustering` and `Birch`. Save each of these models and their results for comparison.

In [141]:
agglo_model = AgglomerativeClustering(n_clusters=3)
agglo_predictions = agglo_model.fit_predict(customers_df)

In [142]:
birch_model = Birch(n_clusters=3)
birch_model.fit(customers_df)
birch_predictions = birch_model.predict(customers_df)

In [143]:
# Previewing the predicted customer classifcations for Birch
birch_predictions[-10:]

array([0, 1, 0, 0, 0, 0, 1, 0, 1, 1], dtype=int64)

## Part 5. Compare the cluster results from using Kmeans, AgglomerativeClustering, Birch

In [144]:
# Create a copy of the customers_df DataFrame
customers_predictions_df = customers_df.copy()
# Add class columns with the labels to the new DataFrame
customers_predictions_df["kmeans-segments"] = kmeans_predictions
customers_predictions_df["agglomerative-segments"] = agglo_predictions
customers_predictions_df["birch-segments"] = birch_predictions
customers_predictions_df[['kmeans-segments','agglomerative-segments', 'birch-segments']].head(3)

Unnamed: 0,kmeans-segments,agglomerative-segments,birch-segments
0,1,1,0
1,0,0,1
2,0,0,1


In [145]:
# Plot the kmeans clusters using the "feature_1" and "feature_2" columns
customers_predictions_df.hvplot.scatter(
    x="feature_1",
    y="feature_2",
    by="kmeans-segments"
)

In [146]:
# Plot the agglomerative clusters using the "feature_1" and "feature_2" columns
customers_predictions_df.hvplot.scatter(
    x="feature_1",
    y="feature_2",
    by="agglomerative-segments"
)

In [147]:
# Plot the birch clusters using the "feature_1" and "feature_2" columns
customers_predictions_df.hvplot.scatter(
    x="feature_1",
    y="feature_2",
    by="birch-segments"
)

In [148]:
customers_predictions_df[['kmeans-segments','agglomerative-segments', 'birch-segments']].corr()

Unnamed: 0,kmeans-segments,agglomerative-segments,birch-segments
kmeans-segments,1.0,0.982562,0.302117
agglomerative-segments,0.982562,1.0,0.30896
birch-segments,0.302117,0.30896,1.0


### Bonus

In [149]:
# Preview the predictions for one of the algorithms
birch_predictions[0:10]

array([0, 1, 1, 0, 1, 1, 0, 0, 0, 1], dtype=int64)

In [150]:
# Equivalently, preview the labels_ attribute for one of the algorithms
birch_model.labels_[0:10]

array([0, 1, 1, 0, 1, 1, 0, 0, 0, 1], dtype=int64)

In [158]:
# Create a list to store values and the values of k
score_kmeans = []
score_agglomerative = []
score_birch = []

# Create a list to set the range of k values to test
k = list(range(2, 11))

In [159]:
from sklearn import metrics

for i in k:
    k_model = KMeans(n_clusters=i, random_state=0)
    k_model.fit(customers_df)
    labels = k_model.labels_
    score = metrics.calinski_harabasz_score(customers_df, labels)    
    score_kmeans.append(score)
    
    agglo_model = AgglomerativeClustering(n_clusters=i)
    agglo_predictions = agglo_model.fit_predict(customers_df)
    labels = agglo_model.labels_
    score = metrics.calinski_harabasz_score(customers_df, labels)    
    score_agglomerative.append(score)    
    
    birch_model = Birch(n_clusters=i)
    birch_model.fit(customers_df)
    labels = birch_model.labels_
    score = metrics.calinski_harabasz_score(customers_df, labels)    
    score_birch.append(score)

In [153]:
display(score_kmeans)

[803.7767901000835,
 1197.2339591364607,
 963.5244943567246,
 878.984431645905,
 781.1913330352697,
 722.008246754843,
 678.3825232189985,
 638.9870291772176,
 603.5166505532345]

In [154]:
score_agglomerative

[793.1761769443768,
 1173.3765904855773,
 920.430407435551,
 783.1374540348882,
 698.3124513125239,
 642.0342150282685,
 609.5331449471877,
 573.5727292902812,
 542.4260224059782]

In [155]:
score_birch

[792.7549736617844,
 1172.1940395784054,
 905.8303632361597,
 807.3524405928957,
 710.299103155839,
 650.134014299598,
 601.7209094043105,
 569.5499222834262,
 533.4727554559031]

In [160]:
k

[2, 3, 4, 5, 6, 7, 8, 9, 10]

**Bonus Question:**If larger metric values indicate a better number of clusters, what cluster count is best? Does it vary by the algorithm selected?

>**Sample Answer**: Based on each of the three lists, the highest value for each of the three algorithms appears to be at the three-cluster count. Based on this metric, two clusters would actually be sufficient to classify these customers, regardless of which of these three algorithms were used.

In [88]:
### Added trick

from sklearn import metrics

n_clusters = 3

ml = {
    "KMeans":KMeans(n_clusters=n_clusters, random_state=0),
    "Birch":Birch(n_clusters=n_clusters),
    "AgglomerativeClustering":AgglomerativeClustering(n_clusters=n_clusters),
    "SpectralClustering":SpectralClustering(n_clusters=n_clusters),
#     "cluster_optics_dbscan":cluster_optics_dbscan(n_clusters=n_clusters),
}

res = []
p=[]
for x in ml:
    model = ml[x]
    model.fit(customers_df)
    y_pred = model.fit_predict(customers_df)

    labels = model.labels_
    score = metrics.calinski_harabasz_score(customers_df, labels) 
    res.append({
        "model":x,
        "score": score
    })
    customers_df["class"] = y_pred
    display(customers_df.hvplot.scatter(
        x="feature_1",
        y="feature_2",
        title=x,
        by="class"
    ))
    
df_res = pd.DataFrame(res)
df_res.sort_values("score")









Unnamed: 0,model,score
1,Birch,1186.538926
2,AgglomerativeClustering,1190.911709
3,SpectralClustering,1192.150635
0,KMeans,1197.233959


# ==========================================

### Rating Class Objectives

* rate your understanding using 1-5 method in each objective

In [None]:
title = "20.2-Unsupervised-Learning - Machine Learning in Practice"
objectives = [
    "Segment data",
    "Prepare data for complex algorithms",
    "Explain the importance of preprocessing data for unsupervised learning",
    "Transform categorical variables into a numerical representation using Pandas",
    "Scale data by using the StandardScaler module from scikit-learn",
]
rating = []
total = 0
for i in range(len(objectives)):
    rate = input(objectives[i]+"? ")
    total += int(rate)
    rating.append(objectives[i] + ". (" + rate + "/5)")
print("="*96)
print(f"Self Evaluation for: {title}")
print("-"*24)
for i in rating:
    print(i)
print("-"*64)
print("Average: " + str(total/len(objectives)))