# **Unsupervised Learning Practice Project: Fantasy Sports Clustering Analysis**

--------------------------------
## **Context** 
-------------------------------

Fantasy sports are online gaming platforms where participants draft and manage virtual teams of real professional sports players. Based on the performance of the players in the real world, players are allotted points in the fantasy sports platform every match. The objective is to create the best possible team with a fixed budget to score maximum fantasy points, and users compete against each other over an entire sports league or season. Some of these fantasy sports require actual financial investments for participation, with the chances of winning monetary rewards as well as free matchday tickets on a periodic basis.

The fantasy sports market has seen tremendous growth over the past few years, with a valuation of \\$18.6 billion in 2019. The football (soccer) segment led in terms of market share in 2019, with over 8 million participants worldwide, and is expected to retain its dominance over the next couple of years. Digitalization is one of the primary factors driving the growth of the fantasy sports market as it allows participants the opportunity to compete on a global level and test their skills. With an increase in smartphone usage and availability of fantasy sports apps, this market is expected to witness a globe surge and reach a \\$48.6 billion valuation by 2027.


----------------------------
## **Objective**
-----------------------------

OnSports is a fantasy sports platform that has fantasy leagues for many different sports and has witnessed an increasing number of participants globally over the past 5 years. For each player, a price is set at the start, and the price keeps changing over time based on the performance of the players in the real world. With the new English Premier League season about to start, they have collected data from the past season and want to analyze it to determine the price of each player for the start of the new season. OnSports have hired you as a data scientist and asked you to conduct a cluster analysis to identify players of different potentials of each player based on previous season performance. This will help them understand the patterns in player performances and fantasy returns and decide the exact price to be set for each player for the upcoming football season.

--------------------------
## **Data Description**
--------------------------

- **Player_Name:** Name of the player.
- **Club:** Club in which the player plays.
- **Position:** Position in which the player plays.
- **Goals_Scored:** Number of goals scored by the player in the previous season.
- **Assists:** Number of passes made by the player leading to goals in the previous season.
- **Total_Points:** Total number of fantasy points scored by the player in the previous season.
- **Minutes:** Number of minutes played by the player in the previous season.
- **Goals_Conceded:** Number of goals conceded by the player in the previous season.
- **Creativity:** A score, computed using a range of stats, that assesses player performance in terms of producing goalscoring opportunities for other players.
- **Influence:** A score, computed using a range of stats, that evaluates a player's impact on a match, taking into account actions that could directly or indirectly affect the match outcome.
- **Threat:** A score, computed using a range of stats, that gauges players who are most likely to score goals.
- **Bonus:** Total bonus points received. The three best performing players in each match receive additional bonus points based on a score computed using a range of stats. 3 points are awarded to the highest scoring player, 2 to the second best, and 1 to the third.
- **Clean_Sheets:** Number of matches without conceding a goal in the previous season.

## **Importing the necessary libraries and overview of the dataset**

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist, pdist

# To perform K-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

# To import DBSCAN and Gaussian Mixture
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Complete the code to import the data
data = pd.read_csv('data/fpl_data.csv')

In [3]:
data.head()

Unnamed: 0,Player_Name,Club,Position,Goals_Scored,Assists,Total_Points,Minutes,Goals_Conceded,Creativity,Influence,Threat,Bonus,Clean_Sheets
0,Alex Runnarsson,Arsenal,Goalkeeper,0,0,1,15,0,0.0,16.6,0,0,0
1,Alexandre Lacazette,Arsenal,Forward,13,3,129,1916,21,307.4,602.4,797,21,7
2,Bernd Leno,Arsenal,Goalkeeper,0,0,131,3131,37,0.0,702.2,2,11,11
3,Bukayo Saka,Arsenal,Midfielder,5,5,114,2554,31,650.6,493.0,984,8,8
4,Calum Chambers,Arsenal,Defender,0,3,36,751,10,169.4,171.8,77,3,2


In [4]:
# Check the shape of the data
data.shape

(476, 13)

In [None]:
# Let's view a sample of the data
data.sample(n = 10, random_state = 1)

In [None]:
# Checking the column names and datatypes
df.info()

In [None]:
# Copying the data to another variable to avoid any changes to original data
df = data.copy()

In [None]:
# Check for duplicate values

In [None]:
# Check for missing values in the data

## **Exploratory Data Analysis**

**Let's check the statistical summary of the data**

In [None]:
df.describe(_____).T

**Observations and Insights: _____**

### **Univariate Analysis**

In [None]:
# Function to plot a boxplot and a histogram along the same scale


def histogram_boxplot(data, feature, figsize = (12, 7), kde = False, bins = None):
    
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12, 7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows = 2,  # Number of rows of the subplot grid= 2
        sharex = True,  # X-axis will be shared among all subplots
        gridspec_kw = {"height_ratios": (0.25, 0.75)},
        figsize = figsize,
    )  # Creating the 2 subplots
    sns.boxplot(
        data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
    )  # Boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins, palette = "winter"
    ) if bins else sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color = "green", linestyle = "--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color = "black", linestyle = "-"
    )  # Add median to the histogram

**`Goals_Scored`**

In [None]:
histogram_boxplot(df, 'Goals_Scored')

**`Assists`**

In [None]:
histogram_boxplot(_______)  # Complete the code to create histogram_boxplot for 'Assists'

In [None]:
# Plot the histogram and the boxplot for 'Goals_Conceded'

In [None]:
# Plot the histogram and the boxplot for 'Clean_Sheets'

In [None]:
# Plot the histogram and the boxplot for 'Minutes'

In [None]:
# Plot the histogram and the boxplot for 'Total_Points'

In [None]:
# Plot the histogram and the boxplot for 'Creativity'

In [None]:
# Plot the histogram and the boxplot for 'Influence'

In [None]:
# Plot the histogram and the boxplot for 'Threat'

In [None]:
# Plot the histogram and the boxplot for 'Bonus'

**Observations and Insights for all the plots: _____**

In [None]:
# Function to create labeled barplots


def labeled_barplot(data, feature, perc = False, n = None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # Length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize = (count + 1, 5))
    else:
        plt.figure(figsize = (n + 1, 5))

    plt.xticks(rotation = 90, fontsize = 15)
    ax = sns.countplot(
        data = data,
        x = feature,
        palette = "Paired",
        order = data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # Percentage of each class of the category
        else:
            label = p.get_height()  # Count of each level of the category

        x = p.get_x() + p.get_width() / 2  # Width of the plot
        y = p.get_height()  # Height of the plot

        ax.annotate(
            label,
            (x, y),
            ha = "center",
            va = "center",
            size = 12,
            xytext = (0, 5),
            textcoords = "offset points",
        )  # Annotate the percentage

    plt.show()  # Show the plot

**`Club`**

In [None]:
labeled_barplot(df, 'Club')

**`Position`**

In [None]:
labeled_barplot('_______')  # Complete the code to create a labelled barplot for 'Position'

**Observations and Insights:__**

### **Bivariate Analysis**

**We are done with univariate analysis. Let's explore the data a bit more with bivariate analysis.**

In [None]:
# Correlation check
cols_list = df.select_dtypes(include = np.number).columns.tolist()

plt.figure(figsize = (15, 7))

sns.heatmap(
    df[cols_list].corr(numeric_only = True), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral"
)

plt.show()

**Observations and Insights:__**

**Let's check players from which team have scored the most fantasy points on average.**

In [None]:
plt.figure(figsize = (15, 8))

sns.barplot(data = df, x = ___ , y = ___ , errorbar=('ci', False))  # Complete the code to choose the right variables

plt.xticks(rotation = 90) 

plt.show()

**We know that players in different positions have specific roles to play in a team. Let's check players in which positions tend to score more fantasy points on average.**

In [None]:
# Complete the code with the right variables

**To effectively utilize their squad depth, managers often rotate the squad to keep key players in shape for tougher games. Let's check the total number of minutes played, on average, across different positions.**

In [None]:
# Complete the code with the right variables

**Every point counts in fantasy sports and getting bonus points for a player is always a treat. Let's check which team's players have secured the most bonus points, on average, last season.**

In [None]:
# Complete the code with the right variables

**Let's see which players scored the most fantasy points last season for different positions of play.**

In [None]:
pos_list = df.Position.unique().tolist()
best_df = pd.DataFrame()

for pos in pos_list:
    df_aux = df[df.Position == pos]
    best_df = pd.concat([best_df, df_aux[df_aux.Total_Points == df_aux.Total_Points.max()][['Player_Name', 'Club', 'Position', 'Total_Points']]])

best_df

**Let's see the top 10 players with the most fantasy points last season for different positions of play.**

In [None]:
best10_df = pd.DataFrame()

for pos in pos_list:
    df_aux = df[df.Position == pos]
    best10_df = pd.concat([best10_df, df_aux.sort_values('Total_Points', ascending=False).reset_index(drop=True).loc[:10, ['Player_Name', 'Club', 'Position', 'Total_Points']]])

best10_df

### **Outlier Check**

- Let's plot the boxplots of all numerical columns to check for outliers.

In [None]:
plt.figure(figsize = (20, 10))

numeric_columns = df.select_dtypes(include = np.number).columns.tolist()

for i, variable in enumerate(numeric_columns):
    
    plt.subplot(2, 5, i + 1)
    
    plt.boxplot(df[variable], whis = 1.5)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

### **Scaling**

- Let's scale the data before we proceed with clustering.

In [None]:
# Scaling the data before clustering
scaler = ________ # Initialize the Standard Scaler

subset = ___  # Complete the code to get the data with numerical features

subset_scaled = ______ # Fit_transform the scaler function on data subset

In [None]:
# Creating a dataframe of the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns = subset.columns)

### **Applying PCA**


In [None]:
# Defining the number of principal components to generate
n = subset.shape[1]                                    # Storing the number of variables in the subset data

pca = ____________                                     # Initialize PCA with n_components = n and random_state = 1

data_pca = pd.DataFrame(pca.____________)              # Fit_transform PCA on the scaled data

# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_                     

## **K-Means Clustering**

In [None]:
k_means_df = data_pca.copy()

In [None]:
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    
    model = KMeans(n_clusters = k, random_state = 1, n_init = "auto")
    
    model.fit(data_pca)
    
    prediction = model.predict(k_means_df)
    
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis = 1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")

plt.xlabel("k")

plt.ylabel("Average Distortion")

plt.title("Selecting k with the Elbow Method", fontsize = 20)

plt.show()

**Observations and Insights:**

- We will move ahead with k = 4. **What can be the reason for the same?**

In [None]:
kmeans = KMeans(______, random_state = 1, n_init = "auto") # Create K-Means with nclusters = 4

kmeans.fit(k_means_df)

In [None]:
# Creating a copy of the original data
df1 = df.copy()

# Adding K-Means cluster labels to the K-Means dataframe
k_means_df["KM_segments"] = kmeans.labels_

# Adding K-Means cluster labels to the original dataframe
df1["KM_segments"] = kmeans.labels_

### **Cluster Profiling**

In [None]:
km_cluster_profile = df1.groupby( ____ ).mean(numeric_only = True)  # Complete the code to groupby the cluster labels

In [None]:
# Creating the "count_in_each_segment" feature in K-Means cluster profile

km_cluster_profile["count_in_each_segment"] = (
    df1.groupby( ______ )["Total_Points"].count().values)  # Complete the code to groupby the cluster labels

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables
km_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0) 

In [None]:
# Complete the code to print the players in each cluster. Hint: Use the KM_segments feature

for cl in df1[ ___ ].unique(): 
    print("In cluster {}, the following players are present:".format(cl))
    print(df1[df1[ ____ ] == cl]["Player_Name"].unique())
    print()

In [None]:
df1.groupby(["KM_segments", "Position"])['Player_Name'].count()

**Let's plot the boxplot**

In [None]:
fig, axes = plt.subplots(3, 4, figsize = (20, 20))
counter = 0

for ii in range(3):
    for jj in range(4):
        if counter < 10:
            sns.boxplot(
                ax = axes[ii][jj],
                data = df1,
                y = df1.columns[3 + counter],
                x = "KM_segments",
            )
            counter = counter + 1

fig.tight_layout(pad = 3.0)

### **Characteristics of each cluster:___**

## **K-Medoids Clustering**

In [None]:
kmed_df = data_pca.copy()

In [None]:
kmed = KMedoids(______, random_state = 1) # Create K-Medoids with nclusters = 4
kmed.fit(kmed_df)

In [None]:
# Creating a copy of the original data
df2 = df.copy()

# Add K-Medoids cluster labels to K-Medoids data
k_med_df["KMed_segments"] = ________

# Add K-Medoids cluster labels to original data
df2["KMed_segments"] =  _________

### **Cluster Profiling**

In [None]:
# Take the cluster-wise mean of all the variables. Hint: First group 'df2' by cluster labels column and then find mean

In [None]:
# Create the "count_in_each_segment" column in K-Medoids cluster profile

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables

In [None]:
# Complete the code to print the players in each cluster. Hint: Use the KMed_segments feature

**Let's plot the boxplot**

In [None]:
# Create boxplot for each of the variables

### **Characteristics of each cluster:___**

### **Comparison of cluster profiles from K-Means and K-Medoids:_______________**

##  **Hierarchical Clustering**

In [None]:
hc_df = data_pca.copy()

In [None]:
hc_df1 = hc_df.copy()

In [None]:
# List of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df1, metric = dm, method = lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
            
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)

**Let's explore different linkage methods with Euclidean distance only.**

In [None]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df1, metric = "euclidean", method = lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm
        
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)

**Let's view the dendrograms for the different linkage methods with Euclidean distance only.**

In [None]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize = (15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df1, metric = "euclidean", method = method)

    dendrogram(Z, ax = axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])

In [None]:
# Create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns = compare_cols)
df_cc = df_cc.sort_values(by = "Cophenetic Coefficient")
df_cc

In [None]:
HCmodel = AgglomerativeClustering(n_clusters = ___ , metric = ___ , linkage = ___ )  # Complete the code to define the hierarchical clustering with average linkage
HCmodel.fit(hc_df)

In [None]:
# Creating a copy of the original data
df3 = df.copy()

# Adding hierarchical cluster labels to the Heirarhical and original dataframes
hc_df["HC_segments_L1"] = _______________
df3["HC_segments_L1"] = _______________

### **Cluster Profiling**

In [None]:
# Take the cluster-wise mean of all the variables. Hint: First group 'df3' by cluster labels column and then find mean

In [None]:
# Create the "count_in_each_segment" column in hierarchical cluster profile

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables

In [None]:
# Complete the code to print the players in each cluster. Hint: Use the HC_segments feature

**We see that most of the players have been grouped into one cluster, and there are two very sparse clusters. This clustering does not look good as the clusters do not have enough variability.**

**Let us try using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 4 appears to be a good number of clusters from the dendrogram for Ward linkage.**

In [None]:
hc_df2 = data_pca.copy()

In [None]:
HCmodel = AgglomerativeClustering(n_clusters = ___ , metric = ___ , linkage = ___ )  # Complete the code to define the hierarchical clustering with Ward Linkage
HCmodel.fit(hc_df2)

In [None]:
# Creating a copy of the original data
df3 = df.copy()

# Adding hierarchical cluster labels to the Heirarhical and original dataframes
hc_df["HC_segments_L2"] = _______________
df3["HC_segments_L2"] = _______________

### **Cluster Profiling**

In [None]:
# Take the cluster-wise mean of all the variables. Hint: First group 'df3' by cluster labels column and then find mean

In [None]:
# Create the "count_in_each_segment" column in hierarchical cluster profile

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables

In [None]:
# Complete the code to print the players in each cluster. Hint: Use the HC_segments feature

**Let's plot the boxplot**

In [None]:
# Create boxplot for each of the variables

### **Characteristics of each cluster:___**

### **Comparison of cluster profiles from Hierarchical and previous algorithms:___________________**


## **GMM clustering**

In [None]:
gmm_df = data_pca.copy()

In [None]:
gmm = GaussianMixture(________, random_state = 1) # Initializing the Gaussian Mixture algorithm with n_components = 4

gmm.fit(_____) # Fit the Gaussian Mixture algorithm on the gmm_df

### **Cluster Profiling**

In [None]:
# Creating a copy of the original data
df4 = df.copy()

# Adding gmm cluster labels to the GMM and original dataframes
gmm_df["GMM_segments"] = ____________
df4["GMM_segments"] = _______________

In [None]:
# Take the cluster-wise mean of all the variables. Hint: First group 'df4' by cluster labels column and then find mean

In [None]:
# Create the "count_in_each_segment" column in gmm cluster profile

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables

In [None]:
# Complete the code to print the players in each cluster. Hint: Use the GMM_segments feature

**Let's plot the boxplot**

In [None]:
# Create boxplot for each of the variables

### **Characteristics of each cluster:___**

### **Comparison of cluster profiles from GMM and previous algorithms:______________________**

## **DBSCAN**

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, `eps`, and `min samples`.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

### **What is the silhouette score?**

Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].

* Silhouette coefficients near +1 indicate that the sample is far away from the neighboring clusters. 
* Silhouette score near -1 indicates that those samples might have been assigned to the wrong cluster.

In [None]:
dbscan_df = data_pca.copy()
dbscan_df1 = dbscan_df.copy()

In [None]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  

In [None]:
# Finding the silhouette_score for each of the combination
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps = i, min_samples = j).fit(dbscan_df)          # Applying DBScan to each of the combinations in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype = bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(dbscan_df, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                            # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                       # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j

In [None]:
# Printing the highest silhouette score
print(
    "Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(
        high_silhouette_avg, high_i_j[0], high_i_j[1]
    )
)

In [None]:
# Applying DBSCAN with the hyperparmeter values that we got

# Fit DBSCAN algorithm with the above hyperparameter values

In [None]:
# Creating a copy of the original data
df5 = df.copy()

# Add DBSCAN cluster labels to dbscan data
dbscan_df1["db_segments"] = ________________

# Add DBSCAN cluster labels to original data
df5["db_segments"] =  ___________________

### **Cluster Profiling**

In [None]:
# Take the cluster-wise mean of all the variables. Hint: First group 'df5' by cluster labels column and then find mean

In [None]:
# Create the "count_in_each_segment" column in hierarchical cluster profile

In [None]:
# Highlight the maximum average value among all the clusters for each of the variables

**Think about it:**

- Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

**Observations and Insights:**

### **Choosing the Best Algorithm**

- Since cluster profiles are the same for every algorithm except DBSCAN, it is difficult to choose the best algorithm. We can compute the silhouette score to choose the best algorithm among all the algorithms.

In [None]:
kmeans = KMeans(n_clusters = 4, random_state = 1, n_init = 'auto')        # Initializing K-Means with number of clusters as 4 and random_state = 1

preds = kmeans.fit_predict((data_pca))                   # Fitting and predicting K-Means on data_pca

score = silhouette_score(data_pca, preds)                # Calculating the silhouette score

print(score)         

In [None]:
# Initialize K-Medoids with number of clusters as 4 and random_state = 1

# Fitting and predicting K-Medoids on data_pca

# Calculate the silhouette score

# Print the score 

In [None]:
# Initialize Agglomerative Clustering with distance as Euclidean, linkage as ward with clusters = 4

# Fitting and predicting HC algorithm on data_pca  

# Calculate the silhouette score

# Print the score 

In [None]:
# Initialize Gaussian Mixture algorithm with number of clusters as 4 and random_state = 1

# Fitting and predicting Gaussian Mixture algorithm on data_pca

# Calculate the silhouette score

# Print the score 

**Think about it:**

- Which is the best algorithm here among all the algorithms?

## **Conclusion:**__


## **Recommendations:**__