# Module 10 Application

## Challenge: Crypto Clustering

In this Challenge, you’ll combine your financial Python programming skills with the new unsupervised learning skills that you acquired in this module.

The CSV file provided for this challenge contains price change data of cryptocurrencies in different periods.

The steps for this challenge are broken out into the following sections:

* Import the Data (provided in the starter code)
* Prepare the Data (provided in the starter code)
* Find the Best Value for `k` Using the Original Data
* Cluster Cryptocurrencies with K-means Using the Original Data
* Optimize Clusters with Principal Component Analysis
* Find the Best Value for `k` Using the PCA Data
* Cluster the Cryptocurrencies with K-means Using the PCA Data
* Visualize and Compare the Results

### Import the Data

This section imports the data into a new DataFrame. It follows these steps:

1. Read  the “crypto_market_data.csv” file from the Resources folder into a DataFrame, and use `index_col="coin_id"` to set the cryptocurrency name as the index. Review the DataFrame.

2. Generate the summary statistics, and use HvPlot to visualize your data to observe what your DataFrame contains.


> **Rewind:** The [Pandas`describe()`function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) generates summary statistics for a DataFrame. 

In [258]:
# Import required libraries and dependencies
# Here are all the required packages imported
import pandas as pd
import hvplot.pandas
from path import Path
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [259]:
# Load the data into a Pandas DataFrame
# Here we read in the csv file in the Resources file, called crypto_market_data.csv, and right off the bat we set the coin_id
# as the index.
df_market_data = pd.read_csv(
    Path("Resources/crypto_market_data.csv"),
    index_col="coin_id")

# Display sample data
# Here we call a simple display command to see what the dataframe looks like.
df_market_data.head(10)

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,1.08388,7.60278,6.57509,7.67258,-3.25185,83.5184,37.51761
ethereum,0.22392,10.38134,4.80849,0.13169,-12.8889,186.77418,101.96023
tether,-0.21173,0.04935,0.0064,-0.04237,0.28037,-0.00542,0.01954
ripple,-0.37819,-0.60926,2.24984,0.23455,-17.55245,39.53888,-16.60193
bitcoin-cash,2.90585,17.09717,14.75334,15.74903,-13.71793,21.66042,14.49384
binancecoin,2.10423,12.85511,6.80688,0.05865,36.33486,155.61937,69.69195
chainlink,-0.23935,20.69459,9.30098,-11.21747,-43.69522,403.22917,325.13186
cardano,0.00322,13.99302,5.55476,10.10553,-22.84776,264.51418,156.09756
litecoin,-0.06341,6.60221,7.28931,1.21662,-17.2396,27.49919,-12.66408
bitcoin-cash-sv,0.9253,3.29641,-1.86656,2.88926,-24.87434,7.42562,93.73082


In [260]:
# Generate summary statistics
# Here we use the describe() function to get the summary statistics for the dataframe.
df_market_data.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,-0.269686,4.497147,0.185787,1.545693,-0.094119,236.537432,347.667956
std,2.694793,6.375218,8.376939,26.344218,47.365803,435.225304,1247.842884
min,-13.52786,-6.09456,-18.1589,-34.70548,-44.82248,-0.3921,-17.56753
25%,-0.60897,0.04726,-5.02662,-10.43847,-25.90799,21.66042,0.40617
50%,-0.06341,3.29641,0.10974,-0.04237,-7.54455,83.9052,69.69195
75%,0.61209,7.60278,5.51074,4.57813,0.65726,216.17761,168.37251
max,4.84033,20.69459,24.23919,140.7957,223.06437,2227.92782,7852.0897


In [261]:
# Plot your data to see what's in your DataFrame
# Here we plot the dataframe in a line chart, 
df_market_data.hvplot.line(
    width=800,
    height=400,
    rot=90
)

---

### Prepare the Data

This section prepares the data before running the K-Means algorithm. It follows these steps:

1. Use the `StandardScaler` module from scikit-learn to normalize the CSV file data. This will require you to utilize the `fit_transform` function.

2. Create a DataFrame that contains the scaled data. Be sure to set the `coin_id` index from the original DataFrame as the index for the new DataFrame. Review the resulting DataFrame.


In [262]:
# Use the `StandardScaler()` module from scikit-learn to normalize the data from the CSV file
# Here we use the 'StandardScaler()' function to calibrate the values in the dataframe, this function helps to scale all the 
# values so that we don't get some major outlier data in our dataframe.
scaled_data = StandardScaler().fit_transform(df_market_data)

In [263]:
# Create a DataFrame with the scaled data
# Here we are creating a new dataframe out of the scaled data from the original data
df_market_data_scaled = pd.DataFrame(
    scaled_data,
    columns=df_market_data.columns
)

# Copy the crypto names from the original data
# Here we are copying the coin_id column over from the original dataframe to the new dataframe of scaled values.
df_market_data_scaled["coin_id"] = df_market_data.index

# Set the coinid column as index
# This is the command to set the coin_id as the index in the new dataframe with scaled values
df_market_data_scaled = df_market_data_scaled.set_index("coin_id")

# Display sample data
# Here is a simple display command so we can see how our new dataframe looks.
df_market_data_scaled.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


---

### Find the Best Value for k Using the Original Data

In this section, you will use the elbow method to find the best value for `k`.

1. Code the elbow method algorithm to find the best value for `k`. Use a range from 1 to 11. 

2. Plot a line chart with all the inertia values computed with the different values of `k` to visually identify the optimal value for `k`.

3. Answer the following question: What is the best value for `k`?

In [264]:
# Create a list with the number of k-values to try
# Use a range from 1 to 11
# Here for the k-means algorithim we make a list from 0-10 and we name it k.
k = list(range(1,11))

In [265]:
# Create an empy list to store the inertia values
# Here we are making an empty list that we can fill with data later. We call the list inertia.
inertia = []

In [266]:
# Create a for loop to compute the inertia with each possible value of k
# Inside the loop:
# 1. Create a KMeans model using the loop counter for the n_clusters
# 2. Fit the model to the data using `df_market_data_scaled`
# 3. Append the model.inertia_ to the inertia list
# Here we have a for loop that calculates the inertia for each value of k, it then fits the KMeans model to the data in the 
# df_market_data_scaled and then appends the new values to the list inertia. 
for i in k:
    model = KMeans(n_clusters=i, random_state=0)
    model.fit(df_market_data_scaled)
    inertia.append(model.inertia_)

In [267]:
# Create a dictionary with the data to plot the Elbow curve
# Here we have a dictionary called elbow_data, this is so we can plot the elbow curve with the values for k that the for loop
# gathered.
elbow_data = {
    "k": k,
    "inertia": inertia
}


# Create a DataFrame with the data to plot the Elbow curve
# Here we are creating a dataframe out of the elbow_data, this is so we can plot it later. 
df_elbow = pd.DataFrame(elbow_data)

In [268]:
# Plot a line chart with all the inertia values computed with 
# the different values of k to visually identify the optimal value for k.
# Here we are plotting the elbow curve so that we can find the best value for k, this will help us find the optimal number of clusters
# to represent the data.
df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)


#### Answer the following question: What is the best value for k?
**Question:** What is the best value for `k`?

**Answer:** # YOUR ANSWER HERE!

4

### Cluster Cryptocurrencies with K-means Using the Original Data

In this section, you will use the K-Means algorithm with the best value for `k` found in the previous section to cluster the cryptocurrencies according to the price changes of cryptocurrencies provided.

1. Initialize the K-Means model with four clusters using the best value for `k`. 

2. Fit the K-Means model using the original data.

3. Predict the clusters to group the cryptocurrencies using the original data. View the resulting array of cluster values.

4. Create a copy of the original data and add a new column with the predicted clusters.

5. Create a scatter plot using hvPlot by setting `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. Color the graph points with the labels found using K-Means and add the crypto name in the `hover_cols` parameter to identify the cryptocurrency represented by each data point.

In [269]:
# Initialize the K-Means model using the best value for k
# Here we initialize the KMeans algorithim with 4 clusters based on the optimal startegy provided by the elbow curve we 
# plotted, and name it model. This command creates the model that the K-Means algorithim will use.
model = KMeans(n_clusters=4)

In [270]:
# Fit the K-Means model using the scaled data
# Next we Fit the model or train it. While the algorithim is training or fitting, its looks for the best centroid for each of the k clusters.
model.fit(df_market_data_scaled)

KMeans(n_clusters=4)

In [271]:
# Predict the clusters to group the cryptocurrencies using the scaled data
# After we have created our model, then trained our model, next we use it to predict by passing our dataframe as a parameter,
# and the function finds the clusters that applies to each segment.
crypto_precictions = model.predict(df_market_data_scaled)

# View the resulting array of cluster values.
# Here we are displaying the array to see how it looks, we have mostly 0's, some 1's, fewer 2's, and one 3.
print(crypto_precictions)

[3 3 0 0 3 3 3 3 3 0 0 0 0 3 0 3 0 0 3 0 0 3 0 0 0 0 0 0 3 0 0 0 1 3 0 0 2
 0 0 0 0]


In [272]:
# Create a copy of the DataFrame
# Here we create a copy of the df_market_data_scaled and give it a new name, df_market_data_predictions.
df_market_data_predictions = df_market_data_scaled.copy()

In [273]:
# Add a new column to the DataFrame with the predicted clusters
# Here we are adding a new column to the dataframe called "Predictions", and adding the data from the KMeans algorithim to the column. 
df_market_data_predictions["Predictions"] = crypto_precictions

# Display sample data
# This is a simple diplay command so we can see what we have in the dataframe.
df_market_data_predictions.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y,Predictions
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637,3
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352,3
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061,0
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546,0
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317,3


In [274]:
# Create a scatter plot using hvPlot by setting 
# `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.
# Here we are creating a scatter plot with the new data we got from the KMeans algorithim. In this situation, we have too many columns
# in the dataframe to make an effective chart, so we chose price_change_percentage_24h and price_change_percentage_7d and Predictions
# to include in the scatter plot.
df_market_data_predictions.hvplot.scatter(x="price_change_percentage_24h",
                                          y="price_change_percentage_7d",
                                          by="Predictions",
                                          hover_cols="coin_id",
                                          title="CryptoCoins by k-means")

---

### Optimize Clusters with Principal Component Analysis

In this section, you will perform a principal component analysis (PCA) and reduce the features to three principal components.

1. Create a PCA model instance and set `n_components=3`.

2. Use the PCA model to reduce to three principal components. View the first five rows of the DataFrame. 

3. Retrieve the explained variance to determine how much information can be attributed to each principal component.

4. Answer the following question: What is the total explained variance of the three principal components?

5. Create a new DataFrame with the PCA data. Be sure to set the `coin_id` index from the original DataFrame as the index for the new DataFrame. Review the resulting DataFrame.

In [275]:
# Create a PCA model instance and set `n_components=3`.
# Here we are using the PCA function to reduce the amount of features that we will work with in the dataframe, we set the number of features
# to 3. This will help us optimize the identification of clusters for the k-means algorithim, and ease the visualization of the clusters.
# This is an example of dimensionality reduction, to ease the calculations for the algorithim yet preserve the integrity of the data.
pca = PCA(n_components=3)

In [276]:
# Use the PCA model with `fit_transform` to reduce to 
# three principal components.
# Here we use the fit_transform function to apply the dimensionality reduction, this is one of several techniques that reduces the
# size of the dataset yet preserves as much of the data as possible.
df_market_pca_data = pca.fit_transform(df_market_data_scaled)

# View the first five rows of the DataFrame. 
# Here is a simple display so we can see the array we got from the fit_transform function.
df_market_pca_data[:5]

array([[-0.60066733,  0.84276006,  0.46159457],
       [-0.45826071,  0.45846566,  0.95287678],
       [-0.43306981, -0.16812638, -0.64175193],
       [-0.47183495, -0.22266008, -0.47905316],
       [-1.15779997,  2.04120919,  1.85971527]])

In [277]:
# Retrieve the explained variance to determine how much information 
# can be attributed to each principal component.
# Here we have used the explained variance ratio function, this function measures the amount of data that the PCA module condensed
# relative to another principle component.

pca.explained_variance_ratio_


array([0.3719856 , 0.34700813, 0.17603793])

#### Answer the following question: What is the total explained variance of the three principal components?

**Question:** What is the total explained variance of the three principal components?

**Answer:** # The total explained variance is the sum of all principle components. Explained variance is the amount of variability in the data that PCA module condensed into principle components. In our case the total explained variance is 89.4%, so essentially we have reduced our dataset to 3 dimensions and still have captured 89.4% of the original data or information. So we sacrificed 11.6% of the information to reduce the dimensions to 3. Not bad.

In [278]:
# Create a new DataFrame with the PCA data.
# Note: The code for this step is provided for you

# Creating a DataFrame with the PCA data
# Here we are converting the array of values from our PCA analysis into a Pandas DataFrame. Becasue we previously reduced our
# data to 3 clusters of data, we have set the new dataframe to include 3 columns, 1 for each cluster. PC1,PC2,PC3.
df_market_pca = pd.DataFrame(
    df_market_pca_data,
    columns=["PC1", "PC2", "PC3"])

# Copy the crypto names from the original data
# Here we are copying the scaled dataframe column "coin_id" and adding it to the new dataframe of the PCA clusters.
df_market_pca["coin_id"] = df_market_data_scaled.index

# Set the coinid column as index
# Here we set the coin_id column as the index
df_market_pca = df_market_pca.set_index("coin_id")

# Display sample data
# Here we display the new dataframe so we can see what we have going on.
df_market_pca.head()

Unnamed: 0_level_0,PC1,PC2,PC3
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bitcoin,-0.600667,0.84276,0.461595
ethereum,-0.458261,0.458466,0.952877
tether,-0.43307,-0.168126,-0.641752
ripple,-0.471835,-0.22266,-0.479053
bitcoin-cash,-1.1578,2.041209,1.859715


---

### Find the Best Value for k Using the PCA Data

In this section, you will use the elbow method to find the best value for `k` using the PCA data.

1. Code the elbow method algorithm and use the PCA data to find the best value for `k`. Use a range from 1 to 11. 

2. Plot a line chart with all the inertia values computed with the different values of `k` to visually identify the optimal value for `k`.

3. Answer the following questions: What is the best value for k when using the PCA data? Does it differ from the best k value found using the original data?

In [279]:
# Create a list with the number of k-values to try
# Use a range from 1 to 11
# Here for the k-means algorithim we make a list from 0-10 and we name it k.
k = list(range(1, 11))

In [280]:
# Create an empy list to store the inertia values
# This is an empty list called inertia, that we can append to later.
inertia = []

In [281]:
# Create a for loop to compute the inertia with each possible value of k
# Inside the loop:
# 1. Create a KMeans model using the loop counter for the n_clusters
# 2. Fit the model to the data using `df_market_data_pca`
# 3. Append the model.inertia_ to the inertia list
# Here we have a for loop that calculates the inertia for each value of k, it then fits the KMeans model to the data in the 
# df_market_data_pca and then appends the new values to the list inertia. 
for i in k:
    model = KMeans(n_clusters=i, random_state=0)
    model.fit(df_market_pca)
    inertia.append(model.inertia_)

In [282]:
# Create a dictionary with the data to plot the Elbow curve
# Here we have a dictionary called elbow_data, this is so we can plot the elbow curve with the values for k that the for loop
# gathered.
elbow_data_1 = {
    "k": k,
    "inertia": inertia
}

# Create a DataFrame with the data to plot the Elbow curve
# We use the pandas DataFrame function to create a new dataframe with the data we just got using the previous for loop to append the 
# values for k to inertia.
df_elbow_pca = pd.DataFrame(elbow_data_1)

In [283]:
# Plot a line chart with all the inertia values computed with 
# the different values of k to visually identify the optimal value for k.
# Here we are plotting the df_elbow_pca data, this is so we can find the optimal value for k, or the optimal umber of clusters.
# It looks like 4 is the optimal value for k. 
df_elbow_pca.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

#### Answer the following questions: What is the best value for k when using the PCA data? Does it differ from the best k value found using the original data?
* **Question:** What is the best value for `k` when using the PCA data?

  * **Answer:** 4, Yes the best value for K was 4 for the original data, and the best value for k for the PCA data is 4 as well.


* **Question:** Does it differ from the best k value found using the original data?

  * **Answer:** My best value for k was 4 in both the original data and the PCA data.

---

### Cluster Cryptocurrencies with K-means Using the PCA Data

In this section, you will use the PCA data and the K-Means algorithm with the best value for `k` found in the previous section to cluster the cryptocurrencies according to the principal components.

1. Initialize the K-Means model with four clusters using the best value for `k`. 

2. Fit the K-Means model using the PCA data.

3. Predict the clusters to group the cryptocurrencies using the PCA data. View the resulting array of cluster values.

4. Add a new column to the DataFrame with the PCA data to store the predicted clusters.

5. Create a scatter plot using hvPlot by setting `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. Color the graph points with the labels found using K-Means and add the crypto name in the `hover_cols` parameter to identify the cryptocurrency represented by each data point.

In [284]:
# Initialize the K-Means model using the best value for k
# Here we initialize the KMeans algorithim by creating an instance and set the number of cluster, or groups we want the model to 
# seperatte our data into 4, 
model = KMeans(n_clusters=4)

In [285]:
# Fit the K-Means model using the PCA data
# After we create the instance we fit it or train it to the data we have as a parameter, df_market_pca
model.fit(df_market_pca)

KMeans(n_clusters=4)

In [286]:
# Predict the clusters to group the cryptocurrencies using the PCA data
# Nect we predict with the predict function and pass it the df_market_pca dataframe as the parameter to predict.
crypto_segments = model.predict(df_market_pca)
# View the resulting array of cluster values.
# Here we print the array of values that the KMeans algorithim came up with.
print(crypto_segments)

[0 0 3 3 0 0 0 0 0 3 3 3 3 0 3 0 3 3 0 3 3 0 3 3 3 3 3 3 0 3 3 3 1 0 3 3 2
 3 3 3 3]


In [287]:
# Create a copy of the DataFrame with the PCA data
# Here we are taking a copy of the df_market_pca dataframe and giving it a new name, df_market_pca_predictions.
df_market_pca_predictions = df_market_pca.copy()

# Add a new column to the DataFrame with the predicted clusters
# Here we add a column for our predictions data that the KMeans algorithim just got for us, We call it Crypto Segment
df_market_pca_predictions["Crypto Segment"] = crypto_segments

# Display sample data
# Here we display the PCA dataframe with the new column for the predictions. So we have reduced the dimensions of our dataframe
# down to 3 and then we had the KMeans algorithim find the optimal cluster mean variance.
df_market_pca_predictions.head()

Unnamed: 0_level_0,PC1,PC2,PC3,Crypto Segment
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bitcoin,-0.600667,0.84276,0.461595,0
ethereum,-0.458261,0.458466,0.952877,0
tether,-0.43307,-0.168126,-0.641752,3
ripple,-0.471835,-0.22266,-0.479053,3
bitcoin-cash,-1.1578,2.041209,1.859715,0


In [288]:
# Create a scatter plot using hvPlot by setting 
# `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.
# Here we have a scatter plot and have created it with the new data we got from reducing the dataframe to 3 dimensions, 
df_market_pca_predictions.hvplot.scatter(x="PC1",
                                         y="PC2",
                                         by="Crypto Segment",
                                         hover_cols="coin_id",
                                         title="Scatter Plot - PCA Data")


---

### Visualize and Compare the Results

In this section, you will visually analyze the cluster analysis results by contrasting the outcome with and without using the optimization techniques.

1. Create a composite plot using hvPlot and the plus (`+`) operator to contrast the Elbow Curve that you created to find the best value for `k` with the original and the PCA data.

2. Create a composite plot using hvPlot and the plus (`+`) operator to contrast the cryptocurrencies clusters using the original and the PCA data.

3. Answer the following question: After visually analyzing the cluster analysis results, what is the impact of using fewer features to cluster the data using K-Means?

> **Rewind:** Back in Lesson 3 of Module 6, you learned how to create composite plots. You can look at that lesson to review how to make these plots; also, you can check [the hvPlot documentation](https://holoviz.org/tutorial/Composing_Plots.html).

In [289]:
# Composite plot to contrast the Elbow curves
# Here we have two elbow_curve line charts. The first on the left is from the original data that was scaled. The plot on the right is from the data we got after 
# reducing the dimensions down to three, You can see that they are slightly different. That is becasue we sacrificed 11.6% of the data so that we could reduce the 
# dimensions down to 3. The optimal value for k stayed the same. So I think it was a good use of the PCA module.
elbow_composite = df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k) + df_elbow_pca.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

elbow_composite

In [290]:
# I wanted to show a overlay to really show the difference in the two charts.
elbow = df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve",legend="top_right", xticks=k) * df_elbow_pca.hvplot.line(x="k", y="inertia", title="Elbow Curve PCA",legend="top_right", xticks=k)

elbow


In [291]:
# Compoosite plot to contrast the clusters
# Here is a scatter plot for both the dataframes, the original that was scaled and the PCA dataframe. 
scatter_composite = df_market_data_predictions.hvplot.scatter(x="price_change_percentage_24h",
                                          y="price_change_percentage_7d",
                                          by="Predictions",
                                          hover_cols="coin_id",
                                          title="Scatter Plot - k-means") + df_market_pca_predictions.hvplot.scatter(x="PC1",
                                          y="PC2",
                                          by="Crypto Segment",
                                          hover_cols="coin_id",
                                          title="Scatter Plot - PCA Data")


scatter_composite

#### Answer the following question: After visually analyzing the cluster analysis results, what is the impact of using fewer features to cluster the data using K-Means?

  * **Question:** After visually analyzing the cluster analysis results, what is the impact of using fewer features to cluster the data using K-Means?

  * **Answer:** # YOUR ANSWER HERE!

The impact that I can see is the scale is much larger with the PCA data than the scale of the original data. In the original data the clusters of data are much closer together where the PCA data clusters are spread out more, particularly along the y-axis. In the PCA data I have an outlier on the y-axis at a 8, where in the original data the outlier along the y-axis is at  1.92.