<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Customer Clustering with KMeans to Boost Business Strategy


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use the KMeans algorithm to cluster customer data.</p>


## __Table of Contents__
<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
      <li>
        <a href="#Task-1---Load-the-data-in-a-csv-file-into-a-dataframe">Task 1 - Load the data in a csv file into a dataframe
        </a>
      </li>
      <li>
        <a href="#Task-2---Decide-how-many-clusters-to-create">Task 2 - Decide how many clusters to create
        </a>
      </li>
      <li>
        <a href="#Task-3---Create-a-clustering-model">Task 3 - Create a clustering model
        </a>
      </li>
      <li>
        <a href="#Task-4---Make-Predictions">Task 4 - Make Predictions
        </a>
      </li>        
    </ol>
  </li>

  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 1 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Decide-how-many-clusters-to-create">Exercise 2 - Decide how many clusters to create
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Create-a-clustering-model">Exercise 3 - Create a clustering model
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Make-Predictions">Exercise 4 - Make Predictions
      </a>
    </li>
  </ol>
</ol>














## Objectives

After completing this lab you will be able to:

 - Use Pandas to load data sets.
 - Use K Means algorithm to cluter the data.



## Datasets

In this lab you will be using dataset(s):

 - Modified version of Wholesale customers dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 
 - Seeds dataset. Available at https://archive.ics.uci.edu/ml/datasets/seeds
 


## Setup


For this lab, we will be using the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine-learning-pipeline related functions.


### Installing Required Libraries

The following required libraries are pre-installed in the Skills Network Labs environment. However, if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda), you will need to install these libraries by removing the `#` sign before `!pip` in the code cell below.


In [ ]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !pip install pandas==1.3.4
# !pip install scikit-learn==0.20.1


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [ ]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

## Clustering demo with generated sample data


In [ ]:
# Generate sample data for clustering
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)

# X now contains 300 rows of data spread across 4 clusters that was generated by the make_blobs function.
# In real life we would use an existing data set.

In [ ]:
# Apply k-means clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

In [ ]:
# Print cluster centers
kmeans.cluster_centers_

In [ ]:
# Plot the clusters and cluster centers
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', s=400, color='black')
plt.show()

End of Demo


# Examples


## Task 1 - Load the data in a csv file into a dataframe


In [ ]:
# the data set is available at the url below.
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv"

# using the read_csv function in the pandas library, we load the data into a dataframe.

df = pd.read_csv(URL)

Let's look at some sample rows from the dataset we loaded:


In [ ]:
# show 5 random rows from the dataset
df.sample(5)

Let's find out the number of rows and columns in the dataset:


In [ ]:
df.shape

Let's plot the histograms of all columns


In [ ]:
df.hist()

## Task 2 - Decide how many clusters to create


You must tell the KMeans algorithm how many clusters to create out of your data


In [ ]:
number_of_clusters = 3

## Task 3 - Create a clustering model


Create a KMeans clustering model


In [ ]:
cluster = KMeans(n_clusters = number_of_clusters)

Train the model on the dataset


In [ ]:
result = cluster.fit_transform(df)

Your model is now trained. Print cluster centers


In [ ]:
cluster.cluster_centers_

## Task 4 - Make Predictions


Make the predictions and save them into the column "cluster_number"


In [ ]:
df['cluster_number'] = cluster.predict(df)

In [ ]:
df.sample(5)

Print the cluster numbers and the number of customers in each cluster


In [ ]:
df.cluster_number.value_counts()

# Exercises


### Exercise 1 - Load the data in a csv file into a dataframe


In [ ]:
URL2 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv"


Load the seeds dataset available at URL2


In [ ]:
df2 = # TODO

<details>
    <summary>Click here for a Hint</summary>
    
Use the read_csv function

</details>


<details>
    <summary>Click here for Solution</summary>

```python
df2 = pd.read_csv(URL2)
```

</details>


### Exercise 2 - Decide how many clusters to create


Cluster the data into 4 clusters


In [ ]:
number_of_clusters = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
set the variable number_of_clusters
</details>


<details>
    <summary>Click here for Solution</summary>

```python
number_of_clusters = 4
```

</details>


### Exercise 3 - Create a clustering model


Create a clustering model and train it using the data in the dataframe


In [ ]:
cluster = #TODO
result = #TODO

<details>
    <summary>Click here for a Hint</summary>
    
use the fit_transform of KMeans
</details>


<details>
    <summary>Click here for Solution</summary>

```python
cluster = KMeans(n_clusters = number_of_clusters)
result = cluster.fit_transform(df2)
```

</details>


In [ ]:
print(cluster.cluster_centers_)

### Exercise 4 - Make Predictions


Make the predictions and save them into the column "cluster_number"


In [ ]:
#your code goes here

<details>
    <summary>Click here for a Hint</summary>
    
use cluster.predict
</details>


<details>
    <summary>Click here for Solution</summary>

```python
df2['cluster_number'] = cluster.predict(df2)

```

</details>


In [ ]:
df2.sample(5)

Print the cluster numbers and the number of seeds in each cluster


In [ ]:
#your code goes here

<details>
    <summary>Click here for a Hint</summary>
    
use the value_counts() method on cluster_number column
</details>


<details>
    <summary>Click here for Solution</summary>

```python
df2.cluster_number.value_counts()

```

</details>


Congratulations you have completed this lab.<br>


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/)


## Contributors
[Vicky Kuo](https://author.skills.network/instructors/vicky_kuo)


Copyright © 2023 IBM Corporation. All rights reserved.


<!-- ## Change Log
-->


<!--
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-04-14|0.1|Ramesh Sannareddy|Initial Version Created|
|2023-06-20|0.3|Vicky Kuo|Proofreading|
-->
