Import the following librairies and modules

- `pickle`
- `warnings`
- `numpy` as `np`
- `pandas` as `pd`
- `seaborn` as `sns`
- from `sklearn.cluster` import `KMeans`
- from `sklearn.metrics` import `silhouette_score`


In [None]:
# To suppress the warnings in the notebook
warnings.filterwarnings("ignore")

##### Step 0:

- read the csv file `mall_costumers.csv` and store in the variable `df`
- drop all the other columns except ``income`` and ``spending_score``

##### Step 1:
- Check the data health.
    - Are there any missing values? See [hint](https://stackoverflow.com/questions/26266362/how-do-i-count-the-nan-values-in-a-column-in-pandas-dataframe)

##### Step 2:

Since this is a unsupervised learning algorithm, therefore, we donot have to split it into `X` and `y` (and subsequently into train dataset and test dataset). In this particular problem, we need to identify the clusters of clients of a mall with respect to their monthly ``income`` and ``spending_score``.

Here, our first challenge is to find the optimum number of clusters, which, by default, is not given by the shopping mall manager. So,

- we can either find it by trial and error method.
- or by using the combination of `elbow plot` and `silhoutte_score`.

##### Elbow Plot

- Make a sequence of numbers from `2` to `15` and store it in variable `list_clusters`
- Make an empty list and store in variable `list_within_cluster_sum_of_squared`
- Iterate over the `list_clusters` such that:
    - Call the `KMeans()` method and set `n_clusters` to the iterative number, and store in variable `kmeans`
    - Perform `kmeans.fit()` on the dataframe `df` and store in the variable `kmeans`
    - Append `kmeans.inertia_` in the `list_within_cluster_sum_of_squared`

- Make a dataframe `elbow` with two columns
    - `clusters` having the `list_clusters`
    - `within_cluster_sum_of_squared` having the `list_within_cluster_sum_of_squared`

- Plot a line plot using `seaborn`
    - x = `clusters`
    - y = `within_cluster_sum_of_squared`
    - data = `elbow`
    - marker= `+`

Find out in the graph where is the elbow? What number do you find on x-axis for elbow?

##### Silhouette Plot

- Make an empty list and store in variable `list_silhouette_scores`
- Iterate over the sequence of numbers such that:
    - Call the `KMeans()` method, set `n_clusters` to the iterative number, keep, `random_state=200` and store in variable `kmeans`
    - Perform `kmeans.fit()` on the dataframe `df` and store in the variable `kmeans`
    - Make a variable `label` to store `kmeans.labels_`
    - Call the `silhouette_score()` and store output in a variable `score`. Set:
        - `X` = `df`
        - `labels` = `label`
        - `metric` = `euclidean`
        - `random_state` = `200`
        - `sample_size` = 1000
    - Append the `score` in `list_silhouette_scores`. 

- Make a dataframe `silhouette` with two columns
    - `clusters` having the `list_clusters`
    - `silhouette_scores` having the `list_silhouette_scores`

- Plot a line plot using `seaborn`
    - x = `clusters`
    - y = `silhouette_scores`
    - data = `silhouette`
    - marker= `+`

For what number of clusters do you find the highest score? Does it correspond to your deduction of elbow plot? Write down the number of optimum clusters you find in a markdown below.

##### Step 3:

- Call the `KMeans()` method, set `n_clusters` to the chosen number of clusters and store in a variable `model`
- Fit the `model` on the dataframe `df`, and store in variable `kmeans_model`.

In [None]:
df['clusters'] = kmeans_model.labels_

##### Step 4:

- Plot a scatter plot using `seaborn`
    - x = `spending_score`
    - y = `income`
    - hue = `clusters`
    - data= `df`

#### Discussion

You see that the clusters are color-coded. What do you deduce from the cluster on top-right? Similarly what do you deduce fro the other clusters? Write in a markdown below.

What incentives should be offered to the clients in the bottom-left clusters?

##### Step 5:

Save the model with filename `model_kmeans.pkl`. You can use it later.

##### Real World Use-Case:

Now that you have trained a clustering model, let us test on real-world use-case to identify the relevant cluster of profiles (so that subsequently the mall manager can offer them the same incentives as that of identified before).

- read the csv file `mall_costumers_rwi.csv` and store in the variable `df_rwi`
- drop all the other columns except ``income`` and ``spending_score``

- Check the data health.
    - Are there any missing values?

Load the saved model `model_kmeans.pkl` and store it in variable `loaded_model`. 

In [None]:
y_predict_rwi = loaded_model.predict(df_rwi)

df_rwi["clusters"] = y_predict_rwi

df_rwi.head()

Repeat step 4 to visualize the clusters. What do you think from looking at dataframe and the visualization? Which cluster has most number of clients? Write your answer in a markdown below.