# How to Choose the Number of Clusters

Using the same code as in the previous exercise, find the WCSS for clustering solutions with 1 to 10 clusters (you can try with more if you wish).

Find the most suitable solutions, run them and compare the results.

## Import the relevant libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

## Load the data

Load data from the csv file: <i> 'Categorical.csv'</i>.

In [None]:
countries=pd.read_csv(r'/kaggle/input/categorical-country-geotags/Categorical.csv')
countries.head()

Plot the <i>'Longtitude'</i> and <i>'Latitude'</i> columns. 

## Plot the data

Plot the <i>'Longtitude'</i> and <i>'Latitude'</i> columns. 

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(x="Longitude",y="Latitude",data=countries)
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

## Select the features

Make sure to select the appropriate features since we are no longer using the categorical variable for our clustering but rather <i>'Longitude'</i> and <i>'Laditude'</i>.

In [None]:
x = countries.iloc[:,1:3]
x.head()

## Clustering

Use 4 clusters initially.

In [None]:
kmeans = KMeans(4)
kmeans.fit(x)

### Clustering results

In [None]:
clusters=kmeans.fit_predict(x)

clusters

In [None]:
clusters_data=countries.copy()
clusters_data['Cluster_4']=clusters
clusters_data.head()

Plot the data once again but separate the data by the clusters we defined.  

## Plot the data

In [None]:
clusters=kmeans.fit_predict(x)
clusters_data['Cluster_4']=clusters
plt.figure(figsize=(8,8))
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_4'],cmap='rainbow')
plt.title("Data with 9 clusters") 
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()


## Selecting the number of clusters

### WCSS

Use the ingerated <i>sklearn</i> method <i> 'inertia_' </i>.

In [None]:
kmeans.inertia_

Write a loop that calculates and saves the WCSS for any number of clusters from 1 up to 10 (or more if you wish).

In [None]:
wcss = []
# 'cl_num' is a that keeps track the highest number of clusters we want to use the WCSS method for.
# Note that 'range' doesn't include the upper boundery
cl_num = 10
for i in range (1,cl_num):
    kmeans= KMeans(i)
    kmeans.fit(x)
    wcss_iter = kmeans.inertia_
    wcss.append(wcss_iter)

In [None]:
wcss

### The Elbow Method

In [None]:
number_clusters = range(1,cl_num)
plt.plot(number_clusters, wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')

Based on the Elbow Curve, plot several graphs with the appropriate amounts of clusters you believe would best fit the data.

Compare the scatter plots to determine which one to use in our further analysis. 

<i>Hint: we already created the scatter plot for 4 clusters, so we only have to slightly alter our code.</i>

In [None]:
kmeans = KMeans(2)
kmeans.fit(x)

In [None]:

clusters=kmeans.fit_predict(x)
clusters_data['Cluster_2']=clusters
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_2'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 2 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

In [None]:
kmeans = KMeans(3)
kmeans.fit(x)

In [None]:

clusters=kmeans.fit_predict(x)
clusters_data['Cluster_3']=clusters
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_3'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 3 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

In [None]:
clusters_data.head()

In [None]:
plt.figure(figsize=(25, 20))
plt.subplot(2,1,1)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_2'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 2 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.subplot(2,1,2)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_3'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 3 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)

plt.show()