# Clustering Categorical Data 

You are given much more country data. Using the same methodology as the one in the lecture, group all the countries in 2 clusters. 

<b> Already done that? Okay! </b>
    
There are other features: name and continent. 

Encode the continent one and use it in the clustering solution. Think about the difference with the previous exercise.

## Import the relevant libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

## Load the data

Load data from the csv file: <i> 'Categorical.csv'</i>.

In [None]:
countries=pd.read_csv(r'/kaggle/input/categorical-country-geotags/Categorical.csv')
countries.head()

## Map the data

Use the <i>'continent'</i> category for this analysis.

In [None]:
countries['continent' ]=countries['continent' ].map({'North America':0,'Europe':1,
                                                     'Asia':2,'Africa':3,'South America':4, 
                                                     'Oceania':5,'Seven seas (open ocean)':6, 'Antarctica':7})
countries.head()

## Select the features

In [None]:
x = countries.iloc[:,3:4]

## Clustering

Use 4 clusters initially.

In [None]:
kmeans = KMeans(4)
kmeans.fit(x)

## Clustering results

In [None]:
clusters=kmeans.fit_predict(x)
clusters_data=countries.copy()

In [None]:
clusters_data['Cluster_4']=clusters

## Plot the data

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_4'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 4 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

Since you already have all the code necessary, go back and play around with the number of clusters. Try 3, 7 and 8 and see if the results match your expectations. 

Simply go back to the beggining of the <b> Clustering </b> section and change <i> kmeans = KMeans(4) </i> to <i> kmeans = KMeans(3) </i>. Then run the remaining cells until the end.

# Clustering

In [None]:
kmeans = KMeans(3)
kmeans.fit(x)

# Clustering Results

In [None]:
clusters=kmeans.fit_predict(x)
clusters_data['Cluster_3']=clusters
plt.figure(figsize=(8,8))
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_3'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 3 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

#  Clustering

In [None]:
kmeans = KMeans(7)
kmeans.fit(x)

# Clustering Results

In [None]:
clusters=kmeans.fit_predict(x)
clusters_data['Cluster_7']=clusters
plt.figure(figsize=(8,8))
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_7'],cmap='rainbow')
plt.title("Data with 7 clusters") 
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

#  Clustering

In [None]:
kmeans = KMeans(8)
kmeans.fit(x)

# Clustering Results

In [None]:
clusters=kmeans.fit_predict(x)
clusters_data['Cluster_8']=clusters
plt.figure(figsize=(8,8))
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_8'],cmap='rainbow')
plt.title("Data with 8 clusters") 
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

Combining all graphs for Summary

In [None]:
plt.figure(figsize=(25, 10))
plt.subplot(2,2,1)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_4'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 4 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.subplot(2,2,2)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_3'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 4 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.subplot(2,2,3)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_7'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 7 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.subplot(2,2,4)
plt.scatter(countries["Longitude"],countries["Latitude"],c=clusters_data['Cluster_8'],cmap='rainbow')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title("Data with 8 clusters") 
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()