# Country clustering with hdbscan
If you think that country info in Doodle competition is too specific and [continent](https://www.kaggle.com/qlasty/localization-context-country-continent) info is too general why not to try something in between and to cluster them in order to take advantage of possible cultural context. In this notebook simple clusterization based on localization is proposed.

In [None]:
!pip install hdbscan
!pip install pycountry-convert

from geopy.geocoders import Nominatim
import pycountry_convert
import geopandas as gp
import pandas as pd
import numpy as np
import hdbscan

Take country names along with their 2 and 3 letter codes available in the pycountry_convert lib.

In [None]:
valid_countries_dict = pycountry_convert.map_countries(cn_name_format="default")
valid_countries = [(k, v['alpha_2'], v['alpha_3']) for k, v in valid_countries_dict.items()]
countryDF = pd.DataFrame(valid_countries, columns=["country","alpha2","alpha3"])

Most countries appear two times (probably their names are in two languages: official and English) so let's get rid of the duplicates.

In [None]:
countryDF.drop_duplicates('alpha2', inplace=True)
countryDF.reset_index(inplace=True)

Following four countries names need to be corrected manually in order to be recognized by the geopy lib:

In [None]:
countryDF.loc[countryDF['alpha3']=='BOL','country']='Bolivia'
countryDF.loc[countryDF['alpha3']=='VAT','country']='Vatican'
countryDF.loc[countryDF['alpha3']=='TWN','country']='Taiwan'
countryDF.loc[countryDF['alpha3']=='VIR','country']='Virgin Islands'

Function for finding localization of a country, given its name. The returned latitude, longitude points the center of a country.

In [None]:
geoloc = Nominatim(user_agent="area_clustering")

def getCoordinates(countryName):    
    try:
        info = geoloc.geocode(countryName)    
        return [info.latitude, info.longitude]
    except:
        print('Error: Country {} not found'.format(countryName))        
        return [0, 90]

In [None]:
coordinates = countryDF["country"].apply(getCoordinates)
countryDF = pd.concat([countryDF, pd.DataFrame(coordinates.tolist(), columns=['latitude', 'longitude'])], axis=1)
countryDF["latitude"] = countryDF["latitude"].apply(np.radians)
countryDF["longitude"] = countryDF["longitude"].apply(np.radians)

Function, which groups countries based on their localization. "Haversine" metric is the one appropriate for clusterization having longitute/latitude info.

In [None]:
def cluster_data(dataframe, min_cluster_size, min_samples):
    clus = hdbscan.HDBSCAN(metric='haversine', min_cluster_size=min_cluster_size, min_samples=min_samples)    
    dataframe.loc[:,'groups'] = clus.fit_predict(dataframe[["latitude","longitude"]])    
    print('n_groups: {}, unclustered objects: {}'.format(max(dataframe['groups']), sum(dataframe['groups']==-1)))
    return dataframe

In [None]:
countryDF = cluster_data(dataframe=countryDF, min_cluster_size=5, min_samples=3)

Select unclustered countries and make clustering again, this time with settings allowing for smaller groups.

In [None]:
rec = countryDF.loc[countryDF['groups']==-1].copy()
rec = cluster_data(dataframe=rec, min_cluster_size=3, min_samples=1)

Merge results from both clusterings and display groups quantities.

In [None]:
rec.loc[rec['groups']>=0,'groups']+=np.max(countryDF['groups']+1)
countryDF[countryDF['groups']==-1]=rec
countryDF['groups'].value_counts()

Display unclustered countries

In [None]:
countryDF[countryDF['groups']==-1]

In [None]:
countryDF['groups']=countryDF['groups']+1

In [None]:
countryDF.head()

Save results to file.

In [None]:
countryDF.to_csv('area_mapping.csv', columns=['alpha3','groups'], index=False)

### Dict usage
How the file with countries groups can be used as dictionary.

In [None]:
our_mapping = pd.read_csv('area_mapping.csv')
our_mapping.head()

Unknown countries will be assigned as unclustered

In [None]:
our_dict = pd.Series(our_mapping.groups.values, index = our_mapping.alpha3).to_dict()

def get_area(iso_code):
    try:
        return our_dict[iso_code]
    except KeyError:
        return 0

In [None]:
print(get_area('POL'))
print(get_area('not valid'))

### Visualize clustering result

In [None]:
world = gp.read_file(gp.datasets.get_path('naturalearth_lowres'))
world['groups']=0
world.head(10)

In [None]:
for _id in range(len(world)):
    number=countryDF.index[countryDF['alpha3']==world.loc[_id,'iso_a3']]
    
    if len(number)>0:        
        tmp = countryDF.loc[number,'groups']
        world.at[_id,'groups'] = tmp

In [None]:
world.head(10)

Show countries groups

In [None]:
n_groups = max(our_mapping.groups)+1

The only purpose of shuffling groups is to spread groups around the world map so similar colors won't be close to each other, what may be confusing.

In [None]:
shuffled = np.random.permutation(n_groups)
categories_dict = {_id: shuffled[_id]  for _id in range(n_groups)}
world['groups'] = world['groups'].replace(categories_dict)

In [None]:
world.head(10)

In [None]:
ax = world.plot(color='white', edgecolor='black', figsize=(20,20))
plot = world.plot(ax=ax, column='groups', cmap='jet')

Show countries without group

In [None]:
wun=world[world['groups']==categories_dict[0]]
ax = world.plot(color='white', edgecolor='black', figsize=(20,20))
plot = wun.plot(ax=ax, column='groups', cmap='jet', vmin=0,vmax=1)

### Manual groups tunning
From the first image (map) one can clearly see that e.g. the grouping of Russia, Mongolia, China and Japan may not be a perfect idea if we expect to cluster countries similar in terms of culture. One simple way of creating manually another group of some already assigned countries is presented below.

In [None]:
world.loc[world.iso_a3=='RUS']

In [None]:
world.loc[world.groups==world.loc[world.iso_a3=='RUS'].groups.values[0]]

In [None]:
def reGroup(iso_a3_list):
    new_group_id = max(world.groups)+1
    
    for item in iso_a3_list:
        world.loc[world.iso_a3==item,'groups']=new_group_id    

If you think that e.g. Russia and Mongolia should be in another group, run the following cell:

In [None]:
reGroup(['RUS', 'MNG'])

In [None]:
ax = world.plot(color='white', edgecolor='black', figsize=(20,20))
plot = world.plot(ax=ax, column='groups', cmap='jet')