In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# K-Means Clustering: A Larger Example

Now that we understand the k-means clustering algorithm, let's try an example with more features and use cross-validation to choose k. We will also show how you can (and should!) run the algorithm multiple times with different initial centroids because, as we saw in the animations from the previous section, the initialization can have an effect on the final clustering.

## Clustering Countries

For this example, we will use a dataset[^*] with information about countries across the world. It includes demographic, economic, environmental, and socio-economic information from 2023. This data and more information about it can be found [here](https://doi.org/10.34740/KAGGLE/DSV/6101670). The first few lines are shown below.

In [2]:
countries = pd.read_csv("../../data/world-data-2023.csv")
countries.head()

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,323000.0,32.49,93.0,Kabul,8672,...,78.40%,0.28,38041754,48.90%,9.30%,71.40%,11.12%,9797273,33.93911,67.709953
1,Albania,105,AL,43.10%,28748,9000.0,11.78,355.0,Tirana,4536,...,56.90%,1.2,2854191,55.70%,18.60%,36.60%,12.33%,1747593,41.153332,20.168331
2,Algeria,18,DZ,17.40%,2381741,317000.0,24.28,213.0,Algiers,150006,...,28.10%,1.72,43053054,41.20%,37.20%,66.10%,11.70%,31510100,28.033886,1.659626
3,Andorra,164,AD,40.00%,468,,7.2,376.0,Andorra la Vella,469,...,36.40%,3.33,77142,,,,,67873,42.506285,1.521801
4,Angola,26,AO,47.50%,1246700,117000.0,40.73,244.0,Luanda,34693,...,33.40%,0.21,31825295,77.50%,9.20%,49.10%,6.89%,21061025,-11.202692,17.873887


We want to see if we can cluster countries based on their characteristics. First, we need to do some cleaning. I don't want to include `Abbreviation`, `Capital/Major City`, `Largest City`, `Latitude`, or `Longitude` in my analysis because they uniquely identify a given country. I also see some variables that are numeric with percentage signs, dollar signs, and commas. These are characters and indicate that the variable is a string, but I would like them to be floats or ints instead so that Python knows they have a numerical meaning.

In [36]:
countries_clean = countries.drop(columns = ['Abbreviation', 'Capital/Major City', 'Largest City', 'Latitude', 'Longitude'])
countries_clean.columns

Index(['Country', 'Density\n(P/Km2)', 'Agricultural Land( %)',
       'Land Area(Km2)', 'Armed Forces size', 'Birth Rate', 'Calling Code',
       'Capital/Major City', 'Co2-Emissions', 'CPI', 'CPI Change (%)',
       'Currency-Code', 'Fertility Rate', 'Forested Area (%)',
       'Gasoline Price', 'GDP', 'Gross primary education enrollment (%)',
       'Gross tertiary education enrollment (%)', 'Infant mortality',
       'Largest city', 'Life expectancy', 'Maternal mortality ratio',
       'Minimum wage', 'Official language', 'Out of pocket health expenditure',
       'Physicians per thousand', 'Population',
       'Population: Labor force participation (%)', 'Tax revenue (%)',
       'Total tax rate', 'Unemployment rate', 'Urban_population'],
      dtype='object')

In [52]:
def str_to_num(my_input):
    '''Takes in a number in string format and removes commas 
    and percentage signs before returning it as a float or int
    
    If the string is not a number or input is not a string, 
    returns the input'''


    if type(my_input) is str:
        print(my_input.replace("$",""))
        if (my_input.endswith("%") or my_input.startswith("$")) and (my_input.replace("%","").replace("$","").replace(".","").replace(" ","").isdigit()):
            print(my_input.replace("$",""))
            return float(my_input.replace("%","").replace("$",""))
        elif ("," in my_input) and (my_input.replace(",","").isdigit()):
            return int(my_input.replace(",",""))
        else:
            return my_input
    else:
        return my_input
    
countries_clean = countries_clean.map(str_to_num)
countries_clean.head()

Afghanistan
Albania
Algeria
Andorra
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
The Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei
Bulgaria
Burkina Faso
Burundi
Ivory Coast
Cape Verde
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Republic of the Congo
Costa Rica
Croatia
Cuba
Cyprus
Czech Republic
Democratic Republic of the Congo
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Eswatini
Ethiopia
Fiji
Finland
France
Gabon
The Gambia
Georgia
Germany
Ghana
Greece
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Vatican City
Honduras
Hungary
Iceland
India
Indonesia
Iran
Iraq
Republic of Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Maldives
Mali
M

Unnamed: 0,Country,Density\n(P/Km2),Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,CPI,...,Minimum wage,Official language,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population
0,Afghanistan,60,58.1,652230,323000.0,32.49,93.0,Kabul,8672,149.9,...,0.43,Pashto,78.4,0.28,38041754,48.9,9.3,71.4,11.12,9797273.0
1,Albania,105,43.1,28748,9000.0,11.78,355.0,Tirana,4536,119.05,...,1.12,Albanian,56.9,1.2,2854191,55.7,18.6,36.6,12.33,1747593.0
2,Algeria,18,17.4,2381741,317000.0,24.28,213.0,Algiers,150006,151.36,...,0.95,Arabic,28.1,1.72,43053054,41.2,37.2,66.1,11.7,31510100.0
3,Andorra,164,40.0,468,,7.2,376.0,Andorra la Vella,469,,...,6.63,Catalan,36.4,3.33,77142,,,,,67873.0
4,Angola,26,47.5,1246700,117000.0,40.73,244.0,Luanda,34693,261.73,...,0.71,Portuguese,33.4,0.21,31825295,77.5,9.2,49.1,6.89,21061025.0


In [51]:
countries_clean['Minimum wage'][0].replace("%","").replace("$","").replace(".","")#.isdigit()

'043 '

## Disadvantages of `K-Means` Clustering

As we discussed above, k-means clustering has several disadvantages. It does not always converge to a solution that provides the global minimum within-cluster variability. Because of this, it can also give differing solutions depending on the initial starting points. In addition, the k-means algorithm requires the user to specify the number of clusters, which may not always be obvious, especially for data with high dimensionality. In the next section, we will discuss another clustering method that does not require you to specify a number of clusters: hierarchical clustering.

[^*]: Nidula Elgiriyewithana. (2023). Global Country Information Dataset 2023 [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6101670