### Note 3
As mentioned previously, you can follow any approach to find the final list of countries. Either you can use the binning approach, or use the outlier assign approach where you assign each outlier that you got to the given cluster means. Here we'll take a look at one such easier approach where we use the binning approach for one of the indicators which show a good variability among the  different clusters. 

In [26]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [38]:
fin = pd.read_csv('Country-data.csv')
#Converting exports,imports and health spending percentages to absolute values.
fin['exports'] = fin['exports']*fin['gdpp']/100
fin['imports'] = fin['imports']*fin['gdpp']/100
fin['health'] = fin['health']*fin['gdpp']/100

In [76]:
fin.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
1,Albania,16.6,1145.2,267.895,1987.74,9930,4.49,76.3,1.65,4090
2,Algeria,27.3,1712.64,185.982,1400.44,12900,16.1,76.5,2.89,4460
3,Angola,119.0,2199.19,100.605,1514.37,5900,22.4,60.1,6.16,3530
4,Antigua and Barbuda,10.3,5551.0,735.66,7185.8,19100,1.44,76.8,2.13,12200


In [77]:
#Let's use the binning with gdpp first to see the list of countries which might be important.
#The upper limit that we got from the clustering process was 1700.
#let's filter the complete dataset with 1700 as the cut-off limit for gdpp.
fin2=fin[fin['gdpp']<=1700]
fin2.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
12,Bangladesh,49.4,121.28,26.6816,165.244,2440,7.14,70.4,2.33,758
17,Benin,111.0,180.404,31.078,281.976,1820,0.885,61.8,5.36,758
25,Burkina Faso,116.0,110.4,38.755,170.2,1430,6.81,57.9,5.87,575
26,Burundi,93.6,20.6052,26.796,90.552,764,12.3,57.7,6.26,231


In [78]:
len(fin2)

48

In [79]:
#So we got 48 countries here. We can create further sub categories by taking another good clustering indicator. 
#Let's use the describe function to see how the variables are aligned now.
fin2.describe()

Unnamed: 0,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
count,48.0,48.0,48.0,48.0,48.0,48.0,48.0,48.0,48.0
mean,84.808333,242.988282,53.166544,389.688794,2209.229167,8.849688,60.789583,4.5525,847.583333
std,37.864382,208.41119,36.338142,306.718665,1134.428833,5.849055,7.282776,1.382764,384.444824
min,17.2,1.07692,12.8212,0.651092,609.0,0.885,32.1,1.27,231.0
25%,61.35,101.63025,31.0795,175.9095,1390.0,4.08,57.175,3.465,551.5
50%,82.05,150.912,44.3886,280.956,1900.0,8.215,61.25,4.875,758.0
75%,108.25,388.0875,60.50125,450.765,2857.5,12.15,66.125,5.37,1205.0
max,208.0,943.2,190.71,1279.55,4490.0,23.6,73.1,7.49,1630.0


In [80]:
#From the clustering process we got child_mortality to be at least 76 for the most downtrodden cluster. 
#Let's see how many countries lie within that range
len(fin2[fin2['child_mort']>=76])

28

In [84]:
#Ok so we got 28 countries now. We can stop here or take one more indicator and find the final list.
#Here we are taking income as the next one, where around 3200 was the income mean of the downtrodden cluster.
fin3=fin2[fin2['child_mort']>=76]
fin4=fin3[fin3['income']<3200]
len(fin4)

23

In [85]:
#We've got 23 countries now, let's use the describe function to see how they're aligned again.
fin4.describe()

Unnamed: 0,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
count,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0
mean,113.743478,165.144704,41.870378,292.629174,1444.913043,7.125435,56.065217,5.463478,627.173913
std,30.255681,140.898936,23.15412,222.532254,587.515796,5.428622,6.853042,0.953232,288.989407
min,80.3,20.6052,17.7508,90.552,609.0,0.885,32.1,3.3,231.0
25%,90.4,79.3795,30.66305,170.185,974.0,3.42,55.3,5.08,432.5
50%,109.0,126.885,37.332,248.297,1410.0,5.45,57.3,5.34,562.0
75%,119.5,188.29,46.1196,328.251,1740.0,10.02,58.75,6.01,733.0
max,208.0,617.32,129.87,1181.7,2690.0,20.8,65.9,7.49,1310.0


In [87]:
#The final list of countries 
fin4

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
17,Benin,111.0,180.404,31.078,281.976,1820,0.885,61.8,5.36,758
25,Burkina Faso,116.0,110.4,38.755,170.2,1430,6.81,57.9,5.87,575
26,Burundi,93.6,20.6052,26.796,90.552,764,12.3,57.7,6.26,231
28,Cameroon,108.0,290.82,67.203,353.7,2660,1.91,57.3,5.11,1310
31,Central African Republic,149.0,52.628,17.7508,118.19,888,2.01,47.5,5.21,446
32,Chad,150.0,330.096,40.6341,390.195,1930,6.39,56.5,6.59,897
36,Comoros,88.2,126.885,34.6819,397.573,1410,3.87,65.9,4.75,769
37,"Congo, Dem. Rep.",116.0,137.274,26.4194,165.664,609,20.8,57.5,6.54,334
40,Cote d'Ivoire,111.0,617.32,64.66,528.26,2690,5.39,56.3,5.27,1220


#### Final Remarks
Major focus should be given to the countries mentioned above.

#### Additional Remarks (Non-Evaluative)

The reason why PCA was done in the beginning was to remove the redundancies in the data and find the most important directions where the data was aligned. A somewhat similar heuristic is also used by the United Nations to calculate the Human Development Index(HDI) to rank countries on the basis of their development. They take 3 measures - Life Expectancy, Literacy Rate and Gross National Income. The logic behind using these 3 variables is that they directly affect the rest of the variables that determine the rest of their development. For example,  if you take a dataset where a lot of different socio-economic variables information is available, those variables would be heavily correlated to atleast one the 3 factors mentioned previously.

In this assignment however, we are actually finding those directions where the maximum variability may lie by using PCA. Then instead of using a metric like HDI, we are clustering them on the basis of those important directions. These directions are analogous to the the 3 measures mentioned previously and the clusters that are formed are similar to the HDI ranges used to denote development.Check out this link to read more on it:http://hdr.undp.org/en/content/human-development-index-hdi. And here is a list of the countries ranked according to HDI:http://hdr.undp.org/en/composite/HDI