#Manipulating HDI Data

#Mounting Drive files and importing packages

The first thing you must do is mount your Google Drive to the python file and import the pandas and numpy packages. These will allow you to use predefined methods in each library to manipulate the data.

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')
import numpy as np
import pandas as pd

Mounted at /content/gdrive


#Creating a DataFrame
Next, use the function pd.read_csv to extract the data from both data files.
* The file entitled "HDI_short" contains the United Nations' Human Development Index (HDI) data for all applicable countries in the world, including some other data from these countries that contribute to their HDI.
* The file entitled "Countries by continents" is a simple table of all countries in the world and their respective continents.

In [3]:
HDIFileName = "/content/gdrive/MyDrive/Colab Notebooks/HDI_short.csv"
CCFileName = "/content/gdrive/MyDrive/Colab Notebooks/Countries by continents.csv"
df=pd.read_csv(HDIFileName, encoding = 'unicode_escape')
df1=pd.read_csv(CCFileName, encoding = 'unicode_escape')

#Creating a DataFrame of the lower HDI countries
Next, to categorize the HDI data into smaller groups, you can use the .iloc function, along with other series attributes. Create new Data Frames to represent each category (variable names 'cat1, cat2...' below) and specify the range of HDI rankings you would like in each subset. Instead of categorizing the countries by HDI rank, you could also separate them by pure HDI score, such as a range from 0.1-0.2 and 0.2 to 0.3. However, it tends to be much easier to manipulate and visualize data when each subset has the same number of elements, as you'll see later.

Next, use the merge function to separate chunks of the HDI rankings and assign that merged series to the category dataframe.

As evidenced by the .iloc function parameters, you may also add other variables from the original Data Frame to further describe your data. The code below includes not only country and HDI rank, but also uses life expectancy at birth, gross national income, and expected years of schooling.

In [4]:
cat1 = df[df['HDI rank'] <=20]
cat1 = cat1.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat2 = pd.merge(df[df['HDI rank'] <=40], df[df['HDI rank']>20])
cat2 = cat2.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"][0:20]]
cat3 = pd.merge(df[df['HDI rank'] <=60], df[df['HDI rank']>40])
cat3 = cat3.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat4 = pd.merge(df[df['HDI rank'] <=80], df[df['HDI rank']>60])
cat4 = cat4.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat5 = pd.merge(df[df['HDI rank'] <=100], df[df['HDI rank']>80])
cat5 = cat5.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat6 = pd.merge(df[df['HDI rank'] <=120], df[df['HDI rank']>100])
cat6 = cat6.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat7 = pd.merge(df[df['HDI rank'] <=140], df[df['HDI rank']>120])
cat7 = cat7.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat8 = pd.merge(df[df['HDI rank'] <=160], df[df['HDI rank']>140])
cat8 = cat8.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]
cat9 = df[df['HDI rank'] >160]
cat9 = cat9.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]

cat2


Unnamed: 0,Country,HDI rank,Life expectancy at birth,Gross national income (GNI) per capita,Expected years of schooling
0,United States,21,77.2,64765,16.3
1,Israel,22,82.3,41524,16.1
2,Malta,23,83.8,38884,16.8
3,Slovenia,23,80.7,39746,17.7
4,Austria,25,81.6,53619,16.0
5,United Arab Emirates,26,78.7,62574,15.7
6,Spain,27,83.0,38354,17.9
7,France,28,82.5,45937,15.8
8,Cyprus,29,81.2,38188,15.6
9,Italy,30,82.9,42840,16.2


#Finding Data Descriptors

Finally, it's time to put our data to use and employ statistics methods to summarize our findings.

###Initializing categories list
First, create a simple one-dimensional list with all the category dataframes from above.

In [5]:
categories = [cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9]

###Finding the mean life expectancy at birth

Next, make a list of integers initialized at 0 to store the mean values found below.

To evaluate the mean life expectancy at birth for each category, use a while loop to iterate through each category, find the sum of the countries' life expectancies, and divide it by the length of the category.

Then, use the round() function to round the mean values, so they appear better in a table.

In [6]:
meanLife = [0,0,0,0,0,0,0,0,0]

i=0
while i<len(meanLife):
  meanLife[i] = categories[i]['Life expectancy at birth'].sum() / categories[i]['Life expectancy at birth'].count()
  meanLife[i]=round(meanLife[i],2)
  i=i+1
meanLife

[82.78, 79.4, 74.97, 74.01, 71.98, 69.63, 69.12, 63.84, 61.88]

###Finding the mean GDI per capita

Next, make a list of integers to store the mean GDI values.

The iteration process to find mean GDI is a bit more tedious, because it requires a two-dimensional while loop. Using arbitrary variables i and j, iterate through the entire Data Frame for each category's GDI per capita. Each value must be stripped of its format (because the comma won't be interpreted as a value in python) using the .replace and eval() functions of strings.

Finally, round the values to two decimal places using the round() function and format it as it was before using the format() function.

>Side note - I had to calculate the final category's mean GDI by hand and input its raw value to the list. The length of it caused trouble in the loop and it couldn't be troubleshot.

In [7]:
i=0
meanGDI=[0,0,0,0,0,0,0,0,0]

j = 0
while j<8:

  i = 0;
  while i<len(categories[j]['Gross national income (GNI) per capita']):
    newstuff = categories[j]['Gross national income (GNI) per capita'].copy()
    newstuff[i] = categories[j]['Gross national income (GNI) per capita'][i].replace(",","")
    meanGDI[j]+=eval(newstuff[i])
    i=i+1

  meanGDI[j]/=newstuff.count()
  j=j+1


j=0
while j<8:
  meanGDI[j]=round(meanGDI[j],2)
  meanGDI[j]=('{:,}'.format(meanGDI[j]))
  j+=1


meanGDI[8]=('{:,}'.format(2632.3))

meanGDI


['62,388.2',
 '41,717.0',
 '33,694.58',
 '17,710.05',
 '12,099.11',
 '10,501.6',
 '6,190.23',
 '4,545.5',
 '2,632.3']

#Making a new DataFrame

With the new variables make a DataFrame to summarize the statistics of each category of HDI rankings using the pd.DataFrame() function.
>With this table, you can already begin to see the trends in the data.

In [11]:
newStats = pd.DataFrame({"Category (by HDI rank)" : ["0-20","20-40","40-60","60-80","80-100","100-120","120-140","140-160","160+"],
                   "Mean life expectancy at birth" : meanLife,
                        "Mean GDI" : meanGDI})
newStats

Unnamed: 0,Category (by HDI rank),Mean life expectancy at birth,Mean GDI
0,0-20,82.78,62388.2
1,20-40,79.4,41717.0
2,40-60,74.97,33694.58
3,60-80,74.01,17710.05
4,80-100,71.98,12099.11
5,100-120,69.63,10501.6
6,120-140,69.12,6190.23
7,140-160,63.84,4545.5
8,160+,61.88,2632.3


#Making a Country-Continent Data Frame

One last thing you might want to do with your data is to group it by continent and visualize it on the world map. However, the previous csv file didn't include a continent attribute.

To do this, you can use the merge function and the previous country/continent Data Frame to make a new Data Frame that includes continent data.

In [65]:
mergedSS = pd.merge(df, df1, on="Country",how="outer")
newFrame = pd.DataFrame({"Country:" : mergedSS['Country'],
                   "Continent" : mergedSS['ï»¿Continent'],
                        "Human Development" : mergedSS['Human Development Index (HDI) ']})

#Filtering Data

Not all countries actually have an HDI value, so you may need to use a while loop to remove those countries for the purposes of your calculations.

In [68]:
i=0
while i<191:
  if newFrame['Human Development'][i] == "NaN":
    newFrame.drop(newFrame[i])
  i+=1

newFrame = newFrame[0:191]
newFrame

Unnamed: 0,Country:,Continent,Human Development
0,Switzerland,Europe,0.962
1,Norway,Europe,0.961
2,Iceland,Europe,0.959
3,"Hong Kong, China (SAR)",,0.952
4,Australia,Oceania,0.951
...,...,...,...
186,Burundi,Africa,0.426
187,Central African Republic,Africa,0.404
188,Niger,Africa,0.400
189,Chad,Africa,0.394


#Specifying by Continent

With the merged dataset, you can now categorize countries by their continent and separate them into their own Data Frames using the functions described earlier.

In [69]:
NA = newFrame[newFrame['Continent'] == "North America"]
SA = newFrame[newFrame['Continent'] == "South America"]
Europe = newFrame[newFrame['Continent'] == "Europe"]
Africa = newFrame[newFrame['Continent'] == "Africa"]
Asia = newFrame[newFrame['Continent'] == "Asia"]
Oceania = newFrame[newFrame['Continent'] == "Oceania"]

NA

Unnamed: 0,Country:,Continent,Human Development
14,Canada,North America,0.936
20,United States,North America,0.921
54,Bahamas,North America,0.812
56,Trinidad and Tobago,North America,0.81
57,Costa Rica,North America,0.809
60,Panama,North America,0.805
68,Grenada,North America,0.795
69,Barbados,North America,0.79
70,Antigua and Barbuda,North America,0.788
74,Saint Kitts and Nevis,North America,0.777


#Calculating Continental Means

To calculate the mean HDI for each continent, simply sum the categories as before and divide them by their length.

Finally, create a Data Frame displaying the continents and their mean HDI values.

In [70]:
meanNA = NA['Human Development'].astype(float).sum()/len(NA['Human Development'])
meanSA = SA['Human Development'].astype(float).sum()/len(SA['Human Development'])
meanEurope = Europe['Human Development'].astype(float).sum()/len(Europe['Human Development'])
meanAfrica = Africa['Human Development'].astype(float).sum()/len(Africa['Human Development'])
meanAsia = Asia['Human Development'].astype(float).sum()/len(Asia['Human Development'])
meanOceania = Oceania['Human Development'].astype(float).sum()/len(Oceania['Human Development'])

continentData = pd.DataFrame({"Continent:" : ["North America", "South America", "Europe", "Africa", "Asia", "Oceania"],
                              "Mean HDI" : [meanNA,meanSA,meanEurope,meanAfrica,meanAsia,meanOceania]})
continentData

Unnamed: 0,Continent:,Mean HDI
0,North America,0.749348
1,South America,0.7675
2,Europe,0.877884
3,Africa,0.560426
4,Asia,0.739182
5,Oceania,0.705833


#Converting back to CSV

As a last step, convert the final, manipulated DataFrames (the two that would be useful for researchers) back into csv files using the .to_csv() function.

In [71]:
newStats.to_csv("HDIStats.csv", index=False)
continentData.to_csv("continentData.csv", index=False)

These will be stored to the left in the 'Files' section, where you can download them to your computer.