#Manipulating HDI Data

#Mounting Drive files and importing packages

The first thing you must do is mount your Google Drive to the python file and import the pandas and numpy packages. These will allow you to use predefined methods in each library to manipulate the data.

In [105]:
from google.colab import drive
drive.mount('/content/gdrive')
import numpy as np
import pandas as pd





Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


#Parsing csv name and creating a DataFrame
Next, I used functions I found online to parse through the name of the csv file (which I already uploaded to my drive) and input the data in the file to the DataFrame.
>Side note: I encountered some errors trying to add the data from the csv to the data frame, so I used the encoding method to turn it into Unicode and that worked.

In [None]:
filename = "/content/gdrive/MyDrive/Colab Notebooks/HDI_short.csv"
filename.encode('utf-8').strip()
df=pd.read_csv(filename, encoding = 'unicode_escape')

#Creating a DataFrame of the lower HDI countries
Next, I isolated the data in the file to only the 20 countries with the lowest HDI ranking via a series verifying that the column "HDI rank" is less than 174, and creating a new series from the specified countries.

I decided to include the columns for life expectancy at birth, gross national income, expected years of schooling, country, and HDI rank in the series, evidenced by the .loc function's parameters.

In [106]:
dflow = df[df['HDI rank'] >= 174]
dflow = dflow.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]

##Doing the same for the higher HDI countries

In [None]:
dfhigh = df[df['HDI rank'] <=20]
dfhigh = dfhigh.loc[:,["Country","HDI rank","Life expectancy at birth","Gross national income (GNI) per capita","Expected years of schooling"]]

#Combining the data sets

After this, I used the .concat function to display the two lists together. I used an axis argument of 0, which ensures that the columns are not repeated when concatenating.

In [107]:
dfnew = pd.concat([dfhigh, dflow], axis=0)
dfnew

Unnamed: 0,Country,HDI rank,Life expectancy at birth,Gross national income (GNI) per capita,Expected years of schooling
0,Switzerland,1,84.0,66933,16.5
1,Norway,2,83.2,64660,18.2
2,Iceland,3,82.7,55782,19.2
3,"Hong Kong, China (SAR)",4,85.5,62607,17.3
4,Australia,5,84.5,49238,21.1
5,Denmark,6,81.4,60365,18.7
6,Sweden,7,83.0,54489,19.4
7,Ireland,8,82.0,76169,18.9
8,Germany,9,80.6,54534,17.0
9,Netherlands,10,81.7,55979,18.7


#Finding Data Descriptors

Finally, I tried experimenting with some functions like .sum(), .astype(), and .replace() to make variables that show more important statistical data, like the mean of certain columns.

In [155]:
meanLifeHigh = dfhigh['Life expectancy at birth'].sum()/20
meanLifeLow = dflow['Life expectancy at birth'].sum()/20

In [178]:
meanGDIHigh = dfhigh['Gross national income (GNI) per capita'].astype(str).replace(',','')
meanGDILow = dflow['Gross national income (GNI) per capita'].astype(str).replace(',','')

#Making a new DataFrame

With the new variables, I attempted to make a DataFrame to summarize the statistics of the top 20 and bottom 20 countries.
>As you can see, my calculation of the mean GNI hasn't been perfected yet. I'm still working on replacing the commas in the original strings and converting them to integers in order to find the sum.

In [183]:
newStats = pd.DataFrame({"Category" : ["Top 20","Bottom 20"],
                   "Mean life expectancy at birth" : [meanLifeHigh,meanLifeLow],
                         "Mean GNI" : [meanGDIHigh,meanGDILow]})
newStats

Unnamed: 0,Category,Mean life expectancy at birth,Mean GNI
0,Top 20,82.78,"0 66,933 1 64,660 2 55,782 3 ..."
1,Bottom 20,67.915,"173 2,172 174 2,361 175 1,729 176 ..."


#Converting back to CSV

As a last step, I converted my final, manipulated DataFrames (the two that would be useful for researchers) back into csv files.

In [186]:
newStats.to_csv("HDIStats.csv", index=False)
dfnew.to_csv("CombinedData.csv", index=False)