# Python for Data Science Project Session 4: Social Sciences

In the previous notebooks, the datasets you have been using to compete the tasks have been readily transformed into a format that is easy and clean to work with.  In this notebook, we will revist the data set used in session 3 that was titled `income_df`.  The dataset had already been cleaned, ordered to match the countries of the happiness survey, and been given the following categorical variables:

* High income country = 3
* Upper middle income country = 2
* Lower middle income country = 1
* Low income country = 0

Session 4 has now given you the tools to transform the orginal raw form of the data.  Working with country names and data is common in social sciences, but can be complicated by different spellings, and territory vs countries vs regions vs continent distinctions.  The techniques used below to clean up the dataset are useful tools that can be applied whenever working with country data.

## Initial dataset cleaning 

Start by importing pandas.

In [1]:
import pandas as pd

Import the file `Raw_country_income_category.csv`, naming the dataset as `raw_df`.

In [2]:
raw_df = pd.read_csv('Raw_country_income_category.csv')
raw_df

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName,Unnamed: 5
0,ABW,Latin America & Caribbean,High income,,Aruba,
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern,
2,AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...,Afghanistan,
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central,
4,AGO,Sub-Saharan Africa,Lower middle income,,Angola,
...,...,...,...,...,...,...
260,XKX,Europe & Central Asia,Upper middle income,,Kosovo,
261,YEM,Middle East & North Africa,Low income,,"Yemen, Rep.",
262,ZAF,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...,South Africa,
263,ZMB,Sub-Saharan Africa,Lower middle income,National accounts data were rebased to reflect...,Zambia,


We only need 2 columns from this dataframe - `IncomeGroup` which gives each country an associated income band, and `TableName` which contains the names of region/countries.  Create a new dataframe with these two columns called `raw_income_df`.

In [4]:
raw_income_df = raw_df[["IncomeGroup", "TableName"]].copy()
raw_income_df

Unnamed: 0,IncomeGroup,TableName
0,High income,Aruba
1,,Africa Eastern and Southern
2,Low income,Afghanistan
3,,Africa Western and Central
4,Lower middle income,Angola
...,...,...
260,Upper middle income,Kosovo
261,Low income,"Yemen, Rep."
262,Upper middle income,South Africa
263,Lower middle income,Zambia


We can go ahead and change the `Income Group` values in `raw_income_df` into the following categorical values:

* High income = 3
* Upper middle income = 2
* Lower middle income = 1
* Low income = 0

Do this using the `.replace({"Old Value" : "New Value"}, inplace=True)`

In [5]:
raw_income_df.replace({"High income":3, "Upper middle income":2, "Lower middle income":1, "Low income":0}, inplace=True)
raw_income_df

Unnamed: 0,IncomeGroup,TableName
0,3.0,Aruba
1,,Africa Eastern and Southern
2,0.0,Afghanistan
3,,Africa Western and Central
4,1.0,Angola
...,...,...
260,2.0,Kosovo
261,0.0,"Yemen, Rep."
262,2.0,South Africa
263,1.0,Zambia


Because of the missing values, Python automatically converts integers into floats, and will not allow for traditional approaches to change it back.  This is annoying because we want the numbers in the `IncomeGroup` to represent categorical values - and hence be integers.

We could correct this once we have dealt with our missing values, but the following code is a neat to trick to make the change now!

Using `Int64` opposed to `Int` tells  python to disregard any `NaN` values. Run the code below to convert the floats in the `IncomeGroup` column of the `raw_income_df` into integers. 

In [6]:
raw_income_df.IncomeGroup = raw_income_df.IncomeGroup.astype('Int64')
raw_income_df

Unnamed: 0,IncomeGroup,TableName
0,3,Aruba
1,,Africa Eastern and Southern
2,0,Afghanistan
3,,Africa Western and Central
4,1,Angola
...,...,...
260,2,Kosovo
261,0,"Yemen, Rep."
262,2,South Africa
263,1,Zambia


## Data filtering: reindexing 

Having established the main components of our income category dataset, we want to start associating it with our happiness dataset in order to make sure each country is given an income category.  Working with country names is often tricky due to different spellings, abbreviations, and ongoing debates about whether a territory is a country or not - such as Taiwan and Palestine.  Trial and error is often an important practice when dealing with special cases.

Start by importing the `Happiness_survey_2019.csv` file and call it `happy_df`.

In [7]:
happy_df = pd.read_csv('Happiness_survey_2019.csv')
happy_df

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


A quick an effective way to filter through our income category data in relation to the countries in the `happy_df` is reindexing the former in terms of the later.  Study the following example (https://stackoverflow.com/questions/45576800/how-to-sort-dataframe-based-on-a-column-in-another-dataframe-in-pandas), and then follow the steps below.

First, we are going to create a new dataframe titled `income1_df`.  This will consist of the country names in `raw_income_df` being set as the index. 

In [8]:
income1_df = raw_income_df.set_index('TableName').copy()
income1_df

Unnamed: 0_level_0,IncomeGroup
TableName,Unnamed: 1_level_1
Aruba,3
Africa Eastern and Southern,
Afghanistan,0
Africa Western and Central,
Angola,1
...,...
Kosovo,2
"Yemen, Rep.",0
South Africa,2
Zambia,1


Now, we are going to reindex `income_df` in relation to the `Country or region` column of the `happy_df`.  This will organise countries in `income_df` in the same order as the countries in the `happy_df`, which will simplify joining the two dataframes later down the line.

In [9]:
income1_df = income1_df.reindex(index=happy_df['Country or region'])
income1_df

Unnamed: 0_level_0,IncomeGroup
Country or region,Unnamed: 1_level_1
Finland,3
Denmark,3
Norway,3
Iceland,3
Netherlands,3
...,...
Rwanda,0
Tanzania,1
Afghanistan,0
Central African Republic,0


Finally, reset the index of `income1_df` so that it is back to being numbered.  Call this dataframe `income_df`.  If we don't rename it here, every time we run the `reset_index` command, it will add a new index.

In [10]:
income_df = income1_df.reset_index()
income_df

Unnamed: 0,Country or region,IncomeGroup
0,Finland,3
1,Denmark,3
2,Norway,3
3,Iceland,3
4,Netherlands,3
...,...,...
151,Rwanda,0
152,Tanzania,1
153,Afghanistan,0
154,Central African Republic,0


## Missing Values

Having filtered through our income category data in relation to our happiness data, we now have to deal with missing values.  

To do this, first we are going to create a boolean array from `income_df` using the function `.isnull().value`.  We want any value in the `IncomeGroup` column that is missing (aka NaN) to be marked as `True`.  Do this below.

In [11]:
income_df.isnull().IncomeGroup

0      False
1      False
2      False
3      False
4      False
       ...  
151    False
152    False
153    False
154    False
155    False
Name: IncomeGroup, Length: 156, dtype: bool

Next, use `.loc[]` to filter through `income_df` using the boolean array code you wrote above.  This will create a list of all the countries who are missing an income group category.  Title this dataframe `missing`. 

ps. don't forget to add `.copy()` to the end.

In [12]:
missing = income_df.loc[income_df.isnull().IncomeGroup].copy()
missing

Unnamed: 0,Country or region,IncomeGroup
24,Taiwan,
37,Slovakia,
38,Trinidad & Tobago,
53,South Korea,
63,Northern Cyprus,
67,Russia,
75,Hong Kong,
85,Kyrgyzstan,
98,Ivory Coast,
102,Congo (Brazzaville),


Now, redfine the `mising` dataframe such that the `IncomeGroup` column is dropped.  That way, when we find our missing values, we can make them into a list that can easily be added onto the dataframe instead of having to alter the value in each row individually.  Call this new dataframe `missing_countries`.

In [13]:
missing_countries = missing.drop(['IncomeGroup'], axis=1).copy()
missing_countries

Unnamed: 0,Country or region
24,Taiwan
37,Slovakia
38,Trinidad & Tobago
53,South Korea
63,Northern Cyprus
67,Russia
75,Hong Kong
85,Kyrgyzstan
98,Ivory Coast
102,Congo (Brazzaville)


Now comes the tricky part of finding the missing country's income level manually.  This can be done by checking the original dataset of `raw_income_df` to see if the country is written under a different name; or having to do wider research to determine the correct income category.

We are first going to do the former, searching for portions of strings in `raw_income_df` to help find the missing countries income category.

As an example, we are going to try to find the associated income group of Yemen.  

Use the following code: `df.str.contains("pattern", case=False, na=False)`, in which:
* `pattern` = substring we are searching for
* `case = False` implies case insensitivity
* `na=False` means any missing values will return as without a pattern

Title this function `mask`, and then apply it to the raw income dataframe such that: `raw_income_df[mask]`.

In [14]:
mask = raw_income_df["TableName"].str.contains("ye", case=False, na=False)
raw_income_df[mask]

Unnamed: 0,IncomeGroup,TableName
261,0,"Yemen, Rep."


This would have to be done for all the countries, and if the data cannot be found in the `raw_income_df`, wider research would have to be conducted.  With this information, a list titled `missing_values` with the income category values in the same order as the `missing_countries` is created.  I have done the majority of the hard work, but have left the space for Yemen blank for you to fill in.

In [1]:
missing_values = [2,3,3,3,3,2,2,1,1,1,1,1,1,1,0,0,1,0,0,0]
missing_values

[2, 3, 3, 3, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0]

Finally, add the list of missing income values to the `missing` dataframe using the following function: `df["new column title"] = list`.  Call the new column `IncomeGroup`.

In [16]:
missing["IncomeGroup"]= missing_values
missing

Unnamed: 0,Country or region,IncomeGroup
24,Taiwan,2
37,Slovakia,3
38,Trinidad & Tobago,3
53,South Korea,3
63,Northern Cyprus,3
67,Russia,2
75,Hong Kong,2
85,Kyrgyzstan,1
98,Ivory Coast,1
102,Congo (Brazzaville),1


## Final dataframe creation 

The final step is to put it all together, establishing our final country income category dataset!

To do so, we can use the following really handy function: `df1.combine_first(df_2)`.

This will fill in the missing values in a `df1` based on the values at corresponding indexes in `df2`.  Do so below with the `income_df` and `missing` dataframes, tittling this completed dataframe as `final_income_df`.

In [17]:
final_income_df = income_df.combine_first(missing)
final_income_df

Unnamed: 0,Country or region,IncomeGroup
0,Finland,3
1,Denmark,3
2,Norway,3
3,Iceland,3
4,Netherlands,3
...,...,...
151,Rwanda,0
152,Tanzania,1
153,Afghanistan,0
154,Central African Republic,0


Finally, export the dataframe we have created into a csv file using `df.to_csv("Name_of_file", index=False)`.  Setting the index to `False` stops Python's default setting of adding an index column as we already have one. 

In [18]:
final_income_df.to_csv("Country_income_category.csv", index=False) 

## Final Thoughts

As I mentionned in the begining, working with country data is tricky.  

Before wrapping off this sesssion, I want to touch on the package `pycountry`. This is a useful library that has various componenents to it: country letter codes, continents of countries, and a full country list to name a few.  If we didn't have the `happy_df` to filter with, using the country list would have been another approach.

For example, run the code below to create a list of countries:

In [None]:
import pycountry 

countries_list = []
for country in pycountry.countries:
    countries_list.append(country.name)
    
countries_list

Now, use this list of countries to filter through our income category dataframe `raw_income_df`.

In [None]:
raw_income_df[raw_income_df.TableName.isin(countries_list)]

As you can see, this has created a dataframe of 186 countries and their associated income category - removing regional data points.  There are undoubtely  errors in this dataframe, such as values wrongly removed due to different spellings - but it is a useful tool when working with country data.

Hope you enjoyed this session, and it provided you with some useful tools that will help you code in a more efficient way!