#UN Trends in International Migrant Stock - Data Cleaning
By: Yifei Chen (Yvette)
Date: Nov 16, 2022
Shareable link: https://colab.research.google.com/drive/1W7Mxf_cMCON33tj_7hTKawjN6mp_vPxZ?usp=sharing

##Introduction
Dataset: International Migrant stock trends from 1990 to 2015 -  The 2015 revision from United Nation's website.

Method: Following the tidy data principles 
1. Column headers should be variable names, not values.
2. Multiple variables should not be stored in one column.
3. Variables should not be stored in both rows and columns.
4. Multiple types of observational units should not be stored in the same table.
5. A single observational unit should not be stored in multiple tables


#Table 1： International migrant stock

###Load Table 1 and prepare the data
Import primary modules and Imported raw data into a pandas dataframe. Let's start with table 1. 

In [230]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 1 
df_intms = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', sheet_name='Table 1') 

#look at first 20 items in the dataset 
df_intms.head(20)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,United Nations,,,,,,...,,,,,,,,,,
4,,,,,Population Division,,,,,,...,,,,,,,,,,
5,,,,,Department of Economic and Social Affairs,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,Trends in International Migrant Stock: The 201...,,,,,,...,,,,,,,,,,
8,,,,,Table 1 - International migrant stock at mid-...,,,,,,...,,,,,,,,,,
9,,,,,POP/DB/MIG/Stock/Rev.2015,,,,,,...,,,,,,,,,,


The dataset does not have the right header assigned, useful data starts at row 13. 

Let's get some high level insights about the dataset which help me clean the data. 

In [231]:
#Get high level pandas functionalities
df_intms.info()
df_intms.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   266 non-null    object 
 1   Unnamed: 1   266 non-null    object 
 2   Unnamed: 2   27 non-null     object 
 3   Unnamed: 3   266 non-null    object 
 4   Unnamed: 4   241 non-null    object 
 5   Unnamed: 5   267 non-null    object 
 6   Unnamed: 6   266 non-null    object 
 7   Unnamed: 7   266 non-null    object 
 8   Unnamed: 8   266 non-null    object 
 9   Unnamed: 9   266 non-null    float64
 10  Unnamed: 10  266 non-null    float64
 11  Unnamed: 11  267 non-null    object 
 12  Unnamed: 12  266 non-null    object 
 13  Unnamed: 13  266 non-null    object 
 14  Unnamed: 14  266 non-null    object 
 15  Unnamed: 15  266 non-null    float64
 16  Unnamed: 16  266 non-null    float64
 17  Unnamed: 17  267 non-null    object 
 18  Unnamed: 18  266 non-null    object 
 19  Unnamed:

Unnamed: 0,Unnamed: 9,Unnamed: 10,Unnamed: 15,Unnamed: 16,Unnamed: 21,Unnamed: 22
count,266.0,266.0,266.0,266.0,266.0,266.0
mean,4368413.0,4835420.0,2280930.0,2528531.0,2087509.0,2306916.0
std,18879230.0,20753740.0,9811595.0,10825000.0,9112168.0,9981618.0
min,154.0,141.0,85.0,78.0,69.0,63.0
25%,34402.0,36466.0,16987.5,17391.0,16598.25,18268.25
50%,201545.0,213510.0,90472.5,102608.5,103906.0,109324.0
75%,1257349.0,1382807.0,613899.5,712221.0,550851.2,615995.2
max,221714200.0,243700200.0,114613700.0,126115400.0,107100500.0,117584800.0


In [232]:
df_intms.isnull().sum()


Unnamed: 0      14
Unnamed: 1      14
Unnamed: 2     253
Unnamed: 3      14
Unnamed: 4      39
Unnamed: 5      13
Unnamed: 6      14
Unnamed: 7      14
Unnamed: 8      14
Unnamed: 9      14
Unnamed: 10     14
Unnamed: 11     13
Unnamed: 12     14
Unnamed: 13     14
Unnamed: 14     14
Unnamed: 15     14
Unnamed: 16     14
Unnamed: 17     13
Unnamed: 18     14
Unnamed: 19     14
Unnamed: 20     14
Unnamed: 21     14
Unnamed: 22     14
dtype: int64

In [233]:
# get the size of the dataframe
df_intms.shape

(280, 23)

In Table 1, there are duplicated "Year" columns for different sexes (both sexes, male, female) 
Problem: Column names are values not variable names
tidy data principle #2: Column names need to be informative, variable names and not values

### Select specific rows or columns or both
So the dataset will start with valid values. 

In [234]:
#get the columns from 1 to 23 and rows from position 14 to 280 
df_intms1= df_intms[df_intms.columns[0:23]].iloc[15:280]
df_intms1.head()


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
15,1,WORLD,,900,,152563212,160801752,172703309,191269100,221714243.0,...,87884839,97866674,114613714.0,126115435.0,74815702,79064275,84818470,93402426,107100529.0,117584801.0
16,2,Developed regions,(b),901,,82378628,92306854,103375363,117181109,132560325.0,...,50536796,57217777,64081077.0,67618619.0,42115231,47214055,52838567,59963332,68479248.0,72863336.0
17,3,Developing regions,(c),902,,70184584,68494898,69327946,74087991,89153918.0,...,37348043,40648897,50532637.0,58496816.0,32700471,31850220,31979903,33439094,38621281.0,44721465.0
18,4,Least developed countries,(d),941,,11075966,11711703,10077824,9809634,10018128.0,...,5361902,5383009,5462714.0,6463217.0,5236216,5573685,4721920,4432371,4560536.0,5493028.0
19,5,Less developed regions excluding least develop...,,934,,59105261,56778501,59244124,64272611,79130668.0,...,31986141,35265888,45069923.0,52033599.0,27464255,26276535,27257983,29006723,34060745.0,39228437.0


In [235]:
#rename columns
df_intms1.columns=['ID','Area','Notes','Country Code','Type of data (a)','Both sexes 1990','Both sexes 1995',
                 'Both sexes 2000', 'Both sexes 2005','Both sexes 2010',
                 'Both sexes 2015', 'Male 1990', 'Male 1995', 'Male 2000', 'Male 2005',    
                 'Male 2010', 'Male 2015', 'Female 1990', 'Female 1995', 'Female 2000',  
                 'Female 2005', 'Female 2010', 'Female 2015']

df_intms1.head()


Unnamed: 0,ID,Area,Notes,Country Code,Type of data (a),Both sexes 1990,Both sexes 1995,Both sexes 2000,Both sexes 2005,Both sexes 2010,...,Male 2000,Male 2005,Male 2010,Male 2015,Female 1990,Female 1995,Female 2000,Female 2005,Female 2010,Female 2015
15,1,WORLD,,900,,152563212,160801752,172703309,191269100,221714243.0,...,87884839,97866674,114613714.0,126115435.0,74815702,79064275,84818470,93402426,107100529.0,117584801.0
16,2,Developed regions,(b),901,,82378628,92306854,103375363,117181109,132560325.0,...,50536796,57217777,64081077.0,67618619.0,42115231,47214055,52838567,59963332,68479248.0,72863336.0
17,3,Developing regions,(c),902,,70184584,68494898,69327946,74087991,89153918.0,...,37348043,40648897,50532637.0,58496816.0,32700471,31850220,31979903,33439094,38621281.0,44721465.0
18,4,Least developed countries,(d),941,,11075966,11711703,10077824,9809634,10018128.0,...,5361902,5383009,5462714.0,6463217.0,5236216,5573685,4721920,4432371,4560536.0,5493028.0
19,5,Less developed regions excluding least develop...,,934,,59105261,56778501,59244124,64272611,79130668.0,...,31986141,35265888,45069923.0,52033599.0,27464255,26276535,27257983,29006723,34060745.0,39228437.0


In [236]:
# Remove unnecessary columns 
df_intms1 = df_intms1.drop(['Notes','Country Code','Type of data (a)'], axis=1) #Axis 1 will act on all the COLUMNS in each ROW
df_intms1.head()

Unnamed: 0,ID,Area,Both sexes 1990,Both sexes 1995,Both sexes 2000,Both sexes 2005,Both sexes 2010,Both sexes 2015,Male 1990,Male 1995,Male 2000,Male 2005,Male 2010,Male 2015,Female 1990,Female 1995,Female 2000,Female 2005,Female 2010,Female 2015
15,1,WORLD,152563212,160801752,172703309,191269100,221714243.0,243700236.0,77747510,81737477,87884839,97866674,114613714.0,126115435.0,74815702,79064275,84818470,93402426,107100529.0,117584801.0
16,2,Developed regions,82378628,92306854,103375363,117181109,132560325.0,140481955.0,40263397,45092799,50536796,57217777,64081077.0,67618619.0,42115231,47214055,52838567,59963332,68479248.0,72863336.0
17,3,Developing regions,70184584,68494898,69327946,74087991,89153918.0,103218281.0,37484113,36644678,37348043,40648897,50532637.0,58496816.0,32700471,31850220,31979903,33439094,38621281.0,44721465.0
18,4,Least developed countries,11075966,11711703,10077824,9809634,10018128.0,11951316.0,5843107,6142712,5361902,5383009,5462714.0,6463217.0,5236216,5573685,4721920,4432371,4560536.0,5493028.0
19,5,Less developed regions excluding least develop...,59105261,56778501,59244124,64272611,79130668.0,91262036.0,31641006,30501966,31986141,35265888,45069923.0,52033599.0,27464255,26276535,27257983,29006723,34060745.0,39228437.0


###Problem 1: Column names ("1990", "1995") are values not variable names 
Tidy data principle 1: Column names need to be informative, variable names are not values

In [237]:
#unpivot measured variables to row axis
intms_melt = pd.melt(df_intms1, id_vars =["ID", "Area"] , var_name = "SexYear", value_name = "International migrant stock")
intms_melt.head()

Unnamed: 0,ID,Area,SexYear,International migrant stock
0,1,WORLD,Both sexes 1990,152563212
1,2,Developed regions,Both sexes 1990,82378628
2,3,Developing regions,Both sexes 1990,70184584
3,4,Least developed countries,Both sexes 1990,11075966
4,5,Less developed regions excluding least develop...,Both sexes 1990,59105261


###Problem 2: There are multiple variables stored in 1 column (Sex & Year)
Tidy data principle 2: Each column needs to consist of one and only one variable

In [238]:
#Sex and Year share one cell, which should be prevented according to tidy data 2
intms_melt1=(intms_melt.assign(Sex = lambda x: x.SexYear.str[:-4].astype(str), Year = lambda x: x.SexYear.str[-4:].astype(str)).drop("SexYear",axis=1))

intms_melt1.head()


Unnamed: 0,ID,Area,International migrant stock,Sex,Year
0,1,WORLD,152563212,Both sexes,1990
1,2,Developed regions,82378628,Both sexes,1990
2,3,Developing regions,70184584,Both sexes,1990
3,4,Least developed countries,11075966,Both sexes,1990
4,5,Less developed regions excluding least develop...,59105261,Both sexes,1990


####Reorder and styling 

In [239]:
#reorder the columns so ID and Year columns come first 
intms_melt2= intms_melt1.sort_values(by =['ID', 'Year'] )
intms_melt2.head()

Unnamed: 0,ID,Area,International migrant stock,Sex,Year
0,1,WORLD,152563212,Both sexes,1990
1590,1,WORLD,77747510,Male,1990
3180,1,WORLD,74815702,Female,1990
265,1,WORLD,160801752,Both sexes,1995
1855,1,WORLD,81737477,Male,1995


###Problem 3: Variables are stored in both rows and columns  (Sex)
Tidy data principle 3: variables need to be in cells, not rows and columns
"Sex" column has 3 variables stored, "Both sexes", "female", and "male"


In [240]:
#make a pivot table by splitting the Sex column into three columns: "Both sexes", "female", and "male"
intms_tidy = intms_melt2.pivot_table(
  index = ['ID','Year', 'Area'],
  columns = 'Sex',
  values = 'International migrant stock', aggfunc='first').reset_index()

intms_tidy1 = intms_tidy.set_index(['ID','Area','Year'])

intms_tidy1.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,152563212.0,74815702.0,77747510.0
1,WORLD,1995,160801752.0,79064275.0,81737477.0
1,WORLD,2000,172703309.0,84818470.0,87884839.0
1,WORLD,2005,191269100.0,93402426.0,97866674.0
1,WORLD,2010,221714243.0,107100529.0,114613714.0
1,WORLD,2015,243700236.0,117584801.0,126115435.0
2,Developed regions,1990,82378628.0,42115231.0,40263397.0
2,Developed regions,1995,92306854.0,47214055.0,45092799.0
2,Developed regions,2000,103375363.0,52838567.0,50536796.0
2,Developed regions,2005,117181109.0,59963332.0,57217777.0


In [323]:
#python is not reading ".." as null, which can raise a problem during data analysis. I will change ".." to NaN
intms_tidy1.replace('..', np.nan, inplace=True)

#data is tidy now
intms_tidy1


Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,152563212.0,74815702.0,77747510.0
1,WORLD,1995,160801752.0,79064275.0,81737477.0
1,WORLD,2000,172703309.0,84818470.0,87884839.0
1,WORLD,2005,191269100.0,93402426.0,97866674.0
1,WORLD,2010,221714243.0,107100529.0,114613714.0
...,...,...,...,...,...
265,Wallis and Futuna Islands,1995,1680.0,821.0,859.0
265,Wallis and Futuna Islands,2000,2015.0,997.0,1018.0
265,Wallis and Futuna Islands,2005,2365.0,1171.0,1194.0
265,Wallis and Futuna Islands,2010,2776.0,1375.0,1401.0


In [242]:
#we want to check how null values are being handeled here 
intms_tidy1.isnull().sum()

Sex
Both sexes     15
Female         15
Male           15
dtype: int64

In [243]:
intms_tidy1.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Both sexes   1575 non-null   float64
 1   Female       1575 non-null   float64
 2   Male         1575 non-null   float64
dtypes: float64(3)
memory usage: 65.7+ KB


# Table 2: Total population 

### Load Table 2 and prepare the data

In [281]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 2 
population_ims = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', sheet_name='Table 2')
population_ims.head(20)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,United Nations,,,,,,...,,,,,,,,,,
4,,,,,Population Division,,,,,,...,,,,,,,,,,
5,,,,,Department of Economic and Social Affairs,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,Trends in International Migrant Stock: The 201...,,,,,,...,,,,,,,,,,
8,,,,,Table 2 - Total population at mid-year by sex...,,,,,,...,,,,,,,,,,
9,,,,,POP/DB/MIG/Stock/Rev.2015,,,,,,...,,,,,,,,,,


In [282]:
#Get high level pandas functionalities
population_ims.info()
population_ims.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   266 non-null    object 
 1   Unnamed: 1   266 non-null    object 
 2   Unnamed: 2   27 non-null     object 
 3   Unnamed: 3   266 non-null    object 
 4   Unnamed: 4   275 non-null    object 
 5   Unnamed: 5   266 non-null    float64
 6   Unnamed: 6   266 non-null    float64
 7   Unnamed: 7   266 non-null    float64
 8   Unnamed: 8   266 non-null    float64
 9   Unnamed: 9   266 non-null    float64
 10  Unnamed: 10  267 non-null    object 
 11  Unnamed: 11  266 non-null    object 
 12  Unnamed: 12  266 non-null    object 
 13  Unnamed: 13  266 non-null    object 
 14  Unnamed: 14  266 non-null    object 
 15  Unnamed: 15  266 non-null    object 
 16  Unnamed: 16  267 non-null    object 
 17  Unnamed: 17  266 non-null    object 
 18  Unnamed: 18  266 non-null    object 
 19  Unnamed:

Unnamed: 0,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
count,266.0,266.0,266.0,266.0,266.0
mean,125898.9,134885.1,143958.6,153458.8,163265.3
std,576532.7,617852.1,658924.5,701276.6,744742.8
min,0.781,0.787,0.798,0.799,0.8
25%,446.4555,484.807,494.8442,528.2893,584.5947
50%,5170.379,5358.465,5798.518,6454.381,7218.885
75%,23770.35,25565.71,27934.41,31797.84,35549.32
max,5735123.0,6126622.0,6519636.0,6929725.0,7349472.0


In [283]:
# get the size of the dataframe
population_ims.shape

(280, 22)

In [284]:
#Drop the description section and set up column names
population_ims= population_ims[population_ims.columns[0:22]].iloc[14:280]
population_ims1= pd.DataFrame(population_ims)
#Set the first column as the headers
population_ims1.columns = population_ims1.iloc[0]
population_ims1 = population_ims.reindex(population_ims1.index.drop(14))
population_ims1.head()


14,NaN,NaN.1,NaN.2,NaN.3,1990.0,1995.0,2000.0,2005.0,2010.0,2015.0,...,2000.0.1,2005.0.1,2010.0.1,2015.0.1,1990.0.1,1995.0.1,2000.0.2,2005.0.2,2010.0.2,2015.0.2
15,1,WORLD,,900,5309667.699,5735123.084,6126622.121,6519635.85,6929725.043,7349472.099,...,3084537.662,3285082.249,3493956.904,3707205.753,2639243.998,2848487.191,3042084.459,3234553.601,3435768.139,3642266.346
16,2,Developed regions,(b),901,1144463.062,1169761.211,1188811.731,1208919.509,1233375.711,1251351.086,...,578010.218,587962.213,599955.476,609297.148,589207.436,601492.755,610801.513,620957.296,633420.235,642053.938
17,3,Developing regions,(c),902,4165204.637,4565361.873,4937810.39,5310716.341,5696349.332,6098121.013,...,2506527.444,2697120.036,2894001.428,3097908.605,2050036.562,2246994.436,2431282.946,2613596.305,2802347.904,3000212.408
18,4,Least developed countries,(d),941,510057.629,585189.354,664386.087,752804.951,847254.847,954157.804,...,331482.475,375757.715,422397.532,476031.179,256015.073,293162.612,332903.612,377047.236,424857.315,478126.625
19,5,Less developed regions excluding least develop...,,934,3655147.008,3980172.519,4273424.303,4557911.39,4849094.485,5143963.209,...,2175044.969,2321362.321,2471603.896,2621877.426,1794021.489,1953831.824,2098379.334,2236549.069,2377490.589,2522085.783


In [285]:
#rename columns
population_ims1.columns=['ID','Area','Notes','Country Code','Both sexes 1990','Both sexes 1995',
                 'Both sexes 2000', 'Both sexes 2005','Both sexes 2010',
                 'Both sexes 2015', 'Male 1990', 'Male 1995', 'Male 2000', 'Male 2005',    
                 'Male 2010', 'Male 2015', 'Female 1990', 'Female 1995', 'Female 2000',  
                 'Female 2005', 'Female 2010', 'Female 2015']

#drop "notes" and "country code" as they are not necessary 
population_ims2 = population_ims1.drop(['Notes','Country Code'], axis=1)

population_ims2.head()

Unnamed: 0,ID,Area,Both sexes 1990,Both sexes 1995,Both sexes 2000,Both sexes 2005,Both sexes 2010,Both sexes 2015,Male 1990,Male 1995,Male 2000,Male 2005,Male 2010,Male 2015,Female 1990,Female 1995,Female 2000,Female 2005,Female 2010,Female 2015
15,1,WORLD,5309667.699,5735123.084,6126622.121,6519635.85,6929725.043,7349472.099,2670423.701,2886635.893,3084537.662,3285082.249,3493956.904,3707205.753,2639243.998,2848487.191,3042084.459,3234553.601,3435768.139,3642266.346
16,2,Developed regions,1144463.062,1169761.211,1188811.731,1208919.509,1233375.711,1251351.086,555255.626,568268.456,578010.218,587962.213,599955.476,609297.148,589207.436,601492.755,610801.513,620957.296,633420.235,642053.938
17,3,Developing regions,4165204.637,4565361.873,4937810.39,5310716.341,5696349.332,6098121.013,2115168.075,2318367.437,2506527.444,2697120.036,2894001.428,3097908.605,2050036.562,2246994.436,2431282.946,2613596.305,2802347.904,3000212.408
18,4,Least developed countries,510057.629,585189.354,664386.087,752804.951,847254.847,954157.804,254042.556,292026.742,331482.475,375757.715,422397.532,476031.179,256015.073,293162.612,332903.612,377047.236,424857.315,478126.625
19,5,Less developed regions excluding least develop...,3655147.008,3980172.519,4273424.303,4557911.39,4849094.485,5143963.209,1861125.519,2026340.695,2175044.969,2321362.321,2471603.896,2621877.426,1794021.489,1953831.824,2098379.334,2236549.069,2377490.589,2522085.783


##Problem 1: Column names ("1990", "1995") are values not variable names
Tidy data principle 1: Column names need to be informative, variable names are not values

In [286]:
#unpivot measured variables to row axis
population_melt = pd.melt(population_ims2, id_vars =["ID", "Area"] , var_name = "SexYear", value_name = "Total Population")
population_melt.head()

Unnamed: 0,ID,Area,SexYear,Total Population
0,1,WORLD,Both sexes 1990,5309667.699
1,2,Developed regions,Both sexes 1990,1144463.062
2,3,Developing regions,Both sexes 1990,4165204.637
3,4,Least developed countries,Both sexes 1990,510057.629
4,5,Less developed regions excluding least develop...,Both sexes 1990,3655147.008


##Problem 2: There are multiple variables stored in 1 column (Sex & Year)
Tidy data principle 2: Each column needs to consist of one and only one variable

In [287]:
#Sex and Year share one cell, they should be separate columns 
population_melt1=(population_melt.assign(Sex = lambda x: x.SexYear.str[:-4].astype(str), Year = lambda x: x.SexYear.str[-4:].astype(str)).drop("SexYear",axis=1))

population_melt1.head()

Unnamed: 0,ID,Area,Total Population,Sex,Year
0,1,WORLD,5309667.699,Both sexes,1990
1,2,Developed regions,1144463.062,Both sexes,1990
2,3,Developing regions,4165204.637,Both sexes,1990
3,4,Least developed countries,510057.629,Both sexes,1990
4,5,Less developed regions excluding least develop...,3655147.008,Both sexes,1990


##Problem 3: Variables are stored in both rows and columns (Sex)
Tidy data principle 3: variables need to be in cells, not rows and columns "Sex" column has 3 variables stored, "Both sexes", "female", and "male"

In [288]:
#make a pivot table by splitting the Sex column into three columns: "Both sexes", "female", and "male"
population_tidy = population_melt1.pivot_table(
  index = ['ID','Year', 'Area'],
  columns = 'Sex',
  values = 'Total Population', aggfunc='first').reset_index()

population_tidy = population_tidy.set_index(['ID','Area','Year'])

#now the data is tidy
population_tidy.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,5309667.699,2639243.998,2670423.701
1,WORLD,1995,5735123.084,2848487.191,2886635.893
1,WORLD,2000,6126622.121,3042084.459,3084537.662
1,WORLD,2005,6519635.85,3234553.601,3285082.249
1,WORLD,2010,6929725.043,3435768.139,3493956.904


In [322]:
#replace ".." with NaN
population_tidy.replace('..', np.nan, inplace=True)

#data is tidy now
population_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,5309667.699,2639243.998,2670423.701
1,WORLD,1995,5735123.084,2848487.191,2886635.893
1,WORLD,2000,6126622.121,3042084.459,3084537.662
1,WORLD,2005,6519635.850,3234553.601,3285082.249
1,WORLD,2010,6929725.043,3435768.139,3493956.904
...,...,...,...,...,...
265,Wallis and Futuna Islands,1995,14.143,,
265,Wallis and Futuna Islands,2000,14.497,,
265,Wallis and Futuna Islands,2005,14.246,,
265,Wallis and Futuna Islands,2010,13.565,,


In [290]:
# check for NaN
population_tidy.isnull().sum()

Sex
Both sexes       0
Female         192
Male           192
dtype: int64

In [291]:
population_tidy.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Both sexes   1590 non-null   float64
 1   Female       1398 non-null   float64
 2   Male         1398 non-null   float64
dtypes: float64(3)
memory usage: 65.7+ KB



# Table 3:  International migrant stock as a percentage of the total population 

### Load Table 3 and prepare the data

In [292]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 3 
ims_percentage = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', sheet_name='Table 3')
ims_percentage.head(20)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,United Nations,,,,,,...,,,,,,,,,,
4,,,,,Population Division,,,,,,...,,,,,,,,,,
5,,,,,Department of Economic and Social Affairs,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,Trends in International Migrant Stock: The 201...,,,,,,...,,,,,,,,,,
8,,,,,Table 3 - International migrant stock as a per...,,,,,,...,,,,,,,,,,
9,,,,,POP/DB/MIG/Stock/Rev.2015,,,,,,...,,,,,,,,,,


In [293]:
#Get high level pandas functionalities
ims_percentage.info()
ims_percentage.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   266 non-null    object 
 1   Unnamed: 1   266 non-null    object 
 2   Unnamed: 2   27 non-null     object 
 3   Unnamed: 3   266 non-null    object 
 4   Unnamed: 4   241 non-null    object 
 5   Unnamed: 5   267 non-null    object 
 6   Unnamed: 6   266 non-null    object 
 7   Unnamed: 7   266 non-null    object 
 8   Unnamed: 8   266 non-null    object 
 9   Unnamed: 9   266 non-null    float64
 10  Unnamed: 10  266 non-null    float64
 11  Unnamed: 11  267 non-null    object 
 12  Unnamed: 12  266 non-null    object 
 13  Unnamed: 13  266 non-null    object 
 14  Unnamed: 14  266 non-null    object 
 15  Unnamed: 15  266 non-null    object 
 16  Unnamed: 16  266 non-null    object 
 17  Unnamed: 17  267 non-null    object 
 18  Unnamed: 18  266 non-null    object 
 19  Unnamed:

Unnamed: 0,Unnamed: 9,Unnamed: 10
count,266.0,266.0
mean,19.795613,20.10345
std,123.669558,123.966384
min,0.063377,0.071076
25%,1.555673,1.472722
50%,4.495284,4.767713
75%,14.866569,15.243307
max,2010.0,2015.0


In [294]:
# get the size of the dataframe
ims_percentage.shape

(280, 23)

In [295]:
#Drop the description section and set up column names
ims_percentage1= ims_percentage[ims_percentage.columns[0:23]].iloc[14:280]
ims_percentage1= pd.DataFrame(ims_percentage1)
#Set the first column as the headers
ims_percentage1.columns = ims_percentage1.iloc[0]
ims_percentage1 = ims_percentage1.reindex(ims_percentage1.index.drop(14))
ims_percentage1.head()


14,NaN,NaN.1,NaN.2,NaN.3,NaN.4,1990.0,1995.0,2000.0,2005.0,2010.0,...,2000.0.1,2005.0.1,2010.0.1,2015.0,1990.0.1,1995.0.1,2000.0.2,2005.0.2,2010.0.2,2015.0.1
15,1,WORLD,,900,,2.87331,2.803806,2.818899,2.933739,3.199467,...,2.849206,2.979124,3.280341,3.4019,2.83474,2.775658,2.788169,2.887645,3.117222,3.228342
16,2,Developed regions,(b),901,,7.198015,7.891085,8.695688,9.693045,10.747765,...,8.743236,9.73154,10.680972,11.097807,7.147777,7.84948,8.650694,9.656595,10.811029,11.348476
17,3,Developing regions,(c),902,,1.685021,1.500317,1.404022,1.395066,1.565106,...,1.490031,1.507122,1.746117,1.888268,1.595116,1.417459,1.315351,1.279428,1.378176,1.49061
18,4,Least developed countries,(d),941,,2.171513,2.001353,1.516863,1.303078,1.182422,...,1.617552,1.432574,1.293264,1.35773,2.045276,1.901226,1.418405,1.175548,1.073428,1.148865
19,5,Less developed regions excluding least develop...,,934,,1.617042,1.426534,1.386338,1.410133,1.631865,...,1.470597,1.519189,1.823509,1.984593,1.530877,1.344872,1.299002,1.296941,1.432634,1.555397


In [296]:
#rename columns
ims_percentage1.columns=['ID','Area','Notes','Country Code','Type of Data','Both sexes 1990','Both sexes 1995',
                 'Both sexes 2000', 'Both sexes 2005','Both sexes 2010',
                 'Both sexes 2015', 'Male 1990', 'Male 1995', 'Male 2000', 'Male 2005',    
                 'Male 2010', 'Male 2015', 'Female 1990', 'Female 1995', 'Female 2000',  
                 'Female 2005', 'Female 2010', 'Female 2015']

#drop "notes" and "country code" as they are not necessary 
ims_percentage2 = ims_percentage1.drop(['Notes','Country Code','Type of Data'], axis=1)

ims_percentage2.head()


Unnamed: 0,ID,Area,Both sexes 1990,Both sexes 1995,Both sexes 2000,Both sexes 2005,Both sexes 2010,Both sexes 2015,Male 1990,Male 1995,Male 2000,Male 2005,Male 2010,Male 2015,Female 1990,Female 1995,Female 2000,Female 2005,Female 2010,Female 2015
15,1,WORLD,2.87331,2.803806,2.818899,2.933739,3.199467,3.315888,2.91143,2.831583,2.849206,2.979124,3.280341,3.4019,2.83474,2.775658,2.788169,2.887645,3.117222,3.228342
16,2,Developed regions,7.198015,7.891085,8.695688,9.693045,10.747765,11.226422,7.251326,7.935123,8.743236,9.73154,10.680972,11.097807,7.147777,7.84948,8.650694,9.656595,10.811029,11.348476
17,3,Developing regions,1.685021,1.500317,1.404022,1.395066,1.565106,1.692624,1.772158,1.580624,1.490031,1.507122,1.746117,1.888268,1.595116,1.417459,1.315351,1.279428,1.378176,1.49061
18,4,Least developed countries,2.171513,2.001353,1.516863,1.303078,1.182422,1.252551,2.30005,2.103476,1.617552,1.432574,1.293264,1.35773,2.045276,1.901226,1.418405,1.175548,1.073428,1.148865
19,5,Less developed regions excluding least develop...,1.617042,1.426534,1.386338,1.410133,1.631865,1.774158,1.700101,1.505273,1.470597,1.519189,1.823509,1.984593,1.530877,1.344872,1.299002,1.296941,1.432634,1.555397


##Problem 1: Column names ("1990", "1995") are values not variable names
Tidy data principle 1: Column names need to be informative, variable names and not values

In [297]:
#unpivot measured variables to row axis
imsp_melt = pd.melt(ims_percentage2, id_vars =["ID", "Area"] , var_name = "SexYear", value_name = "Percentage")
imsp_melt.head()

Unnamed: 0,ID,Area,SexYear,Percentage
0,1,WORLD,Both sexes 1990,2.87331
1,2,Developed regions,Both sexes 1990,7.198015
2,3,Developing regions,Both sexes 1990,1.685021
3,4,Least developed countries,Both sexes 1990,2.171513
4,5,Less developed regions excluding least develop...,Both sexes 1990,1.617042


##Problem 2: There are multiple variables stored in 1 column (Sex & Year)
Tidy data principle 2: Each column needs to consist of one and only one variable

In [298]:
#Sex and Year share one cell, they should be separate columns 
imsp_melt1=(imsp_melt.assign(Sex = lambda x: x.SexYear.str[:-4].astype(str), Year = lambda x: x.SexYear.str[-4:].astype(str)).drop("SexYear",axis=1))

imsp_melt1.head()

Unnamed: 0,ID,Area,Percentage,Sex,Year
0,1,WORLD,2.87331,Both sexes,1990
1,2,Developed regions,7.198015,Both sexes,1990
2,3,Developing regions,1.685021,Both sexes,1990
3,4,Least developed countries,2.171513,Both sexes,1990
4,5,Less developed regions excluding least develop...,1.617042,Both sexes,1990


##Problem 3: Variables are stored in both rows and columns (Sex)
Tidy data principle 3: variables need to be in cells, not rows and columns "Sex" column has 3 variables stored, "Both sexes", "female", and "male"

In [299]:
#make a pivot table by splitting the Sex column into three columns: "Both sexes", "female", and "male"
imsp_tidy = imsp_melt1.pivot_table(
  index = ['ID','Year', 'Area'],
  columns = 'Sex',
  values = 'Percentage', aggfunc='first').reset_index()

imsp_tidy = imsp_tidy.set_index(['ID','Area','Year'])
imsp_tidy.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,2.87331,2.83474,2.91143
1,WORLD,1995,2.803806,2.775658,2.831583
1,WORLD,2000,2.818899,2.788169,2.849206
1,WORLD,2005,2.933739,2.887645,2.979124
1,WORLD,2010,3.199467,3.117222,3.280341


In [300]:
imsp_tidy.replace('..', np.nan, inplace=True)
imsp_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990,2.873310,2.834740,2.911430
1,WORLD,1995,2.803806,2.775658,2.831583
1,WORLD,2000,2.818899,2.788169,2.849206
1,WORLD,2005,2.933739,2.887645,2.979124
1,WORLD,2010,3.199467,3.117222,3.280341
...,...,...,...,...,...
265,Wallis and Futuna Islands,1995,11.878668,,
265,Wallis and Futuna Islands,2000,13.899427,,
265,Wallis and Futuna Islands,2005,16.601151,,
265,Wallis and Futuna Islands,2010,20.464431,,


In [301]:
#Check for missing values in the dataset 
imsp_tidy.isnull().sum()

Sex
Both sexes      19
Female         204
Male           204
dtype: int64

In [302]:
imsp_tidy.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Both sexes   1571 non-null   float64
 1   Female       1386 non-null   float64
 2   Male         1386 non-null   float64
dtypes: float64(3)
memory usage: 65.7+ KB


#Table 4: Female migrants as a percentage of the international migrant stock


##Load table 4 and prepare the data 

In [303]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 4 
missing_values = ['..'] # define missing values 
df_female = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', sheet_name='Table 4', na_values = missing_values)
df_female.head(20)



Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,,,,,,,,,,,
1,,,,,,,,,,,
2,,,,,,,,,,,
3,,,,,United Nations,,,,,,
4,,,,,Population Division,,,,,,
5,,,,,Department of Economic and Social Affairs,,,,,,
6,,,,,,,,,,,
7,,,,,Trends in International Migrant Stock: The 201...,,,,,,
8,,,,,Table 4 - Female migrants as a percentage of t...,,,,,,
9,,,,,POP/DB/MIG/Stock/Rev.2015,,,,,,


In [304]:
#Get high level pandas functionalities
df_female.info()
df_female.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   266 non-null    object 
 1   Unnamed: 1   266 non-null    object 
 2   Unnamed: 2   27 non-null     object 
 3   Unnamed: 3   266 non-null    object 
 4   Unnamed: 4   241 non-null    object 
 5   Unnamed: 5   263 non-null    object 
 6   Unnamed: 6   262 non-null    float64
 7   Unnamed: 7   262 non-null    float64
 8   Unnamed: 8   263 non-null    float64
 9   Unnamed: 9   266 non-null    float64
 10  Unnamed: 10  266 non-null    float64
dtypes: float64(5), object(6)
memory usage: 24.2+ KB


Unnamed: 0,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
count,262.0,262.0,263.0,266.0,266.0
mean,55.857045,56.004701,55.668023,55.597419,55.683104
std,120.411311,120.723619,120.837977,120.488473,120.793807
min,13.858366,13.859552,13.628353,13.458551,13.325719
25%,46.519688,46.227674,45.887502,45.872356,45.88161
50%,48.945826,49.202604,49.10075,49.274357,49.427732
75%,51.714282,52.001573,52.095902,51.979357,52.089893
max,1995.0,2000.0,2005.0,2010.0,2015.0


In [305]:
# get the size of the dataframe
df_female.shape

(280, 11)

In [306]:
#Drop the description section
df_female= df_female[df_female.columns[0:11]].iloc[14:280]
df_female1= pd.DataFrame(df_female)
#Set the first column as the headers
df_female1.columns = df_female1.iloc[0]
df_female1 = df_female1.reindex(df_female1.index.drop(14))
df_female1.head()


14,NaN,NaN.1,NaN.2,NaN.3,NaN.4,1990.0,1995.0,2000.0,2005.0,2010.0,2015.0
15,1,WORLD,,900,,49.03915,49.16879,49.112244,48.832993,48.30566,48.249769
16,2,Developed regions,(b),901,,51.123977,51.149024,51.113307,51.171501,51.658932,51.866687
17,3,Developing regions,(c),902,,46.592099,46.500135,46.128444,45.134297,43.31978,43.327078
18,4,Least developed countries,(d),941,,47.261155,47.571664,46.826689,45.157406,45.499573,45.942752
19,5,Less developed regions excluding least develop...,,934,,46.466684,46.279022,46.009598,45.130768,43.043672,42.984398


In [307]:
#rename columns
df_female1.columns=['ID','Area','Notes','Country Code','Type of data (a)','1990', '1995', '2000',  
                 '2005', '2010', '2015']

#drop "notes" and "country code" as they are not necessary 
df_female2 = df_female1.drop(['Notes','Country Code','Type of data (a)'], axis=1)

df_female2.head()

Unnamed: 0,ID,Area,1990,1995,2000,2005,2010,2015
15,1,WORLD,49.03915,49.16879,49.112244,48.832993,48.30566,48.249769
16,2,Developed regions,51.123977,51.149024,51.113307,51.171501,51.658932,51.866687
17,3,Developing regions,46.592099,46.500135,46.128444,45.134297,43.31978,43.327078
18,4,Least developed countries,47.261155,47.571664,46.826689,45.157406,45.499573,45.942752
19,5,Less developed regions excluding least develop...,46.466684,46.279022,46.009598,45.130768,43.043672,42.984398


##Problem: Column names ("1990", "1995") are values not variable names
Tidy data principle 1: Column names need to be informative, variable names are not values

In [308]:
#unpivot measured variables to row axis
df_female3 = pd.melt(df_female2, id_vars =["ID", "Area"] , var_name = "Year", value_name = "Female Stock Percentage")
df_female3.head()


Unnamed: 0,ID,Area,Year,Female Stock Percentage
0,1,WORLD,1990,49.03915
1,2,Developed regions,1990,51.123977
2,3,Developing regions,1990,46.592099
3,4,Least developed countries,1990,47.261155
4,5,Less developed regions excluding least develop...,1990,46.466684


In [309]:
#reorder the dataset
df_female4= df_female3.sort_values(by =['ID', 'Year'] )
df_female4 = df_female4.set_index(['ID','Area','Year'])

#The dataset is tidy now
df_female4 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Female Stock Percentage
ID,Area,Year,Unnamed: 3_level_1
1,WORLD,1990,49.03915
1,WORLD,1995,49.16879
1,WORLD,2000,49.112244
1,WORLD,2005,48.832993
1,WORLD,2010,48.30566
...,...,...,...
265,Wallis and Futuna Islands,1995,48.869048
265,Wallis and Futuna Islands,2000,49.478908
265,Wallis and Futuna Islands,2005,49.513742
265,Wallis and Futuna Islands,2010,49.5317


In [310]:
df_female4["Female Stock Percentage"]=df_female4["Female Stock Percentage"].astype(float)

In [311]:
df_female4.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 1 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Female Stock Percentage  1575 non-null   float64
dtypes: float64(1)
memory usage: 40.9+ KB



# Table 5 : Annual rate of change  

### Load Table 5 and prepare the data

In [312]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 5 
arc = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', sheet_name='Table 5')
arc.head(20)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
0,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,
3,,,,,United Nations,,,,,,,,,,,,,,,
4,,,,,Population Division,,,,,,,,,,,,,,,
5,,,,,Department of Economic and Social Affairs,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,,,,,,
7,,,,,Trends in International Migrant Stock: The 201...,,,,,,,,,,,,,,,
8,,,,,Table 5 - Annual rate of change of the migrant...,,,,,,,,,,,,,,,
9,,,,,POP/DB/MIG/Stock/Rev.2015,,,,,,,,,,,,,,,


In [313]:
#Get high level pandas functionalities
arc.info()
arc.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   266 non-null    object
 1   Unnamed: 1   266 non-null    object
 2   Unnamed: 2   27 non-null     object
 3   Unnamed: 3   266 non-null    object
 4   Unnamed: 4   241 non-null    object
 5   Unnamed: 5   267 non-null    object
 6   Unnamed: 6   266 non-null    object
 7   Unnamed: 7   266 non-null    object
 8   Unnamed: 8   266 non-null    object
 9   Unnamed: 9   266 non-null    object
 10  Unnamed: 10  267 non-null    object
 11  Unnamed: 11  266 non-null    object
 12  Unnamed: 12  266 non-null    object
 13  Unnamed: 13  266 non-null    object
 14  Unnamed: 14  266 non-null    object
 15  Unnamed: 15  267 non-null    object
 16  Unnamed: 16  266 non-null    object
 17  Unnamed: 17  266 non-null    object
 18  Unnamed: 18  266 non-null    object
 19  Unnamed: 19  266 non-null    

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
count,266,266,27,266,241,267,266,266,266,266,267,266,266,266,266,267,266,266,266,266
unique,266,266,27,266,18,264,263,263,264,266,264,263,263,264,266,264,263,263,264,266
top,Sort\norder,"Major area, region, country or area of destina...",Notes,Country code,B,..,..,..,..,2010-2015,..,..,..,..,2010-2015,..,..,..,..,2010-2015
freq,1,1,1,1,132,4,4,4,3,1,4,4,4,3,1,4,4,4,3,1


In [314]:
# get the size of the dataframe
arc.shape

(280, 20)

In [315]:
#Drop the description section and set up column names
arc1= arc[arc.columns[0:20]].iloc[14:280]
arc1= pd.DataFrame(arc1)
#Set the first column as the headers
arc1.columns = arc1.iloc[0]
arc1 = arc1.reindex(arc1.index.drop(14))
arc1.head()


14,NaN,NaN.1,NaN.2,NaN.3,NaN.4,1990-1995,1995-2000,2000-2005,2005-2010,2010-2015,1990-1995.1,1995-2000.1,2000-2005.1,2005-2010.1,2010-2015.1,1990-1995.2,1995-2000.2,2000-2005.2,2005-2010.2,2010-2015.2
15,1,WORLD,,900,,1.051865,1.428058,2.042124,2.95416,1.890991,1.000922,1.450294,2.151575,3.159228,1.912603,1.104667,1.405044,1.92808,2.737012,1.867837
16,2,Developed regions,(b),901,,2.275847,2.264965,2.50708,2.466343,1.160824,2.265595,2.279583,2.483259,2.265689,1.074685,2.285643,2.250995,2.529838,2.65595,1.241097
17,3,Developing regions,(c),902,,-0.487389,0.241777,1.328107,3.702217,2.929634,-0.45298,0.380246,1.693824,4.352954,2.927058,-0.526904,0.081268,0.89236,2.881555,2.933003
18,4,Least developed countries,(d),941,,1.118175,-3.001139,-0.539636,0.419137,3.526927,1.000073,-2.718952,0.078575,0.293964,3.363629,1.249146,-3.316818,-1.265617,0.57011,3.72079
19,5,Less developed regions excluding least develop...,,934,,-0.803244,0.850177,1.62934,4.159339,2.852687,-0.733256,0.950231,1.952269,4.90598,2.87349,-0.88418,0.733402,1.243624,3.212358,2.825127


In [316]:
#rename columns
arc1.columns=['ID','Area','Notes','Country Code','Type of Data','Both sexes 1990-1995','Both sexes 1995-2000',
                 'Both sexes 2000-2005', 'Both sexes 2005-2010','Both sexes 2010-2015',
                'Male 1990-1995', 'Male 1995-2000', 'Male 2000-2005', 'Male 2005-2010',    
                 'Male 2010-2015', 'Female 1990-1995', 'Female 1995-2000', 'Female 2000-2005',  
                 'Female 2005-2010', 'Female 2010-2015']

#drop "notes" and "country code" as they are not necessary 
arc2 = arc1.drop(['Notes','Country Code','Type of Data'], axis=1)

arc2.head()


Unnamed: 0,ID,Area,Both sexes 1990-1995,Both sexes 1995-2000,Both sexes 2000-2005,Both sexes 2005-2010,Both sexes 2010-2015,Male 1990-1995,Male 1995-2000,Male 2000-2005,Male 2005-2010,Male 2010-2015,Female 1990-1995,Female 1995-2000,Female 2000-2005,Female 2005-2010,Female 2010-2015
15,1,WORLD,1.051865,1.428058,2.042124,2.95416,1.890991,1.000922,1.450294,2.151575,3.159228,1.912603,1.104667,1.405044,1.92808,2.737012,1.867837
16,2,Developed regions,2.275847,2.264965,2.50708,2.466343,1.160824,2.265595,2.279583,2.483259,2.265689,1.074685,2.285643,2.250995,2.529838,2.65595,1.241097
17,3,Developing regions,-0.487389,0.241777,1.328107,3.702217,2.929634,-0.45298,0.380246,1.693824,4.352954,2.927058,-0.526904,0.081268,0.89236,2.881555,2.933003
18,4,Least developed countries,1.118175,-3.001139,-0.539636,0.419137,3.526927,1.000073,-2.718952,0.078575,0.293964,3.363629,1.249146,-3.316818,-1.265617,0.57011,3.72079
19,5,Less developed regions excluding least develop...,-0.803244,0.850177,1.62934,4.159339,2.852687,-0.733256,0.950231,1.952269,4.90598,2.87349,-0.88418,0.733402,1.243624,3.212358,2.825127


##Problem 1: Column names ("1990", "1995") are values not variable names
Tidy data principle 1: Column names need to be informative, variable names are not values

In [317]:
#unpivot measured variables to row axis
arc_melt = pd.melt(arc2, id_vars =["ID", "Area"] , var_name = "SexYear", value_name = "Annual ROC")
arc_melt.head()

Unnamed: 0,ID,Area,SexYear,Annual ROC
0,1,WORLD,Both sexes 1990-1995,1.051865
1,2,Developed regions,Both sexes 1990-1995,2.275847
2,3,Developing regions,Both sexes 1990-1995,-0.487389
3,4,Least developed countries,Both sexes 1990-1995,1.118175
4,5,Less developed regions excluding least develop...,Both sexes 1990-1995,-0.803244


##Problem 2: There are multiple variables stored in 1 column (Sex & Year)
Tidy data principle 2: Each column needs to consist of one and only one variable

In [318]:
#Sex and Year share one cell, they should be separate columns 
arc_melt1=(arc_melt.assign(Sex = lambda x: x.SexYear.str[:-9].astype(str), Year = lambda x: x.SexYear.str[-9:].astype(str)).drop("SexYear",axis=1))

arc_melt1.head(20)

Unnamed: 0,ID,Area,Annual ROC,Sex,Year
0,1,WORLD,1.051865,Both sexes,1990-1995
1,2,Developed regions,2.275847,Both sexes,1990-1995
2,3,Developing regions,-0.487389,Both sexes,1990-1995
3,4,Least developed countries,1.118175,Both sexes,1990-1995
4,5,Less developed regions excluding least develop...,-0.803244,Both sexes,1990-1995
5,6,Sub-Saharan Africa,0.845374,Both sexes,1990-1995
6,7,Africa,0.826734,Both sexes,1990-1995
7,8,Eastern Africa,-3.435412,Both sexes,1990-1995
8,9,Burundi,-5.355717,Both sexes,1990-1995
9,10,Comoros,-0.199873,Both sexes,1990-1995


##Problem 3: Variables are stored in both rows and columns (Sex)
Tidy data principle 3: variables need to be in cells, not rows and columns "Sex" column has 3 variables stored, "Both sexes", "female", and "male"

In [319]:
#make a pivot table by splitting the Sex column into three columns: "Both sexes", "female", and "male"
arc_tidy = arc_melt1.pivot_table(
  index = ['ID','Year', 'Area'],
  columns = 'Sex',
  values = 'Annual ROC', aggfunc='first').reset_index()

arc_tidy= arc_tidy.set_index(['ID','Area','Year'])
arc_tidy.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990-1995,1.051865,1.104667,1.000922
1,WORLD,1995-2000,1.428058,1.405044,1.450294
1,WORLD,2000-2005,2.042124,1.92808,2.151575
1,WORLD,2005-2010,2.95416,2.737012,3.159228
1,WORLD,2010-2015,1.890991,1.867837,1.912603
2,Developed regions,1990-1995,2.275847,2.285643,2.265595
2,Developed regions,1995-2000,2.264965,2.250995,2.279583
2,Developed regions,2000-2005,2.50708,2.529838,2.483259
2,Developed regions,2005-2010,2.466343,2.65595,2.265689
2,Developed regions,2010-2015,1.160824,1.241097,1.074685


In [320]:
arc_tidy.replace('..', np.nan, inplace=True)
arc_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,Both sexes,Female,Male
ID,Area,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,WORLD,1990-1995,1.051865,1.104667,1.000922
1,WORLD,1995-2000,1.428058,1.405044,1.450294
1,WORLD,2000-2005,2.042124,1.928080,2.151575
1,WORLD,2005-2010,2.954160,2.737012,3.159228
1,WORLD,2010-2015,1.890991,1.867837,1.912603
...,...,...,...,...,...
265,Wallis and Futuna Islands,1990-1995,3.617880,3.886601,3.364378
265,Wallis and Futuna Islands,1995-2000,3.636508,3.884553,3.396526
265,Wallis and Futuna Islands,2000-2005,3.203177,3.217252,3.189382
265,Wallis and Futuna Islands,2005-2010,3.204660,3.211913,3.197545


In [321]:

#check data type 
arc_tidy.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1325 entries, (1, 'WORLD', '1990-1995') to (265, 'Wallis and Futuna Islands', '2010-2015')
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Both sexes   1310 non-null   float64
 1   Female       1310 non-null   float64
 2   Male         1310 non-null   float64
dtypes: float64(3)
memory usage: 58.2+ KB



# Table 6 : Estimated refugee stock

### Load Table 6 and prepare the data

In [256]:
#import libraries
import pandas as pd
import numpy as np

#import raw data and table 6 
missing_values = ['..'] # define missing values 
ers_df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/UN_MigrantStockTotal_2015.xlsx', 
                       sheet_name='Table 6',na_values = missing_values)
ers_df


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,United Nations,,,,,,...,,,,,,,,,,
4,,,,,Population Division,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,261,Samoa,,882,B,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
276,262,Tokelau,,772,B,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
277,263,Tonga,,776,B,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
278,264,Tuvalu,,798,C,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,


In [257]:
#Get high level pandas functionalities
ers_df.info()
ers_df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   266 non-null    object 
 1   Unnamed: 1   266 non-null    object 
 2   Unnamed: 2   27 non-null     object 
 3   Unnamed: 3   266 non-null    object 
 4   Unnamed: 4   241 non-null    object 
 5   Unnamed: 5   264 non-null    object 
 6   Unnamed: 6   263 non-null    float64
 7   Unnamed: 7   263 non-null    float64
 8   Unnamed: 8   264 non-null    float64
 9   Unnamed: 9   266 non-null    float64
 10  Unnamed: 10  266 non-null    float64
 11  Unnamed: 11  263 non-null    object 
 12  Unnamed: 12  262 non-null    float64
 13  Unnamed: 13  262 non-null    float64
 14  Unnamed: 14  263 non-null    float64
 15  Unnamed: 15  266 non-null    float64
 16  Unnamed: 16  266 non-null    float64
 17  Unnamed: 17  151 non-null    object 
 18  Unnamed: 18  176 non-null    object 
 19  Unnamed:

Unnamed: 0,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
count,263.0,263.0,264.0,266.0,266.0,262.0,262.0,263.0,266.0,266.0
mean,412501.0,360301.7,300491.5,345734.5,446342.6,19.870113,17.978212,16.747038,17.00106,18.662542
std,1741854.0,1595135.0,1360263.0,1648258.0,2120758.0,126.494471,127.823271,129.68398,131.759169,133.279478
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.006424
50%,1320.0,1609.0,1535.5,1764.5,1524.5,0.606067,0.595974,0.522784,0.42266,0.559671
75%,72925.0,47130.5,46078.25,37433.5,60258.75,10.815753,6.643305,5.242832,5.043185,5.418954
max,17853840.0,15827800.0,13276730.0,15370760.0,19577470.0,1995.0,2000.0,2005.0,2010.0,2015.0


In [258]:
# get the size of the dataframe
ers_df.shape

(280, 22)

In [259]:
#Drop the description section and set up column names
ers_df1= ers_df[ers_df.columns[0:22]].iloc[14:280]
ers_df1= pd.DataFrame(ers_df1)
#Set the first column as the headers
ers_df1.columns = ers_df1.iloc[0]
ers_df1 = ers_df1.reindex(ers_df1.index.drop(14))
ers_df1.head()


14,NaN,NaN.1,NaN.2,NaN.3,NaN.4,1990,1995.0,2000.0,2005.0,2010.0,...,1995.0.1,2000.0.1,2005.0.1,2010.0.1,2015.0,1990-1995,1995-2000,2000-2005,2005-2010,2010-2015
15,1,WORLD,,900,,18836571,17853840.0,15827803.0,13276733.0,15370755.0,...,11.103013,9.164736,6.941389,6.932687,8.033424,-2.123497,-3.837069,-5.557223,-0.025089,2.947267
16,2,Developed regions,(b),901,,2014564,3609670.0,2997256.0,2361229.0,2046917.0,...,3.910511,2.899391,2.015025,1.54414,1.391085,9.388424,-5.983348,-7.277379,-5.323293,-2.087656
17,3,Developing regions,(c),902,,16822007,14244170.0,12830547.0,10915504.0,13323838.0,...,20.795958,18.507035,14.733162,14.944759,17.073768,-2.839417,-2.332154,-4.561,0.285195,2.663652
18,4,Least developed countries,(d),941,,5048391,5160131.0,3047488.0,2363782.0,1957884.0,...,44.041961,30.221557,24.08243,19.533425,28.801534,-0.680327,-7.531747,-4.541459,-4.187109,7.766031
19,5,Less developed regions excluding least develop...,,934,,11773616,9084039.0,9783059.0,8551722.0,11365954.0,...,15.999082,16.51313,13.305391,14.363526,15.537313,-4.3836,0.632489,-4.319731,1.530456,1.571047


In [260]:
#rename columns
ers_df1.columns=['ID','Area','Notes','Country Code','Type of Data','Both sexes 1990','Both sexes 1995',
                 'Both sexes 2000', 'Both sexes 2005','Both sexes 2010','Both sexes 2015',
                'Percentage 1990', 'Percentage 1995', 'Percentage 2000', 'Percentage 2005',    
                 'Percentage 2010', 'Percentage 2015','Refugee ROC 1990-1995', 'Refugee ROC 1995-2000', 'Refugee ROC 2000-2005',  
                 'Refugee ROC 2005-2010', 'Refugee ROC 2010-2015']

#drop "notes" and "country code" as they are not necessary 
ers_df2 = ers_df1.drop(['Notes','Country Code','Type of Data'], axis=1)

ers_df2.head()


Unnamed: 0,ID,Area,Both sexes 1990,Both sexes 1995,Both sexes 2000,Both sexes 2005,Both sexes 2010,Both sexes 2015,Percentage 1990,Percentage 1995,Percentage 2000,Percentage 2005,Percentage 2010,Percentage 2015,Refugee ROC 1990-1995,Refugee ROC 1995-2000,Refugee ROC 2000-2005,Refugee ROC 2005-2010,Refugee ROC 2010-2015
15,1,WORLD,18836571,17853840.0,15827803.0,13276733.0,15370755.0,19577474.0,12.346732,11.103013,9.164736,6.941389,6.932687,8.033424,-2.123497,-3.837069,-5.557223,-0.025089,2.947267
16,2,Developed regions,2014564,3609670.0,2997256.0,2361229.0,2046917.0,1954224.0,2.445494,3.910511,2.899391,2.015025,1.54414,1.391085,9.388424,-5.983348,-7.277379,-5.323293,-2.087656
17,3,Developing regions,16822007,14244170.0,12830547.0,10915504.0,13323838.0,17623250.0,23.968236,20.795958,18.507035,14.733162,14.944759,17.073768,-2.839417,-2.332154,-4.561,0.285195,2.663652
18,4,Least developed countries,5048391,5160131.0,3047488.0,2363782.0,1957884.0,3443582.0,45.56588,44.041961,30.221557,24.08243,19.533425,28.801534,-0.680327,-7.531747,-4.541459,-4.187109,7.766031
19,5,Less developed regions excluding least develop...,11773616,9084039.0,9783059.0,8551722.0,11365954.0,14179668.0,19.919743,15.999082,16.51313,13.305391,14.363526,15.537313,-4.3836,0.632489,-4.319731,1.530456,1.571047


##Problem: There are multiple types of data stored in table 6 
Tidy data principle 4: Multiple types of observational units should not be stored in the same table. 
Let's split Table 6 into 3 separate tables. 

### Estimated refugee stock at mid-year (both sexes) as a separate dataset

In [261]:
#create dataframe with related columns
est_rs = ers_df2[['ID','Area','Both sexes 1990','Both sexes 1995',
                 'Both sexes 2000', 'Both sexes 2005','Both sexes 2010','Both sexes 2015']]

#year variables should not be column headers 
est_rs = pd.melt(est_rs, id_vars =["ID", "Area"] , var_name = "SexYear", value_name = "Refugee Stock (Both Sexes)")
est_rs

Unnamed: 0,ID,Area,SexYear,Refugee Stock (Both Sexes)
0,1,WORLD,Both sexes 1990,18836571
1,2,Developed regions,Both sexes 1990,2014564
2,3,Developing regions,Both sexes 1990,16822007
3,4,Least developed countries,Both sexes 1990,5048391
4,5,Less developed regions excluding least develop...,Both sexes 1990,11773616
...,...,...,...,...
1585,261,Samoa,Both sexes 2015,0.0
1586,262,Tokelau,Both sexes 2015,0.0
1587,263,Tonga,Both sexes 2015,0.0
1588,264,Tuvalu,Both sexes 2015,0.0


In [262]:
#split 
est_rs=(est_rs.assign(Sex = lambda x: x.SexYear.str[:-4].astype(str), Year = lambda x: x.SexYear.str[-4:].astype(str)).drop("SexYear",axis=1))

#we don't need the "sex" column
est_rs=est_rs.drop(['Sex'], axis=1)   

#Order dataset based on ID and Year 
est_rs= est_rs.sort_values(by =['ID', 'Year'] )

#Store the unique ID-Area in est_rs3
est_rs= est_rs.set_index(['ID','Area','Year'])

est_rs

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Refugee Stock (Both Sexes)
ID,Area,Year,Unnamed: 3_level_1
1,WORLD,1990,18836571
1,WORLD,1995,17853840.0
1,WORLD,2000,15827803.0
1,WORLD,2005,13276733.0
1,WORLD,2010,15370755.0
...,...,...,...
265,Wallis and Futuna Islands,1995,0.0
265,Wallis and Futuna Islands,2000,0.0
265,Wallis and Futuna Islands,2005,0.0
265,Wallis and Futuna Islands,2010,0.0


In [263]:
est_rs.dtypes

Refugee Stock (Both Sexes)    object
dtype: object

In [264]:
est_rs=est_rs.astype('float')
est_rs.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 1 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Refugee Stock (Both Sexes)  1579 non-null   float64
dtypes: float64(1)
memory usage: 40.9+ KB


###Refugees as a percentage of the international migrant stock as a separate dataset 

In [265]:
#Create a dataframe with columns for refugee as a percentage of the international migrant stock 
per_rs = ers_df2[['ID','Area','Percentage 1990', 'Percentage 1995', 'Percentage 2000', 'Percentage 2005',    
                 'Percentage 2010', 'Percentage 2015']]

#column headers "year" are values not variable names. Violate principle 2. Unpivot to move year columns into rows. 
per_rs = pd.melt(per_rs, id_vars =["ID", "Area"] , var_name = "PYear", value_name = "Refugee Percentage")

#Column has 2 variables - percentage and year, let's split that 
per_rs=(per_rs.assign(Percentage = lambda x: x.PYear.str[:-4].astype(str), Year = lambda x: x.PYear.str[-4:].astype(str)).drop("PYear",axis=1))

#we don't need the "percentage" column because the whole observational unit refers to this. 
per_rs=per_rs.drop(['Percentage'], axis=1)  

#Order dataset based on ID and Year 
per_rs= per_rs.sort_values(by =['ID', 'Year'] )

#Store the unique ID-Area in est_rs3
per_rs= per_rs.set_index(['ID','Area','Year'])

per_rs

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Refugee Percentage
ID,Area,Year,Unnamed: 3_level_1
1,WORLD,1990,12.346732
1,WORLD,1995,11.103013
1,WORLD,2000,9.164736
1,WORLD,2005,6.941389
1,WORLD,2010,6.932687
...,...,...,...
265,Wallis and Futuna Islands,1995,0.0
265,Wallis and Futuna Islands,2000,0.0
265,Wallis and Futuna Islands,2005,0.0
265,Wallis and Futuna Islands,2010,0.0


In [266]:
#problem: We see some values stored as 0, some stored as 0.0, meaning there are different data types. 
per_rs.dtypes

Refugee Percentage    object
dtype: object

In [267]:
#Refugee Percentage values should be floats, not objects 
per_rs=per_rs.astype('float')
per_rs.info()


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1590 entries, (1, 'WORLD', '1990') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Refugee Percentage  1575 non-null   float64
dtypes: float64(1)
memory usage: 40.9+ KB


### Annual rate of change of the refugee stock as a separate dataset 

In [268]:
#Create data frame for this observational unit
roc_rs = ers_df2[['ID','Area','Refugee ROC 1990-1995', 'Refugee ROC 1995-2000', 'Refugee ROC 2000-2005',  
                 'Refugee ROC 2005-2010', 'Refugee ROC 2010-2015']]

#column headers "year" are values not variable names. Violate principle 2. Unpivot to move year columns into rows. 
roc_rs = pd.melt(roc_rs, id_vars =["ID", "Area"] , var_name = "rocYear", value_name = "Refugee ROC")

#violation 3 that column contains more than one variable, refugee rate of change and year, we will split them into two separate columns 
roc_rs=(roc_rs.assign(placeholder = lambda x: x.rocYear.str[:-4].astype(str), Year = lambda x: x.rocYear.str[-4:].astype(str)).drop("rocYear",axis=1))

#Refugee rate of change column is not useful as the whole observational unit is about the Rate of change, we may remove this column 
roc_rs=roc_rs.drop(['placeholder'], axis=1)  

#Order dataset based on ID and Year 
roc_rs= roc_rs.sort_values(by =['ID', 'Year'] )

#Store the unique ID-Area in est_rs3
roc_rs= roc_rs.set_index(['ID','Area','Year'])

roc_rs

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Refugee ROC
ID,Area,Year,Unnamed: 3_level_1
1,WORLD,1995,-2.123497
1,WORLD,2000,-3.837069
1,WORLD,2005,-5.557223
1,WORLD,2010,-0.025089
1,WORLD,2015,2.947267
...,...,...,...
265,Wallis and Futuna Islands,1995,
265,Wallis and Futuna Islands,2000,
265,Wallis and Futuna Islands,2005,
265,Wallis and Futuna Islands,2010,


In [269]:
#check the datatype
roc_rs.dtypes

Refugee ROC    object
dtype: object

In [270]:
#change datatype to float 
roc_rs=roc_rs.astype('float')
roc_rs.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1325 entries, (1, 'WORLD', '1995') to (265, 'Wallis and Futuna Islands', '2015')
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Refugee ROC  890 non-null    float64
dtypes: float64(1)
memory usage: 37.5+ KB


Table 6 dataset is now broke into three datasets: Estimated refugee stock (both sexes) dataset, Refugees as a percentage of the international migrant stock dataset, and Annual rate of change of the refugee stock dataset. 





