# Capstone Project: BRICS Analysis Data Preparation
### By Ryan Kidd

## Introduction
For my Capstone Project, I want to analyze how the different aspects of the BRICS (Brazil, Russia, India, China, and South Africa) economies distinguished these countries as the strongest emerging markets in the twentieth century. These economies have the potential to become the world's fastest economies with their low labor costs, locational advantages, and abundant natural resources, yet why haven't they? Did political instability cause economic struggle in these countries, or are there underlying reasons in the data that suggest a particular industry underperformed? Using data science techniques, I will analyze the different industries that contribute the most to the BRICS economy overtime. Further, I want to analyze how different aspects of the BRICS economy contributed to their overall GDP and Growth per Year. Through machine learning techniques, I want to answer why some of these countries underperformed, or the underlying cause as to why they were not able to meet expectations.

## Data Preparation
In order to analyze the BRICS datasets, first I will need to prepare the data and then clean it appropriately. I will look into the five datasets downloaded from Kaggle, `https://www.kaggle.com/docstein/brics-world-bank-indicators` and understand what the columns are representing. For each dataset, I will analyze what the datatypes of the columns are and check for number of null values. Depending on how many null values and the proportion of data they account for, I will decide whether to drop them or impute them. Further, I plan on merging the datasets and rotating them in order to achieve a cumulative dataset that will allow me to conduct different machine learning models. This rotated dataset will have the rows as the BRICS countries and the features as the different Series in each dataset. I will also need to create some features that will help me in my analysis such as a Increase/Decrease per Year, Percent Growth, and Total GDP. First, in order to construct the data in this way, I must read in these files. Below, I have downloaded the necessary packages to prepare and clean the datasets.

In [24]:
# Import data science packages for Data Preparation, Cleaning, and Exploratory Data Analysis (EDA)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [25]:
# Reading in all of the CSV files Kaggle for BRICS analysis
df_economy = pd.read_csv('data/Economy_Data.csv', sep=';', index_col=0) # Had to add sep = ";" as all of the CSV file columns were separated with a ";" 
df_education = pd.read_csv('data/Education_And_Environ_Data.csv', sep=';', index_col=0) # Got rid of the index_column as it provides no useful information
df_public = pd.read_csv('data/Public_Sector_Indicators.csv', sep=';', index_col=0)
df_private = pd.read_csv('data/Private_Sector_Data.csv', sep=';', index_col=0)
df_health = pd.read_csv('data/Health_And_Poverty_Data.csv', sep=';', index_col=0)

In [26]:
# Created new Excel files with columns separated by exporting a pandas Dataframe to CSV
# df_economy.to_csv(r'C:\Users\Jim\Desktop\Economy_Fixed.csv', index = False)
# df_education.to_csv(r'C:\Users\Jim\Desktop\Education_Fixed.csv', index = False)
# df_public.to_csv(r'C:\Users\Jim\Desktop\Public_Fixed.csv', index = False)
# df_private.to_csv(r'C:\Users\Jim\Desktop\Private_Fixed.csv', index = False)
# df_health.to_csv(r'C:\Users\Jim\Desktop\Health_Fixed.csv', index = False)

In [27]:
# Taking a look into the Economy_Data.csv
df_economy.head(10) #Looking at the first 10 rows

Unnamed: 0_level_0,SeriesCode,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adjusted net national income (annual % growth),NY.ADJ.NNTY.KD.ZG,Brazil,BRA,1970.0,
Adjusted net national income (annual % growth),NY.ADJ.NNTY.KD.ZG,China,CHN,1970.0,
Adjusted net national income (annual % growth),NY.ADJ.NNTY.KD.ZG,India,IND,1970.0,
Adjusted net national income (annual % growth),NY.ADJ.NNTY.KD.ZG,Russian Federation,RUS,1970.0,
Adjusted net national income (annual % growth),NY.ADJ.NNTY.KD.ZG,South Africa,ZAF,1970.0,
Adjusted net national income (constant 2010 US$),NY.ADJ.NNTY.KD,Brazil,BRA,1970.0,391897400000.0
Adjusted net national income (constant 2010 US$),NY.ADJ.NNTY.KD,China,CHN,1970.0,
Adjusted net national income (constant 2010 US$),NY.ADJ.NNTY.KD,India,IND,1970.0,191533500000.0
Adjusted net national income (constant 2010 US$),NY.ADJ.NNTY.KD,Russian Federation,RUS,1970.0,
Adjusted net national income (constant 2010 US$),NY.ADJ.NNTY.KD,South Africa,ZAF,1970.0,


In [5]:
# Looking at the shape of the df_economy dataframe and datatypes of the columns/features
print(df_economy.shape)
df_economy.dtypes

(86500, 5)


SeriesCode      object
CountryName     object
CountryCode     object
Year           float64
Value          float64
dtype: object

From the initial peak into the dataset, I can see that some things need to be changed in order to perform machine learning and prediction models. First, I must drop the SeriesCode column as it provides no statistical information for the machine learning process. With only four features in the dataset after dropping, I plan on pivoting the table and making each unique SeriesName a feature for analysis with the rows as the Country and Year. In theory, this should make the dataset more applicable for machine learning models. Further, we can see that there are currently three features that are non-numeric, which will have to be addressed in the cleaning. Before, pivoting the table, I plan on understanding and preparing the data more by checking for null values and duplicated rows in each dataset. 

In [4]:
df_economy = df_economy.drop(['SeriesCode'], axis=1)
df_economy.head(10)

Unnamed: 0_level_0,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adjusted net national income (annual % growth),Brazil,BRA,1970.0,
Adjusted net national income (annual % growth),China,CHN,1970.0,
Adjusted net national income (annual % growth),India,IND,1970.0,
Adjusted net national income (annual % growth),Russian Federation,RUS,1970.0,
Adjusted net national income (annual % growth),South Africa,ZAF,1970.0,
Adjusted net national income (constant 2010 US$),Brazil,BRA,1970.0,391897400000.0
Adjusted net national income (constant 2010 US$),China,CHN,1970.0,
Adjusted net national income (constant 2010 US$),India,IND,1970.0,191533500000.0
Adjusted net national income (constant 2010 US$),Russian Federation,RUS,1970.0,
Adjusted net national income (constant 2010 US$),South Africa,ZAF,1970.0,


In [25]:
# Checking the number of null values within the Economy dataset
df_economy.isna().sum()

CountryName      250
CountryCode      250
Year               0
Value          26152
dtype: int64

In [26]:
# Returns the percentage of rows which have NaNs for each column.
df_economy.isna().sum()/len(df_economy)*100.0 

CountryName     0.289017
CountryCode     0.289017
Year            0.000000
Value          30.233526
dtype: float64

In [31]:
# Checking to see the number of duplicated rows in the Economy dataset
df_economy.duplicated().sum()

27793

In [32]:
# Returns the percentage of rows that were duplicated in the Economy dataset.
df_economy.duplicated().sum()/len(df_economy)*100.0

32.13063583815029

We can see that there is a high proportion of null values and duplicated rows in the dataset. Prior to downloading this data, I noticed that the Kaggle data originally pulled from The World Bank's DataBank. This database contains many databases that encompass different development indicators, growing industries, and statistical data related to population dynamics. Exploring these databases, I realized that for many of the indicators, the time series had a major impact on how many values were represented in the data. Considering this fact, I decided for my analysis of BRICS economies, I would focus on the time period starting in 1995 to 2018, as the database has not been updated for 2019 yet. I believe the reason The World Bank does not release current data and the datasets have many missing values is due to the inability to maintain a flow of survey data from some countries every year. For more info on why there are NaN values The World Bank explains the reason on this webpage, `https://datahelpdesk.worldbank.org/knowledgebase/articles/191133-why-are-some-data-not-available`. Before, removing the NaN values and duplicated columns, I will pivot the table in order to see how the number of rows and features change. 

In [8]:
df_economy_reshape = pd.pivot_table(df_economy, values = 'Value', index = ['CountryName','CountryCode','Year'], columns = 'SeriesName')
print(f'The size of the reshaped df_economy: {df_economy_reshape.shape}')
df_economy_reshape.head(10)

The size of the reshaped df_economy: (250, 332)


Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Adjusted net national income (annual % growth),Adjusted net national income (constant 2010 US$),Adjusted net national income (current US$),Adjusted net national income per capita (annual % growth),Adjusted net national income per capita (constant 2010 US$),Adjusted net national income per capita (current US$),"Adjusted net savings, excluding particulate emission damage (% of GNI)","Adjusted net savings, excluding particulate emission damage (current US$)","Adjusted net savings, including particulate emission damage (% of GNI)","Adjusted net savings, including particulate emission damage (current US$)",...,"Total reserves (includes gold, current US$)",Total reserves in months of imports,Total reserves minus gold (current US$),Trade (% of GDP),Trade in services (% of GDP),"Transport services (% of service exports, BoP)","Transport services (% of service imports, BoP)","Travel services (% of service exports, BoP)","Travel services (% of service imports, BoP)","Use of IMF credit (DOD, current US$)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,391897400000.0,37860210000.0,,4120.323413,398.053948,,,,,...,1189905000.0,,1141660000.0,14.479195,,,,,,0.0
Brazil,BRA,1971.0,11.642225,437523000000.0,44146200000.0,8.92838,4488.201546,452.860827,,,,,...,1753865000.0,,1696186000.0,14.55128,,,,,,0.0
Brazil,BRA,1972.0,12.379004,491684000000.0,52523400000.0,9.704598,4923.763456,525.973644,,,,,...,4218806000.0,,4132748000.0,16.103251,,,,,,0.0
Brazil,BRA,1973.0,12.21461,551741300000.0,70588200000.0,9.580841,5395.501408,690.284997,,,,,...,6508867000.0,,6359911000.0,17.773259,,,,,,0.0
Brazil,BRA,1974.0,5.160031,580211300000.0,93417050000.0,2.702725,5541.326994,892.182583,,,,,...,5463238000.0,,5215753000.0,21.896847,,,,,,0.0
Brazil,BRA,1975.0,2.92805,597200200000.0,108683800000.0,0.518431,5570.054972,1013.688084,14.1813,17303550000.0,,,...,4166486000.0,2.949725,3980375000.0,19.044153,2.880946,44.811321,56.829073,6.698113,16.813099,0.0
Brazil,BRA,1976.0,13.847055,679894800000.0,133601800000.0,11.177201,6192.631217,1216.874855,10.980208,16509610000.0,,,...,6666854000.0,4.48418,6488041000.0,16.468331,2.371658,48.330059,56.127545,5.500982,13.830196,0.0
Brazil,BRA,1977.0,6.897196,726788500000.0,154413900000.0,4.392284,6464.629179,1373.47851,11.946388,20697530000.0,,,...,7441921000.0,4.795567,7192022000.0,15.170116,2.270518,44.821872,54.171142,4.556752,8.199069,0.0
Brazil,BRA,1978.0,1.727041,739340400000.0,174350000000.0,-0.655074,6422.281063,1514.491267,11.763013,23065010000.0,,,...,12190030000.0,6.773181,11826400000.0,14.539973,2.203177,41.62963,51.366298,5.037037,8.26285,0.0
Brazil,BRA,1979.0,6.489032,787316400000.0,193614100000.0,3.99818,6679.055398,1642.490014,9.819068,21498380000.0,,,...,9838705000.0,4.143631,8966257000.0,16.299473,2.344318,47.186441,55.146091,5.084746,8.133719,0.0


This reshaped dataframe enables us to do an analysis based on the many features present in the Economic Indicator data for each BRICS country over the years. From the shape of the dataset, I noticed that since we are analyzing 5 countries there are 50 years of data in the dataframe above from 1970 to 2020, 250/5 = 50. If we reduce the number of years, this will significantly reduce the number of rows in our dataset and merging the other four datasets will only add more features to the combined dataframe. Next, I will complete the steps above for the df_economy dataframe and apply it to the other 4 dataframes.

In [28]:
# Creating a CSV file for the cleaning section
df_economy_reshape.to_csv(r'C:\Users\Jim\Desktop\Economy_Rotated.csv')

In [21]:
# Taking a look into the Education_And_Environ_Data.csv
df_education.head(10)

Unnamed: 0_level_0,SeriesCode,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,Brazil,BRA,1970.0,
Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,China,CHN,1970.0,
Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,India,IND,1970.0,
Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,Russian Federation,RUS,1970.0,
Access to clean fuels and technologies for cooking (% of population),EG.CFT.ACCS.ZS,South Africa,ZAF,1970.0,
Access to electricity (% of population),EG.ELC.ACCS.ZS,Brazil,BRA,1970.0,
Access to electricity (% of population),EG.ELC.ACCS.ZS,China,CHN,1970.0,
Access to electricity (% of population),EG.ELC.ACCS.ZS,India,IND,1970.0,
Access to electricity (% of population),EG.ELC.ACCS.ZS,Russian Federation,RUS,1970.0,
Access to electricity (% of population),EG.ELC.ACCS.ZS,South Africa,ZAF,1970.0,


In [13]:
# Looking at the shape of the df_education dataframe and datatypes of the columns/features
print(df_education.shape)
df_education.dtypes

(71500, 5)


SeriesCode      object
CountryName     object
CountryCode     object
Year           float64
Value          float64
dtype: object

We can see a similar pattern as the df_economy dataframe. The datatypes are the same and the shape of the dataset is smaller by approximately 15,000 values. Again, I will remove the SeriesCode column as I have decided that it will not be useful in my analysis and adds a non-numeric feature to the dataframes. After I remove the SeriesCode column, next I will check for null values and duplicated rows.

In [9]:
df_education = df_education.drop(['SeriesCode'], axis=1)
df_education.head(10)

Unnamed: 0_level_0,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Access to clean fuels and technologies for cooking (% of population),Brazil,BRA,1970.0,
Access to clean fuels and technologies for cooking (% of population),China,CHN,1970.0,
Access to clean fuels and technologies for cooking (% of population),India,IND,1970.0,
Access to clean fuels and technologies for cooking (% of population),Russian Federation,RUS,1970.0,
Access to clean fuels and technologies for cooking (% of population),South Africa,ZAF,1970.0,
Access to electricity (% of population),Brazil,BRA,1970.0,
Access to electricity (% of population),China,CHN,1970.0,
Access to electricity (% of population),India,IND,1970.0,
Access to electricity (% of population),Russian Federation,RUS,1970.0,
Access to electricity (% of population),South Africa,ZAF,1970.0,


In [41]:
# Checking the number of null values within the Education dataset
df_education.isna().sum()

CountryName      250
CountryCode      250
Year               0
Value          41640
dtype: int64

In [42]:
# Returns the percentage of rows which have NaNs for each column.
df_education.isna().sum()/len(df_education)*100.0 

CountryName     0.349650
CountryCode     0.349650
Year            0.000000
Value          58.237762
dtype: float64

In [43]:
# Checking to see the number of duplicated rows in the Economy dataset
df_education.duplicated().sum()

42085

In [44]:
# Returns the percentage of rows that were duplicated in the Economy dataset.
df_education.duplicated().sum()/len(df_education)*100.0

58.86013986013986

For the Education dataset, we can see that there are the same number of CountryName and CountryCode missing values as in the Economy dataset. Further, there is a higher proportion of missing values and duplicated rows in this dataset. In order to address the large proportion of data not present, I will have to select the specific timeline as discussed above in the cleaning section. For now, I will prepare the data by pivoting it in the same format as the first pivot `df_economy_reshape`.

In [10]:
df_education_reshape = pd.pivot_table(df_education, values = 'Value', index = ['CountryName','CountryCode','Year'], columns = 'SeriesName')
print(f'The size of the reshaped df_education: {df_education_reshape.shape}')
df_education_reshape.head(10)

The size of the reshaped df_education: (250, 285)


Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)","Adjusted net enrollment rate, primary (% of primary school age children)","Adjusted net enrollment rate, primary, female (% of primary school age children)","Adjusted net enrollment rate, primary, male (% of primary school age children)",Adolescents out of school (% of lower secondary school age),"Adolescents out of school, female (% of female lower secondary school age)","Adolescents out of school, male (% of male lower secondary school age)",...,"Trained teachers in upper secondary education, female (% of female teachers)","Trained teachers in upper secondary education, male (% of male teachers)",Urban land area (sq. km),Urban land area where elevation is below 5 meters (% of total land area),Urban land area where elevation is below 5 meters (sq. km),Urban population,Urban population (% of total population),Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Water productivity, total (constant 2010 US$ GDP per cubic meter of total freshwater withdrawal)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,,,,,,,,,,...,,,,,,53176875.0,55.909,4.268093,,
Brazil,BRA,1971.0,,,,,,,,,,,...,,,,,,55461933.0,56.894,4.207327,,
Brazil,BRA,1972.0,,,,,,,,,,,...,,,,,,57797612.0,57.879,4.125057,,
Brazil,BRA,1973.0,,,,,,,,,,,...,,,,,,60184827.0,58.855,4.047282,,
Brazil,BRA,1974.0,,,,,,,,,,,...,,,,,,62641530.0,59.826,4.00082,,
Brazil,BRA,1975.0,,,,,,,,,,,...,,,,,,65175659.0,60.789,3.965759,,
Brazil,BRA,1976.0,,,,,,,,,,,...,,,,,,67790415.0,61.745,3.933474,,
Brazil,BRA,1977.0,,,,,,,,,,,...,,,,,,70478354.0,62.689,3.888481,,
Brazil,BRA,1978.0,,,,,,,,,,,...,,,,,,73245834.0,63.625,3.851575,,
Brazil,BRA,1979.0,,,,,,,,,,,...,,,,,,76091693.0,64.551,3.811773,,


For this dataset, we can see that there are many NaN values for certain features and some features do not contain any. When cleaning the data and choosing the appropriate columns for the analysis, I will drop the columns that do not have values and keep the features that have enough data. I believe that many of these NaN values come from the fact that some of the data did not start pulling from survey until a certain year. I will have to find the year The World Bank started pulling the data. Now, I proceed to the Public dataset, df_public.

In [29]:
# Creating a CSV file for the cleaning section
df_education_reshape.to_csv(r'C:\Users\Jim\Desktop\Education_Rotated.csv')

In [22]:
# Taking a look into the Public_Sector_Indicators.csv
df_public.head(10)

Unnamed: 0_level_0,SeriesCode,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adequacy of social insurance programs (% of total welfare of beneficiary households),per_si_allsi.adq_pop_tot,Brazil,BRA,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),per_si_allsi.adq_pop_tot,China,CHN,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),per_si_allsi.adq_pop_tot,India,IND,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),per_si_allsi.adq_pop_tot,Russian Federation,RUS,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),per_si_allsi.adq_pop_tot,South Africa,ZAF,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),per_allsp.adq_pop_tot,Brazil,BRA,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),per_allsp.adq_pop_tot,China,CHN,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),per_allsp.adq_pop_tot,India,IND,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),per_allsp.adq_pop_tot,Russian Federation,RUS,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),per_allsp.adq_pop_tot,South Africa,ZAF,1970.0,


In [15]:
# Looking at the shape of the df_public dataframe and datatypes of the columns/features
print(df_public.shape)
df_public.dtypes

(81500, 5)


SeriesCode      object
CountryName     object
CountryCode     object
Year           float64
Value          float64
dtype: object

Again, I will remove the SeriesCode column and check for null values and duplicated rows below.

In [12]:
df_public = df_public.drop(['SeriesCode'], axis=1)
df_public.head(10)

Unnamed: 0_level_0,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adequacy of social insurance programs (% of total welfare of beneficiary households),Brazil,BRA,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),China,CHN,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),India,IND,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),Russian Federation,RUS,1970.0,
Adequacy of social insurance programs (% of total welfare of beneficiary households),South Africa,ZAF,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),Brazil,BRA,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),China,CHN,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),India,IND,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),Russian Federation,RUS,1970.0,
Adequacy of social protection and labor programs (% of total welfare of beneficiary households),South Africa,ZAF,1970.0,


In [47]:
# Checking the number of null values within the Public dataset
df_public.isna().sum()

CountryName      250
CountryCode      250
Year               0
Value          54662
dtype: int64

In [48]:
# Returns the percentage of rows which have NaNs for each column.
df_public.isna().sum()/len(df_public)*100.0 

CountryName     0.306748
CountryCode     0.306748
Year            0.000000
Value          67.069939
dtype: float64

In [49]:
# Checking to see the number of duplicated rows in the Economy dataset
df_public.duplicated().sum()

54816

In [50]:
# Returns the percentage of rows that were duplicated in the Economy dataset.
df_public.duplicated().sum()/len(df_public)*100.0

67.25889570552147

The same theory applies to the Public dataset as the data was most likely formatted to contain the 50 year time series, when some of the data was not pulled during this time period. I will need to account more for the time as the proportion of NaN values and duplicated rows is higher here. An additional note, I believe that some of these duplicated rows are the product of the rows with NaN values. This will need to be proven though in the cleaning section. Now, I will pivot the table in the mentioned format.

In [13]:
df_public_reshape = pd.pivot_table(df_public, values = 'Value', index = ['CountryName','CountryCode','Year'], columns = 'SeriesName')
print(f'The size of the reshaped df_public: {df_public_reshape.shape}')
df_public_reshape.head(10)

The size of the reshaped df_public: (250, 316)


Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Adequacy of social insurance programs (% of total welfare of beneficiary households),Adequacy of social protection and labor programs (% of total welfare of beneficiary households),Adequacy of social safety net programs (% of total welfare of beneficiary households),Adequacy of unemployment benefits and ALMP (% of total welfare of beneficiary households),"Air transport, freight (million ton-km)","Air transport, passengers carried","Air transport, registered carrier departures worldwide","Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)",Armed forces personnel (% of total labor force),...,"Unemployment, youth male (% of male labor force ages 15-24) (modeled ILO estimate)","Unemployment, youth male (% of male labor force ages 15-24) (national estimate)","Unemployment, youth total (% of total labor force ages 15-24) (modeled ILO estimate)","Unemployment, youth total (% of total labor force ages 15-24) (national estimate)","Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,,,,164.100006,3339800.0,137700.0,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,,,,,180.199997,3911000.0,149300.0,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,,,,,254.300003,4671400.0,156700.0,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,,,,,316.700012,5842400.0,173200.0,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,,,,,407.100006,6855500.0,204800.0,,,,...,,,,,,,,,,
Brazil,BRA,1975.0,,,,,460.5,7772900.0,209100.0,,,,...,,,,,,,,,,
Brazil,BRA,1976.0,,,,,471.600006,8799000.0,206100.0,,,,...,,,,,,,,,,
Brazil,BRA,1977.0,,,,,499.299988,9514400.0,194900.0,,,,...,,,,,,,,,,
Brazil,BRA,1978.0,,,,,571.299988,10621300.0,193000.0,,,,...,,,,,,,,,,
Brazil,BRA,1979.0,,,,,570.5,11856900.0,211100.0,,,,...,,,,,,,,,,


We can see a similar pattern as the df_education_reshape dataframe as some features collect data since 1970, where many features do not. Next, I will prepare the Private dataset, df_private.

In [30]:
# Creating a CSV file for the cleaning section
df_public_reshape.to_csv(r'C:\Users\Jim\Desktop\Public_Rotated.csv')

In [23]:
# Taking a look into the Private_Sector_Data.csv
df_private.head(10)

Unnamed: 0_level_0,SeriesCode,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Agricultural raw materials exports (% of merchandise exports),TX.VAL.AGRI.ZS.UN,Brazil,BRA,1970.0,11.895098
Agricultural raw materials exports (% of merchandise exports),TX.VAL.AGRI.ZS.UN,China,CHN,1970.0,
Agricultural raw materials exports (% of merchandise exports),TX.VAL.AGRI.ZS.UN,India,IND,1970.0,5.556494
Agricultural raw materials exports (% of merchandise exports),TX.VAL.AGRI.ZS.UN,Russian Federation,RUS,1970.0,
Agricultural raw materials exports (% of merchandise exports),TX.VAL.AGRI.ZS.UN,South Africa,ZAF,1970.0,
Agricultural raw materials imports (% of merchandise imports),TM.VAL.AGRI.ZS.UN,Brazil,BRA,1970.0,1.851044
Agricultural raw materials imports (% of merchandise imports),TM.VAL.AGRI.ZS.UN,China,CHN,1970.0,
Agricultural raw materials imports (% of merchandise imports),TM.VAL.AGRI.ZS.UN,India,IND,1970.0,9.207277
Agricultural raw materials imports (% of merchandise imports),TM.VAL.AGRI.ZS.UN,Russian Federation,RUS,1970.0,
Agricultural raw materials imports (% of merchandise imports),TM.VAL.AGRI.ZS.UN,South Africa,ZAF,1970.0,


In [19]:
# Looking at the shape of the df_private dataframe and datatypes of the columns/features
print(df_private.shape)
df_private.dtypes

(42250, 5)


SeriesCode      object
CountryName     object
CountryCode     object
Year           float64
Value          float64
dtype: object

First noticable attribute of this dataframe is it contains less values than the other dataframes. This tells me that the Private dataset will have less features when pivoting. I will remove the SeriesCode next.

In [14]:
df_private = df_private.drop(['SeriesCode'], axis=1)
df_private.head(10)

Unnamed: 0_level_0,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agricultural raw materials exports (% of merchandise exports),Brazil,BRA,1970.0,11.895098
Agricultural raw materials exports (% of merchandise exports),China,CHN,1970.0,
Agricultural raw materials exports (% of merchandise exports),India,IND,1970.0,5.556494
Agricultural raw materials exports (% of merchandise exports),Russian Federation,RUS,1970.0,
Agricultural raw materials exports (% of merchandise exports),South Africa,ZAF,1970.0,
Agricultural raw materials imports (% of merchandise imports),Brazil,BRA,1970.0,1.851044
Agricultural raw materials imports (% of merchandise imports),China,CHN,1970.0,
Agricultural raw materials imports (% of merchandise imports),India,IND,1970.0,9.207277
Agricultural raw materials imports (% of merchandise imports),Russian Federation,RUS,1970.0,
Agricultural raw materials imports (% of merchandise imports),South Africa,ZAF,1970.0,


In [53]:
# Checking the number of null values within the Private dataset
df_private.isna().sum()

CountryName      250
CountryCode      250
Year               0
Value          24830
dtype: int64

In [54]:
# Returns the percentage of rows which have NaNs for each column.
df_private.isna().sum()/len(df_private)*100.0 

CountryName     0.591716
CountryCode     0.591716
Year            0.000000
Value          58.769231
dtype: float64

In [55]:
# Checking to see the number of duplicated rows in the Economy dataset
df_private.duplicated().sum()

25907

In [56]:
# Returns the percentage of rows that were duplicated in the Economy dataset.
df_private.duplicated().sum()/len(df_private)*100.0

61.318343195266266

Again, we see a high proportion of Nan values and duplicated rows within the Private dataset. I will now pivot the dataframe in the specified format.

In [15]:
df_private_reshape = pd.pivot_table(df_private, values = 'Value', index = ['CountryName','CountryCode','Year'], columns = 'SeriesName')
print(f'The size of the reshaped df_private: {df_private_reshape.shape}')
df_private_reshape.head(10)

The size of the reshaped df_private: (230, 168)


Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Agricultural raw materials exports (% of merchandise exports),Agricultural raw materials imports (% of merchandise imports),Average number of visits or required meetings with tax officials (for affected firms),Average time to clear exports through customs (days),"Binding coverage, all products (%)","Binding coverage, manufactured products (%)","Binding coverage, primary products (%)","Bound rate, simple mean, all products (%)","Bound rate, simple mean, manufactured products (%)","Bound rate, simple mean, primary products (%)",...,"Time to import, documentary compliance (hours)",Time to obtain an electrical connection (days),Time to prepare and pay taxes (hours),Time to resolve insolvency (years),Total tax and contribution rate (% of profit),Transport services (% of commercial service exports),Transport services (% of commercial service imports),Travel services (% of commercial service exports),Travel services (% of commercial service imports),Value lost due to electrical outages (% of sales for affected firms)
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,11.895098,1.851044,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,10.494023,2.17832,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,9.322425,1.971959,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,8.521143,2.07326,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,5.991149,2.014877,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1975.0,3.891622,1.497433,,,,,,,,,...,,,,,,48.568507,59.589615,7.259714,17.629816,
Brazil,BRA,1976.0,2.273232,1.418146,,,,,,,,,...,,,,,,50.773994,58.769107,5.779154,14.481094,
Brazil,BRA,1977.0,2.353515,1.4587,,,,,,,,,...,,,,,,46.839827,56.5819,4.761905,8.563949,
Brazil,BRA,1978.0,2.918907,1.447775,,,,,,,,,...,,,,,,43.364198,53.525424,5.246914,8.610169,
Brazil,BRA,1979.0,3.442275,1.351869,,,,,,,,,...,,,,,,50.544662,62.425507,5.446623,9.20739,


As guessed, the Private dataset contains less features than the other datasets as the number of values was significantly lower. An important note is the fact that there are only 230 rows compared to the other 3 dataframes. This suggests that there are 20 rows of missing data for a specific time series somewhere in the dataframe. I will need to find these values and understand why they are excluded from the dataset. When combining all the dataframes into one, I will need to understand which features are relevant and contain information for the time period I plan to analyze. Finally, I will prepare and gain an understanding of the Health dataset.

In [31]:
# Creating a CSV file for the cleaning section
df_private_reshape.to_csv(r'C:\Users\Jim\Desktop\Private_Rotated.csv')

In [24]:
# Taking a look into the Health_And_Poverty_Data.csv
df_health.head(10)

Unnamed: 0_level_0,SeriesCode,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,Brazil,BRA,1970.0,77.1184
"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,China,CHN,1970.0,38.6866
"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,India,IND,1970.0,108.3178
"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,Russian Federation,RUS,1970.0,29.8818
"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,South Africa,ZAF,1970.0,93.7106
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,SH.HIV.INCD.TL,Brazil,BRA,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,SH.HIV.INCD.TL,China,CHN,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,SH.HIV.INCD.TL,India,IND,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,SH.HIV.INCD.TL,Russian Federation,RUS,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,SH.HIV.INCD.TL,South Africa,ZAF,1970.0,


In [18]:
# Looking at the shape of the df_health dataframe and datatypes of the columns/features
print(df_health.shape)
df_health.dtypes

(69500, 5)


SeriesCode      object
CountryName     object
CountryCode     object
Year           float64
Value          float64
dtype: object

The size of the Health dataframe tells me that it is a relatively sizable dataset and I will now remove the SeriesCode column as I have for the other 4 datasets.

In [17]:
df_health = df_health.drop(['SeriesCode'], axis=1)
df_health.head(10)

Unnamed: 0_level_0,CountryName,CountryCode,Year,Value
SeriesName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Adolescent fertility rate (births per 1,000 women ages 15-19)",Brazil,BRA,1970.0,77.1184
"Adolescent fertility rate (births per 1,000 women ages 15-19)",China,CHN,1970.0,38.6866
"Adolescent fertility rate (births per 1,000 women ages 15-19)",India,IND,1970.0,108.3178
"Adolescent fertility rate (births per 1,000 women ages 15-19)",Russian Federation,RUS,1970.0,29.8818
"Adolescent fertility rate (births per 1,000 women ages 15-19)",South Africa,ZAF,1970.0,93.7106
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,Brazil,BRA,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,China,CHN,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,India,IND,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,Russian Federation,RUS,1970.0,
Adults (ages 15+) and children (ages 0-14) newly infected with HIV,South Africa,ZAF,1970.0,


In [62]:
# Checking the number of null values within the Health dataset
df_health.isna().sum()

CountryName      250
CountryCode      250
Year               0
Value          39481
dtype: int64

In [63]:
# Returns the percentage of rows which have NaNs for each column.
df_health.isna().sum()/len(df_health)*100.0 

CountryName     0.359712
CountryCode     0.359712
Year            0.000000
Value          56.807194
dtype: float64

In [64]:
# Checking to see the number of duplicated rows in the Economy dataset
df_health.duplicated().sum()

39632

In [65]:
# Returns the percentage of rows that were duplicated in the Economy dataset.
df_health.duplicated().sum()/len(df_health)*100.0

57.024460431654674

In [18]:
df_health_reshape = pd.pivot_table(df_health, values = 'Value', index = ['CountryName','CountryCode','Year'], columns = 'SeriesName')
print(f'The size of the reshaped df_health: {df_health_reshape.shape}')
df_health_reshape.head(10)

The size of the reshaped df_health: (250, 263)


Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,ARI treatment (% of children under 5 taken to a health provider),"Adolescent fertility rate (births per 1,000 women ages 15-19)",Adults (ages 15+) and children (ages 0-14) newly infected with HIV,Adults (ages 15-49) newly infected with HIV,Age dependency ratio (% of working-age population),"Age dependency ratio, old (% of working-age population)","Age dependency ratio, young (% of working-age population)","Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)",Antiretroviral therapy coverage (% of people living with HIV),...,Tuberculosis treatment success rate (% of new cases),UHC service coverage index,Unmet need for contraception (% of married women ages 15-49),Use of insecticide-treated bed nets (% of under-5 population),Vitamin A supplementation coverage rate (% of children ages 6-59 months),Wanted fertility rate (births per woman),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,77.1184,,,83.980449,6.318839,77.66161,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,,75.4532,,,82.948217,6.359471,76.588746,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,,73.788,,,81.878391,6.395842,75.482549,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,,73.808,,,80.746489,6.42757,74.318919,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,,73.828,,,79.514392,6.453058,73.061335,,,,...,,,,,,,,,,
Brazil,BRA,1975.0,,73.848,,,78.196756,6.472243,71.724512,,,,...,,,,,,,,,,
Brazil,BRA,1976.0,,73.868,,,77.037268,6.518907,70.518361,,,,...,,,,,,,,,,
Brazil,BRA,1977.0,,73.888,,,75.877456,6.555841,69.321616,,,,...,,,,,,,,,,
Brazil,BRA,1978.0,,75.0802,,,74.720059,6.582918,68.137142,,,,...,,,,,,,,,,
Brazil,BRA,1979.0,,76.2724,,,73.578631,6.598951,66.979679,,,,...,,,,,,,,,,


Looking at the new Health dataframe, we can see that there are the same number of rows as with the other 3 datasets, indicating the same 50 year period. This will need to be adjusted according to the time series before cleaning, in order to allocate the correct cleaning methods and framework. With the final dataframe pivoted in the same way as the other 4 dataframes, I can now clean the dataset and allocate my time series in order to make a combined dataframe that can be used for the machine learning analysis. I will now clean each dataframe and explain my reasoning as to why I performed the actions I did.

In [32]:
# Creating a CSV file for the cleaning section
df_health_reshape.to_csv(r'C:\Users\Jim\Desktop\Health_Rotated.csv')

## Cleaning (Next Notebook)
With the data prepared for cleaning, now I can proceed to clean all the dataframes according to my time series and combine the dataframes into an aggregated dataframe.

In [67]:
df_economy_reshape

Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Adjusted net national income (annual % growth),Adjusted net national income (constant 2010 US$),Adjusted net national income (current US$),Adjusted net national income per capita (annual % growth),Adjusted net national income per capita (constant 2010 US$),Adjusted net national income per capita (current US$),"Adjusted net savings, excluding particulate emission damage (% of GNI)","Adjusted net savings, excluding particulate emission damage (current US$)","Adjusted net savings, including particulate emission damage (% of GNI)","Adjusted net savings, including particulate emission damage (current US$)",...,"Total reserves (includes gold, current US$)",Total reserves in months of imports,Total reserves minus gold (current US$),Trade (% of GDP),Trade in services (% of GDP),"Transport services (% of service exports, BoP)","Transport services (% of service imports, BoP)","Travel services (% of service exports, BoP)","Travel services (% of service imports, BoP)","Use of IMF credit (DOD, current US$)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,3.918974e+11,3.786021e+10,,4120.323413,398.053948,,,,,...,1.189905e+09,,1.141660e+09,14.479195,,,,,,0.000000e+00
Brazil,BRA,1971.0,11.642225,4.375230e+11,4.414620e+10,8.928380,4488.201546,452.860827,,,,,...,1.753865e+09,,1.696186e+09,14.551280,,,,,,0.000000e+00
Brazil,BRA,1972.0,12.379004,4.916840e+11,5.252340e+10,9.704598,4923.763456,525.973644,,,,,...,4.218806e+09,,4.132748e+09,16.103251,,,,,,0.000000e+00
Brazil,BRA,1973.0,12.214610,5.517413e+11,7.058820e+10,9.580841,5395.501408,690.284997,,,,,...,6.508867e+09,,6.359911e+09,17.773259,,,,,,0.000000e+00
Brazil,BRA,1974.0,5.160031,5.802113e+11,9.341705e+10,2.702725,5541.326994,892.182583,,,,,...,5.463238e+09,,5.215753e+09,21.896847,,,,,,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
South Africa,ZAF,2015.0,3.600109,3.438474e+11,2.588321e+11,2.028188,6208.159544,4673.210034,1.792771,5.553092e+09,1.488583,4.610872e+09,...,4.588707e+10,4.752628,4.161951e+10,61.617073,9.628084,16.412558,41.557783,54.875629,19.305016,2.474104e+09
South Africa,ZAF,2016.0,-0.103042,3.434931e+11,2.389447e+11,-1.555696,6111.579455,4251.408416,0.482315,1.389873e+09,0.180089,5.189576e+08,...,4.718013e+10,5.470245,4.256559e+10,60.638188,9.886579,15.230954,37.947026,55.142912,19.131440,2.400190e+09
South Africa,ZAF,2017.0,2.581799,3.523614e+11,2.825657e+11,1.147830,6181.729998,4957.253510,1.351016,4.580899e+09,1.040869,3.529281e+09,...,5.072289e+10,5.264496,4.549929e+10,57.973895,9.140373,14.700026,39.323200,55.896968,20.138800,2.542671e+09
South Africa,ZAF,2018.0,-0.093618,3.520315e+11,2.956752e+11,-1.440878,6092.658823,5117.291609,-0.242971,-8.665527e+08,-0.553118,-1.972689e+09,...,5.164204e+10,4.840142,4.647827e+10,59.470334,8.816827,13.833318,41.326406,56.292741,20.614470,2.483140e+09


In [68]:
df_education_reshape

Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)","Adjusted net enrollment rate, primary (% of primary school age children)","Adjusted net enrollment rate, primary, female (% of primary school age children)","Adjusted net enrollment rate, primary, male (% of primary school age children)",Adolescents out of school (% of lower secondary school age),"Adolescents out of school, female (% of female lower secondary school age)","Adolescents out of school, male (% of male lower secondary school age)",...,"Trained teachers in upper secondary education, female (% of female teachers)","Trained teachers in upper secondary education, male (% of male teachers)",Urban land area (sq. km),Urban land area where elevation is below 5 meters (% of total land area),Urban land area where elevation is below 5 meters (sq. km),Urban population,Urban population (% of total population),Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Water productivity, total (constant 2010 US$ GDP per cubic meter of total freshwater withdrawal)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,,,,,,,,,,...,,,,,,53176875.0,55.909,4.268093,,
Brazil,BRA,1971.0,,,,,,,,,,,...,,,,,,55461933.0,56.894,4.207327,,
Brazil,BRA,1972.0,,,,,,,,,,,...,,,,,,57797612.0,57.879,4.125057,,
Brazil,BRA,1973.0,,,,,,,,,,,...,,,,,,60184827.0,58.855,4.047282,,
Brazil,BRA,1974.0,,,,,,,,,,,...,,,,,,62641530.0,59.826,4.000820,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
South Africa,ZAF,2015.0,83.64,85.500000,75.363531,90.999474,95.64551,,,,,,...,,,,,,35905874.0,64.828,2.328063,,
South Africa,ZAF,2016.0,84.75,84.200000,70.715722,91.352501,,,,,,,...,,,,,,36724030.0,65.341,2.253041,,
South Africa,ZAF,2017.0,,84.400000,70.314530,91.704765,92.44938,94.66547,90.27715,19.139099,21.98757,16.33079,...,,,,,,37534797.0,65.850,2.183711,,
South Africa,ZAF,2018.0,,91.229874,89.598819,92.056892,,,,14.363630,13.17866,15.53172,...,,,,,,38339668.0,66.355,2.121666,,


In [69]:
df_public_reshape

Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Adequacy of social insurance programs (% of total welfare of beneficiary households),Adequacy of social protection and labor programs (% of total welfare of beneficiary households),Adequacy of social safety net programs (% of total welfare of beneficiary households),Adequacy of unemployment benefits and ALMP (% of total welfare of beneficiary households),"Air transport, freight (million ton-km)","Air transport, passengers carried","Air transport, registered carrier departures worldwide","Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)",Armed forces personnel (% of total labor force),...,"Unemployment, youth male (% of male labor force ages 15-24) (modeled ILO estimate)","Unemployment, youth male (% of male labor force ages 15-24) (national estimate)","Unemployment, youth total (% of total labor force ages 15-24) (modeled ILO estimate)","Unemployment, youth total (% of total labor force ages 15-24) (national estimate)","Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)"
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,,,,164.100006,3.339800e+06,137700.000000,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,,,,,180.199997,3.911000e+06,149300.000000,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,,,,,254.300003,4.671400e+06,156700.000000,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,,,,,316.700012,5.842400e+06,173200.000000,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,,,,,407.100006,6.855500e+06,204800.000000,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
South Africa,ZAF,2015.0,,,,,892.734324,1.888290e+07,209734.000000,,,0.378745,...,46.543999,46.324200,50.320999,50.130402,9.815,9.133,9.429,87.933998,83.476997,85.412003
South Africa,ZAF,2016.0,,,,,767.271186,1.974493e+07,212865.000000,,,0.368445,...,48.896000,48.586601,53.660000,53.346401,9.698,9.498,9.585,87.663002,82.761002,84.884003
South Africa,ZAF,2017.0,,,,,833.347948,2.082104e+07,216275.000000,,,0.352470,...,49.206001,48.822399,53.620998,53.201801,10.220,9.689,9.921,87.024002,82.589996,84.530998
South Africa,ZAF,2018.0,,,,,716.246208,2.392175e+07,180317.186001,,,,...,49.505001,49.151199,53.791000,53.408199,10.242,10.388,10.324,87.463997,82.045998,84.428001


In [70]:
df_private_reshape

Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,Agricultural raw materials exports (% of merchandise exports),Agricultural raw materials imports (% of merchandise imports),Average number of visits or required meetings with tax officials (for affected firms),Average time to clear exports through customs (days),"Binding coverage, all products (%)","Binding coverage, manufactured products (%)","Binding coverage, primary products (%)","Bound rate, simple mean, all products (%)","Bound rate, simple mean, manufactured products (%)","Bound rate, simple mean, primary products (%)",...,"Time to import, documentary compliance (hours)",Time to obtain an electrical connection (days),Time to prepare and pay taxes (hours),Time to resolve insolvency (years),Total tax and contribution rate (% of profit),Transport services (% of commercial service exports),Transport services (% of commercial service imports),Travel services (% of commercial service exports),Travel services (% of commercial service imports),Value lost due to electrical outages (% of sales for affected firms)
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,11.895098,1.851044,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,10.494023,2.178320,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,9.322425,1.971959,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,8.521143,2.073260,,,,,,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,5.991149,2.014877,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
South Africa,ZAF,2015.0,2.179561,0.920385,,,94.29,99.45,80.03,19.27,17.12,27.32,...,36.0,,198.0,2.0,28.8,16.847245,42.714339,56.329009,19.842276,
South Africa,ZAF,2016.0,2.351764,0.989361,,,94.29,99.45,80.03,19.27,17.12,27.32,...,36.0,,203.0,2.0,28.8,15.653246,39.012228,56.671801,19.668474,
South Africa,ZAF,2017.0,2.346479,0.991302,,,94.42,99.44,80.56,19.35,17.15,27.54,...,36.0,,210.0,2.0,28.9,15.068708,40.328214,57.298885,20.653504,
South Africa,ZAF,2018.0,2.322842,1.001650,,,94.42,99.44,80.56,19.35,17.15,27.54,...,36.0,,210.0,2.0,29.1,14.162340,42.323253,57.631651,21.111718,


In [71]:
df_health_reshape

Unnamed: 0_level_0,Unnamed: 1_level_0,SeriesName,ARI treatment (% of children under 5 taken to a health provider),"Adolescent fertility rate (births per 1,000 women ages 15-19)",Adults (ages 15+) and children (ages 0-14) newly infected with HIV,Adults (ages 15-49) newly infected with HIV,Age dependency ratio (% of working-age population),"Age dependency ratio, old (% of working-age population)","Age dependency ratio, young (% of working-age population)","Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)",Antiretroviral therapy coverage (% of people living with HIV),...,Tuberculosis treatment success rate (% of new cases),UHC service coverage index,Unmet need for contraception (% of married women ages 15-49),Use of insecticide-treated bed nets (% of under-5 population),Vitamin A supplementation coverage rate (% of children ages 6-59 months),Wanted fertility rate (births per woman),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
CountryName,CountryCode,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Brazil,BRA,1970.0,,77.1184,,,83.980449,6.318839,77.661610,,,,...,,,,,,,,,,
Brazil,BRA,1971.0,,75.4532,,,82.948217,6.359471,76.588746,,,,...,,,,,,,,,,
Brazil,BRA,1972.0,,73.7880,,,81.878391,6.395842,75.482549,,,,...,,,,,,,,,,
Brazil,BRA,1973.0,,73.8080,,,80.746489,6.427570,74.318919,,,,...,,,,,,,,,,
Brazil,BRA,1974.0,,73.8280,,,79.514392,6.453058,73.061335,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
South Africa,ZAF,2015.0,,69.5352,280000.0,250000.0,52.254651,7.657545,44.597106,,,53.0,...,81.0,69.0,,,,,,,64.5,110000.0
South Africa,ZAF,2016.0,87.6,68.7216,260000.0,230000.0,52.289475,7.800602,44.488872,,,58.0,...,82.0,,14.9,,50.0,2.0,0.9,3.6,64.6,100000.0
South Africa,ZAF,2017.0,,67.9080,240000.0,210000.0,52.368040,7.952359,44.415681,,,62.0,...,77.0,69.0,,,47.0,,,,64.8,90000.0
South Africa,ZAF,2018.0,,67.8488,220000.0,190000.0,52.433201,8.106405,44.326796,,,66.0,...,,,,,,,,,65.0,78000.0


Please move on to the next notebook `Capstone Project: BRICS Analysis Cleaning` for the cleaning process of these datasets and the combination of them.