## Data Cleaning

The data is taken from the World Bank Gender Statistics dataset, covering 790 features. This mainly relate to gender but also include some more general features such as GDP, GNI and Population. Initially I have chosen just to focus on one year, 2019, due to the time constraints of this project. This year was chosen as it was the most recent year which had complete data for a large number of countries, and avoided the COVID period, which would have the effect of distorting some features.

In [42]:
import pandas as pd
import numpy as np
import matplotlib.pylab as pl

Load the data into a pandas dataframe.

In [43]:
df=pd.read_csv('Data/wb_gender_data_large.csv')

In [44]:
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2019 [YR2019]
0,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,Afghanistan,AFG,1
1,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,Albania,ALB,1
2,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,Algeria,DZA,0
3,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,American Samoa,ASM,..
4,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,Andorra,AND,..
...,...,...,...,...,...
170779,,,,,
170780,,,,,
170781,,,,,
170782,Data from database: Gender Statistics,,,,


Remove the colums with codes for series and country, which are not needed.

In [45]:
df=df.drop(['Series Code','Country Code'],axis='columns')

Creating a list of all the unique series names. This can be used to narrow down the field of features to be focused on. Saved to a .csv to be looked at manually.

In [46]:
series = df['Series Name'].unique()
series = pd.DataFrame(series)
series.to_csv('Data/series_names.csv')
# print(series)
print(type(series))
print(series.shape)

<class 'pandas.core.frame.DataFrame'>
(790, 1)


Change all empty cells ('..') to NaN, this will allow them to be dealt with during data cleaning.

In [47]:
(df[df.eq('..')])=np.nan

Remove the last 5 rows as they do not contain data.

In [48]:
df = df.iloc[:-5,:]

Re-order the rows.

In [49]:
df=df.reindex(columns=['Country Name','Series Name','2019 [YR2019]'])

In [50]:
df = df.rename(columns={'2019 [YR2019]': '2019 Value'})

Pivot the data so the series names are shifted to columns. The country name is the index.

In [51]:
df = df.pivot_table(values='2019 Value', index='Country Name', columns='Series Name',aggfunc='min')

Make all the values in the dataframe numeric, not strings

In [52]:
df = df.apply(pd.to_numeric)

List all countries with under 1 mllion population.

In [53]:
small_countries = df[df['Population, total'] < 1000000]
print(small_countries.index)


Index(['American Samoa', 'Andorra', 'Antigua and Barbuda', 'Aruba',
       'Bahamas, The', 'Barbados', 'Belize', 'Bermuda', 'Bhutan',
       'British Virgin Islands', 'Brunei Darussalam', 'Cabo Verde',
       'Cayman Islands', 'Channel Islands', 'Comoros', 'Curacao', 'Dominica',
       'Faroe Islands', 'Fiji', 'French Polynesia', 'Gibraltar', 'Greenland',
       'Grenada', 'Guam', 'Guyana', 'Iceland', 'Isle of Man', 'Kiribati',
       'Liechtenstein', 'Luxembourg', 'Macao SAR, China', 'Maldives', 'Malta',
       'Marshall Islands', 'Micronesia, Fed. Sts.', 'Monaco', 'Montenegro',
       'Nauru', 'New Caledonia', 'Northern Mariana Islands', 'Palau', 'Samoa',
       'San Marino', 'Sao Tome and Principe', 'Seychelles',
       'Sint Maarten (Dutch part)', 'Solomon Islands', 'St. Kitts and Nevis',
       'St. Lucia', 'St. Martin (French part)',
       'St. Vincent and the Grenadines', 'Suriname', 'Tonga',
       'Turks and Caicos Islands', 'Tuvalu', 'Vanuatu',
       'Virgin Islands (U.S.)'

Remove all countries with population under 1 million.

In [54]:
df = df[df['Population, total'] > 1000000]

This is the feature I selected to be the predicted using the machine learning models. I choose this as it seemed a good marker for gender equality within a country. In total, 82 countries had a value for this feature.

In [55]:
df['Female share of employment in senior and middle management (%)'].describe()


count    82.000000
mean     32.599878
std      10.118658
min       6.326000
25%      28.093250
50%      32.848500
75%      39.163250
max      61.956000
Name: Female share of employment in senior and middle management (%), dtype: float64

In [56]:
df[~df['Female share of employment in senior and middle management (%)'].isna()].index

Index(['Albania', 'Angola', 'Argentina', 'Australia', 'Austria', 'Belarus',
       'Belgium', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Cambodia', 'Colombia', 'Costa Rica', 'Croatia', 'Cyprus',
       'Czechia', 'Denmark', 'Dominican Republic', 'Ecuador',
       'Egypt, Arab Rep.', 'El Salvador', 'Estonia', 'Finland', 'France',
       'Georgia', 'Germany', 'Greece', 'Guatemala', 'Honduras', 'Hungary',
       'India', 'Iran, Islamic Rep.', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Jordan', 'Kenya', 'Kosovo', 'Kyrgyz Republic', 'Latvia', 'Lebanon',
       'Lesotho', 'Lithuania', 'Mexico', 'Mongolia', 'Myanmar', 'Netherlands',
       'Nigeria', 'North Macedonia', 'Norway', 'Pakistan', 'Philippines',
       'Poland', 'Portugal', 'Romania', 'Russian Federation', 'Senegal',
       'Serbia', 'Slovak Republic', 'Slovenia', 'Somalia', 'South Africa',
       'Spain', 'Sri Lanka', 'Sweden', 'Switzerland', 'Thailand',
       'Trinidad and Tobago', 'Tunisia

Drop all countries which have NaN values in the selected feature, 'Female share of employment in senior and middle management (%)'.

In [64]:
df_select = df.dropna(subset = ['Female share of employment in senior and middle management (%)'])
df_select.shape

(82, 687)

Remove other features which includes NaN values. This leaves 162 features.

In [65]:
df_select = df_select.dropna(axis=1)
df_select.shape

(82, 162)

In [19]:
df_select.to_csv('Data/wb_gender_data_cleaned.csv')

In [20]:
series_select = df_select.columns
series_select
series_select = pd.DataFrame(series_select)
series_select.to_csv('Data/series_select_names.csv')

In [21]:
df_select

Series Name,A woman can apply for a passport in the same way as a man (1=yes; 0=no),"A woman can be ""head of household"" in the same way as a man (1=yes; 0=no)",A woman can choose where to live in the same way as a man (1=yes; 0=no),A woman can get a job in the same way as a man (1=yes; 0=no),A woman can obtain a judgment of divorce in the same way as a man (1=yes; 0=no),A woman can open a bank account in the same way as a man (1=yes; 0=no),A woman can register a business in the same way as a man (1=yes; 0=no),A woman can sign a contract in the same way as a man (1=yes; 0=no),A woman can travel outside her home in the same way as a man (1=yes; 0=no),A woman can travel outside the country in the same way as a man (1=yes; 0=no),...,"Time required to start a business, female (days)","Time required to start a business, male (days)","Unemployment, female (% of female labor force) (national estimate)","Unemployment, male (% of male labor force) (national estimate)","Unemployment, total (% of total labor force) (national estimate)","Unemployment, youth female (% of female labor force ages 15-24) (national estimate)","Unemployment, youth male (% of male labor force ages 15-24) (national estimate)","Unemployment, youth total (% of total labor force ages 15-24) (national estimate)",Women Business and the Law Index Score (scale 1-100),Women and men have equal ownership rights to immovable property (1=yes; 0=no)
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albania,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,4.5,4.5,11.316,11.585,11.466,25.851,27.764,26.978,91.250,1.0
Angola,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,36.0,36.0,16.546,16.446,16.497,29.072,33.384,31.198,73.125,1.0
Argentina,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,11.5,11.5,10.691,9.178,9.843,28.777,23.867,25.860,76.250,1.0
Australia,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,2.0,2.0,5.099,5.183,5.143,10.571,12.279,11.449,96.875,1.0
Austria,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,21.0,21.0,4.424,4.680,4.560,7.944,9.518,8.786,94.375,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,6.5,6.5,10.640,7.297,8.836,32.785,24.007,27.782,88.750,1.0
Viet Nam,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,16.0,16.0,1.597,1.757,1.681,5.615,5.954,5.800,81.875,1.0
West Bank and Gaza,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,44.0,43.0,41.191,21.349,25.340,67.171,34.792,40.158,26.250,1.0
Zambia,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,8.5,8.5,5.224,5.809,5.538,8.209,12.309,10.304,81.250,1.0


In [40]:


df_select['Female share of employment in senior and middle management (%)'].sort_values(ascending=False).head(20)

Country Name
Jordan                 61.956
Botswana               57.771
Dominican Republic     50.625
Kenya                  49.624
Honduras               47.470
Belarus                45.805
Trinidad and Tobago    45.248
El Salvador            43.128
Costa Rica             42.767
Mongolia               42.513
Ukraine                42.441
Russian Federation     41.989
Sweden                 41.902
Latvia                 41.739
Albania                41.340
Poland                 41.151
Kyrgyz Republic        40.864
United States          40.848
Slovenia               40.507
Guatemala              39.358
Name: Female share of employment in senior and middle management (%), dtype: float64