# Sourcing open data

#### Table of Contents

    1. Importing libraries
    2. Importing data
    3. Consistency checks & cleaning
        3.1 Renaming columns
        3.2 Missing values
        3.3 Unifying 'Country' values
        3.4 Data types
            Check for mixed-type values
            Change data types to reduce size
        3.5 Duplicates
    4. Descriptive statistics
    5. Exporting data

# 1. Importing libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 2. Importing data

In [2]:
# Turn project folder path into a string
path = r'/Users/sarahtischer/Desktop/CareerFoundry/Data Immersion/Achievement 6/03-2024_WorldRiskIndex_Analysis'

In [3]:
# Import "WRI_Kaggle.csv"
df = pd.read_csv(os.path.join(path, '02_Data', 'Original_data', 'WRI_Kaggle.csv'), index_col = False)

In [4]:
# Confirm the shape of the dataset
df.shape

(1917, 12)

# 3. Consistency checks & cleaning

In [5]:
# Check the metadata of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1917 entries, 0 to 1916
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Region                        1917 non-null   object 
 1   WRI                           1917 non-null   float64
 2   Exposure                      1917 non-null   float64
 3   Vulnerability                 1917 non-null   float64
 4   Susceptibility                1917 non-null   float64
 5   Lack of Coping Capabilities   1917 non-null   float64
 6    Lack of Adaptive Capacities  1916 non-null   float64
 7   Year                          1917 non-null   int64  
 8   Exposure Category             1917 non-null   object 
 9   WRI Category                  1916 non-null   object 
 10  Vulnerability Category        1913 non-null   object 
 11  Susceptibility Category       1917 non-null   object 
dtypes: float64(6), int64(1), object(5)
memory usage: 179.8+ KB


## 3.1 Renaming columns

In [6]:
# Rename columns for the sake of consistency
df.rename(
    columns = {
        'Region' : 'Country', 
        'Lack of Coping Capabilities' : 'Lack of Coping Capacities', 
        ' Lack of Adaptive Capacities' : 'Lack of Adaptive Capacities'
    }, inplace = True
)

In [7]:
# Change order of columns for a more intuitive view

# Define the desired order of columns
column_order = [
    'Country', 'Year', 'WRI', 'Exposure', 'Vulnerability', 
    'Susceptibility', 'Lack of Coping Capacities', 'Lack of Adaptive Capacities',
    'WRI Category', 'Exposure Category', 'Vulnerability Category', 'Susceptibility Category'
]

# Reorder columns in DataFrame
df = df[column_order]

In [8]:
# Check the output
df.head()

Unnamed: 0,Country,Year,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities,WRI Category,Exposure Category,Vulnerability Category,Susceptibility Category
0,Vanuatu,2011,32.0,56.33,56.81,37.14,79.34,53.96,Very High,Very High,High,High
1,Tonga,2011,29.08,56.04,51.9,28.94,81.8,44.97,Very High,Very High,Medium,Medium
2,Philippinen,2011,24.32,45.09,53.93,34.99,82.78,44.01,Very High,Very High,High,High
3,Salomonen,2011,23.51,36.4,64.6,44.11,85.95,63.74,Very High,Very High,Very High,High
4,Guatemala,2011,20.88,38.42,54.35,35.36,77.83,49.87,Very High,Very High,High,High


## 3.2 Missing values

In [9]:
# Find missing values
df.isnull().sum()

Country                        0
Year                           0
WRI                            0
Exposure                       0
Vulnerability                  0
Susceptibility                 0
Lack of Coping Capacities      0
Lack of Adaptive Capacities    1
WRI Category                   1
Exposure Category              0
Vulnerability Category         4
Susceptibility Category        0
dtype: int64

In [10]:
# Create subset of missing observations
df_nan = df[df[['Lack of Adaptive Capacities', 'WRI Category', 'Vulnerability Category']].isnull().any(axis=1)]

df_nan

Unnamed: 0,Country,Year,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities,WRI Category,Exposure Category,Vulnerability Category,Susceptibility Category
1193,Österreich,2019,2.87,13.18,21.75,13.63,39.27,12.34,Very Low,Medium,,Very Low
1202,Deutschland,2019,2.43,11.51,21.11,14.3,36.44,12.6,Very Low,Low,,Very Low
1205,Norwegen,2019,2.34,10.6,22.06,13.29,39.21,13.68,Very Low,Low,,Very Low
1292,Föd. Staaten v. Mikronesien,2020,7.59,14.95,50.77,31.79,72.13,48.39,,High,High,High
1858,Korea Republic of 4.59,2016,14.89,30.82,14.31,46.55,31.59,,Very High,Very High,,High


In [11]:
# Set the missing values as obtained from orginal reports
df.loc[1193, 'Vulnerability Category'] = 'Very Low'
df.loc[1202, 'Vulnerability Category'] = 'Very Low'
df.loc[1205, 'Vulnerability Category'] = 'Very Low'
df.loc[1292, 'WRI Category'] = 'High'

In [12]:
# Correct apparently misplaced columns of row index 1858 (data obtained from original report)

# Specify the new values for each field
new_values = {
    'Country': 'Republic of Korea',
    'WRI': 4.59,
    'Exposure': 14.89,
    'Vulnerability': 30.82,
    'Susceptibility': 14.31,
    'Lack of Coping Capacities': 46.55,
    'Lack of Adaptive Capacities': 31.59,
    'Exposure Category': 'High',
    'WRI Category': 'Low',
    'Vulnerability Category': 'Very Low',
    'Susceptibility Category': 'Very Low'
}

# Update the values for the specified row
df.loc[1858, list(new_values.keys())] = list(new_values.values())

In [13]:
# Check the output of all changed rows
print(df.iloc[[1193, 1202, 1205, 1292, 1858]])

                          Country  Year   WRI  Exposure  Vulnerability  \
1193                   Österreich  2019  2.87     13.18          21.75   
1202                  Deutschland  2019  2.43     11.51          21.11   
1205                     Norwegen  2019  2.34     10.60          22.06   
1292  Föd. Staaten v. Mikronesien  2020  7.59     14.95          50.77   
1858            Republic of Korea  2016  4.59     14.89          30.82   

      Susceptibility  Lack of Coping Capacities  Lack of Adaptive Capacities  \
1193           13.63                      39.27                        12.34   
1202           14.30                      36.44                        12.60   
1205           13.29                      39.21                        13.68   
1292           31.79                      72.13                        48.39   
1858           14.31                      46.55                        31.59   

     WRI Category Exposure Category Vulnerability Category  \
1193     Ver

In [14]:
# Check for missing values again
df.isnull().sum()

Country                        0
Year                           0
WRI                            0
Exposure                       0
Vulnerability                  0
Susceptibility                 0
Lack of Coping Capacities      0
Lack of Adaptive Capacities    0
WRI Category                   0
Exposure Category              0
Vulnerability Category         0
Susceptibility Category        0
dtype: int64

## 3.3 Unifying 'Country' values

In [15]:
# Clear display options
pd.options.display.max_rows = None

In [16]:
# Find frequencies of countries
df['Country'].value_counts(dropna = False)

Country
Vanuatu                             11
Lesotho                             11
Turkmenistan                        11
Eritrea                             11
Peru                                11
Uganda                              11
Panama                              11
Pakistan                            11
Sri Lanka                           11
Angola                              11
Malaysia                            11
Myanmar                             11
Ecuador                             11
Malawi                              11
Guyana                              11
Nigeria                             11
Liberia                             11
Sudan                               11
Thailand                            11
Namibia                             11
Guinea                              11
Ukraine                             11
Malta                               11
Bahrain                             11
Kiribati                            11
Grenada          

In [17]:
# Define a mapping dictionary to replace inconsistent country names with their official ones as of 2023
mapping_dict = {
    'Albanien': 'Albania',
    'Algerien': 'Algeria',
    'Antigua und Barbuda': 'Antigua and Barbuda',
    'Argentinien': 'Argentina',
    'Armenien': 'Armenia',
    'Australien': 'Australia',
    'Österreich': 'Austria',
    'Aserbaidschan': 'Azerbaijan',
    'Bangladesch': 'Bangladesh',
    'Weißrussland': 'Belarus',
    'Belgien': 'Belgium',
    'Venezuela': 'Bolivarian Republic of Venezuela',
    'Bosnien und Herzegowina': 'Bosnia and Herzegovina',
    'Botsuana': 'Botswana',
    'Brasilien': 'Brazil',
    'Bulgarien': 'Bulgaria',
    'Kambodscha': 'Cambodia',
    'Kamerun': 'Cameroon',
    'Kanada': 'Canada',
    'Kap Verde': 'Cape Verde',
    'Zentralafrikanische Republik': 'Central African Republic',
    'Zentralafrik. Republik': 'Central African Republic',
    'Tschad': 'Chad',
    'Kolumbien': 'Colombia',
    'Komoren': 'Comoros',
    'Elfenbeinküste': 'Cote d\'Ivoire',
    'Kroatien': 'Croatia',
    'Kuba': 'Cuba',
    'Zypern': 'Cyprus',
    'Tschechische Republik': 'Czech Republic',
    'Demokratische Rep. Kongo': 'Democratic Republic of Congo',
    'Dänemark': 'Denmark',
    'Dschibuti': 'Djibouti',
    'Dominikanische Republik': 'Dominican Republic',
    'Ägypten': 'Egypt',
    'Äquatorialguinea': 'Equatorial Guinea',
    'Estland': 'Estonia',
    'Swasiland': 'Eswatini',
    'Swaziland': 'Eswatini',
    'Äthiopien': 'Ethiopia',
    'Föd. Staaten von Mikronesien': 'Federated States of Micronesia',
    'Föd. Staaten v. Mikronesien': 'Federated States of Micronesia',
    'Fidschi': 'Fiji',
    'Finnland': 'Finland',
    'Frankreich': 'France',
    'Gabun': 'Gabon',
    'Georgien': 'Georgia',
    'Deutschland': 'Germany',
    'Griechenland': 'Greece',
    'Ungarn': 'Hungary',
    'Island': 'Iceland',
    'Indien': 'India',
    'Indonesien': 'Indonesia',
    'Iran': 'Iran (Islamic Republic of)',
    'Irak': 'Iraq',
    'Irland': 'Ireland',
    'Italien': 'Italy',
    'Jamaika': 'Jamaica',
    'Jordanien': 'Jordan',
    'Kasachstan': 'Kazakhstan',
    'Kenia': 'Kenya',
    'Kirgisistan': 'Kyrgyzstan',
    'Laos': 'Lao People\'s Democratic Republic',
    'Lao People\'s Democ. Republic': 'Lao People\'s Democratic Republic',
    'Lettland': 'Latvia',
    'Libanon': 'Lebanon',
    'Libya': 'Libyan Arab Jamahiriya',
    'Libyen': 'Libyan Arab Jamahiriya',
    'Litauen': 'Lithuania',
    'Luxemburg': 'Luxembourg',
    'Madagaskar': 'Madagascar',
    'Malediven': 'Maldives',
    'Mauretanien': 'Mauritania',
    'Mexiko': 'Mexico',
    'Mongolei': 'Mongolia',
    'Mongolien': 'Mongolia',
    'Marokko': 'Morocco',
    'Mosambik': 'Mozambique',
    'Niederlande': 'Netherlands',
    'Neuseeland': 'New Zealand',
    'Republic of Macedonia': 'North Macedonia',
    'T. f. Yugo. Rep. of Macedonia': 'North Macedonia',
    'Mazedonien': 'North Macedonia',
    'Nordmazedonien': 'North Macedonia',
    'Norwegen': 'Norway',
    'Papua-Neuguinea': 'Papua New Guinea',
    'Philippinen': 'Philippines',
    'Bolivien': 'Plurinational State of Bolivia',
    'Bolivia': 'Plurinational State of Bolivia',
    'Polen': 'Poland',
    'Katar': 'Qatar',
    'Kongo': 'Republic of Congo',
    'Congo': 'Republic of Congo',
    'Südkorea': 'Republic of Korea',
    'South Korea': 'Republic of Korea',
    'Korea Republic of': 'Republic of Korea',
    'Moldawien': 'Republic of Moldova',
    'Moldau': 'Republic of Moldova',
    'Romänien': 'Romania',
    'Rumänien': 'Romania',
    'Russische Föderation': 'Russian Federation',
    'Russia': 'Russian Federation',
    'Ruanda': 'Rwanda',
    'St. Lucia': 'Saint Lucia',
    'St. Vincent und d. Grenadinen': 'Saint Vincent and the Grenadines',
    'St. Vincent u. d. Grenadinen': 'Saint Vincent and the Grenadines',
    'St. Vincent u. die Grenadinen': 'Saint Vincent and the Grenadines',
    'São Tomé and Príncipe': 'Sao Tome and Principe',
    'São Tomé und Príncipe': 'Sao Tome and Principe',
    'Saudi-Arabien': 'Saudi Arabia',
    'Serbien': 'Serbia',
    'Seychellen': 'Seychelles',
    'Singapur': 'Singapore',
    'Slowakei': 'Slovakia',
    'Slowenien': 'Slovenia',
    'Salomonen': 'Solomon Islands',
    'Südafrika': 'South Africa',
    'Spanien': 'Spain',
    'Surinam': 'Suriname',
    'Schweden': 'Sweden',
    'Schweiz': 'Switzerland',
    'Syria': 'Syrian Arab Republic',
    'Syrien': 'Syrian Arab Republic',
    'Tadschikistan': 'Tajikistan',
    'Trinidad und Tobago': 'Trinidad and Tobago',
    'Tunesien': 'Tunisia',
    'Türkei': 'Turkey',
    'Vereinigte Arabische Emirate': 'United Arab Emirates',
    'Vereinigte Arabisch Emirate': 'United Arab Emirates',
    'Ver. Arabische Emirate': 'United Arab Emirates',
    'Vereinigtes Königreich': 'United Kingdom of Great Britain and Northern Ireland',
    'United Kingdom': 'United Kingdom of Great Britain and Northern Ireland',
    'Tanzania': 'United Republic of Tanzania',
    'Tansania': 'United Republic of Tanzania',
    'Ver. Staaten von Amerika': 'United States of America',
    'Vereinigte Staaten v. A.': 'United States of America',
    'United States': 'United States of America',
    'Vereinigte Staaten von Amerika': 'United States of America',
    'Usbekistan': 'Uzbekistan',
    'Vietnam': 'Viet Nam',
    'Jemen': 'Yemen',
    'Sambia': 'Zambia',
    'Simbabwe': 'Zimbabwe'
}

# Replace inconsistent values with their English equivalents
df['Country'] = df['Country'].replace(mapping_dict)

In [18]:
# Find frequencies of countries
df['Country'].value_counts(dropna = False)

Country
Vanuatu                                                 11
New Zealand                                             11
Plurinational State of Bolivia                          11
Jordan                                                  11
Iran (Islamic Republic of)                              11
Lebanon                                                 11
Republic of Moldova                                     11
Italy                                                   11
Bahamas                                                 11
Australia                                               11
Brazil                                                  11
Serbia                                                  11
Ireland                                                 11
Czech Republic                                          11
Republic of Korea                                       11
Paraguay                                                11
United Arab Emirates                            

In [19]:
# Check countries with missing counts
df[df['Country'].isin(['Samoa', 'Sao Tome and Principe', 'Antigua and Barbuda', 'Democratic Republic of Congo', 'Federated States of Micronesia', 'Montenegro', 'Saint Lucia', 'Maldives', 'Saint Vincent and the Grenadines', 'Dominica'])].sort_values(by=['Country', 'Year'])

Unnamed: 0,Country,Year,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities,WRI Category,Exposure Category,Vulnerability Category,Susceptibility Category
1041,Antigua and Barbuda,2019,30.8,69.95,44.03,23.38,76.65,32.05,Very High,Very High,Medium,Medium
1223,Antigua and Barbuda,2020,27.44,68.92,39.82,23.33,63.31,32.83,Very High,Very High,Low,Medium
692,Antigua and Barbuda,2021,27.28,67.73,40.28,23.8,64.41,32.62,Very High,Very High,Low,Medium
1095,Democratic Republic of Congo,2019,8.8,11.95,73.63,67.13,92.56,61.21,High,Low,Very High,Very High
1275,Democratic Republic of Congo,2020,8.77,11.8,74.28,67.78,92.95,62.12,High,Low,Very High,Very High
743,Democratic Republic of Congo,2021,8.78,11.86,74.04,67.76,92.8,61.55,High,Low,Very High,Very High
1222,Dominica,2020,28.47,62.74,45.38,26.12,71.21,38.82,Very High,Very High,Medium,Medium
691,Dominica,2021,27.42,61.74,44.41,23.42,71.13,38.67,Very High,Very High,Medium,Medium
1111,Federated States of Micronesia,2019,7.52,14.72,51.05,34.11,72.11,46.93,High,Medium,High,High
1292,Federated States of Micronesia,2020,7.59,14.95,50.77,31.79,72.13,48.39,High,High,High,High


#### <mark>Note:</mark>

A value of 11 is expected, as the dataset should cover entries per country for the years 2011-2021. Countries with a frequency lower than 11 were not part of the WorldRiskReport in some years for various reasons.
The values for the missing years were reworked and could be obtained from the trend data set at a later stage of the analysis, if necessary. However, the revised data take into account changes, adjustments or updates in the source data. Accordingly, there is a possibility of discrepancies between the trend dataset and the individual datasets. Therefore, imputations should be handled with care.

## 3.4 Mixed-type values

#### Check for mixed-type values

In [20]:
# Check for mixed types
for col in df.columns.tolist():
  mixed = (df[[col]].map(type) != df[[col]].iloc[0].apply(type)).any(axis = 1)
  if len(df[mixed]) > 0:
    print(col)
else:
    print('No mixed-type columns.')

No mixed-type columns.


#### Change data types

In [21]:
# Check the metadata of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1917 entries, 0 to 1916
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Country                      1917 non-null   object 
 1   Year                         1917 non-null   int64  
 2   WRI                          1917 non-null   float64
 3   Exposure                     1917 non-null   float64
 4   Vulnerability                1917 non-null   float64
 5   Susceptibility               1917 non-null   float64
 6   Lack of Coping Capacities    1917 non-null   float64
 7   Lack of Adaptive Capacities  1917 non-null   float64
 8   WRI Category                 1917 non-null   object 
 9   Exposure Category            1917 non-null   object 
 10  Vulnerability Category       1917 non-null   object 
 11  Susceptibility Category      1917 non-null   object 
dtypes: float64(6), int64(1), object(5)
memory usage: 179.8+ KB


In [22]:
# Change data type of 'Year' from integer to string to exclude it from statistics
df['Year'] = df['Year'].astype(str)

In [23]:
# Check the data types
df.dtypes

Country                         object
Year                            object
WRI                            float64
Exposure                       float64
Vulnerability                  float64
Susceptibility                 float64
Lack of Coping Capacities      float64
Lack of Adaptive Capacities    float64
WRI Category                    object
Exposure Category               object
Vulnerability Category          object
Susceptibility Category         object
dtype: object

In [24]:
# Check the output
df.head()

Unnamed: 0,Country,Year,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities,WRI Category,Exposure Category,Vulnerability Category,Susceptibility Category
0,Vanuatu,2011,32.0,56.33,56.81,37.14,79.34,53.96,Very High,Very High,High,High
1,Tonga,2011,29.08,56.04,51.9,28.94,81.8,44.97,Very High,Very High,Medium,Medium
2,Philippines,2011,24.32,45.09,53.93,34.99,82.78,44.01,Very High,Very High,High,High
3,Solomon Islands,2011,23.51,36.4,64.6,44.11,85.95,63.74,Very High,Very High,Very High,High
4,Guatemala,2011,20.88,38.42,54.35,35.36,77.83,49.87,Very High,Very High,High,High


## 3.5 Duplicates

In [25]:
# Find full duplicates
df_dups = df[df.duplicated()]

df_dups

Unnamed: 0,Country,Year,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities,WRI Category,Exposure Category,Vulnerability Category,Susceptibility Category


#### <mark>Note:</mark>

No duplicates found.

# 4. Descriptive statistics

In [26]:
# Print descriptive statistics
df.describe()

Unnamed: 0,WRI,Exposure,Vulnerability,Susceptibility,Lack of Coping Capacities,Lack of Adaptive Capacities
count,1917.0,1917.0,1917.0,1917.0,1917.0,1917.0
mean,7.54639,15.380026,48.084371,30.722613,70.446093,43.084512
std,5.551136,10.234068,13.819766,15.667353,15.022557,13.550165
min,0.02,0.05,20.97,8.26,35.16,11.16
25%,3.74,10.16,37.04,17.78,59.33,33.17
50%,6.52,12.76,47.1,25.37,74.23,43.07
75%,9.37,16.45,60.06,42.61,83.0,53.06
max,56.71,99.88,76.47,70.83,94.36,76.11


#### <mark>Note:</mark>

**Variable ranges:**
* WRI scores range from 0.02 to 56.71, indicating a wide range of variability.

**Central tendency:**
* Mean scores for WRI, exposure, vulnerability, susceptibility, lack of coping skills, and lack of adaptive skills hover around 7.55, 15.38, 48.08, 30.72, 70.45, and 43.08, respectively, reflecting typical values.

**Variability:**
* WRI's standard deviation is about 5.55, indicating moderate variability.

# 5. Exporting data

In [27]:
# Confirm the shape of the dataset
df.shape

(1917, 12)

In [28]:
# Export df as "WRI_clean.csv"
df.to_csv(os.path.join(path, '02_Data', 'Prepared_data', 'WRI_clean.csv'), index = False)