# Wrangling Countries & UN Regions
## By: Scott Kustes

### Objective:
Wrangle UN regions and subregions and associated countries for insertion into website database.

#### Dataset:
The original dataset was downloaded here: https://unstats.un.org/unsd/methodology/m49/

#### Contents
- <a href='#gather'>Data Gathering</a>
- <a href='#assess1'>Assess, Part 1</a>
- <a href='#clean1'>Clean, Part 1</a>
- <a href='#assess2'>Assess, Part 2</a>
- <a href='#clean2'>Clean, Part 2</a>
- <a href='#extract-un'>UN Regional Hierarchy Extraction</a>
- <a href='#assess3'>Assess, Part 3</a>
- <a href='#clean3'>Clean, Part 3</a>
- <a href='#extract-groups'>UN Groupings Extraction</a>
- <a href='#assess4'>Assess, Part 4</a>
- <a href='#clean4'>Clean, Part 4</a>
- <a href='#final'>Finished Dataframes</a>

In [1]:
# Import necessary packages
import pandas as pd

<a id='gather'></a>
## Gather

In [2]:
countries = pd.read_csv( 'countries.csv' )
countries.head()

Unnamed: 0,Global Code,Global Name,Region Code,Region Name,Sub-region Code,Sub-region Name,Intermediate Region Code,Intermediate Region Name,Common Name,Official Name,Capital,M49 Code,ISO-alpha3 Code,Least Developed Countries (LDC),Land Locked Developing Countries (LLDC),Small Island Developing States (SIDS),Developed / Developing Countries
0,1,World,2.0,Africa,202.0,Sub-Saharan Africa,11.0,Western Africa,"""Saint Helena, Ascension, and Tristan da Cunha""","""Saint Helena, Ascension, and Tristan da Cunha""",Jamestown,654,SHN,,,,Developing
1,1,World,142.0,Asia,34.0,Southern Asia,,,Afghanistan,Islamic Republic of Afghanistan,Kabul,4,AFG,x,x,,Developing
2,1,World,150.0,Europe,154.0,Northern Europe,,,Åland Islands,Åland Islands,Mariehamn,248,ALA,,,,Developed
3,1,World,150.0,Europe,39.0,Southern Europe,,,Albania,Republic of Albania,Tirana,8,ALB,,,,Developed
4,1,World,2.0,Africa,15.0,Northern Africa,,,Algeria,People's Democratic Republic of Algeria,Algiers,12,DZA,,,,Developing


<a id='assess1'></a>
## Assess, Part 1

In [3]:
countries['Global Code'].unique()

array([1], dtype=int64)

In [4]:
countries['Global Name'].unique()

array(['World'], dtype=object)

### Issues Found:
1) Drop columns: `Global Code` and `Global Name` - only 1 unique value

2) Rename columns: replace spaces with underscores, replace uppercase with lowercase

<a id='clean1'></a>
## Clean, Part 1
### 1) Drop columns `Global Code` and `Global Name`

Drop `Global Code` and `Global Name` columns due to each having only 1 unique value (1 and World, respectively).

#### Code

In [5]:
countries.drop( columns=['Global Code','Global Name'], axis=1, inplace=True )

#### Test

In [6]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 15 columns):
Region Code                                248 non-null float64
Region Name                                248 non-null object
Sub-region Code                            248 non-null float64
Sub-region Name                            248 non-null object
Intermediate Region Code                   108 non-null float64
Intermediate Region Name                   108 non-null object
Common Name                                249 non-null object
Official Name                              249 non-null object
Capital                                    243 non-null object
M49 Code                                   249 non-null int64
ISO-alpha3 Code                            248 non-null object
Least Developed Countries (LDC)            47 non-null object
Land Locked Developing Countries (LLDC)    32 non-null object
Small Island Developing States (SIDS)      53 non-null object
Developed / De

### 2) Rename Columns
Replace spaces with underscores, replace uppercase letters with lowercase

#### Code

In [7]:
# Dictionary of new column names
column_names = {
    'Region Code': 'region_code',
    'Region Name': 'region_name',
    'Sub-region Code': 'subregion_code',
    'Sub-region Name': 'subregion_name',
    'Intermediate Region Code': 'intermediate_region_code',
    'Intermediate Region Name': 'intermediate_region_name',
    'Common Name': 'country_common_name',
    'Official Name': 'country_official_name',
    'Capital': 'capital',
    'Territory of': 'territory_of',
    'M49 Code': 'un_m49',
    'ISO-alpha3 Code': 'iso_alpha3',
    'Least Developed Countries (LDC)': 'least_developed_countries',
    'Land Locked Developing Countries (LLDC)': 'landlocked_developing_countries',
    'Small Island Developing States (SIDS)': 'small_island_developing_states',
    'Developed / Developing Countries': 'developed_developing_countries'
}

countries.rename( mapper=column_names, axis=1, inplace=True )

#### Test

In [8]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 15 columns):
region_code                        248 non-null float64
region_name                        248 non-null object
subregion_code                     248 non-null float64
subregion_name                     248 non-null object
intermediate_region_code           108 non-null float64
intermediate_region_name           108 non-null object
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          47 non-null object
landlocked_developing_countries    32 non-null object
small_island_developing_states     53 non-null object
developed_developing_countries     248 non-null object
dtypes: float64(3), int64(1), object(11)
memory usage: 29.3+ KB


<a id='assess2'></a>
## Assess, Part 2

In [9]:
countries['region_code'].unique()

array([  2., 142., 150.,   9.,  19.,  nan])

In [10]:
countries['region_name'].unique()

array(['Africa', 'Asia', 'Europe', 'Oceania', 'Americas', nan],
      dtype=object)

In [11]:
countries[ countries['region_code'].isnull() ]

Unnamed: 0,region_code,region_name,subregion_code,subregion_name,intermediate_region_code,intermediate_region_name,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,developed_developing_countries
9,,,,,,,Antarctica,Antarctica,,10,ATA,,,,


In [12]:
countries['developed_developing_countries'].unique()

array(['Developing', 'Developed', nan], dtype=object)

In [13]:
countries['intermediate_region_name'].unique()

array(['Western Africa', nan, 'Middle Africa', 'Caribbean',
       'South America', 'Central America', 'Southern Africa',
       'Eastern Africa', 'Channel Islands'], dtype=object)

In [14]:
countries['region_name'].unique()

array(['Africa', 'Asia', 'Europe', 'Oceania', 'Americas', nan],
      dtype=object)

In [15]:
countries['subregion_name'].unique()

array(['Sub-Saharan Africa', 'Southern Asia', 'Northern Europe',
       'Southern Europe', 'Northern Africa', 'Polynesia',
       'Latin America and the Caribbean', nan, 'Western Asia',
       'Australia and New Zealand', 'Western Europe', 'Eastern Europe',
       'Northern America', 'South-eastern Asia', 'Eastern Asia',
       'Melanesia', 'Micronesia', 'Central Asia'], dtype=object)

### Issues Found:
1) Replace NaN with 0 in `region_code`, `subregion_code`, and `intermediate_region_code`

2) Replace NaN with 0 and 'x' with 1 in `least_developed_countries`, `landlocked_developing_countries`, and `small_island_developing_states`

3) Set datatypes:
- <strong>int64:</strong> `region_code`, `subregion_code`, `intermediate_region_code`</li>
- <strong>bool:</strong> `least_developed_countries`, `landlocked_developing_countries`, `small_island_developing_states`

<a id='clean2'></a>
## Clean, Part 2
### 1) Fix Values in Columns with Region Codes
Replace NaN with 0 in region_code, subregion_code, and intermediate_region_code

#### Code

In [16]:
countries['region_code'].fillna( 0, inplace=True )
countries['subregion_code'].fillna( 0, inplace=True )
countries['intermediate_region_code'].fillna( 0, inplace=True )

#### Test

In [17]:
countries['region_code'].unique()

array([  2., 142., 150.,   9.,  19.,   0.])

In [18]:
countries['subregion_code'].unique()

array([202.,  34., 154.,  39.,  15.,  61., 419.,   0., 145.,  53., 155.,
       151.,  21.,  35.,  30.,  54.,  57., 143.])

In [19]:
countries['intermediate_region_code'].unique()

array([ 11.,   0.,  17.,  29.,   5.,  13.,  18.,  14., 830.])

### 2) Fix Values in 'Other Groupings'
Replace NaN with 0 and 'x' with 1 in `least_developed_countries`, `landlocked_developing_countries`, and `small_island_developing_states`

#### Code

In [20]:
countries['least_developed_countries'].fillna( 0, inplace=True )
countries['least_developed_countries'].replace( 'x', 1, inplace=True )

countries['landlocked_developing_countries'].fillna( 0, inplace=True )
countries['landlocked_developing_countries'].replace( 'x', 1, inplace=True )

countries['small_island_developing_states'].fillna( 0, inplace=True )
countries['small_island_developing_states'].replace( 'x', 1, inplace=True )

#### Test

In [21]:
countries['least_developed_countries'].unique()

array([0, 1], dtype=int64)

In [22]:
countries['landlocked_developing_countries'].unique()

array([0, 1], dtype=int64)

In [23]:
countries['small_island_developing_states'].unique()

array([0, 1], dtype=int64)

### 3) Set Datatypes
- <strong>int64:</strong> `region_code`, `subregion_code`, `intermediate_region_code`
- <strong>bool:</strong> `least_developed_countries`, `landlocked_developing_countries`, and `small_island_developing_states`

#### Code

In [24]:
# Set int64 columns
countries['region_code'] = countries['region_code'].astype( 'int64' )
countries['subregion_code'] = countries['subregion_code'].astype( 'int64' )
countries['intermediate_region_code'] = countries['intermediate_region_code'].astype( 'int64' )

# Set bool columns
countries['least_developed_countries'] = countries['least_developed_countries'].astype( 'bool' )
countries['landlocked_developing_countries'] = countries['landlocked_developing_countries'].astype( 'bool' )
countries['small_island_developing_states'] = countries['small_island_developing_states'].astype( 'bool' )

#### Test

In [25]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 15 columns):
region_code                        249 non-null int64
region_name                        248 non-null object
subregion_code                     249 non-null int64
subregion_name                     248 non-null object
intermediate_region_code           249 non-null int64
intermediate_region_name           108 non-null object
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
developed_developing_countries     248 non-null object
dtypes: bool(3), int64(4), object(8)
memory usage: 24.2+ KB


<a id='extract-un'></a>
## UN Regional Hierarchy Extraction
Extract UN Regional hierarchy and rationalize for insertion into website database.

### Create UN Region Hierarchy
In the UN Geoscheme, a region is made up of 1 or more subregions and a subregion is made up of 0 or more intermediate regions. Countries/areas are assigned at either the subregional or intermediate regional level (with Antarctica as the sole exception, having no regional assignment). Note that the hierarchy levels are as follows: `Region -> Subregion -> Intermediate Region`

Antarctica is assigned to `region_code` 0

The `countries` dataframe contains redundant information across the region, subregion, and intermediate region columns. Extract these values to create a hierarchy of UN regions in a new dataframe called `regions`. This new dataframe will contain 3 columns: 
- `region_code`: the value from the `region_code`, `subregion_code`, or `intermediate_region_code` column
- `region_name`: the value from the `region_name`, `subregion_name`, or `intermediate_region_name` column
- `parent_region_code`: the `region_code` for the region or subregion one level above in the hierarchy

In the `countries` dataframe, create a new column called `un_region` to hold the subregion or intermediate region to which each country is assigned.

#### Code

In [26]:
# Create an empty dataframe for holding the regions
regions = pd.DataFrame( columns=['region_code','region_name','parent_region_code'] )

# Iterrate through the countries to extract the UN region information
for row in countries.itertuples():
    # Get the region information from this row
    region_code = row.region_code
    region_name = row.region_name if region_code != 0 else 'Antarctica' # Antarctica is the sole exception so make it a top-level region
    subregion_code = row.subregion_code
    subregion_name = row.subregion_name
    intermediate_region_code = row.intermediate_region_code
    intermediate_region_name = row.intermediate_region_name
    
    # If region doesn't exist in regions dataframe already, add it
    if region_code not in regions['region_code'].unique():
        regions = regions.append( { 'region_code': region_code, 
                                    'region_name': region_name, 
                                    'parent_region_code': 0 }, 
                                    ignore_index=True )
    
    # If subregion doesn't exist in regions dataframe already, add it
    if ( subregion_code != 0 ) & ( subregion_code not in regions['region_code'].unique() ):
        regions = regions.append( { 'region_code': subregion_code, 
                                    'region_name': subregion_name, 
                                    'parent_region_code': region_code }, 
                                    ignore_index=True )

    # If intermediate region doesn't exist in regions dataframe already, add it
    if ( intermediate_region_code != 0 ) & ( intermediate_region_code not in regions['region_code'].unique() ):
        regions = regions.append( { 'region_code': intermediate_region_code, 
                                    'region_name': intermediate_region_name, 
                                    'parent_region_code': subregion_code }, 
                                    ignore_index=True )

In [27]:
# Function to populate un_region field in countries dataframe
# If intermediate_region_code is not 0, return the intermediate_region_code
# Else return subregion_code 
# Note that one entry contains 0 in subregion_code: Antarctica
# Antarctica was previously assigned a region_code of 0 so returning subregion_code of 0 will not cause problems
def get_un_region( row ):
    # If intermediate_region_code is not 0, return it
    if row['intermediate_region_code'] != 0:
        return row['intermediate_region_code']
    else:
        return row['subregion_code']

countries['un_region'] = countries.apply( get_un_region, axis=1 )

#### Test

In [28]:
regions.sort_values( 'parent_region_code' )

Unnamed: 0,region_code,region_name,parent_region_code
0,2,Africa,0
12,19,Americas,0
9,9,Oceania,0
5,150,Europe,0
15,0,Antarctica,0
3,142,Asia,0
8,15,Northern Africa,2
1,202,Sub-Saharan Africa,2
28,57,Micronesia,9
27,54,Melanesia,9


In [29]:
countries[['subregion_code','intermediate_region_code','un_region']].sample(10)

Unnamed: 0,subregion_code,intermediate_region_code,un_region
11,419,5,5
125,15,0,15
4,15,0,15
119,143,0,143
147,419,29,29
236,419,29,29
82,202,11,11
115,143,0,143
227,145,0,145
128,155,0,155


<a id='assess3'></a>
## Assess, Part 3

In [30]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 16 columns):
region_code                        249 non-null int64
region_name                        248 non-null object
subregion_code                     249 non-null int64
subregion_name                     248 non-null object
intermediate_region_code           249 non-null int64
intermediate_region_name           108 non-null object
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
developed_developing_countries     248 non-null object
un_region                          249 non-null int64
dtypes: bool(3), int64(5),

In [31]:
countries[ countries['iso_alpha3'].isnull() ]

Unnamed: 0,region_code,region_name,subregion_code,subregion_name,intermediate_region_code,intermediate_region_name,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,developed_developing_countries,un_region
192,150,Europe,154,Northern Europe,830,Channel Islands,Sark,Sark,,680,,False,False,False,Developed,830


In [32]:
countries[ countries['developed_developing_countries'].isnull() ]

Unnamed: 0,region_code,region_name,subregion_code,subregion_name,intermediate_region_code,intermediate_region_name,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,developed_developing_countries,un_region
9,0,,0,,0,,Antarctica,Antarctica,,10,ATA,False,False,False,,0


### Issues Found:
#### Tidiness
1) Drop 6 UN region columns: `region_code`,`region_name`,`subregion_code`,`subregion_name`,`intermediate_region_code`,`intermediate_region_name`

2) Create two columns, `developed` and `developing`, from `developed_developing_countries` column. 

3) Set datatype of `developed` and `developing` to bool.

4) Drop `developed_developing_countries`.

#### Quality
5) Replace NaN in `iso_alpha3` with 'NA'. The island of Sark has no ISO Alpha3 code.

<a id='clean3'></a>
## Clean, Part 3

### 1) Drop 6 UN region columns
Drop `region_code`,`region_name`,`subregion_code`,`subregion_name`,`intermediate_region_code`, and `intermediate_region_name`. These columns are no longer needed. This information is now contained in the `regions` dataframe to which each country is associated through the `un_region` column.

#### Code

In [33]:
countries.drop( columns=['region_code','region_name','subregion_code','subregion_name','intermediate_region_code','intermediate_region_name'], axis=1, inplace=True )

#### Test

In [34]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 10 columns):
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
developed_developing_countries     248 non-null object
un_region                          249 non-null int64
dtypes: bool(3), int64(2), object(5)
memory usage: 14.4+ KB


### 2) Create `developed` and `developing` Columns

`developed` will contain 1 if `developed_developing_countries` contains 'Developed', 0 otherwise<br/>
`developing` will contain 1 if `developed_developing_countries` contains 'Developing', 0 otherwise

#### Code

In [35]:
countries['developed'] = countries['developed_developing_countries'].apply( lambda x: 1 if x == 'Developed' else 0 )
countries['developing'] = countries['developed_developing_countries'].apply( lambda x: 1 if x == 'Developing' else 0 )

#### Test

In [36]:
countries[['developed','developing','developed_developing_countries']].sample(5)

Unnamed: 0,developed,developing,developed_developing_countries
220,1,0,Developed
18,0,1,Developing
159,0,1,Developing
243,0,1,Developing
144,1,0,Developed


### 3) Set datatype for `developed` and `developing` Columns to bool

#### Code

In [37]:
countries['developed'] = countries['developed'].astype('bool')
countries['developing'] = countries['developing'].astype('bool')

#### Test

In [38]:
countries['developed'].value_counts()

False    183
True      66
Name: developed, dtype: int64

In [39]:
countries['developing'].value_counts()

True     182
False     67
Name: developing, dtype: int64

In [40]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 12 columns):
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
developed_developing_countries     248 non-null object
un_region                          249 non-null int64
developed                          249 non-null bool
developing                         249 non-null bool
dtypes: bool(5), int64(2), object(5)
memory usage: 14.9+ KB


### 4) Drop `developed_developing_countries` Column

#### Code

In [41]:
countries.drop( columns=['developed_developing_countries'], inplace=True )

#### Test

In [42]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 11 columns):
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         248 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
un_region                          249 non-null int64
developed                          249 non-null bool
developing                         249 non-null bool
dtypes: bool(5), int64(2), object(4)
memory usage: 13.0+ KB


### 5) Replace NaN in `iso_alpha3` with 'NA'

The island of Sark has no ISO Alpha3 code.

#### Code

In [43]:
countries['iso_alpha3'].fillna( 'NA', inplace=True )

#### Test

In [44]:
countries[ countries['iso_alpha3'].isnull() ]

Unnamed: 0,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,un_region,developed,developing


<a id='extract-groups'></a>
## UN Groupings Extraction
Extract and rationalize information in the five "Other Groupings" columns: `least_developed_countries`, `landlocked_developing_countries`, `small_island_developing_states`, `developed`, and `developing`.  This information will be stored in 2 database tables. 

The first table will contain five grouping titles: 'Least Developed Countries', 'Landlocked Developing Countries', 'Small Island Developing States', 'Developed', and 'Developing'.

The second table will contain mappings of country ID to grouping ID.

#### Code

In [45]:
# Create a dataframe for the Other Groupings
group_names = ['Least Developed Countries', 'Landlocked Developing Countries', 'Small Island Developing States', 'Developed Countries', 'Developing Countries']
un_groupings = pd.DataFrame( data=group_names, columns=['group_name'] )

In [46]:
# Get the indexes for the countries in each grouping
least_developed = countries.query( 'least_developed_countries == True' ).index.to_list()
landlocked = countries.query( 'landlocked_developing_countries == True' ).index.to_list()
small_island = countries.query( 'small_island_developing_states == True' ).index.to_list()
developed = countries.query( 'developed == True' ).index.to_list()
developing = countries.query( 'developing == True' ).index.to_list()

In [47]:
# Create a blank dataframe to contain the country to grouping mappings
country_to_group = pd.DataFrame( columns=['country_id', 'group_id'] )

groups = ['least_developed','landlocked','small_island','developed','developing']
group_id = 0
for variable_name in groups:
    for value in eval( variable_name ):
        country_to_group = country_to_group.append( { 'country_id': value,
                                                      'group_id': group_id },
                                                      ignore_index=True )
    group_id += 1

#### Test

In [48]:
un_groupings

Unnamed: 0,group_name
0,Least Developed Countries
1,Landlocked Developing Countries
2,Small Island Developing States
3,Developed Countries
4,Developing Countries


In [49]:
country_to_group.sample(10)

Unnamed: 0,country_id,group_id
180,178,3
212,22,4
293,137,4
211,19,4
342,202,4
286,125,4
179,175,3
351,215,4
21,123,0
33,194,0


In [50]:
# Found this one by luck
country_to_group.query( 'country_id == 230' )

Unnamed: 0,country_id,group_id
42,230,0
129,230,2
365,230,4


In [51]:
# Is True in the correct columns?
countries.query( 'index == 230' )

Unnamed: 0,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,un_region,developed,developing
230,Tuvalu,Ellice Islands,Funafuti,798,TUV,True,False,True,61,False,True


<a id='assess4'></a>
## Assess, Part 4

In [52]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 11 columns):
country_common_name                249 non-null object
country_official_name              249 non-null object
capital                            243 non-null object
un_m49                             249 non-null int64
iso_alpha3                         249 non-null object
least_developed_countries          249 non-null bool
landlocked_developing_countries    249 non-null bool
small_island_developing_states     249 non-null bool
un_region                          249 non-null int64
developed                          249 non-null bool
developing                         249 non-null bool
dtypes: bool(5), int64(2), object(4)
memory usage: 13.0+ KB


### Issues Found:
#### Quality
1) Remove quotes from Common and Official Names. Several entries contain " in official name (ex: "Bonaire, Sint Eustatius, and Saba") due to commas in field breaking CSV formatting on input. Remove quotes.

#### Tidiness
2) Drop 5 UN grouping columns: `least_developed_countries`,`landlocked_developing_countries`,`small_island_developing_states`,`developed`,`developing`

<a id='clean4'></a>
## Clean, Part 4

### 1) Remove quotes from Official Names

Several entries contain " in official name (ex: "Bonaire, Sint Eustatius, and Saba") due to commas in field breaking CSV formatting on input. Remove quotes.

#### Code

In [53]:
countries[ countries['country_official_name'].str.contains('"') ]

Unnamed: 0,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,un_region,developed,developing
0,"""Saint Helena, Ascension, and Tristan da Cunha""","""Saint Helena, Ascension, and Tristan da Cunha""",Jamestown,654,SHN,False,False,False,11,False,True
25,BES Islands,"""Bonaire, Sint Eustatius, and Saba""",Kralendijk,535,BES,False,False,True,29,False,True


In [54]:
countries.loc[0,'country_official_name'] = "Saint Helena, Ascension, and Tristan da Cunha"
countries.loc[0,'country_common_name'] = "Saint Helena, Ascension, and Tristan da Cunha"
countries.loc[25,'country_official_name'] = "Bonaire, Sint Eustatius, and Saba"

#### Test

In [55]:
countries.query( '( country_common_name == "BES Islands" ) | ( country_common_name == "Saint Helena, Ascension, and Tristan da Cunha")' )

Unnamed: 0,country_common_name,country_official_name,capital,un_m49,iso_alpha3,least_developed_countries,landlocked_developing_countries,small_island_developing_states,un_region,developed,developing
0,"Saint Helena, Ascension, and Tristan da Cunha","Saint Helena, Ascension, and Tristan da Cunha",Jamestown,654,SHN,False,False,False,11,False,True
25,BES Islands,"Bonaire, Sint Eustatius, and Saba",Kralendijk,535,BES,False,False,True,29,False,True


### 2) Drop 5 UN grouping columns
Drop `least_developed_countries`, `landlocked_developing_countries`, `small_island_developing_states`, `developed`, and `developing`. These columns are no longer needed. This information is now contained in the `un_groupings` and `country_to_group` dataframes.

#### Code

In [56]:
countries.drop( columns=['least_developed_countries','landlocked_developing_countries','small_island_developing_states','developed','developing'], axis=1, inplace=True )

#### Test

In [57]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 6 columns):
country_common_name      249 non-null object
country_official_name    249 non-null object
capital                  243 non-null object
un_m49                   249 non-null int64
iso_alpha3               249 non-null object
un_region                249 non-null int64
dtypes: int64(2), object(4)
memory usage: 11.8+ KB


<a id='final'></a>
## Finished Dataframes

These dataframes are ready for database insertion.

In [58]:
countries.sample(5)

Unnamed: 0,country_common_name,country_official_name,capital,un_m49,iso_alpha3,un_region
203,Somalia,Federal Republic of Somalia,Mogadishu,706,SOM,14
132,Malawi,Republic of Malawi,Lilongwe,454,MWI,14
61,Djibouti,Republic of Djibouti,Djibouti,262,DJI,14
74,Faroe Islands,Faroe Islands,Tórshavn,234,FRO,154
225,Trinidad and Tobago,Republic of Trinidad and Tobago,Port of Spain,780,TTO,29


In [59]:
regions.sample(10)

Unnamed: 0,region_code,region_name,parent_region_code
10,61,Polynesia,9
18,53,Australia and New Zealand,9
21,13,Central America,419
11,17,Middle Africa,202
20,151,Eastern Europe,150
1,202,Sub-Saharan Africa,2
14,29,Caribbean,419
8,15,Northern Africa,2
3,142,Asia,0
23,18,Southern Africa,202


In [60]:
un_groupings

Unnamed: 0,group_name
0,Least Developed Countries
1,Landlocked Developing Countries
2,Small Island Developing States
3,Developed Countries
4,Developing Countries


In [61]:
country_to_group.sample(10)

Unnamed: 0,country_id,group_id
114,176,2
258,82,4
113,169,2
344,204,4
52,29,1
274,106,4
272,104,4
152,77,3
164,110,3
191,213,3
