# Wrangling Former Colonies
## By: Scott Kustes

### Objective:
Compile a list of former colonies of modern countries. ADD INFO ABOUT WHAT DATA IS INCLUDED IN FINAL OUTPUT

#### Discussion:
History is complex. For many countries, the date of independence is not clear cut. This dataset contains Dr. Hensel's determination of independence date as the point at which a country makes its own foreign and domestic policy decisions. It also contains the independence dates as determined by The Correlates of War Project and the Gleditsch/Ward list of independent states for comparison. In some instances, there are vast differences between the dates.

#### Datasets:
The primary dataset was downloaded from <a href='http://www.paulhensel.org/icow.html' target='_new'>The Issue Correlates of War Project</a>. Additional datasets were downloaded from <a href='http://www.correlatesofwar.org' target='_new'>The Correlates of War Project</a>.  For specific dataset citations, see the References section at the bottom of the notebook.

#### Contents
<ul>
    <li><a href='#gather'>Data Gathering</a></li>
    <li><a href='#assess1'>Assess, Part 1</a></li>
    <li><a href='#clean1'>Clean, Part 1</a></li>
    <li><a href='#assess2'>Assess, Part 2</a></li>
    <li><a href='#clean2'>Clean, Part 2</a></li>
    <li><a href='#assess3'>Assess, Part 3</a></li>
    <li><a href='#clean3'>Clean, Part 3</a></li>
    <li><a href='#final'>Finished Dataframes</a></li>
    <li><a href='#references'>References</a></li>
</ul>

In [1]:
# Import packages
import requests
import pandas as pd

In [2]:
# Import classes and functions needed for this analysis from config module
# These are only available on my computer
from config import dbaccess, validator, society, error_dict_to_string

# Create an instance of the DBAccess class for running queries
db = dbaccess.DBAccess()

# Create an instance of the Validator class for validating data prior to inserting/updating database
val = validator.Validator()

<a id='gather'></a>
## Data Gathering
### Colonization Data

In [3]:
icow = pd.read_csv( 'colonial_data.csv' )
icow.sample(5)

Unnamed: 0,State,Name,ColRuler,IndFrom,IndDate,IndViol,IndType,SecFrom,SecDate,SecViol,Into,IntoDate,COWsys,GWsys,Notes
59,300,Austria-Hungary,-9,-9,128212,1,1,-9,-9,-9,-9,-9,181601,181601,Polity2 coding is for #305 (Austria) but begin...
50,260,German Federal Rep. (West Germany),-9,255,194909,1,4,255,194909,1,255,199010,195505,194909,Temporary 1945-90 split of Germany following WWII
105,432,Mali,220,220,196009,0,2,-9,-9,-9,-9,-9,196006,196009,-9
25,110,Guyana,200,200,196605,0,2,-9,-9,-9,-9,-9,196605,196605,Founded by Dutch but ceded to British in 1814
70,335,Parma,-9,-9,154509,0,1,-9,-9,-9,325,186003,185101,181601,Merged into unified Italy


### Country Codes
Read in and de-duplicate the country codes used by The Correlates of War Project. Rename columns and set `country_code` as the dataframe key.

In [4]:
country_codes = pd.read_csv( 'cow_country_codes.csv' )
country_codes.drop_duplicates( inplace=True )
country_codes.rename( columns={'StateAbb': 'abbreviation', 'CCode': 'cow_code', 'StateNme': 'country_name'}, inplace=True )
country_codes.sample(5)

Unnamed: 0,abbreviation,cow_code,country_name
229,AUL,900,Australia
112,AZE,373,Azerbaijan
147,ZAN,511,Zanzibar
228,ETM,860,East Timor
231,NEW,920,New Zealand


#### Match COW Countries with `society` Table
Many countries already exist in the `society` database table, based on United Nations data. Match COW Project countries with those in the database and execute an UPDATE query to set their `cow_code` field to the corresponding value in the `country_codes` dataframe. Add countries in the COW data that are not in the database to `society`.

In [5]:
# Get societies from the database
query = db.run_query('SELECT society_id, common_name FROM society')
societies = pd.DataFrame.from_dict( query['data'] )
societies.sample(5)

Unnamed: 0,society_id,common_name
10,11,Armenia
159,160,North Korea
143,144,Mongolia
37,38,Cambodia
179,180,Rwanda


Find entries in `country_codes` that are not in the `societies` dataframe (i.e., not in the `society` database table).

In [6]:
print( country_codes[ ~country_codes['country_name'].isin(list(societies['common_name'].unique())) ].shape )
country_codes[ ~country_codes['country_name'].isin(list(societies['common_name'].unique())) ]

(40, 3)


Unnamed: 0,abbreviation,cow_code,country_name
0,USA,2,United States of America
2,BHM,31,Bahamas
14,SLU,56,St. Lucia
15,SVG,57,St. Vincent and the Grenadines
16,AAB,58,Antigua & Barbuda
17,SKN,60,St. Kitts and Nevis
41,NTH,210,Netherlands
55,HAN,240,Hanover
56,BAV,245,Bavaria
59,GFR,260,German Federal Republic


Fifteen of these 40 unmatched countries are already in `societies`, but `country_name` in `country_codes` is different from `common_name` in `societies`. Update `country_name` in `country_codes` dataframe to match `common_name`. Create two dataframes, one for countries already in the database that need to be updated with COW country codes and one for countries that need to be added to the database.

In [7]:
# Update country_name to match common_name in the database
update_cow = {
    'United States of America': 'USA',
    'Bahamas': 'The Bahamas',
    'St. Lucia': 'Saint Lucia',
    'St. Vincent and the Grenadines': 'Saint Vincent and the Grenadines',
    'Antigua & Barbuda': 'Antigua and Barbuda',
    'St. Kitts and Nevis': 'Saint Kitts and Nevis',
    'Netherlands': 'The Netherlands',
    'Czech Republic': 'Czechia',
    'Cape Verde': 'Cabo Verde',
    'Sao Tome and Principe': 'São Tomé and Príncipe',
    'Ivory Coast': 'Côte d’Ivoire',
    'Democratic Republic of the Congo': 'DRC',
    'Swaziland': 'Eswatini',
    'East Timor': 'Timor-Leste',
    'Federated States of Micronesia': 'Micronesia'
}
country_codes.replace({'country_name': update_cow}, inplace=True)

####### Create two dataframes
# Societies already in the database that need to be updated with COW country code
# Use inner merge to get countries with data in database and country_codes dataframe
existing_societies = country_codes.merge( societies, left_on='country_name', right_on='common_name', how='inner' )

# Societies that need to be added to the database
# Find country_name in country_codes dataframe that aren't in existing_societies
new_societies = country_codes[ ~country_codes['country_name'].isin(list(existing_societies['country_name'].unique())) ]

In [8]:
# There should be 25 new_societies and 192 existing_societies
print( 'New:', new_societies.shape[0] )
print( 'Existing:', existing_societies.shape[0] )

New: 25
Existing: 192


#### Update `society` Table
`1)` Update `society` table in database: Set `cow_code` for records in `existing_societies`.

`2)` Create necessary information for records in `new_societies` and add to `society` table.

##### Update Existing Societies with COW Country Code

In [9]:
# Validate values in cow_code
# Column dtype must be int
existing_societies.info()
# Check if any of the values fall outside of the 0-999 range for COW country codes
print( "\nNumber of errors:", existing_societies[ val.integer_out_of_bounds(existing_societies['cow_code'],0,999) ].shape[0] )

<class 'pandas.core.frame.DataFrame'>
Int64Index: 192 entries, 0 to 191
Data columns (total 5 columns):
abbreviation    192 non-null object
cow_code        192 non-null int64
country_name    192 non-null object
society_id      192 non-null int64
common_name     192 non-null object
dtypes: int64(2), object(3)
memory usage: 9.0+ KB

Number of errors: 0


In [12]:
# Build an UPDATE statement for each row, then execute them
def build_update_statement(row):
    return 'UPDATE society SET cow_code = ' + str(row['cow_code']) + ' WHERE society_id = ' + str(row['society_id'])

# Create UPDATE statement and execute
existing_societies['update_statement'] = existing_societies.apply(build_update_statement, axis=1)

records_updated = 0
row = 1
print( 'Attempting update of', existing_societies.shape[0], 'rows' )
for update in existing_societies['update_statement']:
    update_query = db.run_query( update )
    records_updated += update_query['rows']
    print( row, end=" ", flush=True )
    row += 1

print( '\n', records_updated, 'rows updated' )

Attempting update of 192 rows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 
 0 rows updated


##### Add New Societies

In [11]:
stop

NameError: name 'stop' is not defined

In [None]:
# Add minimum necessary information to dataframe
# common_name, official_name, society_type_id
new_societies.info()

In [None]:
# Create INSERT statement and execute

<a id='assess1'></a>
## Assess, Part 1

In [None]:
icow.info()

In [None]:
# Unique values in Colonial Ruler column
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

In [None]:
icow['IndViol'].unique()

In [None]:
icow['IndType'].unique()

### Issues Found
`1)` Remove countries with -9 in `ColRuler`, `IndFrom`, `SecFrom`, and `Into`. 

`2)` Rename columns to more reader-friendly format.

`3)` Merge country information from Correlates of War Project with that already in the `society` database table.

<a id='clean1'></a>
## Clean, Part 1

### 1) Remove Uncolonized Countries
Delete all entries where `ColRuler`, `IndFrom`, `SecFrom`, and `Into` columns are -9. These countries were never colonized, never declared independence or seceded from another country, and never merged into another country and are therefore unnecessary in the dataset.

#### Code

In [None]:
icow.drop( icow.query( '( ColRuler == -9 ) & ( IndFrom == -9 ) & ( SecFrom == -9 ) & ( Into == -9 )' ).index, inplace=True )

#### Test

In [None]:
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

### 2) Rename Columns
Create reader-friendly column names.

#### Code

In [None]:
icow.rename( columns={
    'State': 'country_code',
    'Name': 'country',
    'ColRuler': 'colonizer',
    'IndFrom': 'indep_from',
    'IndDate': 'indep_date',
    'IndViol': 'indep_violent',
    'IndType': 'indep_type',
    'SecFrom': 'secession_from',
    'SecDate': 'secession_date',
    'SecViol': 'secession_violent',
    'Into': 'merged_into',
    'IntoDate': 'merged_date',
    'COWsys': 'cow_indep_date',
    'GWsys': 'gw_indep_date',
    'Notes': 'notes'
}, inplace=True )

#### Test

In [None]:
icow.info()

<a id='assess2'></a>
## Assess, Part 2

In [None]:
print( icow['colonizer'].nunique() )
icow['colonizer'].unique()

In [None]:
print( icow['indep_from'].nunique() )
icow['indep_from'].unique()

### Issues Found
`1)` Add country name into dataframe for columns `colonizer` and `indep_from`.

`2)` Convert `indep_date`, `cow_indep_date`, and `gw_indep_date` from int to Time Period.

<a id='clean2'></a>
## Clean, Part 2

### 1) Add Country Name for columns with Country Codes
Insert columns to hold country name based on country codes in `colonizer`, `indep_from`, `secession_from`, and `merged_into` columns. 

Five Central American countries were part of the Federal Republic of Central America (country code: 89), which is not in the Correlates of War Project list. Manually update these entries. Morocco and Saudi Arabia will remain as nan because indep_from is -9.

#### Code

In [None]:
columns = ['colonizer','indep_from','secession_from','merged_into']
for column in columns:
    # Join on country code, then rename the joined column
    icow = icow.join( country_codes['country_name'], on=column )
    new_column_name = column + '_name'
    icow.rename( columns={'country_name': new_column_name}, inplace=True )

In [None]:
indexes = icow.query( 'indep_from == 89' ).index.tolist()
icow.loc[ indexes, 'indep_from_name' ] = 'Federal Republic of Central America'

#### Test

In [None]:
icow.info()

In [None]:
print( icow['colonizer'].nunique() )
print( icow['colonizer_name'].nunique() )
icow['colonizer_name'].unique()

In [None]:
print( icow['indep_from'].nunique() )
print( icow['indep_from_name'].nunique() )
icow['indep_from_name'].unique()

In [None]:
icow[ icow['indep_from_name'].isnull() ]

### 2) Convert date columns to datetime

Convert `indep_date`, `secession_date`, `merged_date`, `cow_indep_date`, and `gw_indep_date` from int to Time Period. 

**Note to future me**: I originally attempted a conversion to datetime, but datetime has a limitation to dates between 1677-09-21 and 2262-04-11. A few values fall outside of this range, therefore the workaround found <a href='https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob' target='_new'>here</a> was used. This caused the following problem: <a href='https://stackoverflow.com/questions/58019763/jupyter-kernel-crash-when-querying-dataframe-with-period-datatype' target='_new'>Jupyter kernel crash when querying dataframe with Period datatype</a>. 

*Solution:* The error was caused by -9 in the `gw_indep_date` column. While the conversion to Period didn't fail, it somehow corrupted the dataframe, making queries impossible. By setting -9 dates to 000101 prior to conversion, a valid Period of 0001-01-01 is obtained and the dataframe functions properly.

#### Code

In [None]:
# Conversion function to create Time Period, copied without modification from Pandas documentation linked in the cell above
def conv(x):
    return pd.Period( year=x // 10000, month=x // 100 % 100, day=x % 100, freq='D' )

# -9 entries in date columns cause problems with the dataframe after conversion to Period
# This sets them to a valid far past date of 000101
def fix_dates(x):
    return '000101' if x == -9 else x

In [None]:
# Convert string representations of date to Time Period
# First, apply fix_dates function to each column to ensure missing data converts properly (-9 in the original dataset)
# Then append '01' to create format YYYYMMDD (dataset contains only YYYYMM), then apply conversion function
columns = ['indep_date','secession_date','merged_date','cow_indep_date','gw_indep_date']
for column in columns:
    icow[column] = icow[column].apply(fix_dates)
    icow[column] = ( icow[column].astype('str') + '01' ).astype('int').apply(conv)

#### Test

In [None]:
icow.info()

In [None]:
# Check the three entries that failed when attempting conversion to datetime
icow[ icow.index.isin( [148,150,170] ) ][['country','indep_date','cow_indep_date','gw_indep_date']]

<a id='assess3'></a>
## Assess, Part 3

In [None]:
icow['colonizer'].value_counts()

In [None]:
icow['indep_from'].value_counts()

In [None]:
icow['secession_from'].value_counts()

In [None]:
icow['merged_into'].value_counts()

### Issues Found
`1)` Numerous -9 values found in `colonizer`, `indep_from`, `secession_from`, and `merged_into` columns. Split into 4 datasets, one dataset for each of the four columns where the value is not -9. 

<a id='clean3'></a>
## Clean, Part 3

### 1) Split into 4 datasets
Create a dataset each for colonized countries, countries that declared independence, countries that seceded, and countries that merged into another. To populate these datasets, get all values that are not -9 from the corresponding columns: `colonizer`, `indep_from`, `secession_from`, `merged_into`

#### Code

In [None]:
colonized = icow.query( 'colonizer != -9' ).copy()
colonized.shape

In [None]:
independence = icow.query( 'indep_from != -9' ).copy()
independence.shape

In [None]:
seceded = icow.query( 'secession_from != -9' ).copy()
seceded.shape

In [None]:
merged = icow.query( 'merged_into != -9' ).copy()
merged.shape

#### Test

In [None]:
colonized['colonizer'].value_counts()

In [None]:
independence['indep_from'].value_counts()

In [None]:
seceded['secession_from'].value_counts()

In [None]:
merged['merged_into'].value_counts()

<a id='references'></a>
## References

<li>Paul R. Hensel (2018). "ICOW Colonial History Data Set, version 1.1." Available at <a href='http://www.paulhensel.org/icowcol.html' target='_new'>http://www.paulhensel.org/icowcol.html</a></li>
<li><a href='http://www.correlatesofwar.org/data-sets/downloadable-files/cow-country-codes' target='_new'>Correlates of War country codes</a></li>