# Wrangling Former Colonies
## By: Scott Kustes

### Objective:
Compile a list of former colonies of modern countries. ADD INFO ABOUT WHAT DATA IS INCLUDED IN FINAL OUTPUT

#### Discussion:
History is complex. For many countries, the date of independence is not clear cut. This dataset contains Dr. Hensel's determination of independence date as the point at which a country makes its own foreign and domestic policy decisions. It also contains the independence dates as determined by The Correlates of War Project and the Gleditsch/Ward list of independent states for comparison. In some instances, there are vast differences between the dates.

#### Datasets:
The primary dataset was downloaded from <a href='http://www.paulhensel.org/icow.html' target='_new'>The Issue Correlates of War Project</a>. Additional datasets were downloaded from <a href='http://www.correlatesofwar.org' target='_new'>The Correlates of War Project</a>.  For specific dataset citations, see the References section at the bottom of the notebook.

#### Contents
<ul>
    <li><a href='#gather'>Data Gathering</a></li>
    <li><a href='#assess1'>Assess, Part 1</a></li>
    <li><a href='#clean1'>Clean, Part 1</a></li>
    <li><a href='#assess2'>Assess, Part 2</a></li>
    <li><a href='#clean2'>Clean, Part 2</a></li>
    <li><a href='#assess3'>Assess, Part 3</a></li>
    <li><a href='#clean3'>Clean, Part 3</a></li>
    <li><a href='#final'>Finished Dataframes</a></li>
    <li><a href='#references'>References</a></li>
</ul>

In [1]:
# Import packages
import requests
import pandas as pd
import os.path as os_path

<a id='gather'></a>
## Data Gathering
### Colonization Data

In [2]:
icow = pd.read_csv( 'colonial_data.csv' )
icow.sample(5)

Unnamed: 0,State,Name,ColRuler,IndFrom,IndDate,IndViol,IndType,SecFrom,SecDate,SecViol,Into,IntoDate,COWsys,GWsys,Notes
1,20,Canada,200,200,186707,0,2,-9,-9,-9,-9,-9,192001,186707,Independent but not a COW system member from 1...
39,212,Luxembourg,210,210,186705,0,2,210,186705,0,-9,-9,192011,186705,COW system interrupted 5/1940-9/1944 (Ger occu...
105,432,Mali,220,220,196009,0,2,-9,-9,-9,-9,-9,196006,196009,-9
160,663,Jordan,640,200,194605,0,2,-9,-9,-9,-9,-9,194603,194605,Occupied by British 1918-23 before becoming Le...
139,560,South Africa,200,200,191005,0,2,-9,-9,-9,-9,-9,192001,191005,Ceded by Dutch to British in 1814


### Country Codes
Read in and de-duplicate the country codes used by The Correlates of War Project. Rename columns and set `country_code` as the dataframe key.

In [3]:
country_codes = pd.read_csv( 'cow_country_codes.csv' )
country_codes.drop_duplicates( inplace=True )
country_codes.rename( columns={'StateAbb': 'abbreviation', 'CCode': 'country_code', 'StateNme': 'country_name'}, inplace=True )
country_codes.set_index( 'country_code', inplace=True )
country_codes.sample(5)

Unnamed: 0_level_0,abbreviation,country_name
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1
920,NEW,New Zealand
626,SSD,South Sudan
702,TAJ,Tajikistan
712,MON,Mongolia
337,TUS,Tuscany


<a id='assess1'></a>
## Assess, Part 1

In [4]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 15 columns):
State       217 non-null int64
Name        217 non-null object
ColRuler    217 non-null int64
IndFrom     217 non-null int64
IndDate     217 non-null int64
IndViol     217 non-null int64
IndType     217 non-null int64
SecFrom     217 non-null int64
SecDate     217 non-null int64
SecViol     217 non-null int64
Into        217 non-null int64
IntoDate    217 non-null int64
COWsys      217 non-null int64
GWsys       217 non-null int64
Notes       217 non-null object
dtypes: int64(13), object(2)
memory usage: 25.5+ KB


In [5]:
# Unique values in Colonial Ruler column
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

17


array([200, 230, 220, 210, 235,  -9, 255, 300, 640, 365, 211, 325, 710,
       740, 900,   2, 920], dtype=int64)

In [6]:
icow['IndViol'].unique()

array([1, 0], dtype=int64)

In [7]:
icow['IndType'].unique()

array([2, 3, 1, 4], dtype=int64)

### Issues Found
`1)` Remove countries with -9 in `ColRuler`, `IndFrom`, `SecFrom`, and `Into`. 

`2)` Rename columns to more reader-friendly format.

<a id='clean1'></a>
## Clean, Part 1

### 1) Remove Uncolonized Countries
Delete all entries where `ColRuler`, `IndFrom`, `SecFrom`, and `Into` columns are -9. These countries were never colonized, never declared independence or seceded from another country, and never merged into another country and are therefore unnecessary in the dataset.

#### Code

In [8]:
icow.drop( icow.query( '( ColRuler == -9 ) & ( IndFrom == -9 ) & ( SecFrom == -9 ) & ( Into == -9 )' ).index, inplace=True )

#### Test

In [9]:
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

17


array([200, 230, 220, 210, 235,  -9, 255, 300, 640, 365, 211, 325, 710,
       740, 900,   2, 920], dtype=int64)

### 2) Rename Columns
Create reader-friendly column names.

#### Code

In [10]:
icow.rename( columns={
    'State': 'country_code',
    'Name': 'country',
    'ColRuler': 'colonizer',
    'IndFrom': 'indep_from',
    'IndDate': 'indep_date',
    'IndViol': 'indep_violent',
    'IndType': 'indep_type',
    'SecFrom': 'secession_from',
    'SecDate': 'secession_date',
    'SecViol': 'secession_violent',
    'Into': 'merged_into',
    'IntoDate': 'merged_date',
    'COWsys': 'cow_indep_date',
    'GWsys': 'gw_indep_date',
    'Notes': 'notes'
}, inplace=True )

#### Test

In [11]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 216
Data columns (total 15 columns):
country_code         193 non-null int64
country              193 non-null object
colonizer            193 non-null int64
indep_from           193 non-null int64
indep_date           193 non-null int64
indep_violent        193 non-null int64
indep_type           193 non-null int64
secession_from       193 non-null int64
secession_date       193 non-null int64
secession_violent    193 non-null int64
merged_into          193 non-null int64
merged_date          193 non-null int64
cow_indep_date       193 non-null int64
gw_indep_date        193 non-null int64
notes                193 non-null object
dtypes: int64(13), object(2)
memory usage: 24.1+ KB


<a id='assess2'></a>
## Assess, Part 2

In [12]:
print( icow['colonizer'].nunique() )
icow['colonizer'].unique()

17


array([200, 230, 220, 210, 235,  -9, 255, 300, 640, 365, 211, 325, 710,
       740, 900,   2, 920], dtype=int64)

In [13]:
print( icow['indep_from'].nunique() )
icow['indep_from'].unique()

34


array([200,   2, 220,  41, 230,  89, 100, 210, 235, 140,  -9, 255, 300,
       315, 640, 345, 365, 380, 390, 432, 211, 325, 530, 560, 625, 678,
       710, 730, 750, 770, 820, 850, 900, 920], dtype=int64)

### Issues Found
`1)` Add country name into dataframe for columns `colonizer` and `indep_from`.

`2)` Convert `indep_date`, `cow_indep_date`, and `gw_indep_date` from int to Time Period.

<a id='clean2'></a>
## Clean, Part 2

### 1) Add Country Name for columns with Country Codes
Insert columns to hold country name based on country codes in `colonizer`, `indep_from`, `secession_from`, and `merged_into` columns. 

Five Central American countries were part of the Federal Republic of Central America (country code: 89), which is not in the Correlates of War Project list. Manually update these entries. Morocco and Saudi Arabia will remain as nan because indep_from is -9.

#### Code

In [14]:
columns = ['colonizer','indep_from','secession_from','merged_into']
for column in columns:
    # Join on country code, then rename the joined column
    icow = icow.join( country_codes['country_name'], on=column )
    new_column_name = column + '_name'
    icow.rename( columns={'country_name': new_column_name}, inplace=True )

In [15]:
indexes = icow.query( 'indep_from == 89' ).index.tolist()
icow.loc[ indexes, 'indep_from_name' ] = 'Federal Republic of Central America'

#### Test

In [16]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 216
Data columns (total 19 columns):
country_code           193 non-null int64
country                193 non-null object
colonizer              193 non-null int64
indep_from             193 non-null int64
indep_date             193 non-null int64
indep_violent          193 non-null int64
indep_type             193 non-null int64
secession_from         193 non-null int64
secession_date         193 non-null int64
secession_violent      193 non-null int64
merged_into            193 non-null int64
merged_date            193 non-null int64
cow_indep_date         193 non-null int64
gw_indep_date          193 non-null int64
notes                  193 non-null object
colonizer_name         171 non-null object
indep_from_name        178 non-null object
secession_from_name    33 non-null object
merged_into_name       19 non-null object
dtypes: int64(13), object(6)
memory usage: 35.2+ KB


In [17]:
print( icow['colonizer'].nunique() )
print( icow['colonizer_name'].nunique() )
icow['colonizer_name'].unique()

17
16


array(['United Kingdom', 'Spain', 'France', 'Netherlands', 'Portugal',
       nan, 'Germany', 'Austria-Hungary', 'Turkey', 'Russia', 'Belgium',
       'Italy', 'China', 'Japan', 'Australia', 'United States of America',
       'New Zealand'], dtype=object)

In [18]:
print( icow['indep_from'].nunique() )
print( icow['indep_from_name'].nunique() )
icow['indep_from_name'].unique()

34
33


array(['United Kingdom', 'United States of America', 'France', 'Haiti',
       'Spain', 'Federal Republic of Central America', 'Colombia',
       'Netherlands', 'Portugal', 'Brazil', nan, 'Germany',
       'Austria-Hungary', 'Czechoslovakia', 'Turkey', 'Yugoslavia',
       'Russia', 'Sweden', 'Denmark', 'Mali', 'Belgium', 'Italy',
       'Ethiopia', 'South Africa', 'Sudan', 'Yemen Arab Republic',
       'China', 'Korea', 'India', 'Pakistan', 'Malaysia', 'Indonesia',
       'Australia', 'New Zealand'], dtype=object)

In [19]:
icow[ icow['indep_from_name'].isnull() ]

Unnamed: 0,country_code,country,colonizer,indep_from,indep_date,indep_violent,indep_type,secession_from,secession_date,secession_violent,merged_into,merged_date,cow_indep_date,gw_indep_date,notes,colonizer_name,indep_from_name,secession_from_name,merged_into_name
47,240,Hanover,-9,-9,181410,1,1,-9,-9,-9,255,186607,183706,181601,Merged into unified Germany,,,,Germany
48,245,Bavaria,-9,-9,150507,1,1,-9,-9,-9,255,187101,181601,181601,Merged into unified Germany,,,,Germany
52,267,Baden,-9,-9,177110,0,1,-9,-9,-9,255,187012,181601,181601,Merged into unified Germany,,,,Germany
53,269,Saxony,-9,-9,180612,1,4,-9,-9,-9,255,186704,181601,181601,Formed by Napoleonic partition of Holy Roman E...,,,,Germany
54,271,Wuerttemburg,-9,-9,180601,1,4,-9,-9,-9,255,187012,181601,181601,Formed by Napoleonic partition of Holy Roman E...,,,,Germany
55,273,Hesse - Kassel/Cassel (Hesse Electoral),-9,-9,180608,1,4,-9,-9,-9,255,186607,181601,181601,Formed by Napoleonic partition of Holy Roman E...,,,,Germany
56,275,Hesse - Darmstadt (Hesse Grand Ducal),-9,-9,180608,1,4,-9,-9,-9,255,186704,181601,181601,Formed by Napoleonic partition of Holy Roman E...,,,,Germany
57,280,Mecklenburg-Schwerin,-9,-9,162100,0,1,-9,-9,-9,255,186704,184301,181601,Merged into unified Germany,,,,Germany
66,327,Papal States,-9,-9,150300,1,1,-9,-9,-9,325,186011,181601,181601,Merged into unified Italy,,,,Italy
67,329,Two Sicilies,-9,-9,173400,1,1,-9,-9,-9,325,186102,181601,181601,Merged into unified Italy,,,,Italy


### 2) Convert date columns to datetime

Convert `indep_date`, `secession_date`, `merged_date`, `cow_indep_date`, and `gw_indep_date` from int to Time Period. 

**Note to future me**: I originally attempted a conversion to datetime, but datetime has a limitation to dates between 1677-09-21 and 2262-04-11. A few values fall outside of this range, therefore the workaround found <a href='https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob' target='_new'>here</a> was used. This caused the following problem: <a href='https://stackoverflow.com/questions/58019763/jupyter-kernel-crash-when-querying-dataframe-with-period-datatype' target='_new'>Jupyter kernel crash when querying dataframe with Period datatype</a>. 

*Solution:* The error was caused by -9 in the `gw_indep_date` column. While the conversion to Period didn't fail, it somehow corrupted the dataframe, making queries impossible. By setting -9 dates to 000101 prior to conversion, a valid Period of 0001-01-01 is obtained and the dataframe functions properly.

#### Code

In [20]:
# Conversion function to create Time Period, copied without modification from Pandas documentation linked in the cell above
def conv(x):
    return pd.Period( year=x // 10000, month=x // 100 % 100, day=x % 100, freq='D' )

# -9 entries in date columns cause problems with the dataframe after conversion to Period
# This sets them to a valid far past date of 000101
def fix_dates(x):
    return '000101' if x == -9 else x

In [21]:
# Convert string representations of date to Time Period
# First, apply fix_dates function to each column to ensure missing data converts properly (-9 in the original dataset)
# Then append '01' to create format YYYYMMDD (dataset contains only YYYYMM), then apply conversion function
columns = ['indep_date','secession_date','merged_date','cow_indep_date','gw_indep_date']
for column in columns:
    icow[column] = icow[column].apply(fix_dates)
    icow[column] = ( icow[column].astype('str') + '01' ).astype('int').apply(conv)

#### Test

In [22]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 216
Data columns (total 19 columns):
country_code           193 non-null int64
country                193 non-null object
colonizer              193 non-null int64
indep_from             193 non-null int64
indep_date             193 non-null period[D]
indep_violent          193 non-null int64
indep_type             193 non-null int64
secession_from         193 non-null int64
secession_date         193 non-null period[D]
secession_violent      193 non-null int64
merged_into            193 non-null int64
merged_date            193 non-null period[D]
cow_indep_date         193 non-null period[D]
gw_indep_date          193 non-null period[D]
notes                  193 non-null object
colonizer_name         171 non-null object
indep_from_name        178 non-null object
secession_from_name    33 non-null object
merged_into_name       19 non-null object
dtypes: int64(8), object(6), period[D](5)
memory usage: 35.2+ KB


In [23]:
# Check the three entries that failed when attempting conversion to datetime
icow[ icow.index.isin( [148,150,170] ) ][['country','indep_date','cow_indep_date','gw_indep_date']]

Unnamed: 0,country,indep_date,cow_indep_date,gw_indep_date
148,Morocco,1666-06-01,1847-01-01,1816-01-01
150,Tunisia,1591-01-01,1825-01-01,1816-01-01
170,Oman,1741-01-01,1971-10-01,1816-01-01


<a id='assess3'></a>
## Assess, Part 3

In [24]:
icow['colonizer'].value_counts()

 200    62
 220    25
-9      22
 640    21
 230    20
 365    12
 235     8
 300     5
 210     4
 2       3
 211     3
 325     2
 740     2
 255     1
 900     1
 920     1
 710     1
Name: colonizer, dtype: int64

In [25]:
icow['indep_from'].value_counts()

 200    59
 220    25
-9      15
 365    15
 640     9
 230     9
 235     6
 345     6
 2       5
 89      5
 210     4
 300     3
 100     3
 255     3
 211     3
 900     2
 730     2
 315     2
 710     2
 390     1
 530     1
 41      1
 560     1
 820     1
 325     1
 850     1
 432     1
 678     1
 920     1
 625     1
 770     1
 750     1
 140     1
 380     1
Name: indep_from, dtype: int64

In [26]:
icow['secession_from'].value_counts()

-9      155
 345      6
 89       5
 365      3
 100      3
 210      2
 300      2
 315      2
 255      2
 380      1
 625      1
 390      1
 651      1
 140      1
 145      1
 432      1
 820      1
 710      1
 41       1
 530      1
 200      1
 770      1
Name: secession_from, dtype: int64

In [27]:
icow['merged_into'].value_counts()

-9      174
 255     10
 325      5
 679      2
 510      1
 816      1
Name: merged_into, dtype: int64

### Issues Found
`1)` Numerous -9 values found in `colonizer`, `indep_from`, `secession_from`, and `merged_into` columns. Split into 4 datasets, one dataset for each of the four columns where the value is not -9. 

<a id='clean3'></a>
## Clean, Part 3

### 1) Split into 4 datasets
Create a dataset each for colonized countries, countries that declared independence, countries that seceded, and countries that merged into another. To populate these datasets, get all values that are not -9 from the corresponding columns: `colonizer`, `indep_from`, `secession_from`, `merged_into`

#### Code

In [28]:
colonized = icow.query( 'colonizer != -9' ).copy()
colonized.shape

(171, 19)

In [29]:
independence = icow.query( 'indep_from != -9' ).copy()
independence.shape

(178, 19)

In [30]:
seceded = icow.query( 'secession_from != -9' ).copy()
seceded.shape

(38, 19)

In [31]:
merged = icow.query( 'merged_into != -9' ).copy()
merged.shape

(19, 19)

#### Test

In [32]:
colonized['colonizer'].value_counts()

200    62
220    25
640    21
230    20
365    12
235     8
300     5
210     4
211     3
2       3
740     2
325     2
255     1
710     1
920     1
900     1
Name: colonizer, dtype: int64

In [33]:
independence['indep_from'].value_counts()

200    59
220    25
365    15
640     9
230     9
235     6
345     6
89      5
2       5
210     4
100     3
300     3
211     3
255     3
710     2
315     2
730     2
900     2
325     1
530     1
41      1
678     1
560     1
820     1
390     1
920     1
432     1
140     1
625     1
770     1
380     1
750     1
850     1
Name: indep_from, dtype: int64

In [34]:
seceded['secession_from'].value_counts()

345    6
89     5
365    3
100    3
255    2
210    2
315    2
300    2
390    1
200    1
380    1
140    1
145    1
530    1
651    1
41     1
710    1
432    1
625    1
820    1
770    1
Name: secession_from, dtype: int64

In [35]:
merged['merged_into'].value_counts()

255    10
325     5
679     2
510     1
816     1
Name: merged_into, dtype: int64

<a id='references'></a>
## References

<li>Paul R. Hensel (2018). "ICOW Colonial History Data Set, version 1.1." Available at <a href='http://www.paulhensel.org/icowcol.html' target='_new'>http://www.paulhensel.org/icowcol.html</a></li>
<li><a href='http://www.correlatesofwar.org/data-sets/downloadable-files/cow-country-codes' target='_new'>Correlates of War country codes</a></li>