# Wrangling Former Colonies
## By: Scott Kustes

### Objective:
Compile a list of former colonies of modern countries. ADD INFO ABOUT WHAT DATA IS INCLUDED IN FINAL OUTPUT

#### Discussion:
History is complex. For many countries, the date of independence is not clear cut. This dataset contains Dr. Hensel's determination of independence date as the point at which a country makes its own foreign and domestic policy decisions. It also contains the independence dates as determined by The Correlates of War Project and the Gleditsch/Ward list of independent states for comparison. In some instances, there are vast differences between the dates.

#### Datasets:
The primary dataset was downloaded from <a href='http://www.paulhensel.org/icow.html' target='_new'>The Issue Correlates of War Project</a>. Additional datasets were downloaded from <a href='http://www.correlatesofwar.org' target='_new'>The Correlates of War Project</a>.  For specific dataset citations, see the References section at the bottom of the notebook.

#### Contents
<ul>
    <li><a href='#gather'>Data Gathering</a></li>
    <li><a href='#assess1'>Assess, Part 1</a></li>
    <li><a href='#clean1'>Clean, Part 1</a></li>
    <li><a href='#assess2'>Assess, Part 2</a></li>
    <li><a href='#clean2'>Clean, Part 2</a></li>
    <li><a href='#final'>Finished Dataframes</a></li>
    <li><a href='#references'>References</a></li>
</ul>

In [1]:
# Import packages
import requests
import pandas as pd
import os.path as os_path

<a id='gather'></a>
## Data Gathering
### Colonization Data

In [2]:
icow = pd.read_csv( 'colonial_data.csv' )
icow.sample(5)

Unnamed: 0,State,Name,ColRuler,IndFrom,IndDate,IndViol,IndType,SecFrom,SecDate,SecViol,Into,IntoDate,COWsys,GWsys,Notes
24,101,Venezuela,230,100,183001,0,3,100,183001,0,-9,-9,184101,182901,Seceded from Gran Colombia (not in COW system;...
27,130,Ecuador,230,100,183005,0,3,100,183005,0,-9,-9,185401,183005,Seceded from Gran Colombia (not in COW system;...
75,343,North Macedonia (Macedonia/FYROM),640,345,199109,0,3,345,199109,0,-9,-9,199304,199111,"""Former Yugoslav Republic of..."" due to Greek ..."
91,370,Belarus,365,365,199112,0,3,-9,-9,-9,-9,-9,199112,199108,Under Polish rule until obtained by Russia in ...
173,702,Tajikistan,365,365,199112,0,3,-9,-9,-9,-9,-9,199112,199109,-9


### Country Codes
Read in and de-duplicate the country codes used by The Correlates of War Project. Rename columns and set `country_code` as the dataframe key.

In [3]:
country_codes = pd.read_csv( 'cow_country_codes.csv' )
country_codes.drop_duplicates( inplace=True )
country_codes.rename( columns={'StateAbb': 'abbreviation', 'CCode': 'country_code', 'StateNme': 'country_name'}, inplace=True )
country_codes.set_index( 'country_code', inplace=True )
country_codes.sample(5)

Unnamed: 0_level_0,abbreviation,country_name
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1
701,TKM,Turkmenistan
355,BUL,Bulgaria
365,RUS,Russia
438,GUI,Guinea
31,BHM,Bahamas


<a id='assess1'></a>
## Assess, Part 1

In [4]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 15 columns):
State       217 non-null int64
Name        217 non-null object
ColRuler    217 non-null int64
IndFrom     217 non-null int64
IndDate     217 non-null int64
IndViol     217 non-null int64
IndType     217 non-null int64
SecFrom     217 non-null int64
SecDate     217 non-null int64
SecViol     217 non-null int64
Into        217 non-null int64
IntoDate    217 non-null int64
COWsys      217 non-null int64
GWsys       217 non-null int64
Notes       217 non-null object
dtypes: int64(13), object(2)
memory usage: 25.5+ KB


In [5]:
# Unique values in Colonial Ruler column
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

17


array([200, 230, 220, 210, 235,  -9, 255, 300, 640, 365, 211, 325, 710,
       740, 900,   2, 920], dtype=int64)

In [6]:
icow['IndViol'].unique()

array([1, 0], dtype=int64)

In [7]:
icow['IndType'].unique()

array([2, 3, 1, 4], dtype=int64)

### Issues Found
`1)` Remove countries with -9 in ColRuler. These places were never colonized.

`2)` Drop columns related to secession ("Sec" columns) and merging into another country ("Into" columns).

`3)` Rename columns to more reader-friendly format.

<a id='clean1'></a>
## Clean, Part 1

### 1) Remove Uncolonized Countries
Delete all entries where ColRuler column is -9. This indicates a country that was never colonized.

#### Code

In [8]:
icow.drop( icow.query( 'ColRuler == -9' ).index, inplace=True )

#### Test

In [9]:
print( icow['ColRuler'].nunique() )
icow['ColRuler'].unique()

16


array([200, 230, 220, 210, 235, 255, 300, 640, 365, 211, 325, 710, 740,
       900,   2, 920], dtype=int64)

### Drop Unnecessary Columns
Drop columns related to secession from or merging into another country: `SecFrom`, `SecDate`, `SecViol`, `Into`, `IntoDate`

#### Code

In [10]:
icow.drop( columns=['SecFrom','SecDate','SecViol','Into','IntoDate'], axis=1, inplace=True )

#### Test

In [11]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 216
Data columns (total 10 columns):
State       171 non-null int64
Name        171 non-null object
ColRuler    171 non-null int64
IndFrom     171 non-null int64
IndDate     171 non-null int64
IndViol     171 non-null int64
IndType     171 non-null int64
COWsys      171 non-null int64
GWsys       171 non-null int64
Notes       171 non-null object
dtypes: int64(8), object(2)
memory usage: 14.7+ KB


### 3) Rename Columns
Create reader-friendly column names.

#### Code

In [12]:
icow.rename( columns={
    'State': 'country_code',
    'Name': 'country',
    'ColRuler': 'colonizer',
    'IndFrom': 'indep_from',
    'IndDate': 'indep_date',
    'IndViol': 'indep_violent',
    'IndType': 'indep_type',
    'COWsys': 'cow_system_ind_date',
    'GWsys': 'gw_system_ind_date',
    'Notes': 'notes'
}, inplace=True )

#### Test

In [13]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 216
Data columns (total 10 columns):
country_code           171 non-null int64
country                171 non-null object
colonizer              171 non-null int64
indep_from             171 non-null int64
indep_date             171 non-null int64
indep_violent          171 non-null int64
indep_type             171 non-null int64
cow_system_ind_date    171 non-null int64
gw_system_ind_date     171 non-null int64
notes                  171 non-null object
dtypes: int64(8), object(2)
memory usage: 14.7+ KB


<a id='assess2'></a>
## Assess, Part 2

In [14]:
print( icow['colonizer'].nunique() )
icow['colonizer'].unique()

16


array([200, 230, 220, 210, 235, 255, 300, 640, 365, 211, 325, 710, 740,
       900,   2, 920], dtype=int64)

In [15]:
print( icow['indep_from'].nunique() )
icow['indep_from'].unique()

32


array([200,   2, 220,  41, 230,  89, 100, 210, 235, 140, 255, 300, 315,
       640, 345, 365, 432, 211, 325, 530, 560,  -9, 625, 678, 710, 730,
       750, 770, 820, 850, 900, 920], dtype=int64)

### Issues Found
`1)` Merge country name into dataframe for columns `colonizer` and `indep_from`.

`2)` Convert `indep_date`, `cow_system_ind_date`, and `gw_system_ind_date` from int to datetime.

<a id='clean2'></a>
## Clean, Part 2

### 1) Merge Country Name for columns with Country Codes
Insert columns to hold country name based on country codes in `colonizer` and `indep_from` columns.

#### Code

In [16]:
columns = ['colonizer','indep_from']
for column in columns:
    # Join on country code, then rename the joined column
    icow = icow.join( country_codes['country_name'], on=column )
    new_column_name = column + '_name'
    icow.rename( columns={'country_name': new_column_name}, inplace=True )

#### Test

In [17]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 216
Data columns (total 12 columns):
country_code           171 non-null int64
country                171 non-null object
colonizer              171 non-null int64
indep_from             171 non-null int64
indep_date             171 non-null int64
indep_violent          171 non-null int64
indep_type             171 non-null int64
cow_system_ind_date    171 non-null int64
gw_system_ind_date     171 non-null int64
notes                  171 non-null object
colonizer_name         171 non-null object
indep_from_name        164 non-null object
dtypes: int64(8), object(4)
memory usage: 17.4+ KB


In [18]:
icow['colonizer_name'].unique(), icow['indep_from_name'].unique()

(array(['United Kingdom', 'Spain', 'France', 'Netherlands', 'Portugal',
        'Germany', 'Austria-Hungary', 'Turkey', 'Russia', 'Belgium',
        'Italy', 'China', 'Japan', 'Australia', 'United States of America',
        'New Zealand'], dtype=object),
 array(['United Kingdom', 'United States of America', 'France', 'Haiti',
        'Spain', nan, 'Colombia', 'Netherlands', 'Portugal', 'Brazil',
        'Germany', 'Austria-Hungary', 'Czechoslovakia', 'Turkey',
        'Yugoslavia', 'Russia', 'Mali', 'Belgium', 'Italy', 'Ethiopia',
        'South Africa', 'Sudan', 'Yemen Arab Republic', 'China', 'Korea',
        'India', 'Pakistan', 'Malaysia', 'Indonesia', 'Australia',
        'New Zealand'], dtype=object))

### 2) Convert date columns to datetime

Convert `indep_date`, `cow_system_ind_date`, and `gw_system_ind_date` from int to Time Period. 

Note: I originally attempted a conversion to datetime, but datetime has a limitation to dates between 1677-09-21 and 2262-04-11. A few values fall outside of this range, therefore the workaround found <a href='https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob' target='_new'>here</a> was used.

#### Code

In [19]:
# Conversion function to create Time Period, copied without modification from Pandas documentation linked in the cell above
def conv(x):
    return pd.Period( year=x // 10000, month=x // 100 % 100, day=x % 100, freq='D' )

In [20]:
# Dates contain only year and month in form YYYYMM
# First, append an '01' to create format YYYYMMDD, then apply conversion function
columns = ['indep_date', 'cow_system_ind_date', 'gw_system_ind_date']
for column in columns:
    icow[column] = ( icow[column].astype('str') + '01' ).astype('int').apply(conv)

#### Test

In [21]:
icow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 216
Data columns (total 12 columns):
country_code           171 non-null int64
country                171 non-null object
colonizer              171 non-null int64
indep_from             171 non-null int64
indep_date             171 non-null period[D]
indep_violent          171 non-null int64
indep_type             171 non-null int64
cow_system_ind_date    171 non-null period[D]
gw_system_ind_date     171 non-null period[D]
notes                  171 non-null object
colonizer_name         171 non-null object
indep_from_name        164 non-null object
dtypes: int64(5), object(4), period[D](3)
memory usage: 17.4+ KB


In [22]:
# Spot check three entries that failed when attempting conversion to datetime
icow[ icow.index.isin( [148,150,170] ) ][['country','indep_date','cow_system_ind_date','gw_system_ind_date']]

Unnamed: 0,country,indep_date,cow_system_ind_date,gw_system_ind_date
148,Morocco,1666-06-01,1847-01-01,1816-01-01
150,Tunisia,1591-01-01,1825-01-01,1816-01-01
170,Oman,1741-01-01,1971-10-01,1816-01-01


<a id='references'></a>
## References

<li>Paul R. Hensel (2018). "ICOW Colonial History Data Set, version 1.1." Available at <a href='http://www.paulhensel.org/icowcol.html' target='_new'>http://www.paulhensel.org/icowcol.html</a></li>
<li><a href='http://www.correlatesofwar.org/data-sets/downloadable-files/cow-country-codes' target='_new'>Correlates of War country codes</a></li>