# Wrangling Former Colonies
## By: Scott Kustes

### Objective:
Wrangle a list of former colonies, their colonizers, year of colonization, and year of independence.

#### Dataset:
The dataset was gathered by scraping information from Wikipedia pages using BeautifulSoup.  The following URLs were scraped:
- British Colonies: https://en.wikipedia.org/wiki/List_of_countries_that_have_gained_independence_from_the_United_Kingdom

#### Contents
<ul>
    <li><a href='#gather'>Data Gathering</a>
        <ul>
            <li><a href='#gather-uk'>United Kingdom</a></li>
        </ul>
    </li>
    <li><a href='#assess-uk'>Assess - United Kingdom</a></li>
    <li><a href='#clean-uk'>Clean - United Kingdom</a></li>
    <li><a href='#final'>Finished Dataframes</a></li>
</ul>

In [87]:
# Import necessary packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os.path as os_path

# Last Tested On
> September 12, 2019

BeautifulSoup code may need updates if changes have been made to code of source pages.

<a id='gather'></a>
## Gather

<a id='gather-uk'></a>
### United Kingdom

There are 9 tables on the Wikipedia page. The first, second, and last tables will be used in the current wrangling effort.

The first table contains former colonies. The second table contains former British dominions. The final table contains regions that were relinquished from British control without a vote for independence (e.g., Hong Kong).

The remaining tables contain areas of the world still considered part of the UK for one reason or another.

In [88]:
# Import the page
url = 'https://en.wikipedia.org/wiki/List_of_countries_that_have_gained_independence_from_the_United_Kingdom'
page = requests.get( url )
soup = BeautifulSoup( page.text, 'html.parser' )

# Get all of the tables on this page
tables = soup.find_all( class_='wikitable' )

#### Get data from `Colonies` table

In [99]:
# Get the first table on the page, Colonies
# Additional tables in indices 1-8
colonies = tables[0].find( 'tbody' ).find_all( 'tr' )
colonies_header = []
colonies_data = []

# Loop through the colonies, gathering the data held in each table cell ('td') into an array
# Save the values in the first row to colonies_header
# Save remaining values to colonies_data
for index, row in enumerate( colonies ):
    # Get the values in the columns
    columns = row.find_all( 'td' ) if index > 0 else row.find_all( 'th' )

    # If it isn't the first row, it's data
    if index > 0:
        # Append to colonies_data
        colonies_data.append( [element.text.strip() for element in columns ] )
    # If it is the first row, it's headers
    else:
        # Append to colonies_header
        colonies_header.append( [element.text.strip() for element in columns ] )
        colonies_header = colonies_header[0]

colonies_df = pd.DataFrame( data=colonies_data, columns=colonies_header )

In [103]:
colonies_df.sample(5)

Unnamed: 0,Country,Date,Year of Independence,Notes
11,Dominica,3 November,1978,
5,Barbados,30 November,1966,Barbados Independence Act 1966
32,Maldives,26 July,1965,
46,Solomon Islands,7 July,1978,
21,Iraq,3 October,1932,


#### Get data from `Evolution of Dominions to Independence` table

In [101]:
# Get the second table on the page, Evolution of Dominions to Independence
dominions = tables[1].find( 'tbody' ).find_all( 'tr' )
dominions_header = []
dominions_data = []

# Loop through the colonies, gathering the data held in each table cell ('td') into an array
# Save the values in the first row to colonies_header
# Save remaining values to colonies_data
for index, row in enumerate( dominions ):
    # Get the values in the columns
    columns = row.find_all( 'td' ) if index > 0 else row.find_all( 'th' )

    # If it isn't the first row, it's data
    if index > 0:
        # Append to colonies_data
        dominions_data.append( [element.text.strip() for element in columns ] )
    # If it is the first row, it's headers
    else:
        # Append to colonies_header
        dominions_header.append( [element.text.strip() for element in columns ] )
        dominions_header = dominions_header[0]

dominions_df = pd.DataFrame( data=dominions_data, columns=dominions_header )

In [102]:
dominions_df

Unnamed: 0,Country,Date of Dominion Status,Date of adoption of the Statute of Westminster,Date of final relinquishment of British powers,Final Event in question.,Other important Dates
0,Australia,1 January 1901,9 October 1942 (effective from 1939),3 March 1986,Australia Act 1986,
1,Canada,1 July 1867,11 December 1931,17 April 1982,Canada Act 1982,
2,Ireland,6 December 1922,11 December 1931,18 April 1949,Republic of Ireland Act and Ireland Act 1949,The 1916 Proclamation of the Irish Republic an...
3,Dominion of Newfoundland,26 September 1907,—,17 April 1982,Canada Act 1982,Newfoundland voted to join Canada in 1948 in a...
4,South Africa,31 May 1910,11 December 1931,21 May 1961,South African Constitution of 1961,
5,New Zealand,26 September 1907,25 November 1947,13 December 1986,Constitution Act 1986,Declaration of Independence of New Zealand 183...


#### Get data from `Countries or region which did not vote to terminate British rule yet were relinquished` table

In [104]:
# Get the final table on the page, Countries or region which did not vote to terminate British rule yet were relinquished
relinquished = tables[8].find( 'tbody' ).find_all( 'tr' )
relinquished_header = []
relinquished_data = []

# Loop through the colonies, gathering the data held in each table cell ('td') into an array
# Save the values in the first row to colonies_header
# Save remaining values to colonies_data
for index, row in enumerate( relinquished ):
    # Get the values in the columns
    columns = row.find_all( 'td' ) if index > 0 else row.find_all( 'th' )

    # If it isn't the first row, it's data
    if index > 0:
        # Append to colonies_data
        relinquished_data.append( [element.text.strip() for element in columns ] )
    # If it is the first row, it's headers
    else:
        # Append to colonies_header
        relinquished_header.append( [element.text.strip() for element in columns ] )
        relinquished_header = relinquished_header[0]

relinquished_df = pd.DataFrame( data=relinquished_data, columns=relinquished_header )

In [106]:
relinquished_df

Unnamed: 0,Country,Date,Year,Notes
0,Hong Kong,30 June,1997,In 1984 the British government signed the Sino...


#### Export list of countries to csv
If the file `colonization_date_uk.csv` doesn't exist, get a list of all of the countries in the 3 dataframes and export them to that file. This csv will be manually updated with dates of colonization since the data pulled from Wikipedia only contains dates of independence.

If the file exists, read it in.

In [128]:
filename = 'colonization_date_uk.csv'
if os_path.isfile( filename ):
    colonization_date_uk = pd.read_csv( filename, index_col=0 )
else:
    countries = pd.concat( [colonies_df['Country'], dominions_df['Country'], relinquished_df['Country']], ignore_index=True )
    countries.to_csv( filename, index=False, header=['country'] )

<a id='assess-uk'></a>
## Assess - United Kingdom

In [107]:
colonies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 4 columns):
Country                 62 non-null object
Date                    62 non-null object
Year of Independence    62 non-null object
Notes                   62 non-null object
dtypes: object(4)
memory usage: 2.0+ KB


In [108]:
dominions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
Country                                           6 non-null object
Date of Dominion Status                           6 non-null object
Date of adoption of the Statute of Westminster    6 non-null object
Date of final relinquishment of British powers    6 non-null object
Final Event in question.                          6 non-null object
Other important Dates                             6 non-null object
dtypes: object(6)
memory usage: 368.0+ bytes


In [109]:
relinquished_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 4 columns):
Country    1 non-null object
Date       1 non-null object
Year       1 non-null object
Notes      1 non-null object
dtypes: object(4)
memory usage: 112.0+ bytes


In [119]:
former_british_colonies = pd.concat( [colonies_df, relinquished_df], ignore_index=True, sort=False )
former_british_colonies.sample(5)

Unnamed: 0,Country,Date,Year of Independence,Notes,Year
4,Bahrain,15 August,1971,,
10,Cyprus,1 October,1960,"16 August 1960, but Cyprus Independence Day is...",
21,Iraq,3 October,1932,,
27,Kuwait,19 June,1961,,
23,Jamaica,6 August,1962,Independence Day (6 August),


### Issues Found:
1) Drop columns: `Global Code` and `Global Name` - only 1 unique value

2) Rename columns: replace spaces with underscores, replace uppercase with lowercase



<a id='clean-uk'></a>
## Clean - United Kingdom
### 1) Drop columns `Global Code` and `Global Name`

Drop `Global Code` and `Global Name` columns due to each having only 1 unique value (1 and World, respectively).

#### Code