# The Hypothesis

“Since the UK was one of the main countries that colonised the USA, and the UK is on the east side of
the USA there are more towns/cities with UK names on the east coast of the US rather than the west
coast”

## Data

https://github.com/apache/commons-csv/raw/master/src/test/resources/perf/worldcitiespop.txt.gz

lets get the imports out of the way

In [4]:
import pandas as pd

In [5]:
dataset = "/home/stormfield/scratch/DLG/worldcitiespop.txt"

In [11]:
dataset_df = pd.read_csv(dataset,encoding = "ISO-8859-1", low_memory=False)

the file wasn't UTF-8 encoded. The is 'ISO-8859-1' encoded (LATIN-1)

(I was about seek clarification regarding th encoding, but on closer examination of the actual use of the dataset in the apache commons, bench marking test class in github, it seems the file was infact 'ISO-8859-1'. Please refer [this](https://github.com/apache/commons-csv/blob/master/src/test/java/org/apache/commons/csv/CSVBenchmark.java) java file at line numbers 64 & 66)

## Data Exploration

Lets see what the data looks like

In [12]:
dataset_df.head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
0,ad,aixas,Aixàs,6,,42.483333,1.466667
1,ad,aixirivali,Aixirivali,6,,42.466667,1.5
2,ad,aixirivall,Aixirivall,6,,42.466667,1.5
3,ad,aixirvall,Aixirvall,6,,42.466667,1.5
4,ad,aixovall,Aixovall,6,,42.466667,1.483333


Lets verify the assumption of country names ie GB for United Kingdom & US for United States of America. [source](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes)

In [17]:
# united kingdom
'gb' in dataset_df['Country'].unique()

True

In [18]:
# USA
'us' in dataset_df['Country'].unique()

True

lets see what each of the data looks like

In [21]:
#uk_cities  
dataset_df.loc[dataset_df['Country'] == 'gb' ].head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
826690,gb,abberley,Abberley,Q4,,52.3,-2.366667
826691,gb,abberton,Abberton,F2,,51.833333,0.916667
826692,gb,abberton,Abberton,F7,,52.183333,-2.016667
826693,gb,abbess roding,Abbess Roding,E4,,51.783333,0.266667
826694,gb,abbey-cwmhir,Abbey-Cwmhir,Y8,,52.333333,-3.4


In [30]:
#us_cities
dataset_df.loc[dataset_df['Country'] == 'us' ].head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2532482,us,abanda,Abanda,AL,,33.100833,-85.529722
2532483,us,abbeville,Abbeville,AL,,31.571667,-85.250556
2532484,us,abbot springs,Abbot Springs,AL,,33.360833,-86.481667
2532485,us,abel,Abel,AL,,33.548611,-85.7125
2532486,us,abercrombie,Abercrombie,AL,,32.848611,-87.165


In [116]:
us = dataset_df.loc[dataset_df['Country'] == 'us' ].copy()

In [117]:
uk=  dataset_df.loc[dataset_df['Country'] == 'gb' ].copy()

In [118]:
us.duplicated('City').sum()

56025

In [119]:
uk.duplicated('City').sum()

1437

In [120]:
us.loc[us['City'] == 'london' ].head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2534705,us,london,London,AL,,31.2975,-87.087778
2541129,us,london,London,AR,,35.328889,-93.252778
2545390,us,london,London,CA,,36.476111,-119.442222
2567874,us,london,London,IN,,39.625556,-85.920278
2575484,us,london,London,KY,,37.128889,-84.083333


Since the current task is see if the UK cities are there in US we need to only consider distinct cities in UK. 

But for US we need to consider all the cities since the same city name could be in more than one region. For example the UK city `London` is there in more than one region in the USA.
Hence they are valid for our analysis.

So for the UK data we will consider distinct city names

For next

In [123]:
 uk.drop_duplicates(subset='City',inplace=True)

**Thoughts before further Analysis**

The map of USA with state names looks like the following:

![Image of USA with State names](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Map_of_USA_with_state_names.svg/1000px-Map_of_USA_with_state_names.svg.png)

To proceed further we need to define which cities fall under east coast or west cost. One way is to use the the latitude longitude data for each city, but that would involve more analysis and data crunching to figure out in coast each city lies. But if we go down that route we might need to consider the non-standard shape of the country as well which would make the analysis harder.

A easier approach is to use the a more widely recognized definition from wikipedia in which:
- East coast of USA implies the coastal states that have shoreline on the Atlantic Oceaan 
- West coast of USA implies the coastal states that have shoreline on the Pasafic Oceaan 

**Notes**

 - **Alaska & Hawaii** was never colonized by the British. and but I will be considering them. The rationale is given [here](#another_cell) with datato support it.
 - **Fun Fact** : Alaska was [pruchased](https://en.wikipedia.org/wiki/Alaska_Purchase) from Russia by the the then 'United states'. Russia  didn't want to sell it to UK so that UK doesn't get a stronghold there. But both UK & US was approached in hopes of a bidding war. But the then British Prime Minister Lord Palmerston steadfastly rejected the offer, arguing that Canada had enough uncharted wilderness to deal with and that Britain would overstretch its resources in maintaining Alaska. Hence Alaska was purchased by the US for a today equivalent of a little over 100 million dollors
 
 - **Pennsylvania**  
 While Pennsylvania is not directly along the Atlantic shoreline, it borders the tidal portion of the Delaware River, and the city of Philadelphia was a major seaport. Hence we will consider it in our analysis [Read more here](https://en.wikipedia.org/wiki/East_Coast_of_the_United_States#cite_ref-3)
 
 - **Fun Fact** : The original [thirteen colonies](https://en.wikipedia.org/wiki/Thirteen_Colonies) of Great Britain in North America all lay along the East Coast. [see citation](https://en.wikipedia.org/wiki/East_Coast_of_the_United_States#cite_ref-3)


**Fun fact**: According to the infograph from this [article](https://en.wikipedia.org/wiki/European_colonization_of_the_Americas#English_and_(after_1707)_British) we can see that UK never colonized the lower part ie South western US. Check the cool GIF below which indicates the same

<img src=https://upload.wikimedia.org/wikipedia/commons/4/40/Non-Native-American-Nations-Territorial-Claims-over-NAFTA-countries-1750-2008.gif width="500">

 **West cost states** are : California, Oregon, Washington, and Alaska. [reference](https://en.wikipedia.org/wiki/West_Coast_of_the_United_States)

 **East coast** states are :  Maine, New Hampshire, Massachusetts, Rhode Island, Connecticut, New York, New Jersey, Delaware, Maryland, Virginia, North Carolina, South Carolina, Georgia, and Florida. [reference](https://en.wikipedia.org/wiki/East_Coast_of_the_United_States#cite_note-East_Coast_States-1)

In [131]:
us['Region'].unique()

array(['AL', 'AK', 'AS', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL',
       'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME',
       'MH', 'MD', 'MA', 'MI', 'FM', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV',
       'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'MP', 'OH', 'OK', 'OR', 'PW',
       'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VI', 'VA',
       'WA', 'WV', 'WI', 'WY'], dtype=object)

so the abbreviations in [ANSI](https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations) 2 letter format

lets make lists for easy handling

In [317]:
west_coast_state_list = ['CA','WA','OR']#'AK','HI']
east_coast_state_list = ['ME', 'NH', 'MA', 'RI', 'CT', 'NY', 'PA', 'NJ',
                        'DE', 'MD', 'VA', 'NC','SC', 'GA', 'FL']

In [318]:
us_west_coast_df = us[us['Region'].isin(west_coast_state_list)].copy()
us_east_coast_df = us[us['Region'].isin(east_coast_state_list)].copy()

lets see what the data looks like

In [319]:
us_west_coast_df.sample(2)

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2630710,us,reston,Reston,OR,,43.130278,-123.618889
2630411,us,lena,Lena,OR,,45.4,-119.280833


In [320]:
us_east_coast_df.sample(2)

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2549966,us,lake view,Lake View,CT,,41.356111,-72.511944
2635791,us,mason,Mason,PA,,41.473611,-79.847222


Nothing out of the ordinary till now.
But for our analysis only the cities names are enough.
So lets take the city names alone into a new dataframe

In [200]:
us_west_coast_cities = us_west_coast_df.filter(['City'],axis=1)

In [201]:
us_east_coast_cities = us_east_coast_df.filter(['City'],axis=1)

Now since we have the relevant US cities, we can easily update their counts in our UK dataframe

In [234]:
uk['east_coast_name_count'] = uk['City'].map(us_east_coast_cities['City'].value_counts())
uk['west_coast_name_count'] = uk['City'].map(us_west_coast_cities['City'].value_counts())

In [209]:
uk.sample(5)

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude,east_coast_name_count,west_coast_name_count
839147,gb,redberth,Redberth,Y7,,51.7025,-4.775278,,
831169,gb,diddington,Diddington,C3,,52.266667,-0.25,,
826756,gb,abergorloch,Abergorloch,X7,,51.983056,-4.060278,,
838353,gb,parkside,Parkside,C7,,52.4,-1.5,5.0,
830603,gb,crambe,Crambe,Q5,,54.066667,-0.866667,,


so as expected there are few UK city names which aren't present in US lets fill those with 0 instead of `NaN`

In [235]:
uk.fillna({'east_coast_name_count':0, 'west_coast_name_count':0},inplace=True)

In [220]:
uk.sample(5)

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude,east_coast_name_count,west_coast_name_count
840116,gb,shefford,Shefford,A5,,51.466667,-1.45,0.0,0.0
840030,gb,selma,Selma,T8,,56.483333,-5.4,3.0,2.0
828310,gb,bishopton,Bishopton,W2,5029.0,55.9,-4.5,0.0,0.0
839264,gb,rickinghall,Rickinghall,N5,,52.333333,1.0,0.0,0.0
836195,gb,llansteffan,Llansteffan,X7,,51.772222,-4.391389,0.0,0.0


In [315]:
uk.loc[uk['City']=='thompson']#.sample(5)

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
841502,gb,thompson,Thompson,I9,,52.533333,0.833333


Sanity check - see if the city `selma` (here value of 7) is actually there in our orginal list

In [316]:

us_east_coast_cities[us_east_coast_cities['City']=='thompson']

Unnamed: 0,City
2550395,thompson
2551906,thompson
2555073,thompson
2582172,thompson
2588823,thompson
2616980,thompson
2638610,thompson


<a id='another_cell'></a>
**Thoughts on including Hawaii & Alaska**

Case in point [Wales, AK, USA](https://en.wikipedia.org/wiki/Wales,_Alaska). Even though this Wales was never colonized by the UK, its name is derived from the Wales in UK. This happend around 1890s. OK so there could be names in cities in Hawaii or Alaska that could have some very distant relation to the English names

While I drilled through the below data I found that there were only 23 such english names. Also most were from Alaska. Regardless this wouldn't statiscally change the result by a huge factor. since total number of cities for AK & HI = 1145
So its debatable on the fact that how granular you would want to analyze

In [None]:
us_west_coast_AK_HI = us[us['Region'].isin(['AK','HI'])].copy()
us_west_coast_AK_HI_cities = us_west_coast_AK_HI.filter(['City'],axis=1)
uk_temp = uk.copy()
uk_temp['west_coast_name_count'] = uk_temp['City'].map(us_west_coast_AK_HI_cities['City']\
                                                       .value_counts())\
                                                       .fillna(0)
test = us[us['Region'].isin(['AK','HI'])].copy()


In [279]:
test.loc[test['City'] =='wales' ]

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2537388,us,wales,Wales,AK,,65.609167,-168.0875


In [280]:
uk_temp.loc[( uk_temp['City'] =='wales') ]

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude,west_coast_name_count
842273,gb,wales,Wales,C9,,53.333333,-1.283333,1.0


In [307]:
test.loc[test['City'] =='westgate' ]

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2537394,us,westgate,Westgate,AK,,64.837222,-147.795833


In [312]:
uk_temp.loc[( uk_temp['west_coast_name_count'] > 0) ].shape

(23, 8)