# Creating a Country-Nationality Crosswalk

I needed a modified country-nationality crosswalk for my project, so I adapted a list put together by `Dinu`, which can be found [here](https://github.com/Dinu/country-nationality-list/blob/master/countries.csv).

I made some initial changes to the raw `countries.csv` before using it to build my crosswalk:
* Added "Ivory Coast"
* Added "Burma"
* Added "Great Britain"
* Added "North Korea"
* Added "South Korea"
* Added "Laos" and "Lao"
* Added "Russia"
* Added "Republic of Congo" and "Democratic Republic of Congo"
* Added duplicates for all islands with commas in them (i.e., Bonaire)
* Added "Tajik"
* Added "Turkmenistani"
* Added "U.S."
* Added "DPRK"
* Added "ROK"
* Added "USA"
* Added "UAE"
* Added "Abu Dhabi"
* Added "WPK"
* Added "DR Congo"
* Removed all parentheses parts and comma parts from country names
* Modified "United States of America" to "United States"
* Modified "Vatican City State" to "Vatican City"
* Modified "Taiwan" entry to solely "Taiwanese" (previously contained "Chinese")
* Modified "Macau" entry to solely "Macanese" (previously contained "Chinese")
* Modified "brunei darussalam" to "brunei"
* Removed entry for "United States Minor Outlying Islands," so "American" will only return "United States"
* Removed "Russian Federation"
* Removed "Lao" and "Lao PDR" from countries list
* Removed entry for "China" (because I want to match on all *other* countries, excluding China)

#### Exploding rows that have multiple entries under `nationality`

Entries are separated by either a comma or 'or'.

In [21]:
import pandas as pd

countries = pd.read_csv('C:data/countries.csv')
countries.head()

Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
0,4,AF,AFG,Afghanistan,Afghan
1,248,AX,ALA,Åland Islands,Åland Island
2,8,AL,ALB,Albania,Albanian
3,12,DZ,DZA,Algeria,Algerian
4,16,AS,ASM,American Samoa,American Samoan


In [22]:
c = countries.copy()
print(c.shape) # (262, 5)
c[c.nationality.str.contains(' or ')]#.shape

(255, 5)


Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
9,28,AG,ATG,Antigua and Barbuda,Antiguan or Barbudan
28,70,BA,BIH,Bosnia and Herzegovina,Bosnian or Herzegovinian
97,334,HM,HMD,Heard Island and McDonald Islands,Heard Island or McDonald Islands
191,659,KN,KNA,Saint Kitts and Nevis,Kittitian or Nevisian
194,666,PM,SPM,Saint Pierre and Miquelon,Saint-Pierrais or Miquelonnais
211,239,GS,SGS,South Georgia and the South Sandwich Islands,South Georgia or South Sandwich Islands
230,780,TT,TTO,Trinidad and Tobago,Trinidadian or Tobagonian
250,876,WF,WLF,Wallis and Futuna,"Wallis and Futuna, Wallisian or Futunan"


8 records with commas, 35 records with 'or', one with both a comma and 'or'.

In [23]:
nc = c[c.nationality.str.contains(',')]
nc.head()

Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
15,31,AZ,AZE,Azerbaijan,"Azerbaijani, Azeri"
23,204,BJ,BEN,Benin,"Beninese, Beninois"
24,60,BM,BMU,Bermuda,"Bermudian, Bermudan"
29,72,BW,BWA,Botswana,"Motswana, Botswanan"
48,174,KM,COM,Comoros,"Comoran, Comorian"


In [24]:
d = pd.DataFrame(nc.nationality.str.split(',').tolist(), index=nc.en_short_name).stack()
d = d.reset_index()[[0, 'en_short_name']]
d.columns = ['nationality','en_short_name']
d = pd.DataFrame(d.nationality.str.split(' or ').tolist(), index=d.en_short_name).stack().reset_index()[[0, 'en_short_name']]
d.columns = ['nationality','en_short_name']
d.head()

Unnamed: 0,nationality,en_short_name
0,Azerbaijani,Azerbaijan
1,Azeri,Azerbaijan
2,Beninese,Benin
3,Beninois,Benin
4,Bermudian,Bermuda


In [25]:
nc2 = c[c.nationality.str.contains(' or ')]
nc2

Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
9,28,AG,ATG,Antigua and Barbuda,Antiguan or Barbudan
28,70,BA,BIH,Bosnia and Herzegovina,Bosnian or Herzegovinian
97,334,HM,HMD,Heard Island and McDonald Islands,Heard Island or McDonald Islands
191,659,KN,KNA,Saint Kitts and Nevis,Kittitian or Nevisian
194,666,PM,SPM,Saint Pierre and Miquelon,Saint-Pierrais or Miquelonnais
211,239,GS,SGS,South Georgia and the South Sandwich Islands,South Georgia or South Sandwich Islands
230,780,TT,TTO,Trinidad and Tobago,Trinidadian or Tobagonian
250,876,WF,WLF,Wallis and Futuna,"Wallis and Futuna, Wallisian or Futunan"


In [26]:
e = pd.DataFrame(nc2.nationality.str.split(' or ').tolist(), index=nc2.en_short_name).stack()
e = e.reset_index()[[0, 'en_short_name']]
e.columns = ['nationality','en_short_name']
e = e[:14]
e

Unnamed: 0,nationality,en_short_name
0,Antiguan,Antigua and Barbuda
1,Barbudan,Antigua and Barbuda
2,Bosnian,Bosnia and Herzegovina
3,Herzegovinian,Bosnia and Herzegovina
4,Heard Island,Heard Island and McDonald Islands
5,McDonald Islands,Heard Island and McDonald Islands
6,Kittitian,Saint Kitts and Nevis
7,Nevisian,Saint Kitts and Nevis
8,Saint-Pierrais,Saint Pierre and Miquelon
9,Miquelonnais,Saint Pierre and Miquelon


#### Merging the two tables with "exploded" records back into the main country table

In [27]:
c = countries.copy()
d # comma
e # or
c = pd.merge(c, d, how='left', on='en_short_name')
c.nationality_y = c.nationality_y.fillna(c.nationality_x)

c = pd.merge(c, e, how='left', on='en_short_name')
c.nationality = c.nationality.fillna(c.nationality_y)
c = c.drop(columns=['nationality_x','nationality_y'])
c = c.drop_duplicates()
#c = c.reset_index()
c.en_short_name = c.en_short_name.str.lower()
c.nationality = c.nationality.str.lower()
c.nationality = c.nationality.str.strip()
c = c.rename(columns={'en_short_name':'country'})
c.to_csv('C:data/countries_clean.csv', index=False)
c.head()

Unnamed: 0,num_code,alpha_2_code,alpha_3_code,country,nationality
0,4,AF,AFG,afghanistan,afghan
1,248,AX,ALA,åland islands,åland island
2,8,AL,ALB,albania,albanian
3,12,DZ,DZA,algeria,algerian
4,16,AS,ASM,american samoa,american samoan
