In [2]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import os
import glob

## Immigration data:
- There are 29 columns
- Only some of the columns should be in the fact tables and extracted during the ETL process. 
- One column has all null values in the dataset. Some have most of the their values as nulls. Should exclude these from dataset
- Columns that are integers are being read in as floats. When performing ETL, ensure that these are read in as integers
- "arrdate" represents arrival data. Currently a float.
 - Since data was originally in SAS format, need to use SAS date formats to transform this column into a date data type
- There are numbers representing certain values for a few of these columns. Interesting ones of note are:
 - i94cit and i94res are numbers representing cities. Need to refer to data dictionary for clarification and details. May need to add corresponding names for these values somewhere in the data if none of the other datasets contain this information. 

In [2]:
# Read in data
data = pd.read_csv(os.getcwd() + '/immigration_data_sample.csv')

In [1]:
# See column names, number or rows and columns, and a high level overview of the nulls in the sample
data.info()

In [26]:
# See first five rows
data.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [5]:
# See columns that were not shown above
data.loc[:, 'i94addr':'entdepu'].head()

Unnamed: 0,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu
0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,
1,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,
2,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,
3,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,
4,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,


In [22]:
# How many immigrated using each mode category
data['i94mode'].value_counts()

1.0    962
3.0     26
2.0     10
9.0      2
Name: i94mode, dtype: int64

In [28]:
# Unclear what i94 addr, i94res and i94 cit represents.
# See some data to get a better understanding
data[['i94addr','i94res','i94cit']].head()

Unnamed: 0,i94addr,i94res,i94cit
0,HI,209.0,209.0
1,TX,582.0,582.0
2,FL,112.0,148.0
3,CA,297.0,297.0
4,NY,111.0,111.0


Looking at a few rows in the data, i94res and i94cit appear similar in nature. Some rows having matching values for these columns and some don't.

Both of these columns have three digits representing a country that is not the United States according to the data dictionary. Unclear what the difference is. I will be using i94res as a feature that represents where someone is originally from. 

__Explore the i94addr field:__

This column is for US states. Assume that this is the state where immigrants are staying.

In [30]:
data['i94addr'].value_counts()

FL    188
CA    163
NY    161
HI     53
TX     42
NV     34
IL     31
GU     27
MA     26
NJ     20
WA     19
GA     19
VA     13
NE     12
DC     12
MD     11
PA     10
MI      9
NC      9
LA      8
TN      7
IN      7
CT      6
AL      5
OH      5
AZ      5
CO      5
MP      3
SC      3
MN      3
VT      2
OR      2
MO      2
UN      2
PR      1
NH      1
ME      1
IA      1
NM      1
MS      1
TE      1
OK      1
SW      1
RI      1
WI      1
UT      1
VQ      1
ID      1
KS      1
KY      1
AR      1
Name: i94addr, dtype: int64

__Explore "arrdate" column:__

This column is a float, but represents a date. Since this is from a SAS file, want to test the date conversion from float. Can use this in other date columns in this file with similar 

In [32]:
# SAS stores values as the number of days from January 1, 1960. Appears to be the correct conversion
print((pd.to_timedelta(data['arrdate'], unit = 'd') + pd.datetime(1960,1,1)).head())
# change column type
data['arrdate'] = pd.to_timedelta(data['arrdate'], unit = 'd') + pd.datetime(1960,1,1)

0   2016-04-22
1   2016-04-23
2   2016-04-07
3   2016-04-28
4   2016-04-06
Name: arrdate, dtype: datetime64[ns]


## Demographics data:
- See that each city has multiple rows of data. For example, Birmingham, Alabama has five rows. There are repeat values for most of the columns in these rows. The only columns that change between the rows are the "Race" and "Count columns". May want to transpose data with each "Race" value as a column and have each corresponding value 
 - There are 5 categories for this column. Since there are not a lot of categoires, it should be ok to transpose the data so each city has its own row (City & State become the unique identifier)
- Some columns are floats when they should be integers

In [3]:
# import data
demo = pd.read_csv('./us-cities-demographics.csv', sep = ';')

In [34]:
demo.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [35]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 12 columns):
City                      2891 non-null object
State                     2891 non-null object
Median Age                2891 non-null float64
Male Population           2888 non-null float64
Female Population         2888 non-null float64
Total Population          2891 non-null int64
Number of Veterans        2878 non-null float64
Foreign-born              2878 non-null float64
Average Household Size    2875 non-null float64
State Code                2891 non-null object
Race                      2891 non-null object
Count                     2891 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 271.1+ KB


Sort by cities to see what exactly makes up the unique identifier for each row.

In [36]:
demo =  demo.sort_values(by=['State', 'City'])
demo.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
212,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Asian,1500
1063,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,American Indian and Alaska Native,1319
2025,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,White,51728
2231,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Black or African-American,157985
2627,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Hispanic or Latino,8940


In [37]:
demo.head(10)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
212,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Asian,1500
1063,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,American Indian and Alaska Native,1319
2025,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,White,51728
2231,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Black or African-American,157985
2627,Birmingham,Alabama,35.6,102122.0,112789.0,214911,13212.0,8258.0,2.21,AL,Hispanic or Latino,8940
479,Dothan,Alabama,38.9,32172.0,35364.0,67536,6334.0,1699.0,2.59,AL,Asian,1175
500,Dothan,Alabama,38.9,32172.0,35364.0,67536,6334.0,1699.0,2.59,AL,Hispanic or Latino,1704
1064,Dothan,Alabama,38.9,32172.0,35364.0,67536,6334.0,1699.0,2.59,AL,Black or African-American,23243
1580,Dothan,Alabama,38.9,32172.0,35364.0,67536,6334.0,1699.0,2.59,AL,White,43516
1925,Dothan,Alabama,38.9,32172.0,35364.0,67536,6334.0,1699.0,2.59,AL,American Indian and Alaska Native,656


In [8]:
demo['Race'].value_counts()

Hispanic or Latino                   596
White                                589
Black or African-American            584
Asian                                583
American Indian and Alaska Native    539
Name: Race, dtype: int64

One thing noticeable is that when summing up the count column, the result is greater than the total population value.For Birmingham Alabama:
- Sum of Count column = 221,472
- Total Population - 214,911

## Airport data:
- There are many nulls in the Continent column. All are from "NA" values being processed as nulls. 
- There are also nulls in the country column
- iso_region does not have any nulls. After performing EDA, determined that it is possible to use iso_region to calculate values in the continent and country columns.
 - Many rows will have nulls as a continent value. This is because the values are supposed to be "NA", but are processed in as nulls. We need to process these with escape characters to prevent improper reading of this field
- There is one error in the data for a US airport with an incorrect continent

In [6]:
# Read in data
airport = pd.read_csv('./airport-codes_csv.csv')

In [40]:
# See first five rows
airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [41]:
# See high level overview of data
airport.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 12 columns):
ident           55075 non-null object
type            55075 non-null object
name            55075 non-null object
elevation_ft    48069 non-null float64
continent       27356 non-null object
iso_country     54828 non-null object
iso_region      55075 non-null object
municipality    49399 non-null object
gps_code        41030 non-null object
iata_code       9189 non-null object
local_code      28686 non-null object
coordinates     55075 non-null object
dtypes: float64(1), object(11)
memory usage: 5.0+ MB


In [14]:
# ISO country nulls all have NA as their country code.
airport[airport['iso_country'].isnull()]['iso_region'].value_counts()

NA-HA    91
NA-OH    34
NA-KH    27
NA-KA    26
NA-ER    18
NA-OD    15
NA-KU    14
NA-CA     5
NA-OT     5
NA-KE     4
NA-OS     3
NA-OW     2
NA-ON     2
NA-KW     1
Name: iso_region, dtype: int64

Look at the types column and how many categories there are.

In [42]:
airport['type'].value_counts()

small_airport     33965
heliport          11287
medium_airport     4550
closed             3606
seaplane_base      1016
large_airport       627
balloonport          24
Name: type, dtype: int64

Look at continent since there are more null values in this column then others like iso_country and iso_region

In [43]:
airport['continent'].value_counts()

EU    7840
SA    7709
AS    5350
AF    3362
OC    3067
AN      28
Name: continent, dtype: int64

There are 6 groups in the continents column. For the rows with missing values, what are the iso_country and iso_region column values?

In [44]:
#first, see how many are nulls
airport['continent'].isnull().sum()

27719

In [45]:
airport[airport['continent'].isnull()][["iso_country","iso_region"]].head()

Unnamed: 0,iso_country,iso_region
0,US,US-PA
1,US,US-KS
2,US,US-AK
3,US,US-AL
4,US,US-AR


In [46]:
airport[airport['continent'].isnull()]["iso_country"].value_counts().head()

US    22756
CA     2784
MX     1181
HN      158
CU      134
Name: iso_country, dtype: int64

In [47]:
airport[airport['continent'].isnull()]["iso_region"].value_counts().head(10)

US-TX    2277
US-CA    1088
US-FL     967
US-PA     918
US-IL     902
US-AK     829
US-OH     798
US-IN     697
CA-ON     695
US-NY     668
Name: iso_region, dtype: int64

Most nulls in the continent column have a iso_country value of US, CA, and MX. The iso_region column also appears to confirm this.


Since these have a large count, will investigate all rows with US, CA, and MX as values for iso_country to see if any rows have values in the continent column:

In [48]:
name_list = ['US', 'CA', 'MX']
for name in name_list:
    print('Where continents field is null for '+ name + ': ' + str("{:,}".format(airport[airport['iso_country'] == name]['continent'].isnull().sum())))
    print('Where continents field is not null for '+ name + ': ' + str(airport[airport['iso_country'] == name]['continent'].notnull().sum()))
    print('Proportion of nulls: '+ str("{:.1%}".format(airport[airport['iso_country'] == name]['continent'].isnull().sum() / len(airport[airport['iso_country'] == name]))))
    print()

Where continents field is null for US: 22,756
Where continents field is not null for US: 1
Proportion of nulls: 100.0%

Where continents field is null for CA: 2,784
Where continents field is not null for CA: 0
Proportion of nulls: 100.0%

Where continents field is null for MX: 1,181
Where continents field is not null for MX: 0
Proportion of nulls: 100.0%



Only one row has a continent value for all of the rows in question. The row has US as its country value and AS as its continent value

In [49]:
print(airport[(airport['iso_country'] == 'US')]['continent'].value_counts())
airport[(airport['iso_country'] == 'US')&(airport['continent'] == 'AS')]

AS    1
Name: continent, dtype: int64


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
37387,OI05,heliport,Dana Heliport,625.0,AS,US,US-OH,Toledo,OI05,,OI05,"-83.6465988159, 41.65589904789999"


Value of "AS" represents Asia. This row is for a heliport in Toledo, Ohio, so it has an incorrect continent value. The continent value should be updated to "NA" for North America

In [50]:
# Get row by filtering for continent == AS and country == US
airport.loc[((airport['continent'] == 'AS')&(airport['iso_country'] == 'US')),'continent'] = "NA"

#verify it is now value of "NA"
airport[(airport['ident'] == 'OI05')]

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
37387,OI05,heliport,Dana Heliport,625.0,,US,US-OH,Toledo,OI05,,OI05,"-83.6465988159, 41.65589904789999"


For these three country values, the continent should be "NA" for North America. However, when reading in this file, the columns with NA were processed as a null value. When reading in the airport file, make sure these values are appropriately processed with escape characters.

Need to process all rows with continent values of US, CA, and MX with escape character:

In [51]:
airport.loc[((airport['iso_country'] == 'US') | (airport['iso_country'] == 'MX') | (airport['iso_country'] == 'CA')),'continent'] = "NA"

Verify values are no longer null:

In [52]:
print(airport[(airport['iso_country'] == 'US')]['continent'].isnull().sum())
print(airport[(airport['iso_country'] == 'MX')]['continent'].isnull().sum())
print(airport[(airport['iso_country'] == 'CA')]['continent'].isnull().sum())

0
0
0


See remaining values

In [53]:
airport[airport['continent'].isnull()]["iso_country"].value_counts()

HN    158
CU    134
CR    133
GL     84
PA     72
BS     66
PR     64
GT     53
DO     39
NI     32
BZ     27
SV     27
JM     24
TC     10
VI      9
HT      8
GP      6
VC      6
BB      6
GD      3
KY      3
BM      3
TT      3
BQ      3
AG      3
VG      3
KN      2
MF      2
LC      2
MS      2
DM      2
PM      2
SX      1
CW      1
MQ      1
GB      1
AW      1
AI      1
BL      1
Name: iso_country, dtype: int64

A lot of nulls still remaining. Based on examination of iso_country values of US, CA, and MX, this may be an issue with reading in "NA" as a null value.

Used [this link](https://dev.maxmind.com/geoip/legacy/codes/country_continent/) to determine a few of these countries' continent codes

After reviewing some of the continent values, it appears these should probably all be NA for North America. Use pycountry package to get the continent values for these rows by looking at country values. Use the package to apply to all rows with nulls in the continent column.

In [54]:
# Set to add country values to without duplicates
na_set = set()

# Dataframe with filtered master data where continent is null. Want to perform test only on this subset of data
na_data = airport[airport['continent'].isnull()]['iso_country']

# Error list created to see any countries where the pycountry package couldn't get a continent
error_list = []

In [55]:
#if need to install
!pip install pycountry_convert



In [57]:
import pycountry_convert as pc

# look through all country values in the filtered dataset
for item in na_data:
# use pycountry to get continent value based on a country value
    try: 
        pc.country_alpha2_to_continent_code(item)
# add to set. If duplicate country value, will not add since sets only hold unique values
        na_set.add(item)
# if pycountry throws an error, add the country name to the error list for further examination
# continue running for loop after handling error
    except:
        error_list.append(item)
        continue

        
# Create a new dataframe with a country column. Use the set we created so only one of each country value
na_df = pd.DataFrame(list(na_set), columns = ["country"])


# Make continent "NA" for all rows in dataset where country is the same as countries in the set created above
for country in na_df['country']:
    airport.loc[(airport['iso_country'] == country),'continent'] = "NA"

In [59]:
#check data to see remaining nulls
print(str(airport['continent'].isnull().sum())+ ' nulls remaining')

#see the country values that were not able to be processed using pycountry
print(error_list)

#SX only one that didn't work on the automated continent add, so add the continent code manually
airport.loc[(airport['iso_country'] == 'SX'),'continent'] = "NA"

1 nulls remaining
['SX', 'SX']


In [60]:
#verify no nulls after processing "SX"
airport['continent'].isnull().sum()

0

Now no nulls in continent column.

__iso_country:__

Looking at first five rows of the iso_country column, it apears it may only values with a character count length of 2 digits. Determine if this is true for all values.

In [61]:
airport['iso_country_length'] = airport['iso_country'].map(str).apply(len)
airport['iso_country_length'].value_counts()

2    54828
3      247
Name: iso_country_length, dtype: int64

Most values in the iso_countries column have a length of 2, but there are some iso_countries values with a length of 3. Look closer at all these rows

In [62]:
#looks like all rows with a value length of 3 are NaNs
airport[airport['iso_country_length'] == 3]['iso_country']
airport[airport['iso_country_length'] == 3]['iso_country'].value_counts()

Series([], Name: iso_country, dtype: int64)

All with lengths of three are NaNs. 

In [63]:
#check using notnull(). Are all counts now 2?
airport[airport['iso_country'].notnull()]['iso_country_length'].value_counts() #all have counts of two when run during testing

2    54828
Name: iso_country_length, dtype: int64

All non-nulls for this column have a length of 2.

So to investigate the nulls, look look at iso_region since this column had no nulls. Can we use it to derive the nulls in the iso_country column?

In [64]:
airport['iso_region'].value_counts()

US-TX     2277
US-CA     1088
US-FL      967
US-PA      918
BR-SP      907
US-IL      902
US-AK      829
US-OH      799
GB-ENG     726
US-IN      697
CA-ON      695
US-NY      668
BR-MT      635
US-WI      624
US-LA      592
US-WA      578
US-MO      578
US-MN      569
US-MI      549
US-OK      537
BR-MS      527
US-GA      522
AU-QLD     511
US-VA      505
US-CO      505
US-OR      492
US-NC      473
CA-BC      467
US-NJ      442
US-KS      439
          ... 
UG-401       1
MH-UJA       1
NI-JI        1
UA-56        1
IR-19        1
GR-16        1
SI-188       1
TN-52        1
JP-26        1
PH-ZSI       1
UY-FD        1
LR-RI        1
TH-50        1
SY-HM        1
HR-20        1
LV-LM        1
GH-EP        1
MK-003       1
PW-010       1
MT-01        1
TR-51        1
SZ-MA        1
TR-13        1
SZ-HH        1
BS-MG        1
RU-KC        1
TM-D         1
SI-073       1
AG-03        1
UG-304       1
Name: iso_region, Length: 2810, dtype: int64

Test if we take the left two digits from the iso_region column and compare against the iso_country. Do this by:
- Create a filtered dataframe so only values with iso_country values are being tested. Since we cannot test against null values
- Derive an iso_country value by taking iso_region and taking the values left of the dash symbol
- Compare the derived value with the actual iso_column value
- Count the number of matches and mismatches
 - If all match, we can use iso_region to derive a country value

In [65]:
#filter for iso_countries that are not null
test = airport[airport['iso_country'].notnull()][["iso_country","iso_region"]]

#get the values left of the iso_region column and compare against the iso_country
test['iso_region_leftdash'] = test.iso_region.str.split('-').str[0]
test.head()


Unnamed: 0,iso_country,iso_region,iso_region_leftdash
0,US,US-PA,US
1,US,US-KS,US
2,US,US-AK,US
3,US,US-AL,US
4,US,US-AR,US


In [66]:
def region_test(df):
    if df['iso_region_leftdash'] == df['iso_country']:
        return True
    else:
        return False

test['test'] = test.apply(region_test, axis = 1)
test['test'].value_counts()

True    54828
Name: test, dtype: int64

The above code block shows that we can derive country from the iso_region since all test values were true.

Now in the airport dataset, use iso_region to derive the country code where it is missing. Apply to rows with null values in the iso_country column. Created a new column to keep original column intact.

In [67]:
#create a new column with the derived country column
airport['iso_region_leftdash'] = airport.iso_region.str.split('-').str[0]
print(airport[airport['iso_region_leftdash'].isnull()]['iso_region_leftdash'].value_counts()) #test to see if any nulls in newly calculated field
# Resulted in no values. This show that the new calculation did not create any new null values

print() # Space in output to make it look nicer

print(airport.info()) # See high level overview of dataframe with new column

Series([], Name: iso_region_leftdash, dtype: int64)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 14 columns):
ident                  55075 non-null object
type                   55075 non-null object
name                   55075 non-null object
elevation_ft           48069 non-null float64
continent              55075 non-null object
iso_country            54828 non-null object
iso_region             55075 non-null object
municipality           49399 non-null object
gps_code               41030 non-null object
iata_code              9189 non-null object
local_code             28686 non-null object
coordinates            55075 non-null object
iso_country_length     55075 non-null int64
iso_region_leftdash    55075 non-null object
dtypes: float64(1), int64(1), object(12)
memory usage: 5.9+ MB
None


__Explore iata_code__

In [68]:
airport['iata_code'].isnull().sum() / len(airport.index)

0.83315478892419426

83% of rows have a null value in IATA code. Although there are a lot of nulls, will keep this column since IATA codes are airport codes and provide useful information. We can use it to see how many people in the immigration data traveled through certain airports using the standardized coding. 

Also, not all people travel by airplane, and even those who do may not use an airport that has an IATA code if it is small or a military base. So a null value in this column may be able to provide information even if it does not have a value.

In [11]:
airport[airport['iata_code'].isnull()].head(2)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"


In [19]:
airport[airport['iata_code'] == 'SEA']
#airport['iata_code'].value_counts()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
29927,KSEA,large_airport,Seattle Tacoma International Airport,433.0,,US,US-WA,Seattle,KSEA,SEA,SEA,"-122.308998, 47.449001"


## Temperature data:
 - For the AverageTemperature column, temperature values with a date of 9/1/2013 have a lot of nulls. So scope this date range out when pulling data in for ETL

In [2]:
# Read in temp data
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
temp = pd.read_csv(fname)

In [70]:
temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


Temperature data set only goes through September 1st 2013. We need to scope the fact table to only include data with dates before 9/1/2013.

In [71]:
temp.sort_values('dt').tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
7330815,2013-09-01,,,Tabriz,Iran,37.78N,46.78E
2684940,2013-09-01,,,Guangyuan,China,32.95N,106.28E
2682622,2013-09-01,,,Guangshui,China,31.35N,113.09E
2707002,2013-09-01,21.815,1.233,Guatemala City,Guatemala,15.27N,90.83W
8599211,2013-09-01,,,Zwolle,Netherlands,52.24N,5.26E


In [72]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [73]:
temp['AverageTemperature'].isnull().sum()

364130

.info() didn't give number of records so loop through all columns to get some information

In [74]:
for column in temp.columns:
    print(column + ' column contains:')
    print('Total records: '+ str('{:,}'.format(len(temp.index))))
    print('Nulls: ' + str('{:,}'.format(temp[column].isnull().sum())))
    print('Proportion of Nulls in Data: ' + str("{:.2%}".format(temp[column].isnull().sum() / len(temp.index))))
    print()

dt column contains:
Total records: 8,599,212
Nulls: 0
Proportion of Nulls in Data: 0.00%

AverageTemperature column contains:
Total records: 8,599,212
Nulls: 364,130
Proportion of Nulls in Data: 4.23%

AverageTemperatureUncertainty column contains:
Total records: 8,599,212
Nulls: 364,130
Proportion of Nulls in Data: 4.23%

City column contains:
Total records: 8,599,212
Nulls: 0
Proportion of Nulls in Data: 0.00%

Country column contains:
Total records: 8,599,212
Nulls: 0
Proportion of Nulls in Data: 0.00%

Latitude column contains:
Total records: 8,599,212
Nulls: 0
Proportion of Nulls in Data: 0.00%

Longitude column contains:
Total records: 8,599,212
Nulls: 0
Proportion of Nulls in Data: 0.00%



Column of interest is the AverageTemperature and AverageTemperatureUncertainty columns. These have nulls so will look at these rows

In [3]:
temp[temp['AverageTemperature'].isnull()]['dt'].value_counts().sort_index(ascending = False).head(10)

2013-09-01    3070
1898-03-01      21
1898-02-01      24
1893-06-01      11
1893-05-01      11
1893-04-01      11
1893-03-01      11
1893-02-01       8
1893-01-01       8
1892-12-01       8
Name: dt, dtype: int64

In [76]:
temp[temp['AverageTemperatureUncertainty'].isnull()]['dt'].value_counts().sort_index(ascending = False).head(10)

2013-09-01    3070
1898-03-01      21
1898-02-01      24
1893-06-01      11
1893-05-01      11
1893-04-01      11
1893-03-01      11
1893-02-01       8
1893-01-01       8
1892-12-01       8
Name: dt, dtype: int64

It looks like there are a lot of nulls with dates of 9/1/2013. Should exclude these from our data import. 

All other nulls are from the 1898 and older

Country is another column to look at. This is one of the columns that we will need to use to join our temperature dimension table with the fact table

In [77]:
print(str(len(temp['Country'].value_counts())) + ' countries')
temp['Country'].value_counts().head(10)

159 countries


India             1014906
China              827802
United States      687289
Brazil             475580
Russia             461234
Japan              358669
Indonesia          323255
Germany            262359
United Kingdom     220252
Mexico             209560
Name: Country, dtype: int64

In [17]:
# Use pycountry to get ISO country codes
import pycountry

# create a dictionary with key = country name and value = country code
countries = {}
for country in pycountry.countries:
    countries[country.name] = country.alpha_2

In [79]:
#add country code to temp df
temp['country_code'] = temp['Country'].map(countries)

In [80]:
temp[['Country','country_code']].head()

Unnamed: 0,Country,country_code
0,Denmark,DK
1,Denmark,DK
2,Denmark,DK
3,Denmark,DK
4,Denmark,DK


In [81]:
# See if there are any nulls in the country_code column and what their associated country vales are
print('Number of countries without derived country code: ' + str(len(temp[temp['country_code'].isnull()]['Country'].value_counts())))
temp[temp['country_code'].isnull()]['Country'].value_counts()

Countries without derived country code: 20


Russia                                461234
Iran                                  151651
Venezuela                              91080
Vietnam                                66330
Taiwan                                 62190
Burma                                  50566
Congo (Democratic Republic Of The)     44547
Tanzania                               31440
Côte D'Ivoire                          25701
Syria                                  18056
Bosnia And Herzegovina                 16195
Czech Republic                         16195
Bolivia                                11406
Moldova                                 9717
Macedonia                               6478
Reunion                                 2721
Laos                                    2371
South Korea                             2097
Guinea Bissau                           1977
Swaziland                               1881
Name: Country, dtype: int64

These countries did not have an associated country code from the pycountry module.

Will use another method to get the country codes of these.

In [82]:
country_set = set()
for country in temp[temp['country_code'].isnull()]['Country']:
    country_set.add(country)

In [83]:
country_set

{'Bolivia',
 'Bosnia And Herzegovina',
 'Burma',
 'Congo (Democratic Republic Of The)',
 'Czech Republic',
 "Côte D'Ivoire",
 'Guinea Bissau',
 'Iran',
 'Laos',
 'Macedonia',
 'Moldova',
 'Reunion',
 'Russia',
 'South Korea',
 'Swaziland',
 'Syria',
 'Taiwan',
 'Tanzania',
 'Venezuela',
 'Vietnam'}

Manually give each of these a country code based on [this link](https://stackoverflow.com/questions/16253060/how-to-convert-country-names-to-iso-3166-1-alpha-2-values-using-python) to determine a few of these countries' continent codes


In [84]:
extra_codes = {'Bolivia' : 'BO',
 'Bosnia And Herzegovina' : 'BA',
 'Burma' : 'MM',
 'Congo (Democratic Republic Of The)' : 'CD',
 'Czech Republic' : 'CZ',
 "Côte D'Ivoire" : 'CI',
 'Guinea Bissau' : 'GW',
 'Iran' : 'IR',
 'Laos' : 'LA',
 'Macedonia' : 'MK',
 'Moldova' : 'MD',
 'Reunion' : 'RE',
 'Russia' : 'RU',
 'South Korea' : 'KR',
 'Swaziland' : 'SZ',
 'Syria' : 'SY',
 'Taiwan' : 'TW',
 'Tanzania' : 'TZ',
 'Venezuela' : 'VE',
 'Vietnam' : 'VN'}

In [85]:
# Add above dictionary onto previous dictionary created from pycountry
countries.update(extra_codes)
temp['country_code'] = temp['Country'].map(countries)

In [86]:
# Verify nulls changed
print(temp['country_code'].isnull().sum())
temp[temp['country_code'].isnull()]['Country'].value_counts()

0


Series([], Name: Country, dtype: int64)

This was difficult since many of the country names in the dataset have different naming conventions than the pycountry module. For example, Bolivia is saved as "Bolivia, Plurinational State of", which makes using our data with the module difficult

__Now look at state data__

Using same code from city section

In [87]:
# Read in temp data
fname = './GlobalLandTemperaturesByState.csv'
temp = pd.read_csv(fname)

In [88]:
temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,State,Country
0,1855-05-01,25.544,1.171,Acre,Brazil
1,1855-06-01,24.228,1.103,Acre,Brazil
2,1855-07-01,24.371,1.044,Acre,Brazil
3,1855-08-01,25.427,1.073,Acre,Brazil
4,1855-09-01,25.675,1.014,Acre,Brazil


In [89]:
temp.sort_values('dt').tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,State,Country
552051,2013-09-01,26.408,1.112,Texas,United States
594124,2013-09-01,,,Victoria,Australia
567074,2013-09-01,,,Tuva,Russia
261547,2013-09-01,,,Krasnoyarsk,Russia
645674,2013-09-01,,,Zhejiang,China


In [90]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645675 entries, 0 to 645674
Data columns (total 5 columns):
dt                               645675 non-null object
AverageTemperature               620027 non-null float64
AverageTemperatureUncertainty    620027 non-null float64
State                            645675 non-null object
Country                          645675 non-null object
dtypes: float64(2), object(3)
memory usage: 24.6+ MB


In [91]:
temp['AverageTemperature'].isnull().sum()

25648

In [92]:
for column in temp.columns:
    print(column + ' column contains:')
    print('Total records: '+ str('{:,}'.format(len(temp.index))))
    print('Nulls: ' + str('{:,}'.format(temp[column].isnull().sum())))
    print('Proportion of Nulls in Data: ' + str("{:.2%}".format(temp[column].isnull().sum() / len(temp.index))))
    print()

dt column contains:
Total records: 645,675
Nulls: 0
Proportion of Nulls in Data: 0.00%

AverageTemperature column contains:
Total records: 645,675
Nulls: 25,648
Proportion of Nulls in Data: 3.97%

AverageTemperatureUncertainty column contains:
Total records: 645,675
Nulls: 25,648
Proportion of Nulls in Data: 3.97%

State column contains:
Total records: 645,675
Nulls: 0
Proportion of Nulls in Data: 0.00%

Country column contains:
Total records: 645,675
Nulls: 0
Proportion of Nulls in Data: 0.00%



In [93]:
temp[temp['AverageTemperature'].isnull()]['dt'].value_counts().sort_index(ascending = False).head(10)

2013-09-01    181
1893-06-01      1
1893-05-01      1
1893-04-01      1
1893-03-01      1
1893-02-01      1
1893-01-01      1
1892-12-01      1
1892-11-01      1
1892-10-01      1
Name: dt, dtype: int64

In [94]:
temp[temp['AverageTemperatureUncertainty'].isnull()]['dt'].value_counts().sort_index(ascending = False).head(10)

2013-09-01    181
1893-06-01      1
1893-05-01      1
1893-04-01      1
1893-03-01      1
1893-02-01      1
1893-01-01      1
1892-12-01      1
1892-11-01      1
1892-10-01      1
Name: dt, dtype: int64

In [95]:
print(str(len(temp['Country'].value_counts())) + ' countries')
temp['Country'].value_counts().head(10)

7 countries


Russia           254972
United States    149745
India             86664
China             68506
Canada            35358
Brazil            34328
Australia         16102
Name: Country, dtype: int64

In [96]:
#add country code to
temp['country_code'] = temp['Country'].map(countries)

In [97]:
temp[['Country','country_code']].head()

Unnamed: 0,Country,country_code
0,Brazil,BR
1,Brazil,BR
2,Brazil,BR
3,Brazil,BR
4,Brazil,BR


In [98]:
# See if there are any nulls in the country_code column and what their associated country vales are
print('Countries without derived country code: ' + str(len(temp[temp['country_code'].isnull()]['Country'].value_counts())))
temp[temp['country_code'].isnull()]['Country'].value_counts()

Countries without derived country code: 0


Series([], Name: Country, dtype: int64)

__Now look at country data__

Using same code from city section

In [30]:
# Read in temp data
fname = './GlobalLandTemperaturesByCountry.csv'
temp = pd.read_csv(fname)

In [38]:
temp['year']  = pd.DatetimeIndex(temp['dt']).year

In [39]:
temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country,country_code,year
0,1743-11-01,4.384,2.294,Åland,,1743
1,1743-12-01,,,Åland,,1743
2,1744-01-01,,,Åland,,1744
3,1744-02-01,,,Åland,,1744
4,1744-03-01,,,Åland,,1744


In [5]:
temp.sort_values('dt').tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
176113,2013-09-01,,,Federated States Of Micronesia
505395,2013-09-01,,,Swaziland
174686,2013-09-01,,,Faroe Islands
409790,2013-09-01,,,Paraguay
577461,2013-09-01,,,Zimbabwe


In [6]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577462 entries, 0 to 577461
Data columns (total 4 columns):
dt                               577462 non-null object
AverageTemperature               544811 non-null float64
AverageTemperatureUncertainty    545550 non-null float64
Country                          577462 non-null object
dtypes: float64(2), object(2)
memory usage: 17.6+ MB


In [7]:
temp['AverageTemperature'].isnull().sum()

32651

In [8]:
for column in temp.columns:
    print(column + ' column contains:')
    print('Total records: '+ str('{:,}'.format(len(temp.index))))
    print('Nulls: ' + str('{:,}'.format(temp[column].isnull().sum())))
    print('Proportion of Nulls in Data: ' + str("{:.2%}".format(temp[column].isnull().sum() / len(temp.index))))
    print()

dt column contains:
Total records: 577,462
Nulls: 0
Proportion of Nulls in Data: 0.00%

AverageTemperature column contains:
Total records: 577,462
Nulls: 32,651
Proportion of Nulls in Data: 5.65%

AverageTemperatureUncertainty column contains:
Total records: 577,462
Nulls: 31,912
Proportion of Nulls in Data: 5.53%

Country column contains:
Total records: 577,462
Nulls: 0
Proportion of Nulls in Data: 0.00%



In [9]:
temp[temp['AverageTemperature'].isnull()]['dt'].value_counts().sort_index(ascending = False).head(10)

2013-09-01    222
2013-08-01      1
2013-07-01      1
2013-06-01      1
2013-05-01      1
2013-04-01      1
2013-03-01      1
2013-02-01      1
2013-01-01      1
2012-12-01      1
Name: dt, dtype: int64

In [14]:
temp[(temp['AverageTemperature'].isnull()) & (temp['dt'] == '2013-07-01')]

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
23186,2013-07-01,,0.673,Antarctica


In [15]:
# See total number of countries and first 10
print(str(len(temp['Country'].value_counts())) + ' countries')
temp['Country'].value_counts().head(10)

243 countries


Poland             3239
Finland            3239
Macedonia          3239
Faroe Islands      3239
Austria            3239
United Kingdom     3239
Europe             3239
Liechtenstein      3239
Monaco             3239
France (Europe)    3239
Name: Country, dtype: int64

In [32]:
# add country code to dataframe

# first make all countries in temp dataframe column uppercase

temp['country_code'] = temp['Country'].map(countries)

In [33]:
temp[['Country','country_code']].head()

Unnamed: 0,Country,country_code
0,Åland,
1,Åland,
2,Åland,
3,Åland,
4,Åland,


In [43]:
# See if there are any nulls in the country_code column and what their associated country vales are
print('Countries without derived country code: ' + str(len(temp[(temp['country_code'].isnull()) & (temp['year'] == 2013)]['Country'].value_counts())))
temp[temp['country_code'].isnull()]['Country'].value_counts()

Countries without derived country code: 59


Europe                                       3239
Denmark (Europe)                             3239
Czech Republic                               3239
Saint Pierre And Miquelon                    3239
Åland                                        3239
France (Europe)                              3239
Moldova                                      3239
Isle Of Man                                  3239
Macedonia                                    3239
Bosnia And Herzegovina                       3239
United Kingdom (Europe)                      3239
Netherlands (Europe)                         3239
Svalbard And Jan Mayen                       3033
North America                                2941
Reunion                                      2721
Syria                                        2460
Gaza Strip                                   2460
Russia                                       2421
Burma                                        2371
Laos                                         2371


#### For temperature data, should use state and country data.
- State can be joined with which state someone arrives to in the United States
- Country can be joined with the immigration data on where the traveler is coming from. It is difficult to get state level data for other countries that will make a good join. This way we can still get temperature data about where the person is arriving from.