<a href="https://colab.research.google.com/github/thedanindanger/yaads-examples/blob/dev/zipCodeImpute/ZIPCodeImpute.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary
BigQuery has several datasets from the census based on SCTA and ZIP Codes.

The problem is that ZCTA from census and ZIP are not a perfect match. ZCTA is only high population 

In another example, I created a notebook pulling census population ZIP codes, then finding the state average ZIP population to impute missing values: https://github.com/thedanindanger/yaads-examples/tree/main/ColabIntro 

However, there is likely a much better solution. First I will try a hierarchical outer join*, then I will try a nearest neighbor imputation given the cartesian distance between ZIP code centroid locations.

*Note: I made this term up as far as I know, the explanation is at the end.

#Connect to BigQuery


In [2]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [7]:
##@title GCP Project Name
project_id = "" #@param {type:"string"}
%load_ext google.colab.data_table

In [8]:
%%bigquery --project $project_id zip_pop_df
SELECT 
  --Forces leading zeros, e.g. ZIP 34 would first concat to 0000034,
  --  Then the right five are 00034
  right(concat('00000' ,cast(zipcode as string)),5) as zip_5, 
  sum(population) as population 
 FROM `bigquery-public-data.census_bureau_usa.population_by_zip_2010` 
 GROUP BY 1

#Explore missing values
Now we have the same set of ZIPs as our previous example.

Let's see how many ZIPs we are missing.

In [13]:
missing_zips = zip_pop_df[zip_pop_df['population'].isnull()]

In [14]:
missing_zips

Unnamed: 0,zip_5,population


Do you see the problem? We don't know which ZIPs are missing because we have nothing to compare them against.

In the previous example, there were a theorectical list of customers. 

I do have a list of ZIP codes which I need to assign population; however, I will be aggregating the population by DMA, which is a proprietary aggregation measure and one I am not at liberty to disclose. Therefore, I will only include the list of ZIPs I need.



In [15]:
import pandas as pd

zip_url = 'https://raw.githubusercontent.com/thedanindanger/yaads-examples/main/zipCodeImpute/target_zips.csv'

target_zips = pd.read_csv(zip_url)

In [17]:
target_zips.head()

Unnamed: 0,Zip
0,79699
1,79698
2,79697
3,79606
4,79605


In [32]:
#zero padding or zero filling. Adds up to 5 zero as first digits. Example: 00313  instead of 313
zip_pop_df['zip_5'] = zip_pop_df['zip_5'].str.zfill(5)

#target_zips read as integer, so first convert to string
target_zips['Zip'] = target_zips['Zip'].astype(str).str.zfill(5)

In [33]:
target_zips['Zip']
test_zips = pd.merge(target_zips, zip_pop_df, left_on='Zip', right_on='zip_5', how='left')
missing_zips = test_zips[test_zips['zip_5'].isnull()]

In [35]:
missing_zips

Unnamed: 0,Zip,zip_5,population
1,79698,,
2,79697,,
89,31704,,
111,31760,,
417,87131,,
...,...,...,...
29479,67476,,
29784,17877,,
29929,00073,,
29940,00074,,


In [41]:
missing_count = missing_zips['Zip'].count()
total_count = target_zips['Zip'].count()

print(f'Missing {missing_count} of {total_count} ZIP codes; or {round((missing_count / total_count) *100,2)}% of total target ZIPs')


Missing 447 of 30056 ZIP codes; or 1.49% of total target ZIPs


#Hierarchical backfill of missing ZIPs
Now we know there are still ZIPs missing. That means we can process our data a little more rigorously.

At this point, I will load the data into Bigquery for further evaluation, mostly because SQL is generally considered easier to read and more widely understood than Python.

I created a table in BigQuery through the cloud console. It's very easy to do. I plan to make a video on it one day. If this still isn't updated feel free to make a comment on the repo to remind me.

You can use whatever name you like, but the region needs to be 'us' since the google public data is in that region as well https://cloud.google.com/bigquery/docs/locations 

In [64]:
#@title BigQuery Dataset and Table
bq_dataset = "sample_data" #@param {type:"string"}
bq_table = "target_zips" #@param {type:"string"}
bq_region = "us" #@param {type:"string"}



In [67]:
import pandas_gbq
pandas_gbq.to_gbq(
    target_zips, f'{bq_dataset}.{bq_table}', project_id=project_id, if_exists='replace',location=bq_region
)

1it [00:02,  2.96s/it]


In [68]:
%%bigquery --project $project_id 
select * from `yaads-articles.sample_data.target_zips`
limit 5 

Unnamed: 0,Zip
0,79699
1,79698
2,79697
3,79606
4,79605


With that loaded, we can get down to business.

Census has tons of zip data sets, we can try several of the most recent.

In [71]:
%%bigquery population_zips_multi_census --project $project_id 
select 
  distinct
  t.zip,
  ifnull(z10.population, 
    ifnull(z18.total_pop,
      ifnull(z17.total_pop,
        ifnull(z16.total_pop,
          ifnull(z15.total_pop,
            ifnull(z14.total_pop, NULL)
            )
          )
        ) 
      )
    ) as population
from
  `yaads-articles.sample_data.target_zips` t
  left join
  `bigquery-public-data.census_bureau_usa.population_by_zip_2010` z10
  on t.zip = right(concat('00000',z10.zipcode),5) --census 2010 zipcode is not zero padded
  left join
  `bigquery-public-data.census_bureau_acs.zip_codes_2018_5yr` z18
  on t.zip = z18.geo_id --'geo_id' is ZIP code in zip acs tables
  left join
  `bigquery-public-data.census_bureau_acs.zip_codes_2017_5yr` z17
  on t.zip = z17.geo_id
  left join
  `bigquery-public-data.census_bureau_acs.zip_codes_2016_5yr` z16
  on t.zip = z16.geo_id
  left join
  `bigquery-public-data.census_bureau_acs.zip_codes_2015_5yr` z15
  on t.zip = z15.geo_id
  left join
  `bigquery-public-data.census_bureau_acs.zip_codes_2014_5yr` z14
  on t.zip = z14.geo_id

In [74]:
missing_zips_multi_census = population_zips_multi_census[population_zips_multi_census['population'].isnull()]

In [76]:
missing_zips_multi_census.count()

zip           447
population      0
dtype: int64

#Imputing by distance 
Well that was completely useless...

One last thing to try. There is a 'shape' file in BigQuery containing all the ZIP code boundaries in the US. Perhaps we can find a match there, then impute based on close neighboring ZIP populations.

In [79]:
%%bigquery zips_geo_join --project $project_id
select
zip,
internal_point_lat as lat,
internal_point_lon as lon
from 
  `yaads-articles.sample_data.target_zips` t
left join 
  `bigquery-public-data.geo_us_boundaries.zip_codes` g
on t.zip = g.zip_code

In [81]:
missing_zips_geo_join = zips_geo_join[zips_geo_join['lat'].isnull()]

In [82]:
missing_zips_geo_join.count()

zip    475
lat      0
lon      0
dtype: int64

Google is still using census data for this, so now for the big guns:
http://download.geonames.org/export/zip/

A repo of all postal codes in the world.

Credit to www.geonames.org for maintaining the repo