# About

This kernel shows how to integrate population and zip code information to the intersections.

The results are writen to the output **pop_zipcode_intersec.csv** if you want to use it in python right away without BigQuery access. 

## Credits
Some of the ideas are inspired by the following kernels. Please visit them and give them upvotes if you like them.
- This kernel is a forked from [BigQuery Machine Learning Tutorial](https://www.kaggle.com/rtatman/bigquery-machine-learning-tutorial).

In [None]:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --  
PROJECT_ID = 'kaggle-bq-geotag' #
#PROJECT_ID='kaggle-competitions-project'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('bqml_example', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

import seaborn as sns
import matplotlib.pyplot as plt

# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

In [None]:
%load_ext google.cloud.bigquery

# Loading population and zip code information

Let's have a look at the 2010 population of a zip code area

In [None]:
%%bigquery
SELECT
    SUM(pop.population) AS population,
    pop.minimum_age, 
    pop.maximum_age,
    pop.gender,
    zipcd.zipcode,
    CASE zipcd.state_code
      WHEN 'MA' THEN 'Boston'
      WHEN 'IL' THEN 'Chicago'
      WHEN 'GA' THEN 'Atlanta'
      WHEN 'PA' THEN 'Philadelphia'
  END
    city,
    zipcd.zipcode_geom
  FROM
    `bigquery-public-data.utility_us.zipcode_area` zipcd,
    `bigquery-public-data.census_bureau_usa.population_by_zip_2010` pop
  WHERE
    zipcd.state_code IN ('MA',
      'IL',
      'PA',
      'GA')
    AND ( zipcd.city LIKE '%Atlanta%'
      OR zipcd.city LIKE '%Boston%'
      OR zipcd.city LIKE '%Chicago%'
      OR zipcd.city LIKE '%Philadelphia%' )
    AND SUBSTR(CONCAT('000000', pop.zipcode),-5) = zipcd.zipcode
  GROUP BY
    pop.minimum_age, 
    pop.maximum_age,
    pop.gender,
    zipcd.zipcode,
    CASE zipcd.state_code
      WHEN 'MA' THEN 'Boston'
      WHEN 'IL' THEN 'Chicago'
      WHEN 'GA' THEN 'Atlanta'
      WHEN 'PA' THEN 'Philadelphia'
  END
    ,
    zipcd.zipcode_geom
    limit 100

The population is by age and gender. The zip code dataset provides geo information as a polygon.

Next we check which intersection coordinates are within a polygon. So we can match intersection to zip code.

In [None]:
%%bigquery df
WITH

  # population per zipcode
  # (for simplicity ignore gender and age information)

  zip_info AS(
  SELECT
    pop.minimum_age, 
    pop.maximum_age,
    pop.gender,
    SUM(pop.population) AS population,
    zipcd.zipcode,
    CASE zipcd.state_code
      WHEN 'MA' THEN 'Boston'
      WHEN 'IL' THEN 'Chicago'
      WHEN 'GA' THEN 'Atlanta'
      WHEN 'PA' THEN 'Philadelphia'
  END
    city,
    zipcd.zipcode_geom
  FROM
    `bigquery-public-data.utility_us.zipcode_area` zipcd,
    `bigquery-public-data.census_bureau_usa.population_by_zip_2010` pop
  WHERE
    zipcd.state_code IN ('MA',
      'IL',
      'PA',
      'GA')
    AND ( zipcd.city LIKE '%Atlanta%'
      OR zipcd.city LIKE '%Boston%'
      OR zipcd.city LIKE '%Chicago%'
      OR zipcd.city LIKE '%Philadelphia%' )
    AND SUBSTR(CONCAT('000000', pop.zipcode),-5) = zipcd.zipcode
  GROUP BY
    pop.minimum_age, 
    pop.maximum_age,
    pop.gender,
    zipcd.zipcode,
    CASE zipcd.state_code
      WHEN 'MA' THEN 'Boston'
      WHEN 'IL' THEN 'Chicago'
      WHEN 'GA' THEN 'Atlanta'
      WHEN 'PA' THEN 'Philadelphia'
  END
    ,
    zipcd.zipcode_geom),
  
  # spatial test and train data
  
  train_and_test AS (
  SELECT
    tr.intersectionId,
    tr.longitude,
    tr.latitude,
    tr.city
  FROM
    `kaggle-competition-datasets.geotab_intersection_congestion.train` tr
  UNION DISTINCT
  SELECT
    ts.intersectionId,
    ts.longitude,
    ts.latitude,
    ts.city
  FROM
    `kaggle-competition-datasets.geotab_intersection_congestion.test` ts),
  
  # Zipcode and Population per Intersection
  
  pop_per_intersection AS (
  SELECT
    t.intersectionId,
    zi.population,
    zi.zipcode,
    t.city,
    zi.minimum_age, 
    zi.maximum_age,
    zi.gender,
    zi.zipcode_geom
  FROM
    train_and_test t,
    zip_info zi
  WHERE
    t.city = zi.city
    AND ST_CONTAINS( ST_GEOGFROMTEXT(zi.zipcode_geom),
      ST_GeogPoint(longitude,
        latitude)))
  
# fill empty/missing zipcodes and population

SELECT
  t.city,
  t.intersectionId, 
  p.minimum_age, 
  p.maximum_age,
  p.gender,
  coalesce(p.population,
    round(AVG(p.population) OVER(PARTITION BY t.city, p.minimum_age, p.maximum_age, p.gender))) AS population,
  coalesce(p.zipcode, 'N/A') AS zipcode,
  CASE
    WHEN p.zipcode IS NULL THEN 1
  ELSE
  0
END AS zip_code_na
#--,
#--ST_GeogPoint(t.longitude,
#--        t.latitude) intersection_gp,
#p.zipcode_geom
FROM
  train_and_test t
LEFT OUTER JOIN
  pop_per_intersection p
ON
  (t.city = p.city
    AND t.intersectionId = p.intersectionId);

In [None]:
df.head()

Check number of unique intersections:

In [None]:
print('Assert, number of unique intersactions as expected:', df.groupby(['city']).intersectionId.nunique().sum()==6381)

# Export to csv
Missing population (where zipcode == 'N/A') is imputed with mean over gender, age and city. Imputed data is flagged in zip_code_na.

In [None]:
df.to_csv('pop_zipcode_intersec.csv', index=False)