https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr

# Feature: Number of Nearby Business Nonrenewals

Spatial: within 1 mile, within same community area, within same ward, within same census tract

Time: in same year, in previous year, in past 2 years total

## 1. Setup

In [1]:
# Setup autoreload
%load_ext autoreload
%autoreload 2

In [2]:
# Import libraries
import itertools
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

# Tweak display settings for tables
pd.options.display.max_columns = 999

In [3]:
# Code-done alert
from IPython.display import Audio
sound_file = 'applause2.wav'
# Audio(sound_file, autoplay=True)

## 2. Read data

In [4]:
DATA_PATH = "../../data/Business_Licenses.csv"
DTYPE_DICT = {
    'ZIP CODE': str,
    'BUSINESS ACTIVITY ID': str,
    'BUSINESS ACTIVITY': str,
}
DATE_COLS = ['LICENSE TERM START DATE', 'LICENSE TERM EXPIRATION DATE', 'DATE ISSUED']

df = pd.read_csv(DATA_PATH,
                 dtype=DTYPE_DICT,
                 parse_dates=DATE_COLS)
df.shape

(970564, 34)

### 2.1 Check that no account-site has two different addresses

Barring NAs

In [18]:
LOCATION_COLS = ['ACCOUNT NUMBER', 'SITE NUMBER', 'ADDRESS', 'CITY',
                     'STATE', 'ZIP CODE', 'LATITUDE', 'LONGITUDE']

df[LOCATION_COLS] \
    .dropna(subset=['LATITUDE', 'LONGITUDE']) \
    .drop_duplicates() \
    .groupby(['ACCOUNT NUMBER', 'SITE NUMBER']) \
    .size() \
    .reset_index() \
    .sort_values(by=0, ascending=False) \
    .head()

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,0
0,1,1,1
138670,329530,2,1
138660,329520,1,1
138661,329522,1,1
138662,329524,1,1


In [17]:
df[['ACCOUNT NUMBER', 'SITE NUMBER']].drop_duplicates().shape  

(232496, 2)

Every account-site has a unique address. We're ok!

## 3. Preprocess supporting data to construct features with

### 3.1 Extract location features for each account-site

In [98]:
def get_locations(input_df):
    '''
    Takes license-level data and returns a dataframe with location attributes
        for each account-site.
    '''
    # Columns to return
    LOCATION_COLS = ['ACCOUNT NUMBER', 'SITE NUMBER', 'ADDRESS', 'CITY',
                     'STATE', 'ZIP CODE', 'WARD', 'POLICE DISTRICT',
                     'LATITUDE', 'LONGITUDE', 'LOCATION']

    # Drop rows if these columns have NA
    NA_COLS = ['LATITUDE', 'LONGITUDE', 'LOCATION']

    df = input_df.copy(deep=True)[LOCATION_COLS] \
        .dropna(subset=NA_COLS) \
        .drop_duplicates() \
        .sort_values(by=['ACCOUNT NUMBER', 'SITE NUMBER'])

    return df

addresses = get_locations(df)

addresses.head()

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,ADDRESS,CITY,STATE,ZIP CODE,WARD,POLICE DISTRICT,LATITUDE,LONGITUDE,LOCATION
156224,1,1,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
26538,1,2,17 W ADAMS ST BSMT & 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
363441,2,2,11601 W TOUHY AVE T1 CO,CHICAGO,IL,60666,41.0,16.0,42.008536,-87.914428,"(42.008536400868735, -87.91442843927047)"
701027,4,1,1028 W DIVERSEY PKWY,CHICAGO,IL,60614,44.0,19.0,41.932727,-87.655042,"(41.93272677149699, -87.65504177558735)"
714842,6,1,3714 S HALSTED ST 1ST #,CHICAGO,IL,60609,11.0,9.0,41.827185,-87.64617,"(41.82718501563474, -87.64617045635079)"


## 4. Construct labels

Now that we have address data for each account-site, we can merge it onto the business nonrenewals dataset and aggregate it by any categorical location feature in the data. 

Below, I demonstrate aggregating the number of nonrenewals by two location methods:
1. By some categorical location feature (e.g. zip code, census tract number)
2. By some distance measure (e.g. within 1 mile)

### 4.1. Number of nonrenewals in the same zipcode, same year

In [100]:
def count_by_zip_year(input_df, license_data):
    '''
    Takes business-year-level data and returns a dataframe of the number of
        nonrenewals in a given categorical location column.
    '''

    # Get locations from license data and merge onto business-year data
    addresses = get_locations(license_data)
    df = input_df.copy(deep=True) \
        .merge(addresses, how='left', on=['ACCOUNT NUMBER', 'SITE NUMBER'])

    # Setting and resetting index serves the purpose of expanding rows to all
    #   years for each zipcode in the data, then filling the "missing" rows
    #   with a count of 0. This lets us handle implicit imputation here, then
    #   merge it onto the original data without introducing NAs at that stage.
    counts_by_zip = df.loc[df['not_renewed_2yrs'] == 1] \
        .groupby(['ZIP CODE', 'YEAR']).size().reset_index() \
        .set_index(['ZIP CODE', 'YEAR']) \
        .reindex(pd.MultiIndex.from_tuples(
            itertools.product(df['ZIP CODE'].unique(), df['YEAR'].unique()))) \
        .reset_index() \
        .rename(columns={'level_0': 'ZIP CODE',
                         'level_1': 'YEAR',
                         0: 'num_not_renewed_zip'}) \
        .fillna(0) \
        .sort_values(by=['ZIP CODE', 'YEAR'])

    # Merge zip-year level data onto base
    result_df = df[['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR', 'ZIP CODE']] \
        .merge(counts_by_zip, how='left', on=['ZIP CODE', 'YEAR']) \
        .drop(labels=['ZIP CODE'], axis=1) \
        .sort_values(by=['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR'])

    return result_df

In [101]:
base = pd.read_csv('../../data/not_renewed_2yrs.csv')
count_by_zip_year(base, df)

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,YEAR,num_not_renewed_zip
0,1,1,2002,38.0
1,1,1,2003,149.0
2,1,1,2004,132.0
3,1,1,2005,189.0
4,1,1,2006,185.0
5,1,2,2016,1.0
6,1,2,2017,0.0
7,2,2,2002,6.0
8,2,2,2003,16.0
9,2,2,2004,20.0


### 4.2. Number of nonrenewals within a distance radius, same year

Strategy:
1. Get cartesian product of all failure event locations, giving pairs of lat/long points
2. Filter for pairs that occur in the same year. Or merge (1) and (2) to block by year.
2. Implement Haversine formula to get distance between both points
3. Filter for pairs under a given distance (e.g. 1 mile)
4. Aggregate by business-year to get a count of failures in the same year within that distance.

In [102]:
# Load dataframe of nonrenewals
fails = pd.read_csv('../../data/not_renewed_2yrs.csv') \
    .merge(addresses, how='left', on=['ACCOUNT NUMBER', 'SITE NUMBER'])
fails.head()

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,YEAR,not_renewed_2yrs,ADDRESS,CITY,STATE,ZIP CODE,WARD,POLICE DISTRICT,LATITUDE,LONGITUDE,LOCATION
0,1,1,2002,0.0,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
1,1,1,2003,0.0,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
2,1,1,2004,0.0,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
3,1,1,2005,0.0,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"
4,1,1,2006,1.0,17 W ADAMS ST # 1ST,CHICAGO,IL,60603,42.0,1.0,41.879342,-87.628412,"(41.879341938770445, -87.62841188861722)"


In [134]:
def count_by_dist_radius(input_df, license_data=None):
    '''
    Counts the number of business nonrenewals within a specified distance in km for each business-year.
    
    Input: input_df - df of business-year level data with binary feature for "not renewed in 2 years".
        Dataframe must also have these cols: ACCOUNT NUMBER, SITE NUMBER, YEAR, LATITUDE, LONGITUDE
        
    Output: input_df with count column appended to it. 
    '''
    
    df = input_df.copy(deep=True)
    
    # Select columns, transforms lat/long in degrees to radians
    df = df[['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR', 'LATITUDE', 'LONGITUDE', 'not_renewed_2yrs']]
    df['LATITUDE_rad'] = np.radians(df['LATITUDE'])
    df['LONGITUDE_rad'] = np.radians(df['LONGITUDE'])    
    R = 6371 # circumference of the Earth in km
    
    year_dfs = []
    years = df['YEAR'].unique()

    for i in sorted(df['YEAR'].unique()):
        year_df = df.loc[df['YEAR'] == i]
        fails_only = year_df.loc[year_df['not_renewed_2yrs'] == 1]
        
        # Get pairwise distance between all businesses that year and all nonrenewals that year
        # Then count number of nonrenewals within threshold distance (using row-wise sum)
        #  and join back on year_df 
        print("Crunching", i)
        dist_df = haversine_distances(year_df[['LATITUDE_rad', 'LONGITUDE_rad']],
                                      fails_only[['LATITUDE_rad', 'LONGITUDE_rad']]) * R
        dist_df = pd.DataFrame(np.where(dist_df <= 1, 1, 0).sum(axis=1))
        year_df = year_df \
            .reset_index(drop=True) \
            .join(dist_df) \
            .drop(labels=['LATITUDE', 'LONGITUDE', 'LATITUDE_rad', 'LONGITUDE_rad', 'not_renewed_2yrs'], axis=1)
        
        year_dfs.append(year_df)
    
    # Concatenate all year-specific dfs to get counts for all business-years
    # Then merge onto original df by business-year id cols
    all_years_df = pd.concat(year_dfs)
    result = input_df.merge(all_years_df, how='left', on=['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR']) \
        .rename(columns={0: 'num_not_renewed_1km'}) \
        [['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR', 'num_not_renewed_1km' ]]
    
    return result

In [136]:
radius_test = count_by_dist_radius(fails, 1)

Crunching 2002




Crunching 2003
Crunching 2004
Crunching 2005
Crunching 2006
Crunching 2007
Crunching 2008
Crunching 2009
Crunching 2010
Crunching 2011
Crunching 2012
Crunching 2013
Crunching 2014
Crunching 2015
Crunching 2016
Crunching 2017


In [139]:
radius_test.head(10)

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,YEAR,num_not_renewed_1km
0,1,1,2002,182
1,1,1,2003,946
2,1,1,2004,838
3,1,1,2005,1017
4,1,1,2006,1070
5,1,2,2016,7
6,1,2,2017,2
7,2,2,2002,6
8,2,2,2003,13
9,2,2,2004,22


### 4.3 Number of businesses within a distance radius, same year

Since the `num_not_renewed_in_1km` metric measures the number of businesses that failed in that year within a certain distance, the same function can be modified to count the number of businesses within a certain distance that did *not* fail that year. That gives us `num_businesses_1km` (or some other distance radius).

In [None]:
Audio(sound_file, autoplay=True)