https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr

- Business failed? (0/1)
    - Failed in 1 year? 2 years?
- Number of nearby businesses
    - Within 1 block, 1 mile, community area, census tract
- Number of non-renewals for other businesses
    - Within 1 block, 1 mile, community area, census tract
    - Within 1 quarter, 1 year, 2 years
- Proximity to CTA station
    - Binary (within constant radius?)
    - Continuous (euclidian, manhattan, transit distance)
    - Count of stations within given radius

## Setup

In [182]:
# Setup autoreload
%load_ext autoreload
%autoreload 2

# Setup chart display
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [183]:
# Import libraries
import datetime
import math
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize, scale

# Import pipeline library, hardcoded config file values
import pipeline_library as library
import pipeline_config as config

# Tweak display settings for tables
pd.options.display.max_columns = 999

In [184]:
# Code-done alert
from IPython.display import Audio
sound_file = 'applause2.wav'
# Audio(sound_file, autoplay=True)

## Read data

In [185]:
DATA_PATH = "../../data/Business_Licenses.csv"
DTYPE_DICT = {
    'ZIP CODE': str,
    'BUSINESS ACTIVITY ID': str,
    'BUSINESS ACTIVITY': str,
}
DATE_COLS = ['LICENSE TERM START DATE', 'LICENSE TERM EXPIRATION DATE', 'DATE ISSUED']

# DATE_COLS = ['APPLICATION CREATED DATE', 'APPLICATION REQUIREMENTS COMPLETE', 'PAYMENT DATE', 
#              'LICENSE TERM START DATE', 'LICENSE TERM EXPIRATION DATE', 'LICENSE APPROVED FOR ISSUANCE', 
#              'DATE ISSUED', 'LICENSE STATUS CHANGE DATE']

df = pd.read_csv(DATA_PATH,
                 dtype=DTYPE_DICT,
                 parse_dates=DATE_COLS)
df.shape

(970564, 34)

In [186]:
df.head()

Unnamed: 0,ID,LICENSE ID,ACCOUNT NUMBER,SITE NUMBER,LEGAL NAME,DOING BUSINESS AS NAME,ADDRESS,CITY,STATE,ZIP CODE,WARD,PRECINCT,WARD PRECINCT,POLICE DISTRICT,LICENSE CODE,LICENSE DESCRIPTION,BUSINESS ACTIVITY ID,BUSINESS ACTIVITY,LICENSE NUMBER,APPLICATION TYPE,APPLICATION CREATED DATE,APPLICATION REQUIREMENTS COMPLETE,PAYMENT DATE,CONDITIONAL APPROVAL,LICENSE TERM START DATE,LICENSE TERM EXPIRATION DATE,LICENSE APPROVED FOR ISSUANCE,DATE ISSUED,LICENSE STATUS,LICENSE STATUS CHANGE DATE,SSA,LATITUDE,LONGITUDE,LOCATION
0,22308-20060816,1723393,29481,1,BELL OIL TERMINAL INC,Bell Oil Terminal LLC,3741 S PULASKI RD 1,CHICAGO,IL,60623,14.0,,14-,8.0,1010,Limited Business License,,,22308.0,RENEW,,06/21/2006,08/10/2006,N,2006-08-16,2007-08-15,08/10/2006,2006-08-11,AAI,,,41.82532,-87.72396,"(41.82531992987547, -87.72395999659746)"
1,1620668-20160516,2455262,295026,1,BUCCI BIG & TALL INC.,BUCCI BIG & TALL INC.,558 W ROOSEVELT RD,CHICAGO,IL,60607,25.0,28.0,25-28,1.0,1010,Limited Business License,911.0,Retail Sales of Clothing / Accessories / Shoes,1620668.0,RENEW,,03/15/2016,05/18/2016,N,2016-05-16,2018-05-15,05/18/2016,2016-08-30,AAI,,,41.867339,-87.64159,"(41.86733856638269, -87.64159005699716)"
2,2368602-20160616,2460909,291461,3,"PROJECT: VISION , INC.","PROJECT : VISION , INC",2301 S ARCHER AVE 1 1,CHICAGO,IL,60616,25.0,18.0,25-18,9.0,1625,Raffles,720.0,Not-For-Profit Selling Raffles for Prizes of $...,2368602.0,RENEW,,04/15/2016,06/21/2016,N,2016-06-16,2017-06-15,06/21/2016,2016-06-22,AAC,08/30/2016,,41.850843,-87.638734,"(41.85084294374687, -87.63873424399071)"
3,2060891-20141016,2353257,357247,1,FOLASHADE'S CLEANING SERVICE INC.,FOLASHADE'S CLEANING SERVICE INC.,1965 BERNICE RD 1 1SW,LANSING,IL,60438,,,,,1010,Limited Business License,,,2060891.0,RENEW,,08/15/2014,04/01/2016,N,2014-10-16,2016-10-15,04/01/2016,2016-04-01,AAI,,38.0,41.951316,-87.678586,"(41.95131555606832, -87.67858578019546)"
4,1144216-20070516,1804790,147,63,WALGREEN CO.,Walgreens # 05192,9148 S COMMERCIAL AVE 1ST,CHICAGO,IL,60617,10.0,25.0,10-25,4.0,1010,Limited Business License,,,1144216.0,RENEW,,03/23/2007,05/10/2007,N,2007-05-16,2008-05-15,05/10/2007,2007-05-11,AAI,,5.0,41.728622,-87.551366,"(41.72862173556932, -87.55136646594693)"


In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 970564 entries, 0 to 970563
Data columns (total 34 columns):
ID                                   970564 non-null object
LICENSE ID                           970564 non-null int64
ACCOUNT NUMBER                       970564 non-null int64
SITE NUMBER                          970564 non-null int64
LEGAL NAME                           970560 non-null object
DOING BUSINESS AS NAME               970505 non-null object
ADDRESS                              970564 non-null object
CITY                                 970559 non-null object
STATE                                970552 non-null object
ZIP CODE                             970151 non-null object
WARD                                 898992 non-null float64
PRECINCT                             868992 non-null float64
WARD PRECINCT                        899165 non-null object
POLICE DISTRICT                      874368 non-null float64
LICENSE CODE                         970564 non-n

## Clean data

In [188]:
# Drop all licenses with a non-Chicago address
df = df.loc[df['CITY'] == 'CHICAGO']

# Drop all linceses where expiration date came before issue date
df = df.loc[df['LICENSE TERM EXPIRATION DATE'] > df['LICENSE TERM START DATE']]

In [189]:
# Extract year from DATE ISSUED column
df['YEAR'] = df["DATE ISSUED"].dt.year.astype('int')

## Generate Features

Prediction will be done at the account-site-year level. Specifically, given data for a particular business-year, the model predicts if a business will not renew their license within 2 years of their last renewal.

### LABEL: Business did not renew license in 2 years?

For all licenses that exist as of 2 years before the end of the test/train data, does a subsequent license renewal exist in the dataset? 

For example, we have data from 2002-2018, with training data 2002-2014 and test data 2015-2018. 
- In the training data (2002-2014), businesses that had licenses issued up til 12/31/2014 are marked as "failed in 2 years" if there does not exist a subsequent license renewal in the 2015-2016 data. 
- In the test data (2015-2018), businesses with licenses issued up til 12/31/2016 are marked as "failed in 2 years" if there does not exist a subsequent license renewal in the 2017-2018 data.

We can apply this to sequentially shorter training data periods, giving 6 possible temporal validation splits.

| Data exists as of | Train duration | Test duration | Forward buffer
| ------- | ------- | ------- | -------
| 12/31/2018 | 01/01/2002 - 12/31/2014 | 01/01/2015 - 12/31/2016 | 01/01/2017 - 12/31/2018
| 12/31/2016 | 01/01/2002 - 12/31/2012 | 01/01/2013 - 12/31/2014 | 01/01/2015 - 12/31/2016
| 12/31/2014 | 01/01/2002 - 12/31/2010 | 01/01/2011 - 12/31/2012 | 01/01/2013 - 12/31/2014
| 12/31/2012 | 01/01/2002 - 12/31/2008 | 01/01/2009 - 12/31/2010 | 01/01/2011 - 12/31/2012
| 12/31/2010 | 01/01/2002 - 12/31/2006 | 01/01/2007 - 12/31/2008 | 01/01/2009 - 12/31/2010
| 12/31/2008 | 01/01/2002 - 12/31/2004 | 01/01/2005 - 12/31/2006 | 01/01/2007 - 12/31/2008

Below, I manually create test-train splits for the full dataset.

In [190]:
# Manual test-train split for now, write into a function later

train = df.loc[df['DATE ISSUED'] <= pd.to_datetime('12/31/2014')]
train_buffer = df.loc[
    (df['DATE ISSUED'] > pd.to_datetime('12/31/2014')) &
    (df['DATE ISSUED'] <= pd.to_datetime('12/31/2016'))
]

test = df.loc[df['DATE ISSUED'] <= pd.to_datetime('12/31/2016')]
test_buffer = df.loc[df['DATE ISSUED'] > pd.to_datetime('12/31/2016')]

In [216]:
def reshape_df(input_df):
    '''
    Processes raw business license-level dataframe into account-site-year level dataframe. 
    - Extracts years from min/max year and expands dataframe into account-site-year level
    
    Returns transformed dataframe.
    '''
    
    df = input_df.copy(deep=True)
    
    # Aggregate by account, site and get min/max issue + expiry dates for licenses
    df = df.groupby(['ACCOUNT NUMBER', 'SITE NUMBER']) \
        .agg({'DATE ISSUED': ['min', 'max'],
              'LICENSE TERM EXPIRATION DATE': 'max'}) \
        .reset_index(col_level=1)
    
    # Flatten column names into something usable
    df.columns = df.columns.to_flat_index()
    df = df.rename(columns={
        ('', 'ACCOUNT NUMBER'): "account",
        ('' , 'SITE NUMBER'): 'site',
        ('DATE ISSUED', 'min'): 'min_license_date',
        ('DATE ISSUED', 'max'): 'max_license_date',
        ('LICENSE TERM EXPIRATION DATE', 'max'): 'expiry'})
    
    # Extract min/max license dates into list of years_open
    df['min_year'] = df['min_license_date'].dt.year.astype('int')
    df['max_year'] = df['max_license_date'].dt.year.astype('int')
    df['years_open'] = pd.Series(map(lambda x, y: [z for z in range(x, y+2)], 
                                      df.min_year, 
                                      df.max_year))
    df = df.drop(labels=['min_year', 'max_year'], axis=1)

    # make account-site id var
    # melt step below doesn't work well without merging these two cols
    df['account_site'] = df['account'].astype('str') + "-" + df['site'].astype('str')
    df = df[df.columns.tolist()[-1:] + df.columns.tolist()[:-1]]
    df = df.drop(labels=['account', 'site'], axis=1)
    
    # Expand list of years_open into one row for each account-site-year
    # # https://mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas-9849d8ff2401
    df = df \
        .years_open \
        .apply(pd.Series) \
        .merge(df, left_index=True, right_index=True) \
        .drop(labels=['years_open'], axis=1) \
        .melt(id_vars=['account_site', 'min_license_date', 'max_license_date', 'expiry'],
              value_name='YEAR') \
        .drop(labels=['variable'], axis=1) \
        .dropna() \
        .sort_values(by=['account_site', 'YEAR'])

    # Split account_site back into ACCOUNT NUMBER, SITE NUMBER
    df['ACCOUNT NUMBER'], df['SITE NUMBER'] = df['account_site'].str.split('-', 1).str
    df['ACCOUNT NUMBER'] = train_df['ACCOUNT NUMBER'].astype('int')
    df['SITE NUMBER'] = train_df['SITE NUMBER'].astype('int')

    # reorder columns
    df['YEAR'] = df['YEAR'].astype('int')
    df = df[['ACCOUNT NUMBER', 'SITE NUMBER', 'account_site', 'YEAR', 
             'min_license_date', 'max_license_date', 'expiry']] \
        .sort_values(by=['ACCOUNT NUMBER', 'SITE NUMBER'])
    
    return df

In [220]:
train_df = reshape_df(train)

train_df.head()

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,account_site,YEAR,min_license_date,max_license_date,expiry
0,1,1,1-1,2002,2002-05-08,2005-05-20,2006-05-15
175561,1,1,1-1,2003,2002-05-08,2005-05-20,2006-05-15
351122,1,1,1-1,2004,2002-05-08,2005-05-20,2006-05-15
526683,1,1,1-1,2005,2002-05-08,2005-05-20,2006-05-15
702244,1,1,1-1,2006,2002-05-08,2005-05-20,2006-05-15


In [221]:
# For each account-site-year, generate label not_renewed_2yrs

# TODO: parameterize hardcoded dates for test/train date bounds
# train: 2002 - end2014, train: 2002 - end-2016, buffer: 2017-2018

# 1. if last expiry date is within training data bounds, 
#     label = 1 if business-year is 1+ years after the business's last issued license date in train_df, else 0
# 2. if last expiry date is within test data bounds,
#     label = 1 a new license for account-site is not found in test_df, else 0
# 3. if last expiry date is beyond test data bounds (e.g. license duration > 2 years), label is NaN
#     We have no good way to tell if business is dead or alive.

train_buffer_ids = train_buffer['ACCOUNT NUMBER'].astype('str') + '-' \
    + train_buffer['SITE NUMBER'].astype('str')

train_df['not_renewed_2yrs'] = np.where(
    train_df['expiry'] <= pd.to_datetime('12/31/2014'),
    np.where(train_df['YEAR'] >= train_df['max_license_date'].dt.year.astype('int') + 1, 1, 0),
    np.where(
        (train_df['expiry'] > pd.to_datetime('12/31/2014')) & (train_df['expiry'] <= pd.to_datetime('12/31/2016')),
        ~train_df['account_site'].isin(train_buffer_ids),
        np.nan))

# Drop unnecessary columns
train_df = train_df \
    .drop(labels=['account_site'], axis=1) \
    .reset_index(drop=True)

train_df.head(30)

Unnamed: 0,ACCOUNT NUMBER,SITE NUMBER,YEAR,min_license_date,max_license_date,expiry,not_renewed_2yrs
0,1,1,2002,2002-05-08,2005-05-20,2006-05-15,0.0
1,1,1,2003,2002-05-08,2005-05-20,2006-05-15,0.0
2,1,1,2004,2002-05-08,2005-05-20,2006-05-15,0.0
3,1,1,2005,2002-05-08,2005-05-20,2006-05-15,0.0
4,1,1,2006,2002-05-08,2005-05-20,2006-05-15,1.0
5,2,2,2002,2002-04-29,2014-04-02,2016-04-15,0.0
6,2,2,2003,2002-04-29,2014-04-02,2016-04-15,0.0
7,2,2,2004,2002-04-29,2014-04-02,2016-04-15,0.0
8,2,2,2005,2002-04-29,2014-04-02,2016-04-15,0.0
9,2,2,2006,2002-04-29,2014-04-02,2016-04-15,0.0


In [223]:
# TODO: write a function that splits account-site back into original vars, exports CSV

def export_features(input_df, filepath):

    df = input_df.copy(deep=True)
    
    cols=['ACCOUNT NUMBER', 'SITE NUMBER', 'YEAR', 'not_renewed_2yrs']
    df = df[cols]
    
    df.to_csv(filepath, index=False)
    
    return None
    
FILEPATH = '../../data/not_renewed_2yrs.csv'
export_features(train_df, FILEPATH)

In [199]:
# TODO: make this easy to run for different test-train splits

In [200]:
Audio(sound_file, autoplay=True)