# 1.1. Baseline - The Last Value is All You Need

Forecasting microbusiness density for 3,154 counties over 8 months is a challenging task. Inspired by [Chris Deotte](https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/418287)'s simple yet effective approach, I'm starting with a straightforward method: using the last known value for each county. This baseline approach, adjusted with open census data, has proven powerful without needing complex mathematical models or training steps. All we need is the latest recorded microbusiness density for each county.

In [1]:
import os, datetime
from tqdm import tqdm

import numpy as np
import pandas as pd

In [2]:
data_dir = '../data/'

# train data
df_train = pd.read_csv(os.path.join(data_dir, 'train.csv'))
df_revealed_test = pd.read_csv(os.path.join(data_dir, 'revealed_test.csv'))
df_submission = pd.read_csv(os.path.join(data_dir, 'sample_submission.csv'))
df_census = pd.read_csv(os.path.join(data_dir, 'census_starter.csv'))

# test data
df_test = pd.read_csv(os.path.join(data_dir, 'all_revealed_test.csv'))

## 1. yield last know value of each county

In [3]:
# type casting
df_train.first_day_of_month = pd.to_datetime(df_train.first_day_of_month)

# sort data by cfips and first_day_of_month
df_train = df_train.sort_values(by=['cfips', 'first_day_of_month'])

df_train.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In [4]:
# select last value only of each county
df_lastvalues = df_train.drop_duplicates(subset='cfips', keep='last')
cfips2lastvalues = {
	cfips: md for cfips, md in zip(
		df_lastvalues['cfips'], df_lastvalues['microbusiness_density'])}

In [6]:
# set cfips of each 
df_submission['cfips'] = df_submission.row_id.apply(lambda x: int(x.split('_')[0]))
df_submission['first_day_of_month'] = pd.to_datetime(df_submission.row_id.apply(lambda x: x.split('_')[1]))

In [7]:
# set submission with the last value
df_submission.microbusiness_density = df_submission.cfips.map(cfips2lastvalues)

In [22]:
from utils import SMAPE

smape_value = SMAPE(
	df_test.microbusiness_density.values, 
	df_submission.microbusiness_density.values).item()
smape_value

5.083047455230122

public score

In [23]:
submission = df_submission[df_submission.first_day_of_month.dt.year == 2022]

In [24]:
# sort new test data along the order of submission
sortmap = {val:idx for idx, val in enumerate(df_revealed_test.row_id.values)}
submission['sortmap'] = submission.row_id.map(sortmap)
submission = submission.sort_values('sortmap').drop(columns='sortmap')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission['sortmap'] = submission.row_id.map(sortmap)


In [None]:
smape_value = SMAPE(
	df_revealed_test.microbusiness_density.values, 
	submission.microbusiness_density.values).item()
smape_value

1.7862393745117036

The most basic last value forecasting outperforms in short period of time like only 2022 but not good in a long term like all test period (2022nov-2023june)

# Adjusting microbusiness density with 2021 census data

Based on this notebook, 2023 density will refer 2021 census to calculate its value.


Given that `microbusiness_density` is the number of `activate` per 100 adults in the county, we are able to estimate the number of adults in the county with `activate` values. 

> `microbusiness_density` = `activate` / (the number of adults) * 100

Furthermore, since the number of adults are lagged by 2 years, we are able to apply 2023 microbusiness density adjusted by the number of adults 2021, instead of 2020. Referring to this notebook 


In [None]:
COLS = ['GEO_ID','NAME','S0101_C01_026E']
df2020 = pd.read_csv(
	os.path.join(data_dir, 'census', 'ACSST5Y2020.S0101-Data.csv'),usecols=COLS
	).iloc[1:]
df2020['S0101_C01_026E'] = df2020['S0101_C01_026E'].astype('int')
print(df2020.shape)
df2020.head()

(3221, 3)


  df2020 = pd.read_csv(os.path.join(data_dir, 'census', 'ACSST5Y2020.S0101-Data.csv'),usecols=COLS).iloc[1:]


Unnamed: 0,GEO_ID,NAME,S0101_C01_026E
1,0500000US01001,"Autauga County, Alabama",42496
2,0500000US01003,"Baldwin County, Alabama",171296
3,0500000US01005,"Barbour County, Alabama",19804
4,0500000US01007,"Bibb County, Alabama",17790
5,0500000US01009,"Blount County, Alabama",44383


In [None]:
df2021 = pd.read_csv(
	os.path.join(data_dir, 'census', 'ACSST5Y2021.S0101-Data.csv'), usecols=COLS
	).iloc[1:]
df2021['S0101_C01_026E'] = df2021['S0101_C01_026E'].astype('int')
print(df2021.shape)
df2021.head()

(3221, 3)


  df2021 = pd.read_csv(os.path.join(data_dir, 'census', 'ACSST5Y2021.S0101-Data.csv'), usecols=COLS).iloc[1:]


Unnamed: 0,GEO_ID,NAME,S0101_C01_026E
1,0500000US01001,"Autauga County, Alabama",44438
2,0500000US01003,"Baldwin County, Alabama",178105
3,0500000US01005,"Barbour County, Alabama",19995
4,0500000US01007,"Bibb County, Alabama",17800
5,0500000US01009,"Blount County, Alabama",45201


In [31]:
df2020['cfips'] = df2020.GEO_ID.apply(lambda x: int(x.split('US')[-1]) )
adult2020 = df2020.set_index('cfips').S0101_C01_026E.to_dict()

df2021['cfips'] = df2021.GEO_ID.apply(lambda x: int(x.split('US')[-1]) )
adult2021 = df2021.set_index('cfips').S0101_C01_026E.to_dict()

In [33]:
df_submission['adult2020'] = df_submission.cfips.map(adult2020)
df_submission['adult2021'] = df_submission.cfips.map(adult2021)
df_submission.head()

Unnamed: 0,row_id,microbusiness_density,cfips,first_day_of_month,adult2020,adult2021,adjusted_microbusiness_density
0,1001_2022-11-01,3.463856,1001,2022-11-01,42496,44438,3.31248
1,1003_2022-11-01,8.359798,1003,2022-11-01,171296,178105,8.040201
2,1005_2022-11-01,1.232074,1005,2022-11-01,19804,19995,1.220305
3,1007_2022-11-01,1.28724,1007,2022-11-01,17790,17800,1.286517
4,1009_2022-11-01,1.831783,1009,2022-11-01,44383,45201,1.798633


In [34]:
df_submission['adjusted_microbusiness_density'] = \
	df_submission['microbusiness_density'] * df_submission['adult2020'] / df_submission['adult2021']

In [35]:
df_submission['adjusted_microbusiness_density']

0         3.312480
1         8.040201
2         1.220305
3         1.286517
4         1.798633
           ...    
25075     2.871739
25076    26.266367
25077     3.975138
25078     3.150000
25079     1.818512
Name: adjusted_microbusiness_density, Length: 25080, dtype: float64

In [36]:
smape_value = SMAPE(df_test.microbusiness_density.values, df_submission.adjusted_microbusiness_density.values)
smape_value

np.float64(3.8673410842826885)