# New Last Value Baseline on New LB
In this notebook we create and submit new "last value" baseline to the new public LB. The old "last value" baseline was using October 2022 data to predict old November 2022 public LB and achieved SMAPE = 1.093. Note that previously both October 2022 and November 2022 were in the same year 2022 and both used the same census data file. But now our last value of December 2022 and prediction of January 2023 are two different years using two different census data files. Therefore we will adjust last value based on changes in census data. Without adjustment LB=3.2776, and with adjustment is is much better! Discussion about this notebook is [here][2]

# Public LB is now January 2023 
On Monday Feb 20th, Kaggle released additional train data in the file `revealed_test.csv`. This file contains data for months November 2022 and December 2022. Discussion about this new file is [here][1]. Kaggle also updated the Public LB to score January 2023 (and not November 2022 anymore). Everyone's old public LB scores are not being updated, but from now on when we submit to public LB, our submission.csv will be scored against January 2023.

# Private LB is March, April, May 2023
The private LB will be the 3 months of March, April, May 2023. Therefore our models need to use data from August 2019 thru December 2022 inclusive 41 months to predict the future 3 months of March, April, May 2023. Discussion about this [here][3]

[1]: https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/389138
[2]: https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/389215
[3]: https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/388154#2148695

# Load New Train Data

In [1]:
import pandas as pd

new_data = pd.read_csv('/kaggle/input/godaddy-microbusiness-density-forecasting/revealed_test.csv')
print('New data shape:', new_data.shape )
new_data.head()

New data shape: (6270, 7)


Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2022-11-01,1001,Autauga County,Alabama,2022-11-01,3.442677,1463
1,1001_2022-12-01,1001,Autauga County,Alabama,2022-12-01,3.470915,1475
2,1003_2022-11-01,1003,Baldwin County,Alabama,2022-11-01,8.257636,14145
3,1003_2022-12-01,1003,Baldwin County,Alabama,2022-12-01,8.25063,14133
4,1005_2022-11-01,1005,Barbour County,Alabama,2022-11-01,1.247223,247


# New Last Value Prediction
We will generate a new "last value" prediction from each county's December 2022 microbusiness density.

In [2]:
new_data = new_data.drop_duplicates('cfips',keep='last')
print('Generating predictions using',new_data.first_day_of_month.unique(),'data...')
pred = new_data.set_index('cfips').microbusiness_density.to_dict()

Generating predictions using ['2022-12-01'] data...


# Create Submission.CSV
We will use each county's December 2022 microbusiness density as a prediction for their January 2023 (public LB) and March, April, May 2023 (private LB) forecast. The submission file also contains forecast for November 2022, December 2022, February 2023, and June 2023, but these 4 months' values are ignored by Kaggle.

In [3]:
sub = pd.read_csv('/kaggle/input/godaddy-microbusiness-density-forecasting/sample_submission.csv')
print('Sample submission shape:', sub.shape )
sub.head()

Sample submission shape: (25080, 2)


Unnamed: 0,row_id,microbusiness_density
0,1001_2022-11-01,3.817671
1,1003_2022-11-01,3.817671
2,1005_2022-11-01,3.817671
3,1007_2022-11-01,3.817671
4,1009_2022-11-01,3.817671


In [4]:
sub['cfips'] = sub.row_id.apply(lambda x: int(x.split('_')[0]))
sub.microbusiness_density = sub.cfips.map(pred)
#sub = sub.drop('cfips',axis=1)
#sub.to_csv('submission.csv',index=False)
sub.head()

Unnamed: 0,row_id,microbusiness_density,cfips
0,1001_2022-11-01,3.470915,1001
1,1003_2022-11-01,8.25063,1003
2,1005_2022-11-01,1.252272,1005
3,1007_2022-11-01,1.28724,1007
4,1009_2022-11-01,1.85206,1009


# Adjust Microbusiness Density using New Census Data
The formula for microbusiness density is `microbusiness_density = 100 * active / adult_population`. Therefore even if active microbusinesses stay the same, if the adult population changes then microbusiness density will change too. All microbusiness density values in the year 2022 use the same census adult population file from 2020. But the microbusiness density values in the year 2023 use a different census adult population from 2021. You can find the data online at ` https://data.census.gov/table?q=S0101+by+county&tid=ACSST5Y2021.S0101`. I downloaded the files from 2020 and 2021 and uploaded them to a Kaggle dataset [here][1]. Don't forget to upvote the dataset if you find it helpful :-) The host has confirmed that these are the files they use in discussion [here][2]

We will now adjust the last value baseline to account for the ratio change in adult population from 2020 to 2021 census below. Without adjustment, our "last value" baseline achieves public LB SMAPE = 3.2776. Let's see how much the public LB improves after adjustment. According to the meta notes in file `/kaggle/input/census-data-for-godaddy/ACSST5Y2020.S0101-Column-Metadata.csv` the adult population 18 years and older is column `S0101_C01_026E`. Also the host confirms [here][2] that this is the column they use.

[1]: https://www.kaggle.com/datasets/cdeotte/census-data-for-godaddy
[2]: https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/discussion/372604#2078710

# Load Census 2020 and 2021

In [5]:
COLS = ['GEO_ID','NAME','S0101_C01_026E']
df2020 = pd.read_csv('/kaggle/input/census-data-for-godaddy/ACSST5Y2020.S0101-Data.csv',usecols=COLS)
df2020 = df2020.iloc[1:]
df2020['S0101_C01_026E'] = df2020['S0101_C01_026E'].astype('int')
print( df2020.shape )
df2020.head()

(3221, 3)


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,GEO_ID,NAME,S0101_C01_026E
1,0500000US01001,"Autauga County, Alabama",42496
2,0500000US01003,"Baldwin County, Alabama",171296
3,0500000US01005,"Barbour County, Alabama",19804
4,0500000US01007,"Bibb County, Alabama",17790
5,0500000US01009,"Blount County, Alabama",44383


In [6]:
df2021 = pd.read_csv('/kaggle/input/census-data-for-godaddy/ACSST5Y2021.S0101-Data.csv',usecols=COLS)
df2021 = df2021.iloc[1:]
df2021['S0101_C01_026E'] = df2021['S0101_C01_026E'].astype('int')
print( df2021.shape )
df2021.head()

(3221, 3)


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,GEO_ID,NAME,S0101_C01_026E
1,0500000US01001,"Autauga County, Alabama",44438
2,0500000US01003,"Baldwin County, Alabama",178105
3,0500000US01005,"Barbour County, Alabama",19995
4,0500000US01007,"Bibb County, Alabama",17800
5,0500000US01009,"Blount County, Alabama",45201


# Merge Census 2020 2021

In [7]:
df2020['cfips'] = df2020.GEO_ID.apply(lambda x: int(x.split('US')[-1]) )
adult2020 = df2020.set_index('cfips').S0101_C01_026E.to_dict()

df2021['cfips'] = df2021.GEO_ID.apply(lambda x: int(x.split('US')[-1]) )
adult2021 = df2021.set_index('cfips').S0101_C01_026E.to_dict()

sub['adult2020'] = sub.cfips.map(adult2020)
sub['adult2021'] = sub.cfips.map(adult2021)
sub.head()

Unnamed: 0,row_id,microbusiness_density,cfips,adult2020,adult2021
0,1001_2022-11-01,3.470915,1001,42496,44438
1,1003_2022-11-01,8.25063,1003,171296,178105
2,1005_2022-11-01,1.252272,1005,19804,19995
3,1007_2022-11-01,1.28724,1007,17790,17800
4,1009_2022-11-01,1.85206,1009,44383,45201


# Adjust Submission Microbusiness Density
Since the formula is `microbusiness_density = 100 * active / adult_population`, then we need to divide microbusiness density by the ratio of `population_2021 / population_2020`.

In [8]:
sub.microbusiness_density = sub.microbusiness_density * sub.adult2020 / sub.adult2021
sub = sub.drop(['adult2020','adult2021','cfips'],axis=1)
sub.to_csv('submission.csv',index=False)
sub.head()

Unnamed: 0,row_id,microbusiness_density
0,1001_2022-11-01,3.319231
1,1003_2022-11-01,7.935207
2,1005_2022-11-01,1.24031
3,1007_2022-11-01,1.286517
4,1009_2022-11-01,1.818544
