### Exploring Cleaned Series Results
Here we check out the cleaned series table.
We get the set of all counties for which there are targets and then remove the counties without targets from the series table. Then we reaggregate and output

In [1]:
import pandas as pd
import numpy as np
import time
import requests
import json

In [2]:
feat_agg = pd.read_csv('aggregated_feature_info.csv')
cleaned_series_table = pd.read_csv('cleaned_series_table.csv')

In [3]:
print(feat_agg.shape)
print(cleaned_series_table.shape)

(153, 8)
(307778, 8)


In [4]:
print(feat_agg[feat_agg.title == 'All-Transactions House Price Index'])

                                 title  frequency    id  observation_end  \
71  All-Transactions House Price Index       2402  2402             2402   

    observation_start  seasonal_adjustment  units  county_id  
71               2402                 2402   2402       2402  


Looks like we only have the target variable for 2402 counties.<br>
Let's find the set of counties for which we have target variable

In [5]:
counties_with_target = cleaned_series_table[cleaned_series_table.title == 'All-Transactions House Price Index'].county_id.values
ndx, counts = np.unique(counties_with_target, return_counts=True)

In [6]:
dup_counties = ndx[counts > 1]
print(dup_counties[0])

27899


In [7]:
print(cleaned_series_table[cleaned_series_table.county_id == 27899].head())

      frequency                id observation_end observation_start  \
53234    Annual   2020RATIO016011      2018-01-01        2010-01-01   
53235    Annual    ATNHPIUS15003A      2018-01-01        1975-01-01   
53236    Annual    ATNHPIUS16011A      2018-01-01        1986-01-01   
53237    Annual  B01002001E016011      2018-01-01        2009-01-01   
53238    Annual  B03002001E016011      2018-01-01        2009-01-01   

           seasonal_adjustment  \
53234  Not Seasonally Adjusted   
53235  Not Seasonally Adjusted   
53236  Not Seasonally Adjusted   
53237  Not Seasonally Adjusted   
53238  Not Seasonally Adjusted   

                                                   title           units  \
53234                                  Income Inequality           Ratio   
53235                 All-Transactions House Price Index  Index 2000=100   
53236                 All-Transactions House Price Index  Index 2000=100   
53237  Estimate, Median Age by Sex, Total Population ...    Year

The targets ATNHPIUS16011A and ATNHPIUS15003A have the same county id. Looks like the second belongs to Honolulu County and the first belongs to Bingham County

In [8]:
county_table = pd.read_csv('county_table_dedup.csv')
print(county_table[county_table.name.str.match('.*Honolulu')])
print(county_table[county_table.name.str.match('.*Bingham')])

     county_id                      name  state_id
518      27889  Honolulu County/city, HI     27887
     county_id                name  state_id
528      27899  Bingham County, ID     27893


So the series ATNHPIUS15003A should have county id 27889

In [9]:
cleaned_series_table.loc[53235, 'county_id'] = 27889

counties_with_target = cleaned_series_table[cleaned_series_table.title == 'All-Transactions House Price Index'].county_id.values
print(len(counties_with_target))
counties_with_target = set(counties_with_target)
print(len(counties_with_target))

2402
2402


Ok so now we have a list of all the counties with targets. Let's remove the series corresponding to counties for which we do not have targets

In [10]:
clipped_series_table = cleaned_series_table[cleaned_series_table['county_id'].isin(counties_with_target)]

In [11]:
print(clipped_series_table.shape)

(244506, 8)


In [12]:
clipped_series_table.to_csv('clipped_series_table.csv', index=False)

Let's remove those counties from the county table also

In [13]:
clipped_county_table = county_table[county_table['county_id'].isin(counties_with_target)]

In [14]:
clipped_county_table.to_csv('clipped_county_table.csv', index=False)

#### Reaggregate

In [26]:
# Count the unique counties for each feature title
clipped_feat_info_agg = clipped_series_table.groupby('title')['county_id'].nunique()
print(clipped_feat_info_agg)

title
90% Confidence Interval Lower Bound of Estimate of Median Household Income                     2402
90% Confidence Interval Lower Bound of Estimate of People Age 0-17 in Poverty                  2402
90% Confidence Interval Lower Bound of Estimate of People of All Ages in Poverty               2402
90% Confidence Interval Lower Bound of Estimate of Percent of People Age 0-17 in Poverty       2402
90% Confidence Interval Lower Bound of Estimate of Percent of People of All Ages in Poverty    2402
                                                                                               ... 
SNAP Benefits Recipients                                                                       2402
Single-parent Households with Children as a Percentage of Households with Children             2401
Unemployed Persons                                                                             2394
Unemployment Rate                                                                             

In [27]:
feats_for_all_counties = clipped_feat_info_agg[clipped_feat_info_agg >= 2402]
print(feats_for_all_counties.shape)

(54,)


So we have 2402 counties (examples) with 54 county-specific features at least. We can trade off number of counties for other features if there are ones that we particularly want.

In [28]:
clipped_feat_info_agg.to_csv('agg_feat_info_clipped.csv')