### Clean Series Table
The purpose of this notebook is to take the series table and clean up the data, primarily doing the following:
- Make sure there are series for each county
- Remove county from column titles
- Unduplicate features
- Count number of counties for each feature and aggregate

In [1]:
import pandas as pd
import numpy as np
import time
import requests
import json

def get_series_name(county_name, series_title):
    ndx = series_title.find(county_name)
    return series_title[:ndx]

In [2]:
df_county_table = pd.read_csv('county_table_dedup.csv')
print(df_county_table.shape)
print(df_county_table.head())

df_series_table = pd.read_csv('series_table_dedup.csv')
print(df_series_table.shape)
print(df_series_table.head())

(3176, 3)
   county_id                name  state_id
0      27336  Autauga County, AL     27335
1      27337  Baldwin County, AL     27335
2      27338  Barbour County, AL     27335
3      27339     Bibb County, AL     27335
4      27340   Blount County, AL     27335
(307833, 8)
  frequency               id observation_end observation_start  \
0    Annual  2020RATIO001001      2018-01-01        2010-01-01   
1   Monthly    ACTLISCOU1001      2020-03-01        2016-07-01   
2   Monthly  ACTLISCOUMM1001      2020-03-01        2017-07-01   
3   Monthly  ACTLISCOUYY1001      2020-03-01        2017-07-01   
4   Monthly       ALAUTA1LFN      2020-02-01        1990-01-01   

       seasonal_adjustment                                              title  \
0  Not Seasonally Adjusted                                  Income Inequality   
1  Not Seasonally Adjusted            Housing Inventory: Active Listing Count   
2  Not Seasonally Adjusted  Housing Inventory: Active Listing Count Month-...   

In [3]:
# Make sure there are series for each county
all_counties = df_county_table.county_id.unique()
unique_counties = df_series_table.county_id.unique()
print(len(unique_counties))
print(len(all_counties))

3176
3176


In [4]:
# Get number of unique features in series table
unique_features = df_series_table.title.unique()
print(len(unique_features))

91215


91215 seems really high

In [5]:
print(unique_features)

['Income Inequality' 'Housing Inventory: Active Listing Count'
 'Housing Inventory: Active Listing Count Month-Over-Month' ...
 'Poverty Universe, Age 5-17 related for Winn Parish, L'
 'Poverty Universe, All Ages for Winn Parish, L'
 'White to Non-White Racial Dissimilarity Index for Winn Parish, L']


After looking, it seems like some of them have the words: 'for _ county', so we should remove the county name

#### Remove county from column titles

In [6]:
features_with_for = [('for' in feat) for feat in unique_features]
print(np.sum(features_with_for))

88299


So after consolidating the 88299 features that have county names in them, we should be left with ~2900 features

In [7]:
def rename_for_features(row):
    ndx = row.find(' for ')
    if ndx != -1:
        return row[:row.find(' for ')]
    else: return row

In [8]:
df_series_table_decounty = df_series_table.copy(deep=True)
df_series_table_decounty['title'] = df_series_table_decounty['title'].apply(rename_for_features)

In [9]:
print(len(df_series_table_decounty.title.unique()))

3018


Looks like we have reduced many of the duplicates caused by the county names. <br>
Let's aggregate the features to see how many counties worth of data we have for each feature

In [10]:
initial_agg = df_series_table_decounty.groupby('title').count()
initial_agg_results = initial_agg.sort_values(by='county_id', axis=0, ascending=False)
agg_out = initial_agg_results[['county_id']]
print(agg_out)

                                                    county_id
title                                                        
Unemployment Rate                                        6202
Civilian Labor Force                                     6188
Unemployed Persons                                       6134
Employed Persons                                         6134
Estimate of Related Children Age 5-17 in Famili...       3150
...                                                       ...
Housing Inventory: Pending Ratio in Mclean Coun...          1
Housing Inventory: Pending Ratio in Mclennan Co...          1
Housing Inventory: Pending Ratio in Mcminn Coun...          1
Housing Inventory: Price Increased Count Month-...          1
Percent of Population Below the Poverty Level i...          1

[3018 rows x 1 columns]


There are only 3176 counties so there cant be 6202 instances of unemployment rate. Something is fishy

In [11]:
sec_agg = df_series_table_decounty.groupby(['title','county_id']).size()
print(sec_agg[sec_agg > 1])

title                                                                             county_id
90% Confidence Interval Lower Bound of Estimate of Median Household Income        32078        2
90% Confidence Interval Lower Bound of Estimate of People Age 0-17 in Poverty     27423        2
                                                                                  32078        2
90% Confidence Interval Lower Bound of Estimate of People of All Ages in Poverty  27423        2
                                                                                  32078        2
                                                                                              ..
Unemployment Rate in LaPorte County, I                                            28087        2
Unemployment Rate in Lac qui Parle County, M                                      28704        2
Unemployment Rate in Lafourche Parish, L                                          28490        2
Unemployment Rate in Nantucket Coun

Let's pick one of these titles and see why there are two options

In [12]:
df_series_table_decounty[df_series_table_decounty.title == 'Unemployment Rate in LaPorte County, I']

Unnamed: 0,frequency,id,observation_end,observation_start,seasonal_adjustment,title,units,county_id
71316,Monthly,INLAPO0URN,2020-02-01,1990-01-01,Not Seasonally Adjusted,"Unemployment Rate in LaPorte County, I",Percent,28087
71317,Annual,LAUCN180910000000003A,2018-01-01,1990-01-01,Not Seasonally Adjusted,"Unemployment Rate in LaPorte County, I",Percent,28087


So there are some features for which there are more than one frequency. We can leave those in for now.
Let's check the features that had very few counties

In [13]:
print(agg_out[agg_out.county_id < 10])

                                                    county_id
title                                                        
Net Migration Flow                                          5
All Employees: Total Nonfarm                                4
Civilian Labor Force in DeKalb County, I                    4
Employed Persons in DeKalb County, I                        4
Unemployed Persons in DeKalb County, I                      4
...                                                       ...
Housing Inventory: Pending Ratio in Mclean Coun...          1
Housing Inventory: Pending Ratio in Mclennan Co...          1
Housing Inventory: Pending Ratio in Mcminn Coun...          1
Housing Inventory: Price Increased Count Month-...          1
Percent of Population Below the Poverty Level i...          1

[2898 rows x 1 columns]


Looks like there are still features with the county name in them, minus the last letter (from an error i made earlier when querying the data). Let's get the county ids for those and check them out: [28087, 28704, 28490, 28578]

In [14]:
print(df_county_table[df_county_table.county_id == 28087])

     county_id                 name  state_id
714      28087  La Porte County, IN     28041


Looks like the name in the county table has a space between La and Porte but in the titles it doesnt. Annoying

In [15]:
df_county_table.at[714, 'name'] = 'LaPorte County, IN'

In [16]:
print(df_county_table[df_county_table.county_id == 28704])
print(df_county_table[df_county_table.county_id == 28490])

      county_id                      name  state_id
1258      28704  Lac Qui Parle County, MN     28667
      county_id                  name  state_id
3140      28490  LaFourche Parish, LA     28461


Looks like for these two the case is different than in title

In [17]:
# Reframe county_table to index on county_id
county_index = df_county_table.copy(deep=True)
county_index.set_index('county_id', drop=True, inplace=True)

# Define function to rename titles
def rename_in_features(row):
    county_name = county_index.at[row.county_id, 'name']
    # Case insensitive matching
    lower_county_name = county_name.lower()
    lower_title = row.title.lower()
    ndx = lower_title.find(lower_county_name[:-1])
    if ndx != -1:
        return row.title[:ndx]
    else:
        return row.title

In [18]:
df_series_table_decounty.title = df_series_table_decounty.apply(rename_in_features, axis=1)

In [19]:
print(len(df_series_table_decounty.title.unique()))

1734


Ok now we're down from 3018 unique features to 1734 unique features. Let's aggregate again to check if any other feature titles only appear once

In [20]:
third_agg = df_series_table_decounty.groupby('title').count()
print(third_agg[third_agg.county_id == 1].index)

Index(['All Employees: Administrative and Support and Waste Management and Remediation Services in Baltimore City, M',
       'Bachelor's Degree or Higher (5-year estimate) in DeKalb County, A',
       'Bachelor's Degree or Higher (5-year estimate) in DeKalb County, G',
       'Bachelor's Degree or Higher (5-year estimate) in DeKalb County, M',
       'Bachelor's Degree or Higher (5-year estimate) in DeKalb County, T',
       'Bachelor's Degree or Higher (5-year estimate) in DeSoto County, F',
       'Bachelor's Degree or Higher (5-year estimate) in DeSoto County, M',
       'Bachelor's Degree or Higher (5-year estimate) in DeWitt County, T',
       'Bachelor's Degree or Higher (5-year estimate) in DuPage County, I',
       'Bachelor's Degree or Higher (5-year estimate) in LaMoure County, N',
       ...
       'Single-parent Households with Children as a Percentage of Households with Children in Wrangell City and Borough, A',
       'Unemployment Rate in DeBaca County, N',
       'Unem

Lets look for DeKalb, DeSoto, DeWitt, DuPage, and LaMoure Counties

In [21]:
print(county_index[county_index.name.str.match('De')])

                            name  state_id
county_id                                 
27360         De Kalb County, AL     27335
758             Desha County, AR     27445
27529       Del Norte County, CA     27521
27595           Delta County, CO     27580
27596          Denver County, CO     27580
27673         De Soto County, FL     27659
27770         Decatur County, GA     27727
27771         De Kalb County, GA     27727
27957         De Kalb County, IL     27938
27958         De Witt County, IL     27938
28056        Dearborn County, IN     28041
28057         Decatur County, IN     28041
28058         De Kalb County, IN     28041
28059        Delaware County, IN     28041
28161         Decatur County, IA     28134
28162        Delaware County, IA     28134
28163      Des Moines County, IA     28134
28254         Decatur County, KS     28234
28604           Delta County, MI     28583
596           De Soto County, MS     28755
28870         De Kalb County, MO     28838
1047       

our counties of interest have codes 27360, 27771, 28870, 29823, 27673, 596, 29960. Let's remove those and see what's left

In [22]:
county_index.at[27360, 'name'] = 'DeKalb County, AL'
county_index.at[27771, 'name'] = 'DeKalb County, GA'
county_index.at[28870, 'name'] = 'DeKalb County, MO'
county_index.at[29823, 'name'] = 'DeKalb County, TN'
county_index.at[27673, 'name'] = 'DeSoto County, FL'
county_index.at[596, 'name'] = 'DeSoto County, MS'
county_index.at[29960, 'name'] = 'DeWitt County, TX'
county_index.at[29164, 'name'] = 'DeBaca County, NM'

In [23]:
df_series_table_decounty.title = df_series_table_decounty.apply(rename_in_features, axis=1)

In [24]:
print(len(df_series_table_decounty.title.unique()))

1307


In [25]:
four_agg = df_series_table_decounty.groupby('title').count()
print(four_agg[four_agg.county_id == 1].index)

Index(['All Employees: Administrative and Support and Waste Management and Remediation Services in Baltimore City, M',
       'Bachelor's Degree or Higher (5-year estimate) in DuPage County, I',
       'Bachelor's Degree or Higher (5-year estimate) in LaMoure County, N',
       'Bachelor's Degree or Higher (5-year estimate) in LaSalle County, I',
       'Bachelor's Degree or Higher (5-year estimate) in Shannon County, S',
       'Bachelor's Degree or Higher (5-year estimate) in Wade Hampton Census Area, A',
       'Burdened Households in Broomfield County, C',
       'Burdened Households in DuPage County, I',
       'Burdened Households in Honolulu County, H',
       'Burdened Households in Juneau City and Borough, A',
       ...
       'Single-parent Households with Children as a Percentage of Households with Children in Sitka City and Borough, A',
       'Single-parent Households with Children as a Percentage of Households with Children in Wade Hampton Census Area, AK (DISCONTINUED',

So there's 1000 titles with specific county names in them. Let's next look for DuPage, LaMoure, LaSalle, Shannon

In [26]:
print(county_index[county_index.name.str.match('Du.Page')])
print(county_index[county_index.name.str.match('La.Moure')])
print(county_index[county_index.name.str.match('La.Salle')])
print(county_index[county_index.name.str.match('Shannon')])

                         name  state_id
county_id                              
27960      Du Page County, IL     27938
                          name  state_id
county_id                               
29378      La Moure County, ND     29355
                          name  state_id
county_id                               
27988      La Salle County, IL     27938
30040      La Salle County, TX     29898
28491      La Salle Parish, LA     28461
                         name  state_id
county_id                              
966        Shannon County, MO     28838
29791      Shannon County, SD     29735


In [27]:
county_index.at[27960, 'name'] = 'DuPage County, IL'
county_index.at[29378, 'name'] = 'LaMoure County, ND'
county_index.at[27988, 'name'] = 'LaSalle County, IL'
county_index.at[28491, 'name'] = 'LaSalle Parish, LA'

Not sure what's going on with Shannon, the county name in the table seems the same as the county name in the title

In [28]:
df_series_table_decounty.title = df_series_table_decounty.apply(rename_in_features, axis=1)
print(len(df_series_table_decounty.title.unique()))

1150


In [29]:
fifth_agg = df_series_table_decounty.groupby('title').count()
print(fifth_agg[fifth_agg.county_id == 1].index.values[:20])

['All Employees: Administrative and Support and Waste Management and Remediation Services in Baltimore City, M'
 "Bachelor's Degree or Higher (5-year estimate) in Shannon County, S"
 "Bachelor's Degree or Higher (5-year estimate) in Wade Hampton Census Area, A"
 'Burdened Households in Broomfield County, C'
 'Burdened Households in Honolulu County, H'
 'Burdened Households in Juneau City and Borough, A'
 'Burdened Households in Philadelphia County, P'
 'Burdened Households in San Francisco County, C'
 'Burdened Households in Shannon County, SD (DISCONTINUED'
 'Burdened Households in Sitka City and Borough, A'
 'Burdened Households in Wade Hampton Census Area, AK (DISCONTINUED'
 'Burdened Households in Wrangell City and Borough, A'
 'Civilian Labor Force in Shannon County, SD (DISCONTINUED'
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in Wade Hampton Census Area, AK (DISCONTINUED'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime 

855 left. Let's look at Broomfield, Honolulu, Juneau City and Borough, Philadelphia County

In [30]:
print(county_index[county_index.name.str.match('Broomfield')])
print(county_index[county_index.name.str.match('Honolulu')])
print(county_index[county_index.name.str.match('Juneau')])
print(county_index[county_index.name.str.match('Philadelphia')])

                                 name  state_id
county_id                                      
32077      Broomfield County/city, CO     27580
                               name  state_id
county_id                                    
27889      Honolulu County/city, HI     27887
                              name  state_id
county_id                                   
30461            Juneau County, WI     30432
27412      Juneau Borough/city, AK     27403
                                   name  state_id
county_id                                        
29664      Philadelphia County/city, PA     29613


In [31]:
county_index.at[32077, 'name'] = 'Broomfield County, CO'
county_index.at[27889, 'name'] = 'Honolulu County, HI'
county_index.at[27412, 'name'] = 'Juneau City and Borough, AK'
county_index.at[29664, 'name'] = 'Philadelphia County, PA'

In [32]:
df_series_table_decounty.title = df_series_table_decounty.apply(rename_in_features, axis=1)
print(len(df_series_table_decounty.title.unique()))

1010


In [33]:
sixth_agg = df_series_table_decounty.groupby('title').count()
print(len(sixth_agg[sixth_agg.county_id == 1].index.values))
print(sixth_agg[sixth_agg.county_id == 1].index.values[:20])

715
['All Employees: Administrative and Support and Waste Management and Remediation Services in Baltimore City, M'
 "Bachelor's Degree or Higher (5-year estimate) in Shannon County, S"
 "Bachelor's Degree or Higher (5-year estimate) in Wade Hampton Census Area, A"
 'Burdened Households in San Francisco County, C'
 'Burdened Households in Shannon County, SD (DISCONTINUED'
 'Burdened Households in Sitka City and Borough, A'
 'Burdened Households in Wade Hampton Census Area, AK (DISCONTINUED'
 'Burdened Households in Wrangell City and Borough, A'
 'Civilian Labor Force in Shannon County, SD (DISCONTINUED'
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in Wade Hampton Census Area, AK (DISCONTINUED'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Shannon County, S'
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Pove

In [34]:
test = sixth_agg[sixth_agg.county_id == 1].index.values[3]
test = test
print(test)

Burdened Households in San Francisco County, C


In [35]:
print(county_index[county_index.name.str.match('San Francisco')])
print(county_index[county_index.name.str.match('Baltimore City')])
print(county_index[county_index.name.str.match('Sitka')])
print(county_index[county_index.name.str.match('Wrangell')])
print(df_series_table_decounty[df_series_table_decounty.title.str.match('All Employees: Admin')])

                                    name  state_id
county_id                                         
27559      San Francisco County/city, CA     27521
                         name  state_id
county_id                              
28547      Baltimore City, MD     28543
                             name  state_id
county_id                                  
27422      Sitka Borough/city, AK     27403
                                          name  state_id
county_id                                               
33518                Wrangell Borough/City, AK     27403
27427      Wrangell-Petersburg Census Area, AK     27403
       frequency                      id observation_end observation_start  \
107348   Monthly    SMU24925816056000001      2020-02-01        1990-01-01   
107533    Annual   SMU24925816056000001A      2019-01-01        1990-01-01   
107534   Monthly  SMU24925816056000001SA      2020-02-01        1990-01-01   

            seasonal_adjustment  \
107348  Not Seasona

In [36]:
print(df_series_table_decounty.iloc[107348])
print(county_index.loc[28546])

frequency                                                        Monthly
id                                                  SMU24925816056000001
observation_end                                               2020-02-01
observation_start                                             1990-01-01
seasonal_adjustment                              Not Seasonally Adjusted
title                  All Employees: Administrative and Support and ...
units                                               Thousands of Persons
county_id                                                          28546
Name: 107348, dtype: object
name        Baltimore County, MD
state_id                   28543
Name: 28546, dtype: object


Looks like the first one just has the wrong link for county id

In [37]:
df_series_table_decounty.at[107348, 'county_id'] = 28547
county_index.at[27559, 'name'] = 'San Francisco County, CA'
county_index.at[27422, 'name'] = 'Sitka City and Borough, AK'
county_index.at[33518, 'name'] = 'Wrangell City and Borough, AK'

In [38]:
df_series_table_decounty.title = df_series_table_decounty.apply(rename_in_features, axis=1)
print(len(df_series_table_decounty.title.unique()))

918


In [39]:
sev_agg = df_series_table_decounty.groupby('title').count()
print(len(sev_agg[sev_agg.county_id == 1].index.values))
print(sev_agg[sev_agg.county_id == 1].index.values[:20])

623
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 "Bachelor's Degree or Higher (5-year estimate) in Shannon County, S"
 "Bachelor's Degree or Higher (5-year estimate) in Wade Hampton Census Area, A"
 'Burdened Households in Shannon County, SD (DISCONTINUED'
 'Burdened Households in Wade Hampton Census Area, AK (DISCONTINUED'
 'Civilian Labor Force in Shannon County, SD (DISCONTINUED'
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in Wade Hampton Census Area, AK (DISCONTINUED'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Shannon County, S'
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in Dade County, FL (DISCONTINUED'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People Age 0-17 in Poverty in Shannon County, SD (DISCONTINU

In [40]:
print(df_series_table_decounty[df_series_table_decounty.title.str.match('.*Shannon')][['title','county_id']])

                                                    title  county_id
228481  Income Inequality in Shannon County, SD (DISCO...      33805
228482  Estimate, Median Age by Sex, Total Population ...      33805
228483  Population Estimate, Total, Not Hispanic or La...      33805
228484  Population Estimate, Total, Not Hispanic or La...      33805
228485  Population Estimate, Total, Not Hispanic or La...      33805
228486  Population Estimate, Total, Not Hispanic or La...      33805
228487  Population Estimate, Total, Hispanic or Latino...      33805
228490      SNAP Benefits Recipients in Shannon County, S      33805
228493  Rate of Preventable Hospital Admissions in Sha...      33805
228494  Burdened Households in Shannon County, SD (DIS...      33805
228497  Combined Violent and Property Crime Incidents ...      33805
228498  High School Graduate or Higher (5-year estimat...      33805
228499  Bachelor's Degree or Higher (5-year estimate) ...      33805
228507  Per Capita Personal Income

In [41]:
print(county_index.loc[33805])

name        Oglala Lakota County, SD
state_id                       29735
Name: 33805, dtype: object


So it looks like these are discontinued series that are labeled with Shannon County but associated with Oglala Lakota County. Let's remove discontinued series

In [42]:
df_series_table_no_disc = df_series_table_decounty[df_series_table_decounty.title.str.match('.*DISCONTINUED') == False]
print(df_series_table_no_disc.head())

  frequency               id observation_end observation_start  \
0    Annual  2020RATIO001001      2018-01-01        2010-01-01   
1   Monthly    ACTLISCOU1001      2020-03-01        2016-07-01   
2   Monthly  ACTLISCOUMM1001      2020-03-01        2017-07-01   
3   Monthly  ACTLISCOUYY1001      2020-03-01        2017-07-01   
4   Monthly       ALAUTA1LFN      2020-02-01        1990-01-01   

       seasonal_adjustment                                              title  \
0  Not Seasonally Adjusted                                  Income Inequality   
1  Not Seasonally Adjusted            Housing Inventory: Active Listing Count   
2  Not Seasonally Adjusted  Housing Inventory: Active Listing Count Month-...   
3  Not Seasonally Adjusted  Housing Inventory: Active Listing Count Year-O...   
4  Not Seasonally Adjusted                               Civilian Labor Force   

     units  county_id  
0    Ratio      27336  
1    Level      27336  
2  Percent      27336  
3  Percent      2733

In [49]:
df_series_table_no_disc.reset_index(inplace=True, drop=True)
print(df_series_table_no_disc.shape)
eigh_agg = df_series_table_no_disc.groupby('title').count()
print(len(eigh_agg[eigh_agg.county_id == 1].index.values))
print(eigh_agg[eigh_agg.county_id == 1].index.values[:20])

(307792, 8)
582
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 "Bachelor's Degree or Higher (5-year estimate) in Shannon County, S"
 "Bachelor's Degree or Higher (5-year estimate) in Wade Hampton Census Area, A"
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Shannon County, S'
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Albemarle + Charlottesville County, V'
 'Gross Domestic Product: All Industries in Aleutians East Borough County, A'
 'Gross Domestic Product: All Industries in Aleutians West Census Area County, A'
 'Gross Domestic Product: All Industries in Alexandria (Independent City) County, V'
 '

The remaining 582 series seem like the county in the title is mismatched with the county id and will have to be dealt with one by one

In [55]:
print(county_index[county_index.name.str.match('.*Wade')])
print(df_series_table_no_disc[df_series_table_no_disc.title.str.match('.*Wade')][['title','county_id']])

                                   name  state_id
county_id                                        
28747                 Wadena County, MN     28667
27426      Wade Hampton Census Area, AK     27403
                                                    title  county_id
300053  SNAP Benefits Recipients in Wade Hampton Censu...      33897
300056  Rate of Preventable Hospital Admissions in Wad...      33897
300063  Bachelor's Degree or Higher (5-year estimate) ...      33897
300070  Per Capita Personal Income in Wade Hampton Cen...      33897
300078     Personal Income in Wade Hampton Census Area, A      33897


We already have all 5 of those series under the correct county, so we can delete them

In [56]:
df_series_table_no_disc.drop([300053, 300056, 300063, 300070, 300078], inplace=True)
print(df_series_table_no_disc.shape)

(307787, 8)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [58]:
print(df_series_table_no_disc[df_series_table_no_disc.title.str.match('.*Shannon')][['title','county_id']])

                                                    title  county_id
228480      SNAP Benefits Recipients in Shannon County, S      33805
228483  Rate of Preventable Hospital Admissions in Sha...      33805
228486  Combined Violent and Property Crime Incidents ...      33805
228487  High School Graduate or Higher (5-year estimat...      33805
228488  Bachelor's Degree or Higher (5-year estimate) ...      33805


In [60]:
df_series_table_no_disc.drop([228480, 228483, 228486, 228487, 228488], inplace=True)
print(df_series_table_no_disc.shape)

KeyError: '[228480 228483 228486 228487 228488] not found in axis'

In [99]:
county_index.at[33806, 'name'] = 'Albemarle + Charlottesville County, VA'
county_index.at[27404, 'name'] = 'Aleutians East Borough County, AK'
county_index.at[30202, 'name'] = 'Alexandria (Independent City) County, VA'
county_index.at[33807, 'name'] = 'Alleghany + Covington County, VA'
county_index.at[27406, 'name'] = 'Anchorage Municipality, AK'
county_index.at[28547, 'name'] = 'Baltimore (Independent City) County, MD'
county_index.at[27407, 'name'] = 'Bethel Census Area County, AK'
county_index.at[27408, 'name'] = 'Bristol Bay Borough County, AK'
county_index.at[33808, 'name'] = 'Campbell + Lynchburg County, VA'

In [101]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

787


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [102]:
nine_agg = df_series_table_no_disc.groupby('title').count()
print(len(nine_agg[nine_agg.county_id == 1].index.values))
print(nine_agg[nine_agg.county_id == 1].index.values[:20])

492
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Aleutians West Census Area County, A'
 'Gross Domestic Product: All Industries in Anchorage Municipality County, A'
 'Gross Domestic Product: All Industries in Carroll + Galax County, V'
 'Gross Domestic Product: All Industries in Carson City (Independent City) County, N'
 'Gross Domestic Product: All Industries in Chesapeake (Independent City) County, V'
 'Gross Domestic Product: All Industries in Denali Borough County, A'
 'Gross Domestic Product: All Industries in Dillingham Census Area County, A'
 'Gross Domestic Product: All Industr

In [135]:
county_index.at[27405, 'name'] = 'Aleutians West Census Area County, AK'
county_index.at[27406, 'name'] = 'Anchorage Municipality County, AK'
county_index.at[33809, 'name'] = 'Carroll + Galax County, VA'
county_index.at[29107, 'name'] = 'Carson City (Independent City) County, NV'
county_index.at[30225, 'name'] = 'Chesapeake (Independent City) County, VA'
county_index.at[32079, 'name'] = 'Denali Borough County, AK'
county_index.at[27409, 'name'] = 'Dillingham Census Area County, AK'
county_index.at[27410, 'name'] = 'Fairbanks North Star Borough County, AK'
county_index.at[33810, 'name'] = 'Frederick + Winchester County, VA'

In [137]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

715


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [138]:
ten_agg = df_series_table_no_disc.groupby('title').count()
print(len(ten_agg[ten_agg.county_id == 1].index.values))
print(ten_agg[ten_agg.county_id == 1].index.values[:20])

420
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Greensville + Emporia County, V'
 'Gross Domestic Product: All Industries in Haines Borough County, A'
 'Gross Domestic Product: All Industries in Hampton (Independent City) County, V'
 'Gross Domestic Product: All Industries in Henry + Martinsville County, V'
 'Gross Domestic Product: All Industries in Hoonah-Angoon Census Area County, A'
 'Gross Domestic Product: All Industries in James City + Williamsburg County, V'
 'Gross Domestic Product: All Industries in Juneau City and Borough County, A'
 'Gross Domestic Product: All Industries 

In [174]:
county_index.at[33811, 'name'] = 'Greensville + Emporia County, VA'
county_index.at[27411, 'name'] = 'Haines Borough County, AK'
county_index.at[30257, 'name'] = 'Hampton (Independent City) County, VA'
county_index.at[33812, 'name'] = 'Henry + Martinsville County, VA'
county_index.at[33517, 'name'] = 'Hoonah-Angoon Census Area County, AK'
county_index.at[33813, 'name'] = 'James City + Williamsburg County, VA'
county_index.at[27412, 'name'] = 'Juneau City and Borough County, AK'
county_index.at[27413, 'name'] = 'Kenai Peninsula Borough County, AK'
county_index.at[27414, 'name'] = 'Ketchikan Gateway Borough County, AK'
county_index.at[27415, 'name'] = 'Kodiak Island Borough County, AK'
county_index.at[33897, 'name'] = 'Kusilvak Census Area County, AK'
county_index.at[27416, 'name'] = 'Lake and Peninsula Borough County, AK'
county_index.at[27417, 'name'] = 'Matanuska-Susitna Borough County, AK'
county_index.at[33804, 'name'] = 'Maui + Kalawao County, HI'

In [175]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

603


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [176]:
elev_agg = df_series_table_no_disc.groupby('title').count()
print(len(elev_agg[elev_agg.county_id == 1].index.values))
print(elev_agg[elev_agg.county_id == 1].index.values[:20])

308
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Montgomery + Radford County, V'
 'Gross Domestic Product: All Industries in Newport News (Independent City) County, V'
 'Gross Domestic Product: All Industries in Nome Census Area County, A'
 'Gross Domestic Product: All Industries in Norfolk (Independent City) County, V'
 'Gross Domestic Product: All Industries in North Slope Borough County, A'
 'Gross Domestic Product: All Industries in Northwest Arctic Borough County, A'
 'Gross Domestic Product: All Industries in Oglala Lakota County, S'
 'Gross Domestic Product: All Industries in Pe

In [194]:
df_series_table_no_disc.drop([229008, 229023], inplace=True)
print(df_series_table_no_disc.shape)

(307780, 8)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [214]:
county_index.at[33814, 'name'] = 'Montgomery + Radford County, VA'
county_index.at[30286, 'name'] = 'Newport News (Independent City) County, VA'
county_index.at[27418, 'name'] = 'Nome Census Area County, AK'
county_index.at[30287, 'name'] = 'Norfolk (Independent City) County, VA'
county_index.at[27419, 'name'] = 'North Slope Borough County, AK'
county_index.at[27420, 'name'] = 'Northwest Arctic Borough County, AK'
county_index.at[33519, 'name'] = 'Petersburg Borough County, AK'
county_index.at[33815, 'name'] = 'Pittsylvania + Danville County, VA'
county_index.at[30298, 'name'] = 'Portsmouth (Independent City) County, VA'
county_index.at[33816, 'name'] = 'Prince George + Hopewell County, VA'
county_index.at[33520, 'name'] = 'Prince of Wales-Hyder Census Area County, AK'
county_index.at[30307, 'name'] = 'Richmond (Independent City) County, VA'
county_index.at[30309, 'name'] = 'Roanoke (Independent City) County, VA'
county_index.at[33817, 'name'] = 'Roanoke + Salem County, VA'
df_series_table_no_disc.at[229001, 'county_id'] = 33805
df_series_table_no_disc.at[229002, 'county_id'] = 33805
df_series_table_no_disc.at[229003, 'county_id'] = 33805
df_series_table_no_disc.at[229004, 'county_id'] = 33805
df_series_table_no_disc.at[229038, 'county_id'] = 33805
df_series_table_no_disc.at[229039, 'county_id'] = 33805
df_series_table_no_disc.at[229040, 'county_id'] = 33805
df_series_table_no_disc.at[229041, 'county_id'] = 33805

In [215]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

481


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [219]:
twelv_agg = df_series_table_no_disc.groupby('title').count()
print(len(twelv_agg[twelv_agg.county_id == 1].index.values))
print(twelv_agg[twelv_agg.county_id == 1].index.values[:20])

186
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Rockingham + Harrisonburg County, V'
 'Gross Domestic Product: All Industries in Sitka City and Borough County, A'
 'Gross Domestic Product: All Industries in Skagway Municipality County, A'
 'Gross Domestic Product: All Industries in Southampton + Franklin County, V'
 'Gross Domestic Product: All Industries in Southeast Fairbanks Census Area County, A'
 'Gross Domestic Product: All Industries in Spotsylvania + Fredericksburg County, V'
 'Gross Domestic Product: All Industries in St. Louis (Independent City) County, M'
 'Gross Domestic P

In [251]:
county_index.at[33818, 'name'] = 'Rockingham + Harrisonburg County, VA'
county_index.at[27422, 'name'] = 'Sitka City and Borough County, AK'
county_index.at[33516, 'name'] = 'Skagway Municipality County, AK'
county_index.at[33819, 'name'] = 'Southampton + Franklin County, VA'
county_index.at[27424, 'name'] = 'Southeast Fairbanks Census Area County, AK'
county_index.at[33820, 'name'] = 'Spotsylvania + Fredericksburg County, VA'
county_index.at[28941, 'name'] = 'St. Louis (Independent City) County, MO'
county_index.at[30322, 'name'] = 'Suffolk (Independent City) County, VA'
county_index.at[27425, 'name'] = 'Valdez-Cordova Census Area County, AK'
county_index.at[30326, 'name'] = 'Virginia Beach (Independent City) County, VA'
county_index.at[33821, 'name'] = 'Washington + Bristol County, VA'
county_index.at[33822, 'name'] = 'Wise + Norton County, VA'
county_index.at[33518, 'name'] = 'Wrangell City and Borough County, AK'
county_index.at[33823, 'name'] = 'York + Poquoson County, VA'

In [252]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

369


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [253]:
thirt_agg = df_series_table_no_disc.groupby('title').count()
print(len(thirt_agg[thirt_agg.county_id == 1].index.values))
print(thirt_agg[thirt_agg.county_id == 1].index.values[:20])

74
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Estimate of People Age 0-17 in Poverty in DeKalb County, I'
 'Estimate of People of All Ages in Poverty in DeKalb County, I'
 'Gross Domestic Product: All Industries in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Government and Government Enterprises in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Private Goods-Producing Industries in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Private Services-Providing Industries in Yukon-Koyukuk Census Area County, A'
 'High School Graduate or Higher (5-year estimate) in the '
 'Housing Inventory: Active Listing Count Month-Over-Month in DeKalb County, I'
 'Housing Inventory: Active Listing Count Year-Over-Year i

In [262]:
county_index.at[27957, 'name'] = 'DeKalb County, IL'
county_index.at[28058, 'name'] = 'DeKalb County, IN'

In [263]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

283


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [264]:
fourt_agg = df_series_table_no_disc.groupby('title').count()
print(len(fourt_agg[fourt_agg.county_id == 1].index.values))
print(fourt_agg[fourt_agg.county_id == 1].index.values[:20])

35
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'Gross Domestic Product: All Industries in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Government and Government Enterprises in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Private Goods-Producing Industries in Yukon-Koyukuk Census Area County, A'
 'Gross Domestic Product: Private Services-Providing Industries in Yukon-Koyukuk Census Area County, A'
 'High School Graduate or Higher (5-year estimate) in the '
 'Per Capita Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Per Capita Personal Income in the '
 'Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Personal Income in the '
 'Real Gross Domestic Product: All Industries in Dinwiddie, Colonial 

In [271]:
county_index.at[27428, 'name'] = 'Yukon-Koyukuk Census Area County, AK'
county_index.at[33925, 'name'] = 'Dinwiddie, Colonial Heights + Petersburg Virginia County, VA'
county_index.at[33929, 'name'] = 'Prince William, Manassas + Manassas Park Virginia County, VA'
county_index.at[33925, 'name'] = 'Dinwiddie, Colonial Heights + Petersburg Virginia County, VA'
county_index.at[33925, 'name'] = 'Dinwiddie, Colonial Heights + Petersburg Virginia County, VA'

In [272]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

267


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [273]:
fiftee_agg = df_series_table_no_disc.groupby('title').count()
print(len(fiftee_agg[fiftee_agg.county_id == 1].index.values))
print(fiftee_agg[fiftee_agg.county_id == 1].index.values[:20])

19
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in St. Louis [Independent City], M'
 'Civilian Labor Force in the '
 'Combined Violent and Property Crime Incidents Known to Law Enforcement in Trousdale County,T'
 'High School Graduate or Higher (5-year estimate) in the '
 'Per Capita Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Per Capita Personal Income in the '
 'Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Personal Income in the ' 'Resident Population in Denver County/city, C'
 'Resident Population in Elliot County, K'
 'Resident Population in McCreary, K'
 'Resident Population in Nantucket County/town, M'
 'Resident Population in Prince of Wales-Hyder Census Area, A'
 'Resident Population in Skagway Municipality, A'
 'Resident Population in Wrangell City and Borough, A'
 'SNAP Benefits Recipients in Petersburg Borough, A'
 'Unemployment Rate in Kusilvak Census Area, A'
 'Unemploym

In [295]:
county_index.at[33519, 'name'] = 'Petersburg Borough, AK'
county_index.at[28941, 'name'] = 'St. Louis [Independent City], MO'
county_index.at[29887, 'name'] = 'Trousdale County,TN'
county_index.at[27596, 'name'] = 'Denver County/city, CO'
county_index.at[28372, 'name'] = 'Elliot County, KY'
county_index.at[28420, 'name'] = 'McCreary, KY'
county_index.at[28578, 'name'] = 'Nantucket County/town, MA'
df_series_table_no_disc.at[300706, 'county_id'] = 33520

In [296]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

251


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [304]:
sixtee_agg = df_series_table_no_disc.groupby('title').count()
print(len(sixtee_agg[sixtee_agg.county_id == 1].index.values))
print(sixtee_agg[sixtee_agg.county_id == 1].index.values[:20])

11
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in the '
 'High School Graduate or Higher (5-year estimate) in the '
 'Per Capita Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Per Capita Personal Income in the '
 'Personal Income in Skagway-Yakutat-Angoon Census Area, A'
 'Personal Income in the '
 'Resident Population in Prince of Wales-Hyder Census Area, A'
 'Resident Population in Skagway Municipality, A'
 'Resident Population in Wrangell City and Borough, A'
 'Unemployment Rate in Kusilvak Census Area, A']


In [320]:
df_series_table_no_disc.at[301120, 'county_id'] = 33897
county_index.at[33897, 'name'] = 'Kusilvak Census Area, AK'
df_series_table_no_disc.at[301227, 'county_id'] = 33518
county_index.at[33518, 'name'] = 'Wrangell City and Borough, AK'
df_series_table_no_disc.at[300814, 'county_id'] = 33516
county_index.at[33516, 'name'] = 'Skagway Municipality, AK'
df_series_table_no_disc.drop([300821, 300839], inplace=True)
county_index.at[33520, 'name'] = 'Prince of Wales-Hyder Census Area, AK'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [321]:
df_series_table_no_disc.title = df_series_table_no_disc.apply(rename_in_features, axis=1)
print(len(df_series_table_no_disc.title.unique()))

245


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [322]:
sevtee_agg = df_series_table_no_disc.groupby('title').count()
print(len(sevtee_agg[sevtee_agg.county_id == 1].index.values))
print(sevtee_agg[sevtee_agg.county_id == 1].index.values[:20])

5
['All Employees: Administrative and Support and Waste Management and Remediation Services in '
 'Civilian Labor Force in the '
 'High School Graduate or Higher (5-year estimate) in the '
 'Per Capita Personal Income in the ' 'Personal Income in the ']


In [336]:
df_series_table_no_disc.at[df_series_table_no_disc.title == 'Civilian Labor Force in the ', 'title'] = 'Civilian Labor Force'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [349]:
df_series_table_no_disc.loc[df_series_table_no_disc.title == 'High School Graduate or Higher (5-year estimate) in the ', 'title'] = 'High School Graduate or Higher (5-year estimate)'
df_series_table_no_disc.loc[df_series_table_no_disc.title == 'Personal Income in the ', 'title'] = 'Personal Income'
df_series_table_no_disc.loc[df_series_table_no_disc.title == 'Per Capita Personal Income in the ', 'title'] = 'Per Capita Personal Income'
df_series_table_no_disc.loc[df_series_table_no_disc.title == 'All Employees: Administrative and Support and Waste Management and Remediation Services in ', 'title'] = 'All Employees: Administrative and Support and Waste Management and Remediation Services'
df_series_table_no_disc.loc[df_series_table_no_disc.title.str.match('.*Unemployment Rate in the'),'title'] = 'Unemployment Rate'
df_series_table_no_disc.loc[df_series_table_no_disc.title.str.match('.*Unemployment Rate in '),'title'] = 'Unemployment Rate'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Now let's check out the remaining featues with low counts

In [386]:
df_series_table_no_disc.groupby('title').count().sort_values('frequency')

Unnamed: 0_level_0,frequency,id,observation_end,observation_start,seasonal_adjustment,units,county_id
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"All Employees: Transportation and Utilities: Transportation, Warehousing, and Utilities",3,3,3,3,3,3,3
All Employees: Health Care: Hospitals,3,3,3,3,3,3,3
All Employees: Information,3,3,3,3,3,3,3
All Employees: Leisure and Hospitality,3,3,3,3,3,3,3
All Employees: Manufacturing,3,3,3,3,3,3,3
...,...,...,...,...,...,...,...
90% Confidence Interval Upper Bound of Estimate of Percent of Related Children Age 5-17 in Families in Poverty,3150,3150,3150,3150,3150,3150,3150
Employed Persons,6134,6134,6134,6134,6134,6134,6134
Unemployed Persons,6134,6134,6134,6134,6134,6134,6134
Civilian Labor Force,6189,6189,6189,6189,6189,6189,6189


In [398]:
def remove_in_at_end(row):
    if row.title[-4:] == ' in ':
        return row.title[:-4]
    else:
        return row.title

In [399]:
df_series_table_no_disc = df_series_table_no_disc.copy(deep=True)
df_series_table_no_disc.title = df_series_table_no_disc.apply(remove_in_at_end, axis=1)

In [421]:
corrected_series_table = df_series_table_no_disc
# corrected_series_table.to_csv('cleaned_series_table.csv', index=False)

#### Generate Aggregated Table

In [412]:
print(len(df_series_table_no_disc.title.unique()))
output_agg = df_series_table_no_disc.groupby('title').count().sort_values('frequency')

153


In [414]:
output_agg.to_csv('aggregated_features.csv')