In [1]:
import pandas as pd
import numpy as np
import os, fnmatch
import matplotlib as mplt
from scipy import stats 


## Census Data

### The United States Census Bureau (USCB) conducts the decennial 'census' required under the United States Constitution, but also conducts many other information gathering and desemination programs.  Indeed, nearly $100 billion of the federal budget is allocated among local geographical units, such as census tracts, based upon those surveys and estimates.  One of the most important such surveys is the American Community Survey (ACS), conducted annually and analyzed at different geographic levels on an annual basis and on a five-year basis projecting to the then current year based upon the annual surveys taken over the period.  The five-year ACS Surveys take the analysis down to the level of census tracts and, in most cases, block groups:  these surveys form the nucleus of the data for our analysis here.

###  In this research, I am taking data from the five-year ACS Surveys for 2012 (i.e., providing data gathered from 2007-2012) through the 2017 five-year survey that was released in stages starting in November, 2018:  the 2017 ACS survey is the most recent comprehensive data for our purposes.  Of the thousands of tables included in the survey, I have winnowed the selection down to what I believe to be six particularly relevant features for our study of households, in each case evaluated at the tract level.  There are roughly 220 census tracts covering the island of Manhattan, of roughly equal population density: in midtown and downtown, each tract typically includes 15-20 blocks. 

### In its tables, the USCB distinguishes between family-based data and household-based data.  The latter looks particularly to the individuals living in a single housing unit, without necessary regard to any relationship among them.  More detail can be found regarding this concept and its consequences in the survey information published by the USCB.  In general, for some survey purposes that will be relevant here, the attributes of a "housholder" is studied: this references a leader in the housing unit without regard to whether that unit is a single family home or, as will be nearly universally the situation in cases in which we are interested here, an owned or leased condominium or coop unit. We will look in detail at that data, with regard to the number of occupants [Table B11001], the highest educational level achieved by the householder [Table B15003], the twelve-month household income [Table B19001], the receipt of any passive income (interest, dividends, rent) by the household [Table B19054], the age of the structure in which the household unit is situated [Table B25034], and the monthly rent asked of the household unit [Table B25061].



### Following my review of literature and experience in financing of economic development, my hypothesis is that successful rapid development of a restaurant-friendly neighborhood may be corollated with upward momentum in education, income, rents and investments of an upwardly mobile residential community.  The study of the census data is intended to provide a resource of features for our cluster analysis.  In this module, we will upload the data sets into Pandas dataframes, clean and normalize it, and then use a linear regression method to evaluate the upward (or downward) trend (or what I refer to as 'momentum') for each feature and on a census tract basis.

## Uploading the Data

#### We will be uploading five tables from each of six five-year surveys, referred to here as 'ACSnn_5YR' (referring to the nn-year American Communities five-year survey), into Pandas dataframes.  The raw data is located in the 'Data_for_NYCHR' that is part of the project repository.  The nomenclature also will refer to the specific tables (described above) from each survey.  All census data used here has been obtained from the USCB on its open-data website at: https://www.census.gov/acs/www/data/data-tables-and-tools/american-factfinder/.  

### Current Rent Asked (Table B25061)

### Unique to Table B25061, the table fields were modified during the period of our study, to provide a more detailed differentiation of rents above $2,000/month.  That modification is reflected in the module below.  The fields of each of the other tables have remained consistent throughout the 2012-2017 releases.

In [2]:
ACS_list = [('ACS17_5YR_FULL_N',True),('ACS16_5YR_FULL_N',True),('ACS15_5YR_FULL_N',True),('ACS14_5YR_FULL_N',False),
            ('ACS13_5YR_FULL_N',False),('ACS12_5YR_FULL_N',False)]

In [3]:
B25061_current_fields = ['GEO.id2','HD01_VD01','HD01_VD22','HD01_VD23','HD01_VD24','HD01_VD25']
current_names = ['Total','>$2,000','<$2,500','<$3,000','>$3,500']
B25061_original_fields = ['GEO.id2','HD01_VD01','HD01_VD22']
original_names = ['Total','>$2,000']

In [4]:
pds={}

for i in range(len(ACS_list)):
    subdir=ACS_list[i]
    tr = '/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/'+subdir[0]+'/'
    print(tr)
    csv_file = fnmatch.filter(os.listdir(tr),'*61*with_ann.csv')
    full_path = tr+csv_file[0]
    if ACS_list[i][1]:
        cols = B25061_current_fields
        col_names = current_names
    else:
        cols = B25061_original_fields
        col_names = original_names
    df_t = pd.read_csv(full_path, index_col='GEO.id2',usecols=cols)
    df_t.columns = [ACS_list[i][0][0:5]+" "+col_names[j] for j in range(len(col_names))]
    df_t.drop(labels='Id2', inplace=True)
    pds[ACS_list[i][0]]=df_t


/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS17_5YR_FULL_N/
/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS16_5YR_FULL_N/
/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS15_5YR_FULL_N/
/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS14_5YR_FULL_N/
/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS13_5YR_FULL_N/
/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/ACS12_5YR_FULL_N/


#### Because of the shift in the rent survey fields in 2015, described above, we align the information for 2015, 2016 and 2017 with that from prior years.  We will do that by combining the counts of all households with rental levels higher than two thousand in the later years and compare those aggregates to the counts of households in the highest range for the earlier years.  This step will not be required for the tables addressing the remaining features.

In [5]:
rent=pd.concat([pds[ACS_list[0][0]],pds[ACS_list[1][0]],pds[ACS_list[2][0]],pds[ACS_list[3][0]],pds[ACS_list[4][0]],pds[ACS_list[5][0]]],axis=1)
rent_int=rent.astype(int)

agg17=(rent_int['ACS17 >$2,000']+rent_int['ACS17 <$2,500']+rent_int['ACS17 <$3,000']+rent_int['ACS17 >$3,500'])
agg16 = (rent_int['ACS16 >$2,000']+rent_int['ACS16 <$2,500']+rent_int['ACS16 <$3,000']+rent_int['ACS16 >$3,500'])
agg15 = (rent_int['ACS15 >$2,000']+rent_int['ACS15 <$2,500']+rent_int['ACS15 <$3,000']+rent_int['ACS15 >$3,500'])

rent_int['ACS17 >$2,000 Agg'] = agg17
rent_int['ACS16 >$2,000 Agg'] = agg16
rent_int['ACS15 >$2,000 Agg'] = agg15
print(rent_int.columns)
rent_int.head()


Index(['ACS17 Total', 'ACS17 >$2,000', 'ACS17 <$2,500', 'ACS17 <$3,000',
       'ACS17 >$3,500', 'ACS16 Total', 'ACS16 >$2,000', 'ACS16 <$2,500',
       'ACS16 <$3,000', 'ACS16 >$3,500', 'ACS15 Total', 'ACS15 >$2,000',
       'ACS15 <$2,500', 'ACS15 <$3,000', 'ACS15 >$3,500', 'ACS14 Total',
       'ACS14 >$2,000', 'ACS13 Total', 'ACS13 >$2,000', 'ACS12 Total',
       'ACS12 >$2,000', 'ACS17 >$2,000 Agg', 'ACS16 >$2,000 Agg',
       'ACS15 >$2,000 Agg'],
      dtype='object')


Unnamed: 0_level_0,ACS17 Total,"ACS17 >$2,000","ACS17 <$2,500","ACS17 <$3,000","ACS17 >$3,500",ACS16 Total,"ACS16 >$2,000","ACS16 <$2,500","ACS16 <$3,000","ACS16 >$3,500",...,"ACS15 >$3,500",ACS14 Total,"ACS14 >$2,000",ACS13 Total,"ACS13 >$2,000",ACS12 Total,"ACS12 >$2,000","ACS17 >$2,000 Agg","ACS16 >$2,000 Agg","ACS15 >$2,000 Agg"
GEO.id2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
36061000100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36061000201,82,0,0,0,0,59,0,0,0,0,...,0,96,0,86,0,45,0,0,0,0
36061000202,98,0,0,0,0,44,0,0,0,0,...,0,44,0,46,0,32,0,0,0,0
36061000500,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36061000600,79,0,0,0,0,0,0,0,0,0,...,0,65,0,105,0,106,0,0,0,0


In [6]:
### Now we normalize the data by division of the number of households reporting rent in excess of $2,000/month 
# by the total number of households reporting rent in the respective surveys.  In the cases where no households reported in
# a particular survey, the resulting division by zero, resulting in a missing data 'NaN' entry, will be addressed below.

rent_int['2017 Values']= rent_int['ACS17 >$2,000 Agg'].div(rent_int['ACS17 Total'],axis=0)
rent_int['2016 Values']= rent_int['ACS16 >$2,000 Agg'].div(rent_int['ACS16 Total'],axis=0)
rent_int['2015 Values']= rent_int['ACS15 >$2,000 Agg'].div(rent_int['ACS15 Total'],axis=0)
rent_int['2014 Values']= rent_int['ACS14 >$2,000'].div(rent_int['ACS14 Total'],axis=0)
rent_int['2013 Values']= rent_int['ACS13 >$2,000'].div(rent_int['ACS13 Total'],axis=0)
rent_int['2012 Values']= rent_int['ACS12 >$2,000'].div(rent_int['ACS12 Total'],axis=0)
rent_int_idx = rent_int.reset_index(level=0, inplace=False)
# rent_int_idx.info()


In [7]:
##### Now we will address the missing data. We first will interpolate between valid data entries and after that we will try backfilling.
# Any remaining instances that have missing data will be dropped.  
# As will be seen, relatively few relevant census tracts are lost through this procedure:
 

rent_int_clean= pd.DataFrame(rent_int_idx[['2017 Values','2016 Values','2015 Values','2014 Values','2013 Values','2012 Values']])
y=rent_int_clean.interpolate(method='linear',axis=1, inplace=False)
z=y.fillna(method='bfill', axis=1)
z['Tract']=rent_int_idx.loc[:,'GEO.id2']
zz=z.dropna(axis=0)
rent_moment = zz.reset_index(drop=True, inplace=False)
# rent_moment.head(30)

#### We will determine the momentum of the population of households on each feature by taking a straight line linear regression of the six years of data sets.  As an aside, there nearly certainly are more available, but complex, methods for measuring this momentum: however, the scipy stats.linregress method should provide adequate trend information and statistical analysis for out purposes.  Here, we create a dictionary of results, one entry for each census tract that had valid data: the values in each entry are the slope (m) and the intercept (b) for the line, and the correlation factor (r), p-value (p) and standard error (std_err) for the fit to the data.  A cursory review of the results reveals a wide dispersion in slopes (momentum), with remarkably strong statistics. This is reassuring, as the principal reason for including as many as six data points for each tract was to reduce the impact of statistical noise.

In [8]:
numbs=pd.Series([0,1,2,3,4,5])
results = {}
for i in range(len(rent_moment)):
    testdata = pd.Series(rent_moment.loc[i,['2012 Values','2013 Values','2014 Values','2015 Values','2016 Values','2017 Values']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results[i]=(m,b,r,p,std_err)

len(results)  

272

## Uploading and Analyzing the Remaining Feature Tables

### Table B11001 (Household Type)

#### Table B11001 addresses the composition of the members of a household unit.  It discriminates between households comprising two or more members of a 'family' and those comprising one or more persons who are not of the same family; within the latter group, separate features include households of a persons living alone, on the one hand, and those of two or more persons. For purposes of our study, I have included only the latter two features since (i) the category of family households is the converse of the category of total nonfamily households (given the total number of responses); (ii) in the case of units occupied by families, no distinction is made between those with and those without children of a young age that might be relevant to decisions regarding dining at restaurants of the type contemplated in this study; but (iii) it is resonable to expect that the restaurant behavior of households comprising only one person would be different from that of households comprising two or more unrelated persons.

In [9]:
# For the avoidance of confusion, we will continue to use the ACS_list of tuples, notwithstanding that the second member of each tuple was
# introduced solely to accommodate the change in fields on B25061, discussed above.

ACS_list = [('ACS17_5YR_FULL_N',True),('ACS16_5YR_FULL_N',True),('ACS15_5YR_FULL_N',True),('ACS14_5YR_FULL_N',False),
            ('ACS13_5YR_FULL_N',False),('ACS12_5YR_FULL_N',False)]

In [10]:
B11001_current_fields = ['GEO.id2','HD01_VD01','HD01_VD08','HD01_VD09']
current_names = ['Total','HH Living Alone','HH Not Living Alone']


In [11]:

pds={}

for i in range(len(ACS_list)):
    subdir=ACS_list[i]
    tr = '/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/'+subdir[0]+'/'
    csv_file = fnmatch.filter(os.listdir(tr),'*01*with_ann.csv')
    full_path = tr+csv_file[0]
    cols = B11001_current_fields
    col_names = current_names
    df_t = pd.read_csv(full_path, index_col='GEO.id2',usecols=cols)
    df_t.columns = [ACS_list[i][0][0:5]+" "+col_names[j] for j in range(len(col_names))]
    df_t.drop(labels='Id2', inplace=True)
    pds[ACS_list[i][0]]=df_t
    
# pds


In [12]:
hhstat=pd.concat([pds[ACS_list[0][0]],pds[ACS_list[1][0]],pds[ACS_list[2][0]],pds[ACS_list[3][0]],pds[ACS_list[4][0]],pds[ACS_list[5][0]]],axis=1)
hhstat_int=hhstat.astype(int)
# hhstat_int.head(10)


In [13]:
## Now we normalize the data by division of the number of households reporting occupancy status  
# by the total number of households reporting rent in the respective surveys.  In the cases where no households reported in
# a particular survey, the resulting division by zero, resulting in a missing data 'NaN' entry, will be addressed below.

hhstat_int['ACS17 HH Living Alone']= hhstat_int['ACS17 HH Living Alone'].div(hhstat_int['ACS17 Total'],axis=0)
hhstat_int['ACS17 HH Not Living Alone']= hhstat_int['ACS17 HH Not Living Alone'].div(hhstat_int['ACS17 Total'],axis=0)
hhstat_int['ACS16 HH Living Alone']= hhstat_int['ACS16 HH Living Alone'].div(hhstat_int['ACS16 Total'],axis=0)
hhstat_int['ACS16 HH Not Living Alone']= hhstat_int['ACS16 HH Not Living Alone'].div(hhstat_int['ACS16 Total'],axis=0)
hhstat_int['ACS15 HH Living Alone']= hhstat_int['ACS15 HH Living Alone'].div(hhstat_int['ACS15 Total'],axis=0)
hhstat_int['ACS15 HH Not Living Alone']= hhstat_int['ACS15 HH Not Living Alone'].div(hhstat_int['ACS15 Total'],axis=0)
hhstat_int['ACS14 HH Living Alone']= hhstat_int['ACS14 HH Living Alone'].div(hhstat_int['ACS14 Total'],axis=0)
hhstat_int['ACS14 HH Not Living Alone']= hhstat_int['ACS14 HH Not Living Alone'].div(hhstat_int['ACS14 Total'],axis=0)                                                                             
hhstat_int['ACS13 HH Living Alone']= hhstat_int['ACS13 HH Living Alone'].div(hhstat_int['ACS13 Total'],axis=0)
hhstat_int['ACS13 HH Not Living Alone']= hhstat_int['ACS13 HH Not Living Alone'].div(hhstat_int['ACS13 Total'],axis=0)                                                                             
hhstat_int['ACS12 HH Living Alone']= hhstat_int['ACS12 HH Living Alone'].div(hhstat_int['ACS12 Total'],axis=0)
hhstat_int['ACS12 HH Not Living Alone']= hhstat_int['ACS12 HH Not Living Alone'].div(hhstat_int['ACS12 Total'],axis=0)
hhstat_int_idx = hhstat_int.reset_index(level=0, inplace=False)
# hhstat_int_idx.info()
hhstat_int.info()


<class 'pandas.core.frame.DataFrame'>
Index: 288 entries, 36061000100 to 36061031900
Data columns (total 18 columns):
ACS17 Total                  288 non-null int64
ACS17 HH Living Alone        280 non-null float64
ACS17 HH Not Living Alone    280 non-null float64
ACS16 Total                  288 non-null int64
ACS16 HH Living Alone        280 non-null float64
ACS16 HH Not Living Alone    280 non-null float64
ACS15 Total                  288 non-null int64
ACS15 HH Living Alone        280 non-null float64
ACS15 HH Not Living Alone    280 non-null float64
ACS14 Total                  288 non-null int64
ACS14 HH Living Alone        279 non-null float64
ACS14 HH Not Living Alone    279 non-null float64
ACS13 Total                  288 non-null int64
ACS13 HH Living Alone        279 non-null float64
ACS13 HH Not Living Alone    279 non-null float64
ACS12 Total                  288 non-null int64
ACS12 HH Living Alone        279 non-null float64
ACS12 HH Not Living Alone    279 non-null fl

In [14]:
##### Now we will address the missing data. We first will interpolate between valid data entries and after that we will try backfilling.
# Any remaining instances that have missing data will be dropped.  
# As will be seen, relatively few relevant census tracts are lost through this procedure:
 

hhstat_int_clean= pd.DataFrame(hhstat_int_idx[['ACS17 HH Living Alone', 'ACS17 HH Not Living Alone',
       'ACS16 HH Living Alone', 'ACS16 HH Not Living Alone',
       'ACS15 HH Living Alone', 'ACS15 HH Not Living Alone',
       'ACS14 HH Living Alone', 'ACS14 HH Not Living Alone',
       'ACS13 HH Living Alone', 'ACS13 HH Not Living Alone',
       'ACS12 HH Living Alone', 'ACS12 HH Not Living Alone']])
y=hhstat_int_clean.interpolate(method='linear',axis=1, inplace=False)
z=y.fillna(method='bfill', axis=1)
z['Tract']=hhstat_int_idx.loc[:,'GEO.id2']
zz=z.dropna(axis=0)
hhstat_moment = zz.reset_index(drop=True, inplace=False)
hhstat_moment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 13 columns):
ACS17 HH Living Alone        280 non-null float64
ACS17 HH Not Living Alone    280 non-null float64
ACS16 HH Living Alone        280 non-null float64
ACS16 HH Not Living Alone    280 non-null float64
ACS15 HH Living Alone        280 non-null float64
ACS15 HH Not Living Alone    280 non-null float64
ACS14 HH Living Alone        280 non-null float64
ACS14 HH Not Living Alone    280 non-null float64
ACS13 HH Living Alone        280 non-null float64
ACS13 HH Not Living Alone    280 non-null float64
ACS12 HH Living Alone        280 non-null float64
ACS12 HH Not Living Alone    280 non-null float64
Tract                        280 non-null object
dtypes: float64(12), object(1)
memory usage: 28.5+ KB


In [15]:
numbs=pd.Series([0,1,2,3,4,5])
# First, for Households Living Alone
results_HHLA = {}
for i in range(len(hhstat_moment)):
    testdata = pd.Series(hhstat_moment.loc[i,['ACS17 HH Living Alone','ACS16 HH Living Alone', 
       'ACS15 HH Living Alone', 'ACS14 HH Living Alone','ACS13 HH Living Alone', 'ACS12 HH Living Alone']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHLA[i]=(m,b,r,p,std_err)
len(results_HHLA)


# Now, for Households Not Living Alone
results_HHNLA = {}
for i in range(len(hhstat_moment)):
    testdata = pd.Series(hhstat_moment.loc[i,['ACS17 HH Living Alone','ACS16 HH Living Alone', 
       'ACS15 HH Living Alone', 'ACS14 HH Living Alone','ACS13 HH Living Alone', 'ACS12 HH Living Alone']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHNLA[i]=(m,b,r,p,std_err)
len(results_HHNLA)

280

### Table B15003 (Educational Attainment)

#### Table B15003 addresses the highest level of educational attainment in the household. It provides very detailed classifications of such attainment, from lower school through the Ph.D. level.  In my experience (having both a J.D. and a Ph.D.), the restaurant choices made by persons holding a Masters or lower level of attainment is quite different from that of a holder of a professional degree, and each of those is very different from the decisions made by a holder of a Ph.D. This study will utilize as features only the three highest levels.

In [16]:
ACS_list = [('ACS17_5YR_FULL_N',True),('ACS16_5YR_FULL_N',True),('ACS15_5YR_FULL_N',True),('ACS14_5YR_FULL_N',False),
            ('ACS13_5YR_FULL_N',False),('ACS12_5YR_FULL_N',False)]

In [17]:
B15003_current_fields = ['GEO.id2','HD01_VD01','HD01_VD23','HD01_VD24','HD01_VD25']
current_names = ['Total','Masters','Professional', 'PhD']
B15003_current_fields

['GEO.id2', 'HD01_VD01', 'HD01_VD23', 'HD01_VD24', 'HD01_VD25']

In [18]:

pds={}

for i in range(len(ACS_list)):
    subdir=ACS_list[i]
# # #     print('subdir is: ', subdir)
    tr = '/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/'+subdir[0]+'/'
# #     print ('tr is: ', tr)
    csv_file = fnmatch.filter(os.listdir(tr),'*003*with_ann.csv')
#     print ('csv_file is: ', fnmatch)
    full_path = tr+csv_file[0]
#     print('full_path is: ', full_path)
    cols = B15003_current_fields
    col_names = current_names
    df_t = pd.read_csv(full_path, index_col='GEO.id2',usecols=cols)
#     print(df_t.info())
    df_t.columns = [ACS_list[i][0][0:5]+" "+col_names[j] for j in range(len(col_names))]
    df_t.drop(labels='Id2', inplace=True)
    pds[ACS_list[i][0]]=df_t
    
# pds


In [19]:
edstat=pd.concat([pds[ACS_list[0][0]],pds[ACS_list[1][0]],pds[ACS_list[2][0]],pds[ACS_list[3][0]],pds[ACS_list[4][0]],pds[ACS_list[5][0]]],axis=1)
edstat_int=edstat.astype(int)
edstat_int.head(10)
edstat_int.columns
edstat_int.info()

<class 'pandas.core.frame.DataFrame'>
Index: 288 entries, 36061000100 to 36061031900
Data columns (total 24 columns):
ACS17 Total           288 non-null int64
ACS17 Masters         288 non-null int64
ACS17 Professional    288 non-null int64
ACS17 PhD             288 non-null int64
ACS16 Total           288 non-null int64
ACS16 Masters         288 non-null int64
ACS16 Professional    288 non-null int64
ACS16 PhD             288 non-null int64
ACS15 Total           288 non-null int64
ACS15 Masters         288 non-null int64
ACS15 Professional    288 non-null int64
ACS15 PhD             288 non-null int64
ACS14 Total           288 non-null int64
ACS14 Masters         288 non-null int64
ACS14 Professional    288 non-null int64
ACS14 PhD             288 non-null int64
ACS13 Total           288 non-null int64
ACS13 Masters         288 non-null int64
ACS13 Professional    288 non-null int64
ACS13 PhD             288 non-null int64
ACS12 Total           288 non-null int64
ACS12 Masters        

In [20]:
### Now we normalize the data by division of the number of households reporting a level of educational attainment of interest here 
# by the total number of households reporting educational attainment at any level in the respective surveys.  In the cases 
# where no households reported in
# a particular survey, the resulting division by zero, resulting in a missing data 'NaN' entry, will be addressed below.

edstat_int['ACS17 Masters']= edstat_int['ACS17 Masters'].div(edstat_int['ACS17 Total'],axis=0)
edstat_int['ACS17 Professional']= edstat_int['ACS17 Professional'].div(edstat_int['ACS17 Total'],axis=0)
edstat_int['ACS17 PhD']= edstat_int['ACS17 PhD'].div(edstat_int['ACS17 Total'],axis=0)
edstat_int['ACS16 Masters']= edstat_int['ACS16 Masters'].div(edstat_int['ACS16 Total'],axis=0)
edstat_int['ACS16 Professional']= edstat_int['ACS16 Professional'].div(edstat_int['ACS16 Total'],axis=0)
edstat_int['ACS16 PhD']= edstat_int['ACS16 PhD'].div(edstat_int['ACS16 Total'],axis=0)
edstat_int['ACS15 Masters']= edstat_int['ACS15 Masters'].div(edstat_int['ACS15 Total'],axis=0)
edstat_int['ACS15 Professional']= edstat_int['ACS15 Professional'].div(edstat_int['ACS15 Total'],axis=0)
edstat_int['ACS15 PhD']= edstat_int['ACS15 PhD'].div(edstat_int['ACS15 Total'],axis=0)
edstat_int['ACS14 Masters']= edstat_int['ACS14 Masters'].div(edstat_int['ACS14 Total'],axis=0)
edstat_int['ACS14 Professional']= edstat_int['ACS14 Professional'].div(edstat_int['ACS14 Total'],axis=0)
edstat_int['ACS14 PhD']= edstat_int['ACS14 PhD'].div(edstat_int['ACS14 Total'],axis=0)
edstat_int['ACS13 Masters']= edstat_int['ACS13 Masters'].div(edstat_int['ACS13 Total'],axis=0)
edstat_int['ACS13 Professional']= edstat_int['ACS13 Professional'].div(edstat_int['ACS13 Total'],axis=0)
edstat_int['ACS13 PhD']= edstat_int['ACS13 PhD'].div(edstat_int['ACS13 Total'],axis=0)
edstat_int['ACS12 Masters']= edstat_int['ACS12 Masters'].div(edstat_int['ACS12 Total'],axis=0)
edstat_int['ACS12 Professional']= edstat_int['ACS12 Professional'].div(edstat_int['ACS12 Total'],axis=0)
edstat_int['ACS12 PhD']= edstat_int['ACS12 PhD'].div(edstat_int['ACS12 Total'],axis=0)
edstat_int_idx = edstat_int.reset_index(level=0, inplace=False)
# edstat_int_idx.info()
# edstat_int.head(10)


In [21]:
##### Now we will address the missing data. We first will interpolate between valid data entries and after that we will try backfilling.
# Any remaining instances that have missing data will be dropped.  
# As will be seen, relatively few relevant census tracts are lost through this procedure:
 

edstat_int_clean= pd.DataFrame(edstat_int_idx[['ACS17 Masters', 'ACS17 Professional', 'ACS17 PhD',
       'ACS16 Masters', 'ACS16 Professional', 'ACS16 PhD',
       'ACS15 Masters', 'ACS15 Professional', 'ACS15 PhD',
       'ACS14 Masters', 'ACS14 Professional', 'ACS14 PhD',
       'ACS13 Masters', 'ACS13 Professional', 'ACS13 PhD',
       'ACS12 Masters', 'ACS12 Professional', 'ACS12 PhD']])
y=edstat_int_clean.interpolate(method='linear',axis=1, inplace=False)
z=y.fillna(method='bfill', axis=1)
z['Tract']=edstat_int_idx.loc[:,'GEO.id2']
zz=z.dropna(axis=0)
edstat_moment = zz.reset_index(drop=True, inplace=False)
edstat_moment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283 entries, 0 to 282
Data columns (total 19 columns):
ACS17 Masters         283 non-null float64
ACS17 Professional    283 non-null float64
ACS17 PhD             283 non-null float64
ACS16 Masters         283 non-null float64
ACS16 Professional    283 non-null float64
ACS16 PhD             283 non-null float64
ACS15 Masters         283 non-null float64
ACS15 Professional    283 non-null float64
ACS15 PhD             283 non-null float64
ACS14 Masters         283 non-null float64
ACS14 Professional    283 non-null float64
ACS14 PhD             283 non-null float64
ACS13 Masters         283 non-null float64
ACS13 Professional    283 non-null float64
ACS13 PhD             283 non-null float64
ACS12 Masters         283 non-null float64
ACS12 Professional    283 non-null float64
ACS12 PhD             283 non-null float64
Tract                 283 non-null object
dtypes: float64(18), object(1)
memory usage: 42.1+ KB


In [22]:
numbs=pd.Series([0,1,2,3,4,5])
# First, for Households Achieving Masters Degree
results_HHMSTR = {}
for i in range(len(edstat_moment)):
    testdata = pd.Series(edstat_moment.loc[i,['ACS17 Masters','ACS16 Masters','ACS15 Masters', 'ACS14 Masters','ACS13 Masters', 'ACS12 Masters']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHMSTR[i]=(m,b,r,p,std_err)
len(results_HHMSTR)


# Now, for Households Achieving Professional Degree
results_HHPRD = {}
for i in range(len(edstat_moment)):
    testdata = pd.Series(edstat_moment.loc[i,['ACS17 Masters','ACS16 Masters', 
       'ACS15 Masters', 'ACS14 Masters','ACS13 Masters', 'ACS12 Masters']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHPRD[i]=(m,b,r,p,std_err)
len(results_HHPRD)

# Finally, for Households Achieving a Ph.D.

results_HHPHD = {}
for i in range(len(edstat_moment)):
    testdata = pd.Series(edstat_moment.loc[i,['ACS17 PhD','ACS16 PhD', 
       'ACS15 PhD', 'ACS14 PhD','ACS13 PhD', 'ACS12 PhD']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHPHD[i]=(m,b,r,p,std_err)

len(results_HHPHD)


283

### Table B19054 (Interest, Dividends or Net Rental Income)

#### Income from interest, dividends or net rental income constitutes the bulk of noncapital returns from investment -- that is 'passive income'.  Table B19054 estimates whether during the preceding twelve month period the number of households that had, or did not have, any passive income.  The receipt of passive income may be a relevant factor in restaurant usage to the extent that it indicates that the household has invested wealth that might also support such usage.  That is, it is reasonable to assume that a household with investment income may partake of restaurants of the type considered here, while a household with no investment income probably would not partake.

In [23]:
ACS_list = [('ACS17_5YR_FULL_N',True),('ACS16_5YR_FULL_N',True),('ACS15_5YR_FULL_N',True),('ACS14_5YR_FULL_N',False),
            ('ACS13_5YR_FULL_N',False),('ACS12_5YR_FULL_N',False)]

In [24]:
B19054_current_fields = ['GEO.id2','HD01_VD01','HD01_VD02','HD01_VD03']
current_names = ['Total','Passive Income','No Passive Iincome']


In [25]:

pds={}

for i in range(len(ACS_list)):
    subdir=ACS_list[i]
# # #     print('subdir is: ', subdir)
    tr = '/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/'+subdir[0]+'/'
# #     print ('tr is: ', tr)
    csv_file = fnmatch.filter(os.listdir(tr),'*054*with_ann.csv')
#     print ('csv_file is: ', fnmatch)
    full_path = tr+csv_file[0]
#     print('full_path is: ', full_path)
    cols = B19054_current_fields
    col_names = current_names
    df_t = pd.read_csv(full_path, index_col='GEO.id2',usecols=cols)
#     print(df_t.info())
    df_t.columns = [ACS_list[i][0][0:5]+" "+col_names[j] for j in range(len(col_names))]
    df_t.drop(labels='Id2', inplace=True)
    pds[ACS_list[i][0]]=df_t
    
# pds.items()


In [26]:
pincstat=pd.concat([pds[ACS_list[0][0]],pds[ACS_list[1][0]],pds[ACS_list[2][0]],pds[ACS_list[3][0]],pds[ACS_list[4][0]],pds[ACS_list[5][0]]],axis=1)
pincstat_int=pincstat.astype(int)
pincstat_int.head(10)
pincstat_int.columns

Index(['ACS17 Total', 'ACS17 Passive Income', 'ACS17 No Passive Iincome',
       'ACS16 Total', 'ACS16 Passive Income', 'ACS16 No Passive Iincome',
       'ACS15 Total', 'ACS15 Passive Income', 'ACS15 No Passive Iincome',
       'ACS14 Total', 'ACS14 Passive Income', 'ACS14 No Passive Iincome',
       'ACS13 Total', 'ACS13 Passive Income', 'ACS13 No Passive Iincome',
       'ACS12 Total', 'ACS12 Passive Income', 'ACS12 No Passive Iincome'],
      dtype='object')

In [27]:
### Now we normalize the data by division of the number of households reporting a level of educational attainment of interest here 
# by the total number of households reporting educational attainment at any level in the respective surveys.  In the cases 
# where no households reported in
# a particular survey, the resulting division by zero, resulting in a missing data 'NaN' entry, will be addressed below.

pincstat_int['ACS17 Passive Income']= pincstat_int['ACS17 Passive Income'].div(pincstat_int['ACS17 Total'],axis=0)
pincstat_int['ACS17 No Passive Iincome']= pincstat_int['ACS17 No Passive Iincome'].div(pincstat_int['ACS17 Total'],axis=0)
pincstat_int['ACS16 Passive Income']= pincstat_int['ACS16 Passive Income'].div(pincstat_int['ACS16 Total'],axis=0)
pincstat_int['ACS16 No Passive Iincome']= pincstat_int['ACS16 No Passive Iincome'].div(pincstat_int['ACS16 Total'],axis=0)
pincstat_int['ACS15 Passive Income']= pincstat_int['ACS15 Passive Income'].div(pincstat_int['ACS15 Total'],axis=0)
pincstat_int['ACS15 No Passive Iincome']= pincstat_int['ACS15 No Passive Iincome'].div(pincstat_int['ACS15 Total'],axis=0)
pincstat_int['ACS14 Passive Income']= pincstat_int['ACS14 Passive Income'].div(pincstat_int['ACS14 Total'],axis=0)
pincstat_int['ACS14 No Passive Iincome']= pincstat_int['ACS14 No Passive Iincome'].div(pincstat_int['ACS14 Total'],axis=0)
pincstat_int['ACS13 Passive Income']= pincstat_int['ACS13 Passive Income'].div(pincstat_int['ACS13 Total'],axis=0)
pincstat_int['ACS13 No Passive Iincome']= pincstat_int['ACS13 No Passive Iincome'].div(pincstat_int['ACS13 Total'],axis=0)
pincstat_int['ACS12 Passive Income']= pincstat_int['ACS12 Passive Income'].div(pincstat_int['ACS12 Total'],axis=0)
pincstat_int['ACS12 No Passive Iincome']= pincstat_int['ACS12 No Passive Iincome'].div(pincstat_int['ACS12 Total'],axis=0)

pincstat_int_idx = pincstat_int.reset_index(level=0, inplace=False)
pincstat_int_idx.info()
# pincstat_int.head(10)
# pincstat_int_idx.columns
pincstat_int_idx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 19 columns):
GEO.id2                     288 non-null object
ACS17 Total                 288 non-null int64
ACS17 Passive Income        280 non-null float64
ACS17 No Passive Iincome    280 non-null float64
ACS16 Total                 288 non-null int64
ACS16 Passive Income        280 non-null float64
ACS16 No Passive Iincome    280 non-null float64
ACS15 Total                 288 non-null int64
ACS15 Passive Income        280 non-null float64
ACS15 No Passive Iincome    280 non-null float64
ACS14 Total                 288 non-null int64
ACS14 Passive Income        279 non-null float64
ACS14 No Passive Iincome    279 non-null float64
ACS13 Total                 288 non-null int64
ACS13 Passive Income        279 non-null float64
ACS13 No Passive Iincome    279 non-null float64
ACS12 Total                 288 non-null int64
ACS12 Passive Income        279 non-null float64
ACS12 No Passive Iincome  

In [28]:
##### Now we will address the missing data. We first will interpolate between valid data entries and after that we will try backfilling.
# Any remaining instances that have missing data will be dropped.  
# As will be seen, relatively few relevant census tracts are lost through this procedure:
 

pincstat_int_clean= pd.DataFrame(pincstat_int_idx[['ACS17 Total', 'ACS17 Passive Income', 'ACS17 No Passive Iincome',
       'ACS16 Total', 'ACS16 Passive Income', 'ACS16 No Passive Iincome',
       'ACS15 Total', 'ACS15 Passive Income', 'ACS15 No Passive Iincome',
       'ACS14 Total', 'ACS14 Passive Income', 'ACS14 No Passive Iincome',
       'ACS13 Total', 'ACS13 Passive Income', 'ACS13 No Passive Iincome',
       'ACS12 Total', 'ACS12 Passive Income', 'ACS12 No Passive Iincome']])

In [29]:
y=pincstat_int_clean.interpolate(method='linear',axis=1, inplace=False)
z=y.fillna(method='bfill', axis=1)
z['Tract']=pincstat_int_idx.loc[:,'GEO.id2']
zz=z.dropna(axis=0)
pincstat_moment = zz.reset_index(drop=True, inplace=False)
pincstat_moment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 19 columns):
ACS17 Total                 288 non-null float64
ACS17 Passive Income        288 non-null float64
ACS17 No Passive Iincome    288 non-null float64
ACS16 Total                 288 non-null float64
ACS16 Passive Income        288 non-null float64
ACS16 No Passive Iincome    288 non-null float64
ACS15 Total                 288 non-null float64
ACS15 Passive Income        288 non-null float64
ACS15 No Passive Iincome    288 non-null float64
ACS14 Total                 288 non-null float64
ACS14 Passive Income        288 non-null float64
ACS14 No Passive Iincome    288 non-null float64
ACS13 Total                 288 non-null float64
ACS13 Passive Income        288 non-null float64
ACS13 No Passive Iincome    288 non-null float64
ACS12 Total                 288 non-null float64
ACS12 Passive Income        288 non-null float64
ACS12 No Passive Iincome    288 non-null float64
Tract        

In [30]:
numbs=pd.Series([0,1,2,3,4,5])
# First, for Households Receiving Passive Income
results_HHPINC = {}
for i in range(len(pincstat_moment)):
    testdata = pd.Series(pincstat_moment.loc[i,['ACS17 Passive Income','ACS16 Passive Income','ACS15 Passive Income',
                                                'ACS14 Passive Income','ACS13 Passive Income','ACS12 Passive Income']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHPINC[i]=(m,b,r,p,std_err)
len(results_HHPINC)


# Now, for Households Receiving No Passive Income
results_HHNPINC = {}
for i in range(len(pincstat_moment)):
    testdata = pd.Series(pincstat_moment.loc[i,['ACS17 No Passive Iincome','ACS16 No Passive Iincome', 
       'ACS15 No Passive Iincome', 'ACS14 No Passive Iincome','ACS13 No Passive Iincome', 'ACS12 No Passive Iincome']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHNPINC[i]=(m,b,r,p,std_err)
len(results_HHNPINC)



288

### Table B19001 (Household Income in Past Twelve Months)

#### Table B19001 estimates twelve-month household income by housing unit on the basis of income brackets ranging from under $10,000 to over $200,000.  It is likely that income level differentiation is relevant to the decision on whether to partake of restaurants of the type we envision.  For purposes of this study, we look to momentum in two brackets:  households with income of between $150,000 and $199,000; and households with income above that level.


In [31]:
ACS_list = [('ACS17_5YR_FULL_N',True),('ACS16_5YR_FULL_N',True),('ACS15_5YR_FULL_N',True),('ACS14_5YR_FULL_N',False),
            ('ACS13_5YR_FULL_N',False),('ACS12_5YR_FULL_N',False)]

In [32]:
B19001_current_fields = ['GEO.id2','HD01_VD01','HD01_VD16','HD01_VD17']
current_names = ['Total','150k-199k','200k and Over']
current_names

['Total', '150k-199k', '200k and Over']

In [33]:

pds={}

for i in range(len(ACS_list)):
    subdir=ACS_list[i]
    tr = '/users/richardkornblith/Data_Science/NYCHR/Data_for_NYCHR/'+subdir[0]+'/'
    csv_file = fnmatch.filter(os.listdir(tr),'*9001*with_ann.csv')
    full_path = tr+csv_file[0]
    cols = B19001_current_fields
    col_names = current_names
    df_t = pd.read_csv(full_path, index_col='GEO.id2',usecols=cols)
    df_t.columns = [ACS_list[i][0][0:5]+" "+col_names[j] for j in range(len(col_names))]
    df_t.drop(labels='Id2', inplace=True)
    pds[ACS_list[i][0]]=df_t
    
# pds


In [34]:
gincstat=pd.concat([pds[ACS_list[0][0]],pds[ACS_list[1][0]],pds[ACS_list[2][0]],pds[ACS_list[3][0]],pds[ACS_list[4][0]],pds[ACS_list[5][0]]],axis=1)
gincstat_int=gincstat.astype(int)
gincstat_int.head(10)
gincstat_int.columns
gincstat_int.info()

<class 'pandas.core.frame.DataFrame'>
Index: 288 entries, 36061000100 to 36061031900
Data columns (total 18 columns):
ACS17 Total            288 non-null int64
ACS17 150k-199k        288 non-null int64
ACS17 200k and Over    288 non-null int64
ACS16 Total            288 non-null int64
ACS16 150k-199k        288 non-null int64
ACS16 200k and Over    288 non-null int64
ACS15 Total            288 non-null int64
ACS15 150k-199k        288 non-null int64
ACS15 200k and Over    288 non-null int64
ACS14 Total            288 non-null int64
ACS14 150k-199k        288 non-null int64
ACS14 200k and Over    288 non-null int64
ACS13 Total            288 non-null int64
ACS13 150k-199k        288 non-null int64
ACS13 200k and Over    288 non-null int64
ACS12 Total            288 non-null int64
ACS12 150k-199k        288 non-null int64
ACS12 200k and Over    288 non-null int64
dtypes: int64(18)
memory usage: 42.8+ KB


In [35]:
### Now we normalize the data by division of the number of households reporting a level of educational attainment of interest here 
# by the total number of households reporting educational attainment at any level in the respective surveys.  In the cases 
# where no households reported in
# a particular survey, the resulting division by zero, resulting in a missing data 'NaN' entry, will be addressed below.

gincstat_int['ACS17 150k-199k']= gincstat_int['ACS17 150k-199k'].div(gincstat_int['ACS17 Total'],axis=0)
gincstat_int['ACS17 200k and Over']= gincstat_int['ACS17 200k and Over'].div(gincstat_int['ACS17 Total'],axis=0)
gincstat_int['ACS16 150k-199k']= gincstat_int['ACS16 150k-199k'].div(gincstat_int['ACS16 Total'],axis=0)
gincstat_int['ACS16 200k and Over']= gincstat_int['ACS16 200k and Over'].div(gincstat_int['ACS16 Total'],axis=0)
gincstat_int['ACS15 150k-199k']= gincstat_int['ACS15 150k-199k'].div(gincstat_int['ACS15 Total'],axis=0)
gincstat_int['ACS15 200k and Over']= gincstat_int['ACS15 200k and Over'].div(gincstat_int['ACS15 Total'],axis=0)
gincstat_int['ACS14 150k-199k']= gincstat_int['ACS14 150k-199k'].div(gincstat_int['ACS14 Total'],axis=0)
gincstat_int['ACS14 200k and Over']= gincstat_int['ACS14 200k and Over'].div(gincstat_int['ACS14 Total'],axis=0)
gincstat_int['ACS13 150k-199k']= gincstat_int['ACS13 150k-199k'].div(gincstat_int['ACS13 Total'],axis=0)
gincstat_int['ACS13 200k and Over']= gincstat_int['ACS13 200k and Over'].div(gincstat_int['ACS13 Total'],axis=0)
gincstat_int['ACS12 150k-199k']= gincstat_int['ACS12 150k-199k'].div(gincstat_int['ACS12 Total'],axis=0)
gincstat_int['ACS12 200k and Over']= gincstat_int['ACS12 200k and Over'].div(gincstat_int['ACS12 Total'],axis=0)

gincstat_int_idx = gincstat_int.reset_index(level=0, inplace=False)
# gincstat_int.head(10)
# gincstat_int_idx.columns
gincstat_int_idx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 19 columns):
GEO.id2                288 non-null object
ACS17 Total            288 non-null int64
ACS17 150k-199k        280 non-null float64
ACS17 200k and Over    280 non-null float64
ACS16 Total            288 non-null int64
ACS16 150k-199k        280 non-null float64
ACS16 200k and Over    280 non-null float64
ACS15 Total            288 non-null int64
ACS15 150k-199k        280 non-null float64
ACS15 200k and Over    280 non-null float64
ACS14 Total            288 non-null int64
ACS14 150k-199k        279 non-null float64
ACS14 200k and Over    279 non-null float64
ACS13 Total            288 non-null int64
ACS13 150k-199k        279 non-null float64
ACS13 200k and Over    279 non-null float64
ACS12 Total            288 non-null int64
ACS12 150k-199k        279 non-null float64
ACS12 200k and Over    279 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 42.8+ KB


In [38]:
##### Now we will address the missing data. We first will interpolate between valid data entries and after that we will try backfilling.
# Any remaining instances that have missing data will be dropped.  
# As will be seen, relatively few relevant census tracts are lost through this procedure:
 

gincstat_int_clean= pd.DataFrame(gincstat_int_idx[['ACS17 150k-199k', 'ACS17 200k and Over','ACS16 150k-199k', 'ACS16 200k and Over',
                                               'ACS15 150k-199k', 'ACS15 200k and Over','ACS14 150k-199k', 'ACS14 200k and Over',
                                               'ACS13 150k-199k', 'ACS13 200k and Over','ACS12 150k-199k', 'ACS12 200k and Over']])
y=gincstat_int_clean.interpolate(method='linear',axis=1, inplace=False)
z=y.fillna(method='bfill', axis=1)
z['Tract']=gincstat_int_idx.loc[:,'GEO.id2']
zz=z.dropna(axis=0)
gincstat_moment = zz.reset_index(drop=True, inplace=False)
gincstat_moment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 13 columns):
ACS17 150k-199k        280 non-null float64
ACS17 200k and Over    280 non-null float64
ACS16 150k-199k        280 non-null float64
ACS16 200k and Over    280 non-null float64
ACS15 150k-199k        280 non-null float64
ACS15 200k and Over    280 non-null float64
ACS14 150k-199k        280 non-null float64
ACS14 200k and Over    280 non-null float64
ACS13 150k-199k        280 non-null float64
ACS13 200k and Over    280 non-null float64
ACS12 150k-199k        280 non-null float64
ACS12 200k and Over    280 non-null float64
Tract                  280 non-null object
dtypes: float64(12), object(1)
memory usage: 28.5+ KB


In [39]:
numbs=pd.Series([0,1,2,3,4,5])
# First, for Households of 150k-199k Income
results_HHMIDI = {}
for i in range(len(gincstat_moment)):
    testdata = pd.Series(gincstat_moment.loc[i,['ACS17 150k-199k','ACS16 150k-199k','ACS15 150k-199k', 'ACS14 150k-199k','ACS13 150k-199k',
                                              'ACS12 150k-199k']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHMIDI[i]=(m,b,r,p,std_err)
len(results_HHMIDI)


# Now, for Households of 200k and over Income
results_HHHII = {}
for i in range(len(gincstat_moment)):
    testdata = pd.Series(gincstat_moment.loc[i,['ACS17 200k and Over','ACS16 200k and Over', 
       'ACS15 200k and Over', 'ACS14 200k and Over','ACS13 200k and Over','ACS12 200k and Over']])
    m,b,r,p,std_err = stats.linregress(numbs.astype(float), testdata.astype(float))
    results_HHHII[i]=(m,b,r,p,std_err)
len(results_HHHII)



280