# Data Science with Python, Part 1

Data science is a broad term.  This is the definition on Wikipedia:

"Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD)."

We do not go deeply into statistics and machine learning in this course, but focus more on the initial part of this definition -- extracting insights from structured and unstructured data.

Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session.

We start by experimenting with the datasets we used in the data visualization session. For convenience we include below the data dictionary for sf1.

In [3]:
import pandas as pd

sf1store = pd.HDFStore('bay_sf1_small.h5')
sf1 = sf1store['sf1_extract']
sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100
sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100
sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100
sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100
sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100
sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100
sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988))
sf1 = sf1[sf1['P0030001']>0]
sf1[:5]

Unnamed: 0,logrecno,blockfips,state,county,tract,blkgrp,block,arealand,P0010001,P0020001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
1,26,60014271001001,6,1,427100,1,1001,79696,113,113,...,0,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839
3,28,60014271001003,6,1,427100,1,1003,19546,29,29,...,0,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166
4,29,60014271001004,6,1,427100,1,1004,14364,26,26,...,0,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441
6,31,60014271001006,6,1,427100,1,1006,1281,2,2,...,0,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112
7,32,60014271001007,6,1,427100,1,1007,19020,30,30,...,0,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574


## Groupby and Aggregation Operations

Groupby is a powerful method in pandas that follows the split-apply-combine approach to data.  As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values.  Then an operation, in this case a sum, is applied to each group.  Then the results are combined.

The built-in aggregation methods available for groupby operations include:
* count
* sum
* mean
* median
* std, var
* min, max
* first, last

You can also apply your own functions as aggregation methods.

![Groupby Operations](groupby.png "Groupby")

Let's apply this approach to computing total population in each county in our dataset.  We can do this in two steps to help explain what is happening.  First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code.

In [4]:
grouped = sf1['P0010001'].groupby(sf1['county'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x114d03f98>

Now that we have this grouping object that represents the **split** part of the workflow in the figure above, we can **apply** operations and **combine** the results using methods like sum:

In [5]:
grouped.sum()

county
001    1510271
013    1049025
041     252409
055     136484
075     805235
081     718451
085    1781642
095     413344
097     483878
Name: P0010001, dtype: int64

Let's add county names to the dataframe so we get more readable output, and rerun this aggregation.

In [6]:
county_names = {'001': 'Alameda', '013': 'Contra Costa', '041': 'Marin', '055': 'Napa', '075': 'San Francisco',
                '081': 'San Mateo', '085': 'Santa Clara', '095': 'Solano', '097': 'Sonoma'}

Let's add county_name as a column in the dataframe.  It would be easy to append it as the last column with a merge, but let's see how to insert it in a specified location so that it is easier to read when we browse the data.  We can insert it as the 4th column, between county and tract, like so:

In [7]:
sf1.insert(4, 'county_name', sf1['county'].replace(county_names))
sf1[:5]

Unnamed: 0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
1,26,60014271001001,6,1,Alameda,427100,1,1001,79696,113,...,0,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839
3,28,60014271001003,6,1,Alameda,427100,1,1003,19546,29,...,0,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166
4,29,60014271001004,6,1,Alameda,427100,1,1004,14364,26,...,0,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441
6,31,60014271001006,6,1,Alameda,427100,1,1006,1281,2,...,0,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112
7,32,60014271001007,6,1,Alameda,427100,1,1007,19020,30,...,0,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574


Now we can print the results of summing population by county_name:

In [8]:
print('Total Population by County:')
print(sf1['P0010001'].groupby(sf1['county_name']).sum())

Total Population by County:
county_name
Alameda          1510271
Contra Costa     1049025
Marin             252409
Napa              136484
San Francisco     805235
San Mateo         718451
Santa Clara      1781642
Solano            413344
Sonoma            483878
Name: P0010001, dtype: int64


We might want to capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame.

In [9]:
county_pop = sf1['P0010001'].groupby(sf1['county_name']).sum().to_frame(name='total_population')
county_pop

Unnamed: 0_level_0,total_population
county_name,Unnamed: 1_level_1
Alameda,1510271
Contra Costa,1049025
Marin,252409
Napa,136484
San Francisco,805235
San Mateo,718451
Santa Clara,1781642
Solano,413344
Sonoma,483878


Here we merge the county total population with sf1 and create a new DataFrame.

In [10]:
sf2 = pd.merge(sf1,county_pop, left_on='county_name', right_index=True)
sf2[:5]

Unnamed: 0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi,total_population
1,26,60014271001001,6,1,Alameda,427100,1,1001,79696,113,...,0,0,80.0,13.274336,5.309735,78.761062,1.769912,0.0,3672.312839,1510271
3,28,60014271001003,6,1,Alameda,427100,1,1003,19546,29,...,0,2,70.0,13.793103,27.586207,37.931034,24.137931,23.076923,3842.712166,1510271
4,29,60014271001004,6,1,Alameda,427100,1,1004,14364,26,...,0,0,75.0,0.0,38.461538,34.615385,0.0,0.0,4688.087441,1510271
6,31,60014271001006,6,1,Alameda,427100,1,1006,1281,2,...,0,0,100.0,0.0,50.0,0.0,100.0,0.0,4043.697112,1510271
7,32,60014271001007,6,1,Alameda,427100,1,1007,19020,30,...,0,0,33.333333,0.0,43.333333,50.0,0.0,10.0,4085.154574,1510271


Let's say we wanted to compute the population per square mile by county using the groupby method.  We could go ahead and create another dataframe with total area by county than then do the division of total population by total area.

In [11]:
county_land = sf1['arealand'].groupby(sf1['county_name']).sum().to_frame(name='total_area')
county_land

Unnamed: 0_level_0,total_area
county_name,Unnamed: 1_level_1
Alameda,1190434861
Contra Costa,1095085515
Marin,1046029032
Napa,1556005658
San Francisco,95535946
San Mateo,884654868
Santa Clara,2378681334
Solano,1224964331
Sonoma,3206326062


In [12]:
county_pop_per_sqmi = county_pop['total_population'] / county_land['total_area'] * 2589988.11
county_pop_per_sqmi

county_name
Alameda           3285.844577
Contra Costa      2481.050329
Marin              624.969565
Napa               227.179082
San Francisco    21829.993454
San Mateo         2103.396042
Santa Clara       1939.911635
Solano             873.948750
Sonoma             390.864261
dtype: float64

Or of course we could have done this whole thing in one line:

In [13]:
sf1['P0010001'].groupby(sf1['county_name']).sum() / sf1['arealand'].groupby(sf1['county_name']).sum() * 2589988.11

county_name
Alameda           3285.844577
Contra Costa      2481.050329
Marin              624.969565
Napa               227.179082
San Francisco    21829.993454
San Mateo         2103.396042
Santa Clara       1939.911635
Solano             873.948750
Sonoma             390.864261
dtype: float64

## Your turn to practice:

Count the number of census blocks per county.

Calculate total households per county.

Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county)

Calculate percent vacant by county.

Calculate mean, min and max pop_sqmi (at the block level) by county.

Calculate the 90th percentile of pop_sqmi (at the block level) by county.

## Transforming Data with Groupby

In some cases you may want to apply a function to your data, by group.  An example would be to normalize a column by a mean of each group.  Say we wanted to subtract the mean population density of each county from the population density of each census block. We could write a function to subtract the mean from each value, and then use the transform operation to apply this to each group:

In [14]:
def demean(arr):
    return arr - arr.mean()

Now we can apply this tranformation to columns in our dataframe.  As examples, let's 'demean' the pop_sqmi and pct_rent columns, subtracting the county-wide mean of these values from the block-specific values, so that the result is transformed to have a mean of zero within each county.

To check the results, we print the means per county, then the original values for the first 5 rows, then the transformed results.  The transformed results we should be able to calculate by subtracting the appropriate county mean from the block value.

In [71]:
normalized = sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).transform(demean)
print(sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).mean())
print(sf1[['county_name','pop_sqmi', 'pct_rent']][:5])
print(normalized[:5])

                   pop_sqmi   pct_rent
county_name                           
Alameda        13753.632044  37.484398
Contra Costa    8081.846244  27.223329
Marin           6338.936151  33.169454
Napa            6245.985021  32.903209
San Francisco  28395.093537  51.927943
San Mateo      11011.638488  30.274104
Santa Clara    10598.597545  29.811290
Solano          7203.793038  34.598761
Sonoma          5415.876988  35.121698
  county_name     pop_sqmi    pct_rent
1     Alameda  3672.312839   80.000000
3     Alameda  3842.712166   70.000000
4     Alameda  4688.087441   75.000000
6     Alameda  4043.697112  100.000000
7     Alameda  4085.154574   33.333333
       pop_sqmi   pct_rent
1 -10081.319205  42.515602
3  -9910.919878  32.515602
4  -9065.544603  37.515602
6  -9709.934932  62.515602
7  -9668.477470  -4.151064


We can merge these transformed results on to the original DataFrame, and check the means of the original variables and the tranformed ones.  The transformed ones should be arbitrarily close to zero.

In [88]:
sf2 = pd.merge(sf1,normalized, left_index=True, right_index=True)

sf2.groupby('county_name')[['pop_sqmi_x', 'pop_sqmi_y', 'pct_rent_x', 'pct_rent_y']].mean()

Unnamed: 0_level_0,pop_sqmi_x,pop_sqmi_y,pct_rent_x,pct_rent_y
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alameda,13753.632044,-7.944529e-11,37.484398,6.856168e-14
Contra Costa,8081.846244,-7.040712e-12,27.223329,-3.800049e-13
Marin,6338.936151,1.461176e-12,33.169454,9.143583e-15
Napa,6245.985021,-1.304992e-13,32.903209,-1.685373e-14
San Francisco,28395.093537,6.833338e-12,51.927943,-1.349699e-15
San Mateo,11011.638488,8.19233e-12,30.274104,-7.197617e-14
Santa Clara,10598.597545,5.34788e-11,29.81129,-1.21129e-13
Solano,7203.793038,9.815285e-12,34.598761,-5.029091e-14
Sonoma,5415.876988,1.308518e-11,35.121698,-5.9376e-14


Apply is a method we have learned previously, which allows us to apply a function to each row in a DataFrame.  We can also combine apply with groupby to apply functions based on group membership.  For example, the function 'top' sorts an array and selects the top n rows from it.  We provide some defaults for the arguments of how many rows, and the column to use for the selection:

In [96]:
def top(df, n=5, column='pop_sqmi'):
    return df.sort_values(by=column)[-n:]

Using this on the full dataset and setting the number of rows and the column to get the top values for, in this case using pct_rent to override the default argument, we get the top 10 blocks in the region in terms of percentage rental.

In [99]:
top(sf1, n=10, column='pct_rent')

Unnamed: 0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
106284,718740,60971530013004,6,97,Sonoma,153001,3,3004,67428,2,...,0,0,0,,0.0,0.0,100.0,50.0,,76.822329
106359,718820,60971530024009,6,97,Sonoma,153002,4,4009,114793,1,...,0,0,0,,0.0,0.0,100.0,0.0,,22.562247
106628,719120,60971531041005,6,97,Sonoma,153104,1,1005,29617,3,...,0,0,0,,0.0,33.333333,66.666667,33.333333,,262.348111
106651,719144,60971531042008,6,97,Sonoma,153104,2,2008,16533,3,...,0,0,0,,0.0,0.0,66.666667,66.666667,,469.966975
106715,719215,60971533001000,6,97,Sonoma,153300,1,1000,105712,1,...,0,0,0,,0.0,0.0,100.0,0.0,,24.500416
107509,720093,60971530023027,6,97,Sonoma,153002,3,3027,4076,1,...,0,0,0,,0.0,0.0,100.0,0.0,,635.423945
108689,721415,60971502042007,6,97,Sonoma,150204,2,2007,18214,38,...,0,0,0,,2.631579,0.0,97.368421,5.263158,,5403.510706
108848,721586,60971501003012,6,97,Sonoma,150100,3,3012,457268,3,...,0,0,1,,0.0,0.0,100.0,100.0,100.0,16.992145
109112,721873,60971505001009,6,97,Sonoma,150500,1,1009,44620,80,...,0,0,0,,6.25,8.75,82.5,3.75,,4643.636038
109122,721883,60971505001019,6,97,Sonoma,150500,1,1019,26054,551,...,0,0,0,,5.989111,3.266788,89.110708,4.718693,,54774.061104


Below we apply this with groupby and use the defaults for n and column, and it applies the function within each county and concatenates the results, producing the top 5 blocks on pop_sqmi for each county in the region.

In [105]:
sf1.groupby('county_name').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alameda,9702,10388,60014369001004,6,1,Alameda,436900,1,1004,2148,431,...,0,0,0,98.876404,3.712297,0.0,24.593968,91.87935,1.111111,519685.7
Alameda,9130,9766,60014351042022,6,1,Alameda,435104,2,2022,1013,246,...,0,0,0,100.0,2.03252,25.609756,22.357724,42.682927,3.225806,628960.6
Alameda,19003,20241,60014028002013,6,1,Alameda,402800,2,2013,941,370,...,5,0,1,100.0,39.72973,14.054054,36.486486,8.378378,17.307692,1018380.0
Alameda,4786,5117,60014419251004,6,1,Alameda,441925,1,1004,1240,801,...,5,1,3,100.0,4.11985,64.54432,22.721598,5.617978,4.591837,1673049.0
Alameda,8494,9073,60014311002003,6,1,Alameda,431100,2,2003,320,392,...,0,0,1,100.0,20.663265,10.714286,34.438776,37.244898,8.72093,3172735.0
Contra Costa,27904,45921,60133340042011,6,13,Contra Costa,334004,2,2011,799,60,...,0,0,0,38.461538,5.0,21.666667,61.666667,11.666667,13.333333,194492.2
Contra Costa,27062,45016,60133150001170,6,13,Contra Costa,315000,1,1170,932,115,...,0,0,1,18.421053,1.73913,11.304348,77.391304,7.826087,7.317073,319580.1
Contra Costa,25628,43440,60133131021012,6,13,Contra Costa,313102,1,1012,4525,568,...,0,0,2,100.0,43.661972,3.34507,28.34507,31.338028,3.252033,325107.9
Contra Costa,27063,45017,60133150001190,6,13,Contra Costa,315000,1,1190,900,126,...,1,0,0,12.5,3.174603,15.079365,65.873016,19.84127,5.882353,362598.3
Contra Costa,37338,56116,60133551151034,6,13,Contra Costa,355115,1,1034,2386,336,...,0,0,0,100.0,12.797619,36.607143,36.904762,9.22619,3.726708,364725.9


Here we pass arguments to the function to set n and the column to select the top value from.

In [111]:
sf1.groupby('county_name').apply(top, n=1, column='arealand')

Unnamed: 0_level_0,Unnamed: 1_level_0,logrecno,blockfips,state,county,county_name,tract,blkgrp,block,arealand,P0010001,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alameda,12544,13434,60014301021000,6,1,Alameda,430102,1,1000,31296322,166,...,0,0,3,45.283019,1.204819,3.614458,84.337349,6.024096,7.017544,13.737653
Contra Costa,37789,56579,60133551121158,6,13,Contra Costa,355112,1,1158,19441925,141,...,2,0,3,23.809524,2.12766,0.70922,75.886525,24.822695,12.5,18.783547
Marin,42773,311828,60411322003008,6,41,Marin,132200,3,3008,48884156,112,...,14,6,0,91.666667,0.0,0.892857,64.285714,66.071429,36.842105,5.934002
Napa,47066,355191,60552018001000,6,55,Napa,201800,1,1000,93228090,13,...,4,1,2,83.333333,0.0,0.0,92.307692,0.0,53.846154,0.361156
San Francisco,56139,593833,60750604001013,6,75,San Francisco,60400,1,1013,1036262,3,...,0,0,0,,0.0,33.333333,66.666667,0.0,,7.498069
San Mateo,57660,622919,60816138001035,6,81,San Mateo,613800,1,1035,28976148,105,...,0,0,4,55.172414,0.952381,0.0,47.619048,60.952381,19.444444,9.385262
Santa Clara,66451,644610,60855135001202,6,85,Santa Clara,513500,1,1202,277483160,62,...,3,0,1,16.666667,3.225806,0.0,82.258065,1.612903,14.285714,0.578699
Solano,92499,703629,60952527026009,6,95,Solano,252702,6,6009,30318073,14,...,1,0,0,33.333333,0.0,35.714286,64.285714,0.0,14.285714,1.195981
Sonoma,99632,711527,60971542022000,6,97,Sonoma,154202,2,2000,35658559,106,...,14,0,1,36.585366,0.0,0.0,89.622642,14.150943,28.070175,7.699098


## Experimenting with Rental Listings Merged with SF1

Let's read the geocoded rental listings for the Bay Area to begin.  We will make sure the fips_block column is read as a string dtype so we can merge properly with the census data.  It has leading zeros and is a string in the census data.

In [125]:
rentals = pd.read_csv('sfbay_geocoded.csv', usecols=['rent', 'bedrooms', 'sqft', 'fips_block', 'longitude', 'latitude'], dtype={'fips_block': str})
#rentals = pd.read_csv('sfbay_geocoded.csv')
                                                    
rentals[:5]

Unnamed: 0,rent,bedrooms,sqft,longitude,latitude,fips_block
0,4500.0,2.0,1200.0,-122.4383,37.745,60750216002015
1,2650.0,2.0,1040.0,-122.008131,37.353699,60855085053008
2,3100.0,2.0,1000.0,-122.439743,37.731584,60750311005011
3,1850.0,1.0,792.0,-122.234294,37.491715,60816101001026
4,1325.0,1.0,642.0,-122.087751,37.923448,60133400021004


And merge it with the census data using the FIPS block codes, which are named differently in the two DataFrames.

In [126]:
rentals_sf1 = pd.merge(rentals, sf1, left_on='fips_block', right_on='blockfips')
rentals_sf1[:10]

Unnamed: 0,rent,bedrooms,sqft,longitude,latitude,fips_block,logrecno,blockfips,state,county,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
0,4500.0,2.0,1200.0,-122.4383,37.745,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
1,6250.0,3.0,1215.0,-122.4383,37.745,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
2,6250.0,3.0,1215.0,-122.4383,37.745,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
3,6650.0,3.0,2900.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
4,2600.0,1.0,615.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
5,2615.0,1.0,615.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
6,2615.0,1.0,615.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
7,2600.0,1.0,615.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
8,3200.0,,900.0,-122.4383,37.745,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053
9,2600.0,1.0,615.0,-122.440088,37.745296,60750216002015,589261,60750216002015,6,75,...,3,0,5,88.888889,23.445826,10.301954,46.358792,23.268206,5.882353,24936.524053


In [143]:
rentals_sf1.groupby(rentals_sf1['county_name'])[['rent']].mean()

Unnamed: 0_level_0,rent
county_name,Unnamed: 1_level_1
Alameda,2235.584293
Contra Costa,1955.075348
Marin,3277.287562
Napa,2117.797398
San Francisco,3746.737974
San Mateo,2857.011248
Santa Clara,2665.584276
Solano,1359.965551
Sonoma,1805.14076


In [150]:
rentals_sf1.groupby(['county_name', 'bedrooms'])[['rent']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,rent
county_name,bedrooms,Unnamed: 2_level_1
Alameda,1.0,1862.396037
Alameda,2.0,2284.538409
Alameda,3.0,2715.874074
Alameda,4.0,3315.346895
Alameda,5.0,4820.488152
Alameda,6.0,5898.392857
Alameda,7.0,7362.500000
Alameda,8.0,8400.000000
Contra Costa,1.0,1596.991079
Contra Costa,2.0,1874.641106


In [151]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,rent
county_name,bedrooms,Unnamed: 2_level_1
Alameda,1.0,1862.396037
Alameda,2.0,2284.538409
Alameda,3.0,2715.874074
Contra Costa,1.0,1596.991079
Contra Costa,2.0,1874.641106
Contra Costa,3.0,2390.508197
Marin,1.0,2209.910211
Marin,2.0,2998.448575
Marin,3.0,4476.835979
Napa,1.0,1387.362745


In [154]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].agg(['mean', 'std', 'min', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,rent,rent,rent,rent
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,min,max
county_name,bedrooms,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Alameda,1.0,1862.396037,393.300923,500.0,4950.0
Alameda,2.0,2284.538409,534.999766,800.0,4998.0
Alameda,3.0,2715.874074,806.606556,350.0,7500.0
Contra Costa,1.0,1596.991079,375.069796,496.0,3125.0
Contra Costa,2.0,1874.641106,505.719679,689.0,3895.0
Contra Costa,3.0,2390.508197,775.568225,793.0,5995.0
Marin,1.0,2209.910211,618.305211,430.0,5600.0
Marin,2.0,2998.448575,950.448793,1325.0,9500.0
Marin,3.0,4476.835979,1423.768448,1657.0,9000.0
Napa,1.0,1387.362745,224.556583,750.0,1800.0


In [156]:
rentals_sf1.groupby(['county_name', 'bedrooms']).apply(top, n=1, column='rent')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rent,bedrooms,sqft,longitude,latitude,fips_block,logrecno,blockfips,state,county,...,H0050006,H0050007,H0050008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,pop_sqmi
county_name,bedrooms,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Alameda,1.0,29483,4950.0,1.0,800.0,-122.280400,37.836500,060014251041007,18519,060014251041007,06,001,...,0,0,0,66.666667,57.142857,14.285714,28.571429,0.000000,0.000000,1965.089530
Alameda,2.0,36889,4998.0,2.0,1400.0,-122.264292,37.879443,060014216004005,2094,060014216004005,06,001,...,6,0,1,62.264151,0.000000,15.841584,76.237624,8.910891,14.516129,13043.569584
Alameda,3.0,58544,7500.0,3.0,2500.0,-122.226934,37.830734,060014261001000,25365,060014261001000,06,001,...,0,0,0,10.000000,1.298701,23.376623,74.025974,1.298701,6.250000,2159.258077
Alameda,4.0,43448,8200.0,4.0,3247.0,-122.229400,37.828817,060014261001006,25371,060014261001006,06,001,...,0,0,1,6.000000,1.351351,29.054054,58.783784,6.081081,5.660377,10261.222401
Alameda,5.0,22134,7800.0,5.0,1856.0,-122.266723,37.863312,060014235001001,3192,060014235001001,06,001,...,0,0,0,,0.000000,0.000000,100.000000,0.000000,,907.176182
Alameda,6.0,45415,10200.0,6.0,2600.0,-122.255180,37.860126,060014236011001,3257,060014236011001,06,001,...,0,0,1,68.493151,4.787234,12.765957,72.340426,5.851064,3.947368,21795.780842
Alameda,7.0,44877,9200.0,7.0,2600.0,-122.250000,37.857100,060014238003015,3446,060014238003015,06,001,...,0,0,0,10.000000,0.000000,5.882353,94.117647,0.000000,0.000000,10746.838174
Alameda,8.0,59696,8400.0,8.0,3000.0,-122.266181,37.853644,060014239011007,3469,060014239011007,06,001,...,0,0,2,58.536585,7.228916,7.228916,79.518072,6.024096,6.818182,14735.006100
Contra Costa,1.0,27811,3125.0,1.0,539.0,-122.073700,37.954000,060133230004016,49621,060133230004016,06,013,...,0,0,1,0.000000,0.000000,20.689655,79.310345,6.896552,10.000000,6952.022584
Contra Costa,2.0,21614,3895.0,2.0,1050.0,-122.128766,37.856406,060133522011000,48681,060133522011000,06,013,...,1,0,26,59.819121,2.034525,21.331689,66.399507,9.556104,5.953827,3136.152191
