# Session 12: Data Wrangling with Pandas

Data wrangling is a technical term... or maybe it is more of a metaphor...

In any case, it is what we've been doing already with various examples, taking messy data in some inconvenient format, and transforming it, cleaning it up, merging it with other data, in order to do meaningful analysis with it.  Like getting that Craigslist rental data into a form we can analyze and map it.

This session goes more methodically through the tools in Pandas to handle these operations, with an emphasis on data merging and transformations.  It follows the content in Chapter 7 of Python for Data Analysis pretty closely.

## Merging DataFrames in Pandas: an alternative to Databases for joining tables

Python and Numpy do not provide convenient ways to merge two or more datasets (tables) easily.  This is what database platforms are designed to do. Microsoft Access is a lightweight database. More industrial strength databases include popular open source platforms like Postgres and MySQL, and commercial platforms like Oracle, Microsoft SQL Server, IBM DB2 and IBM Informix.

The database platforms use Structured Query Language (SQL) to do database queries and table joins, or views.  The syntax varies from one platform to another in subtle ways, making the queries generally not interchangeable.  Further, most databases require a heavyweight installation and administration of user accounts, etc.  There are exceptions like SQLite3 which comes with Python, and is a very lightweight database platform with a subset of the features in a full database platform.

For Python developers, one solution to the multiple SQL syntax variants is to use a database abstraction library like SQLAlchemy (www.sqlalchemy.org) to interface with many of these database platforms, while using Python syntax to do database queries.  For those of you who want to work extensively with databases, this is a very popular library and worth learning how to use.  Here is a tutorial on it to give a flavor of what it is like: http://www.pythoncentral.io/introductory-tutorial-python-sqlalchemy/

An alternative solution is to use pandas instead of a database.  This is what we focus on here.

#### Table 7-1. merge function arguments ####
| Argument | Description |
|---|--------------------------|
| left | DataFrame to be merged on the left side
| right | DataFrame to be merged on the right side
| how | One of 'inner', 'outer', 'left' or 'right'. 'inner' by default
| on | Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys
| left_on | Columns in left DataFrame to use as join keys
| right_on | Analogous to left_on for left DataFrame
| left_index | Use row index in left as its join key (or keys, if a MultiIndex) right_index Analogous to left_index
| sort | Sort merged data lexicographically by join keys; True by default. Disable to get better performance in some cases on large datasets
| suffixes | Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y'). For example, if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result
| copy | If False, avoid copying data into resulting data structure in some exceptional cases. By default always copies

Let's look at our rents collected for the Bay Area, and consider how to merge on some attributes from the census data.

In [1]:
import pandas as pd
store = pd.HDFStore('data/bay.h5')
df = store['rents']
df['tractfips'] = df['blockfips'].map(lambda x: x[:11])
bayarea = ['Alameda','Contra Costa','Marin','Napa','San Francisco','San Mateo','Santa Clara','Solano','Sonoma']
df_bay = df[df['county'].isin(bayarea)]
df_bay[:5]

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,subregion,link,latitude,sqft,month,day,year,blockfips,countyfips,county,tractfips
0,bayview,Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY,950,1,4076905111,-122.396965,SF,/sfc/apa/4076905111.html,37.761216,,Sep,18,2013,60750227022008,6075,San Francisco,6075022702
1,bayview,Only walking distance to major shopping centers.,950,1,4076901755,-122.396793,SF,/sfc/apa/4076901755.html,37.76108,,Sep,18,2013,60750227022008,6075,San Francisco,6075022702
2,bayview,"furnished - 1 Bedroom(s), 1 Bath(s), Air Condi...",950,1,4076899340,-122.3971,SF,/sfc/apa/4076899340.html,37.7621,,Sep,18,2013,60750227022005,6075,San Francisco,6075022702
3,financial district,"*NEW* Beautiful, Upscale Condo in Historic Jac...",3300,1,4067393707,-122.399747,SF,/sfc/apa/4067393707.html,37.798108,830.0,Sep,18,2013,60750105002005,6075,San Francisco,6075010500
8,bayview,"We accept either 6, 12 month or month-to-month...",950,1,4076896866,-122.397137,SF,/sfc/apa/4076896866.html,37.76203,,Sep,18,2013,60750227022005,6075,San Francisco,6075022702


In [2]:
df_bay.describe()

Unnamed: 0,price,bedrooms,pid,longitude,latitude,sqft,day,year
count,3073.0,2852.0,3089.0,3089.0,3089.0,2030.0,3089.0,3089
mean,2704.504068,2.0554,4067689000.0,-122.269204,37.747187,1177.346305,17.520881,2013
std,1911.996404,1.006337,13695040.0,0.254461,0.331229,738.347407,0.74695,0
min,1.0,1.0,4008227000.0,-123.1965,37.00578,180.0,14.0,2013
25%,1700.0,1.0,4065168000.0,-122.440428,37.469299,750.0,17.0,2013
50%,2260.0,2.0,4074161000.0,-122.283714,37.759692,1000.0,18.0,2013
75%,3000.0,3.0,4075908000.0,-122.054572,37.894979,1373.0,18.0,2013
max,35000.0,8.0,4076905000.0,-121.56828,38.813554,11685.0,18.0,2013


In [3]:
store = pd.HDFStore('data/sf1_small.h5')
h1 = store['h1']

In [4]:
h1.describe()

Unnamed: 0,logrecno,blkgrp,arealand,H00010001
count,109228.0,109228.0,109228.0,109228.0
mean,395469.92079,2.078881,163768.0,25.505804
std,303602.730347,1.150206,1518068.0,50.771956
min,25.0,0.0,0.0,0.0
25%,45280.75,1.0,6847.75,0.0
50%,592132.5,2.0,16655.0,10.0
75%,661501.25,3.0,38794.5,31.0
max,721990.0,7.0,277483200.0,1455.0


In [5]:
h1[:5]

Unnamed: 0,logrecno,fipsblock,state,county,tract,blkgrp,block,arealand,H00010001
0,25,60014271001000,6,1,427100,1,1000,0,0
1,26,60014271001001,6,1,427100,1,1001,79696,5
2,27,60014271001002,6,1,427100,1,1002,739,0
3,28,60014271001003,6,1,427100,1,1003,19546,13
4,29,60014271001004,6,1,427100,1,1004,14364,8


In [6]:
p1 = store['p1']
p1[:5]

Unnamed: 0,logrecno,fipsblock,state,county,tract,blkgrp,block,arealand,P0010001
0,25,60014271001000,6,1,427100,1,1000,0,0
1,26,60014271001001,6,1,427100,1,1001,79696,113
2,27,60014271001002,6,1,427100,1,1002,739,0
3,28,60014271001003,6,1,427100,1,1003,19546,29
4,29,60014271001004,6,1,427100,1,1004,14364,26


To merge these two tables, we can anticipate that the merging should be very simple since they have the same file structure, with one row per census block (fipsblock), and only one column different.  This is the simplest kind of merge: a one-to-one merge.  One row in h1 uniquely matches one and only one row in p1.

The default merge options here work just fine and produce a joined result that is correct.

In [7]:
ph = pd.merge(p1,h1)
ph[:5]

Unnamed: 0,logrecno,fipsblock,state,county,tract,blkgrp,block,arealand,P0010001,H00010001
0,25,60014271001000,6,1,427100,1,1000,0,0,0
1,26,60014271001001,6,1,427100,1,1001,79696,113,5
2,27,60014271001002,6,1,427100,1,1002,739,0,0
3,28,60014271001003,6,1,427100,1,1003,19546,29,13
4,29,60014271001004,6,1,427100,1,1004,14364,26,8


Notice what happens if we try to be a bit more explicit here and specify a key to merge on:

In [8]:
ph = pd.merge(p1,h1, on='fipsblock')
ph[:5]

Unnamed: 0,logrecno_x,fipsblock,state_x,county_x,tract_x,blkgrp_x,block_x,arealand_x,P0010001,logrecno_y,state_y,county_y,tract_y,blkgrp_y,block_y,arealand_y,H00010001
0,25,60014271001000,6,1,427100,1,1000,0,0,25,6,1,427100,1,1000,0,0
1,26,60014271001001,6,1,427100,1,1001,79696,113,26,6,1,427100,1,1001,79696,5
2,27,60014271001002,6,1,427100,1,1002,739,0,27,6,1,427100,1,1002,739,0
3,28,60014271001003,6,1,427100,1,1003,19546,29,28,6,1,427100,1,1003,19546,13
4,29,60014271001004,6,1,427100,1,1004,14364,26,29,6,1,427100,1,1004,14364,8


Suddenly we end up with a lot of columns that are in both tables, duplicated, with a suffix of _x or _y depending on which dataframe they came from.  Why? Because pandas now thinks it is ONLY joining on fipsblock, and these other columns are 'overlapping', so it keeps each, but differentiates the name.

Why didn't this happen the first time? Because pandas figured out that the two dataframes actually had multiple columns that matched, and did a multi-key join, leaving only p1 and h1 as unique columns.

Now if we merge census data on to the cleaned up, geocoded rental listings, we can expect this to be more complicated.  First, the field names don't quite match.  Second, there are many rental listings that could be in the same census block. So this is referred to as a many-to-one merge.

In a many-to-one merge, the expected behavior is that the table with one match, is repeated for each row in the other table that it matches.  It 'fans out' to fill the values for each of the matching rows.  So in this case, all the rental listings in the same census block receive the values from the census table for that census block.

Since the names don't quite match, we'll use optional arguments to set the 'left_on' and 'right_on' key arguments for the merge.

In [9]:
#The version with suffixes is less useful than the original merge, so let's go back to that.
ph = pd.merge(p1,h1)
ph[:5]

Unnamed: 0,logrecno,fipsblock,state,county,tract,blkgrp,block,arealand,P0010001,H00010001
0,25,60014271001000,6,1,427100,1,1000,0,0,0
1,26,60014271001001,6,1,427100,1,1001,79696,113,5
2,27,60014271001002,6,1,427100,1,1002,739,0,0
3,28,60014271001003,6,1,427100,1,1003,19546,29,13
4,29,60014271001004,6,1,427100,1,1004,14364,26,8


In [10]:
rent_ph = pd.merge(df_bay,ph, left_on='blockfips', right_on='fipsblock')

In [11]:
rent_ph[:5]

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,subregion,link,latitude,sqft,...,logrecno,fipsblock,state,county_y,tract,blkgrp,block,arealand,P0010001,H00010001
0,bayview,Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY,950,1,4076905111,-122.396965,SF,/sfc/apa/4076905111.html,37.761216,,...,589451,60750227022008,6,75,22702,2,2008,12111,110,47
1,bayview,Only walking distance to major shopping centers.,950,1,4076901755,-122.396793,SF,/sfc/apa/4076901755.html,37.76108,,...,589451,60750227022008,6,75,22702,2,2008,12111,110,47
2,bayview,"furnished - 1 Bedroom(s), 1 Bath(s), Air Condi...",950,1,4076899340,-122.3971,SF,/sfc/apa/4076899340.html,37.7621,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37
3,bayview,"We accept either 6, 12 month or month-to-month...",950,1,4076896866,-122.397137,SF,/sfc/apa/4076896866.html,37.76203,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37
4,SOMA / south beach,Within Walking Distance of Public Transportation,2930,1,4075785618,-122.3971,SF,/sfc/apa/4075785618.html,37.7621,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37


So -- how many rows should we expect this merged file to contain? The 3,073 that were in the df_bay rental listings, or the 109,228 that were in the census block files for California?  Or something else?

In [12]:
rent_ph.describe()

Unnamed: 0,price,bedrooms,pid,longitude,latitude,sqft,day,year,logrecno,blkgrp,arealand,P0010001,H00010001
count,3073.0,2852.0,3089.0,3089.0,3089.0,2030.0,3089.0,3089,3089.0,3089.0,3089.0,3089.0,3089.0
mean,2704.504068,2.0554,4067689000.0,-122.269204,37.747187,1177.346305,17.520881,2013,516922.108449,2.139851,196940.697961,258.510845,124.745225
std,1911.996404,1.006337,13695040.0,0.254461,0.331229,738.347407,0.74695,0,230037.324094,1.170612,1466542.30535,315.801926,154.58982
min,1.0,1.0,4008227000.0,-123.1965,37.00578,180.0,14.0,2013,397.0,1.0,0.0,0.0,0.0
25%,1700.0,1.0,4065168000.0,-122.440428,37.469299,750.0,17.0,2013,586607.0,1.0,15358.0,58.0,23.0
50%,2260.0,2.0,4074161000.0,-122.283714,37.759692,1000.0,18.0,2013,624095.0,2.0,28914.0,157.0,72.0
75%,3000.0,3.0,4075908000.0,-122.054572,37.894979,1373.0,18.0,2013,653188.0,3.0,80848.0,347.0,174.0
max,35000.0,8.0,4076905000.0,-121.56828,38.813554,11685.0,18.0,2013,721924.0,7.0,34928405.0,2515.0,1155.0


Looks like this merge did a left join, or inner (the default mode of merging), keeping only records if they showed up in df_bay AND in ph.

What if we had done a right join?

In [13]:
rent_ph_right = pd.merge(df_bay,ph, how='right', left_on='blockfips', right_on='fipsblock')

In [14]:
rent_ph_right.describe()

Unnamed: 0,price,bedrooms,pid,longitude,latitude,sqft,day,year,logrecno,blkgrp,arealand,P0010001,H00010001
count,3073.0,2852.0,3089.0,3089.0,3089.0,2030.0,3089.0,3089,110354.0,110354.0,110354.0,110354.0,110354.0
mean,2704.504068,2.0554,4067689000.0,-122.269204,37.747187,1177.346305,17.520881,2013,397098.696966,2.07949,164356.7,67.87126,26.810483
std,1911.996404,1.006337,13695040.0,0.254461,0.331229,738.347407,0.74695,0,303116.792183,1.150768,1521330.0,130.131644,55.15117
min,1.0,1.0,4008227000.0,-123.1965,37.00578,180.0,14.0,2013,25.0,0.0,0.0,0.0,0.0
25%,1700.0,1.0,4065168000.0,-122.440428,37.469299,750.0,17.0,2013,45490.25,1.0,6937.25,0.0,0.0
50%,2260.0,2.0,4074161000.0,-122.283714,37.759692,1000.0,18.0,2013,592326.5,2.0,16775.0,25.0,10.0
75%,3000.0,3.0,4075908000.0,-122.054572,37.894979,1373.0,18.0,2013,661431.75,3.0,39203.5,86.0,31.0
max,35000.0,8.0,4076905000.0,-121.56828,38.813554,11685.0,18.0,2013,721990.0,7.0,277483200.0,5115.0,1455.0


What happened, exactly?

In [15]:
rent_ph_right.head()

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,subregion,link,latitude,sqft,...,logrecno,fipsblock,state,county_y,tract,blkgrp,block,arealand,P0010001,H00010001
0,bayview,Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY,950,1,4076905111,-122.396965,SF,/sfc/apa/4076905111.html,37.761216,,...,589451,60750227022008,6,75,22702,2,2008,12111,110,47
1,bayview,Only walking distance to major shopping centers.,950,1,4076901755,-122.396793,SF,/sfc/apa/4076901755.html,37.76108,,...,589451,60750227022008,6,75,22702,2,2008,12111,110,47
2,bayview,"furnished - 1 Bedroom(s), 1 Bath(s), Air Condi...",950,1,4076899340,-122.3971,SF,/sfc/apa/4076899340.html,37.7621,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37
3,bayview,"We accept either 6, 12 month or month-to-month...",950,1,4076896866,-122.397137,SF,/sfc/apa/4076896866.html,37.76203,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37
4,SOMA / south beach,Within Walking Distance of Public Transportation,2930,1,4075785618,-122.3971,SF,/sfc/apa/4075785618.html,37.7621,,...,589448,60750227022005,6,75,22702,2,2005,12177,75,37


In [16]:
rent_ph_right.tail()

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,subregion,link,latitude,sqft,...,logrecno,fipsblock,state,county_y,tract,blkgrp,block,arealand,P0010001,H00010001
110349,,,,,,,,,,,...,721986,60971505004042,6,97,150500,4,4042,4562,0,0
110350,,,,,,,,,,,...,721987,60971505004043,6,97,150500,4,4043,1089804,9,6
110351,,,,,,,,,,,...,721988,60971505004044,6,97,150500,4,4044,1412940,64,42
110352,,,,,,,,,,,...,721989,60971505004045,6,97,150500,4,4045,931534,5,9
110353,,,,,,,,,,,...,721990,60971505004046,6,97,150500,4,4046,617,0,0


## Concatenating DataFrames

#### Table 7-2. concat function arguments ####
| Argument | Description |
|---|--------------------------|
| objs | List or dict of pandas objects to be concatenated. The only required argument
| axis | Axis to concatenate along; defaults to 0
| join | One of 'inner', 'outer', defaulting to 'outer'; whether to intersection (inner) or union (outer) together indexes along the other axes
| join_axes | Specific indexes to use for the other n-1 axes instead of performing union/intersection logic
| keys | Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis. Can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple level arrays passed in levels)
| levels | Specific indexes to use as hierarchical index level or levels if keys passed
| names | Names for created hierarchical levels if keys and / or levels passed verify_integrity Check new axis in concatenated object for duplicates and raise exception if so. By default (False) allows duplicates
| ignore_index | Do not preserve indexes along concatenation axis, instead producing a new range(total_length) index

A different kind of merge is when you need to not add columns to a table, but add rows.  In Pandas this is called concatenating.

A motivating example would be if we had obtained rental listings separately by county, and wanted to merge them into a single Bay Area table.  Let's create a split table and put it back together to demonstrate how this works.

In [17]:
rent_ph['county_x'].unique()

array([u'San Francisco', u'San Mateo', u'Santa Clara', u'Contra Costa',
       u'Alameda', u'Solano', u'Sonoma', u'Napa', u'Marin'], dtype=object)

Let's create a dataframe with only San Francisco, and another with the remaining counties.

In [18]:
rent_sf = rent_ph[rent_ph['county_x']=='San Francisco']
rent_sf['county_x'].unique()

array([u'San Francisco'], dtype=object)

In [19]:
rent_not_sf = rent_ph[rent_ph['county_x']!='San Francisco']

In [20]:
rent_not_sf['county_x'].unique()

array([u'San Mateo', u'Santa Clara', u'Contra Costa', u'Alameda',
       u'Solano', u'Sonoma', u'Napa', u'Marin'], dtype=object)

OK, now we have two dataframes, and can concatenate them to see how this works.

In [21]:
rent_bay = pd.concat([rent_sf, rent_not_sf])

In [22]:
rent_bay['county_x'].unique()

array([u'San Francisco', u'San Mateo', u'Santa Clara', u'Contra Costa',
       u'Alameda', u'Solano', u'Sonoma', u'Napa', u'Marin'], dtype=object)

In [23]:
rent_bay.describe()

Unnamed: 0,price,bedrooms,pid,longitude,latitude,sqft,day,year,logrecno,blkgrp,arealand,P0010001,H00010001
count,3073.0,2852.0,3089.0,3089.0,3089.0,2030.0,3089.0,3089,3089.0,3089.0,3089.0,3089.0,3089.0
mean,2704.504068,2.0554,4067689000.0,-122.269204,37.747187,1177.346305,17.520881,2013,516922.108449,2.139851,196940.697961,258.510845,124.745225
std,1911.996404,1.006337,13695040.0,0.254461,0.331229,738.347407,0.74695,0,230037.324094,1.170612,1466542.30535,315.801926,154.58982
min,1.0,1.0,4008227000.0,-123.1965,37.00578,180.0,14.0,2013,397.0,1.0,0.0,0.0,0.0
25%,1700.0,1.0,4065168000.0,-122.440428,37.469299,750.0,17.0,2013,586607.0,1.0,15358.0,58.0,23.0
50%,2260.0,2.0,4074161000.0,-122.283714,37.759692,1000.0,18.0,2013,624095.0,2.0,28914.0,157.0,72.0
75%,3000.0,3.0,4075908000.0,-122.054572,37.894979,1373.0,18.0,2013,653188.0,3.0,80848.0,347.0,174.0
max,35000.0,8.0,4076905000.0,-121.56828,38.813554,11685.0,18.0,2013,721924.0,7.0,34928405.0,2515.0,1155.0


## Finding and Filtering Outliers

In [24]:
rent_bay['price'].quantile(.001)

486.08000000000004

In [25]:
rent_bay['price'].quantile(.999)

17232.000000000084

In [26]:
rent_bay['bedrooms'].unique()

array([  1.,   2.,  nan,   3.,   4.,   5.,   6.,   8.,   7.])

In [27]:
rent_bay['bedrooms'] = rent_bay['bedrooms'].fillna(0)

In [28]:
rent_bay['bedrooms'].unique()

array([ 1.,  2.,  0.,  3.,  4.,  5.,  6.,  8.,  7.])

In [29]:
rent_bay['outlier'] = (rent_bay['price'] < rent_bay['price'].quantile(.001)) | (rent_bay['price'] > rent_bay['price'].quantile(.999))

In [30]:
rent_bay_filtered = rent_bay[rent_bay['outlier'] == False]

In [31]:
rent_bay_filtered['price'].min()

500.0

In [32]:
rent_bay_filtered['price'].max()

17000.0

#### Table 7-3. Python built-in string methods ####
| Argument | Description |
|---|--------------------------|
| count | Return the number of non-overlapping occurrences of substring in the string.
| endswith, startswith | Returns True if string ends with suffix (starts with prefix).
| join | Use string as delimiter for concatenating a sequence of other strings.
| index | Return position of first character in substring if found in the string. Raises ValueEr ror if not found.
| find  | Return position of first character of first occurrence of substring in the string. Like index, but returns -1 if not found.
| rfind | Return position of first character of last occurrence of substring in the string. Returns -1 if not found.
| replace | Replace occurrences of string with another string.
| strip, rstrip, lstrip | Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
| split | Break string into list of substrings using passed delimiter.
| lower, upper | Convert alphabet characters to lowercase or uppercase, respectively.
| ljust, rjust | Left justify or right justify, respectively. Pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

#### Table 7-4. Regular expression methods ####
| Argument | Description |
|---|--------------------------|
| findall, finditer | Return all non-overlapping matching patterns in a string. findall returns a list of all patterns while finditer returns them one by one from an iterator.
| match | Match pattern at start of string and optionally segment pattern components into groups. If the pattern matches, returns a match object, otherwise None.
| search | Scan string for match to pattern; returning a match object if so. Unlike match, the match can be anywhere in the string as opposed to only at the beginning.
| split |Break string into pieces at each occurrence of pattern.
| sub, subn |Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression. Use symbols \1, \2, ... to refer to match group elements in the re- placement string.

#### Table 7-5. Vectorized string methods ####
| Method | Description |
|---|--------------------------|
| cat | Concatenate strings element-wise with optional delimiter
| contains |Return boolean array if each string contains pattern/regex
| count |Count occurrences of pattern
| endswith, startswith |Equivalent to x.endswith(pattern) or x.startswith(pattern) for each element.
| findall |Compute list of all occurrences of pattern/regex for each string 
| get | Index into each element (retrieve i-th element)
| join |Join strings in each element of the Series with passed separator
| len |Compute length of each string
| lower, upper |Convert cases; equivalent to x.lower() or x.upper() for each element.
| match |Use re.match with the passed regular expression on each element, returning matched groups as list.
| pad |Add whitespace to left, right, or both sides of strings
| center |Equivalent to pad(side='both')
| repeat |Duplicate values; for example s.str.repeat(3) equivalent to x * 3 for each string.
| replace |Replace occurrences of pattern/regex with some other string
| slice |Slice each string in the Series.
| split |Split strings on delimiter or regular expression
| strip, rstrip, lstrip |Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.

In [33]:
rent_bay_filtered['title'].str.lower().str.contains('view').sum()

266

In [34]:
rent_bay_filtered['view'] = rent_bay_filtered['title'].str.lower().str.contains('view')
rent_bay_filtered.groupby('view')['price'].mean()

view
False    2537.763487
True     4139.229323
Name: price, dtype: float64

In [35]:
rent_bay_filtered['title'].str.lower().str.contains('walk').sum()

103

In [36]:
rent_bay_filtered['walk'] = rent_bay_filtered['title'].str.lower().str.contains('walk')
rent_bay_filtered.groupby('walk')['price'].mean()

walk
False    2679.012829
True     2611.640777
Name: price, dtype: float64

## Jumping Ahead: Creating a Hedonic Regression of Rental Prices using StatsModels

OK, this is a bit ahead of schedule, but now that we have the rent data somewhat cleaned up, and some attributes of the listings, wouldn't it be tempting to explore how those attributes contribute to the variation in rents?

Hedonic regression is a great tool to do this, and statsmodels is a Python library that works very nicely with pandas, and supports multiple regression using various methods, including Ordinary Least Squares regression - the most common flavor.  Hedonic regression is just multiple regression in which prices (or rents) are regressed on attributes of the properties, to 'decompose' the price into its component parts -- like figuring out what the price would be if you could rent an apartment 'a la carte'.  It shows how the market (all those sellers / landlords / buyers / renters) interact on average to negotiate market values for different structural qualities like sqft and number of bedrooms, as well as locational attributes, like whether a place is walkable or has views.

In [37]:
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices

In [38]:
y, X = dmatrices('price ~ sqft + view', data=rent_bay_filtered, return_type='dataframe')

In [39]:
mod = sm.OLS(y, X)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.398
Model:                            OLS   Adj. R-squared:                  0.397
Method:                 Least Squares   F-statistic:                     666.9
Date:                Tue, 07 Oct 2014   Prob (F-statistic):          4.08e-223
Time:                        11:24:27   Log-Likelihood:                -17288.
No. Observations:                2023   AIC:                         3.458e+04
Df Residuals:                    2020   BIC:                         3.460e+04
Df Model:                           2                                         
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept     1035.5038     54.465     19.012      0.000       928.691  1142.316
view[T.True]  1376.3213     99.919     13.774 