# DS-SF-36 | 03 | `pandas` | Codealong | Answer Key

(http://pandas.pydata.org/pandas-docs/stable)

## Part A | Introduction to `pandas`

In [1]:
import os

import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

> ## `pd.read_csv()`: load datasets from files (or even over the Internet)

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow-properties.csv'))

> ## `DataFrame`

Let's check `df`'s type:

In [3]:
type(df)

pandas.core.frame.DataFrame

`df` is a `DataFrame`.  (http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

A `DataFrame` stores tabular data.  Let's have a look at its content:

In [4]:
df

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,2121978635,"829 Folsom St UNIT 906, San Francisco, CA",37781429,-122401860,False,...,557.0,sqft,,,2010.0
1,89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,...,1050.0,sqft,,,2007.0
2,15131782,"401 Grand View Ave APT 3, San Francisco, CA",37752157,-122442356,False,...,937.0,sqft,,,1983.0
3,15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,...,1574.0,sqft,1947.0,sqft,1959.0
4,52266124,"88 King St APT 317, San Francisco, CA",37780630,-122389635,False,...,1205.0,sqft,,,2000.0
...,...,...,...,...,...,...,...,...,...,...,...
995,82786211,"310 Townsend St APT 311, San Francisco, CA",37777027,-122395736,False,...,853.0,sqft,,,2006.0
996,15103435,"1343 31st Ave, San Francisco, CA",37762152,-122490254,False,...,1886.0,sqft,3000.0,sqft,1934.0
997,15195183,"3916 Alemany Blvd, San Francisco, CA",37711527,-122467755,False,...,1300.0,sqft,2553.0,sqft,1941.0
998,15180783,"430 Fair Oaks St, San Francisco, CA",37749725,-122424094,False,...,2678.0,sqft,,,1911.0


> ## `.head()`: first 5 (default) rows

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.head.html)

In [5]:
df.head()

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,2121978635,"829 Folsom St UNIT 906, San Francisco, CA",37781429,-122401860,False,...,557.0,sqft,,,2010.0
1,89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,...,1050.0,sqft,,,2007.0
2,15131782,"401 Grand View Ave APT 3, San Francisco, CA",37752157,-122442356,False,...,937.0,sqft,,,1983.0
3,15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,...,1574.0,sqft,1947.0,sqft,1959.0
4,52266124,"88 King St APT 317, San Francisco, CA",37780630,-122389635,False,...,1205.0,sqft,,,2000.0


> ## `.tail()`: last 5 (default) rows

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.tail.html)

In [6]:
df.tail()

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
995,82786211,"310 Townsend St APT 311, San Francisco, CA",37777027,-122395736,False,...,853.0,sqft,,,2006.0
996,15103435,"1343 31st Ave, San Francisco, CA",37762152,-122490254,False,...,1886.0,sqft,3000.0,sqft,1934.0
997,15195183,"3916 Alemany Blvd, San Francisco, CA",37711527,-122467755,False,...,1300.0,sqft,2553.0,sqft,1941.0
998,15180783,"430 Fair Oaks St, San Francisco, CA",37749725,-122424094,False,...,2678.0,sqft,,,1911.0
999,54854296,"720 Stockton St APT 3, San Francisco, CA",37792578,-122407366,False,...,886.0,sqft,,,2001.0


> ## `.shape`: shape (i.e., number of rows and columns)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.shape.html)

In [7]:
df.shape

(1000, 12)

The first value (at index 0) is the number of rows, the second (at index 1), the number of columns:

In [8]:
df.shape[0]

1000

In [9]:
df.shape[1]

12

You can also use the idiomatic Python `len` function to get the number of rows:

In [10]:
len(df)

1000

> ## `.dtypes`: column types

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html)
- (http://pandas.pydata.org/pandas-docs/stable/basics.html)

In [11]:
df.dtypes

ID               int64
Address         object
Latitude         int64
Longitude        int64
IsAStudio       object
                ...   
Size           float64
SizeUnit        object
LotSize        float64
LotSizeUnit     object
BuiltInYear    float64
dtype: object

> ## `.isnull()` and `.notnull()`: NaN (Not-a-Number)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.notnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html)

As a data scientist, we will have to decide what to do when encountering missing values (a.k.a, not-a-numbers).  We might decide to drop the row containing it, drop the whole column, or impute it.  Today, let's focus on finding these NaNs.

In [12]:
df.isnull()

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,False,False,False,False,False,...,False,False,True,True,False
1,False,False,False,False,False,...,False,False,True,True,False
2,False,False,False,False,False,...,False,False,True,True,False
3,False,False,False,False,False,...,False,False,False,False,False
4,False,False,False,False,False,...,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,...,False,False,True,True,False
996,False,False,False,False,False,...,False,False,False,False,False
997,False,False,False,False,False,...,False,False,False,False,False
998,False,False,False,False,False,...,False,False,True,True,False


In return, we get a new `DataFrame` with Boolean values.  `True` if the value is `NaN`, `False` otherwise.

We can also get the count per column:

In [13]:
df.isnull().sum()

ID               0
Address          0
Latitude         0
Longitude        0
IsAStudio       14
              ... 
Size            33
SizeUnit        33
LotSize        444
LotSizeUnit    444
BuiltInYear     25
dtype: int64

Summing again will return the number of cells in the `DataFrame` with missing values.

In [14]:
df.isnull().sum().sum()

1215

Equivalently, we can also use the `.isnull()` function:

In [15]:
pd.isnull(df)

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,False,False,False,False,False,...,False,False,True,True,False
1,False,False,False,False,False,...,False,False,True,True,False
2,False,False,False,False,False,...,False,False,True,True,False
3,False,False,False,False,False,...,False,False,False,False,False
4,False,False,False,False,False,...,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,...,False,False,True,True,False
996,False,False,False,False,False,...,False,False,False,False,False
997,False,False,False,False,False,...,False,False,False,False,False
998,False,False,False,False,False,...,False,False,True,True,False


We also also use `.notnull()`, its complement method:

In [16]:
df.notnull()

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,True,True,True,True,True,...,True,True,False,False,True
1,True,True,True,True,True,...,True,True,False,False,True
2,True,True,True,True,True,...,True,True,False,False,True
3,True,True,True,True,True,...,True,True,True,True,True
4,True,True,True,True,True,...,True,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
995,True,True,True,True,True,...,True,True,False,False,True
996,True,True,True,True,True,...,True,True,True,True,True
997,True,True,True,True,True,...,True,True,True,True,True
998,True,True,True,True,True,...,True,True,False,False,True


In [17]:
pd.notnull(df)

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,True,True,True,True,True,...,True,True,False,False,True
1,True,True,True,True,True,...,True,True,False,False,True
2,True,True,True,True,True,...,True,True,False,False,True
3,True,True,True,True,True,...,True,True,True,True,True
4,True,True,True,True,True,...,True,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
995,True,True,True,True,True,...,True,True,False,False,True
996,True,True,True,True,True,...,True,True,True,True,True
997,True,True,True,True,True,...,True,True,True,True,True
998,True,True,True,True,True,...,True,True,False,False,True


> ### `.index` and `.columns`: row and column labels

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html)

Use the `.index` property to get the label for rows.  For columns, use the `.columns` property.

In [18]:
df.index

RangeIndex(start=0, stop=1000, step=1)

In [19]:
type(df.index)

pandas.indexes.range.RangeIndex

In this specific case, rows are just numbered from 0 to 1,000.  Note that, similarly to Python's standard `range` function, this range also excludes the last number.

In [20]:
df.columns

Index([u'ID', u'Address', u'Latitude', u'Longitude', u'IsAStudio', u'Beds',
       u'Baths', u'Size', u'SizeUnit', u'LotSize', u'LotSizeUnit',
       u'BuiltInYear'],
      dtype='object')

In [21]:
type(df.columns)

pandas.indexes.base.Index

> ## `[ [] ]` and `[]`: subsetting on columns

Selecting specific columns is performed by using the `[]` operator.

If the values passed to `[]` are non-integers, the `DataFrame` will attempt to match them to those in the `columns` property.

> Let's subset the `DataFrame` on columns `Size` and `SizeUnit`:

In [22]:
df[ ['Size', 'SizeUnit'] ]

Unnamed: 0,Size,SizeUnit
0,557.0,sqft
1,1050.0,sqft
2,937.0,sqft
3,1574.0,sqft
4,1205.0,sqft
...,...,...
995,853.0,sqft
996,1886.0,sqft
997,1300.0,sqft
998,2678.0,sqft


> How about just on `Address`?

In [23]:
df[ ['Address'] ]

Unnamed: 0,Address
0,"829 Folsom St UNIT 906, San Francisco, CA"
1,"690 Market St UNIT 1705, San Francisco, CA"
2,"401 Grand View Ave APT 3, San Francisco, CA"
3,"250 Concord St, San Francisco, CA"
4,"88 King St APT 317, San Francisco, CA"
...,...
995,"310 Townsend St APT 311, San Francisco, CA"
996,"1343 31st Ave, San Francisco, CA"
997,"3916 Alemany Blvd, San Francisco, CA"
998,"430 Fair Oaks St, San Francisco, CA"


> ## `Series`

(http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

Not passing a list will result in a `Series`:

In [24]:
df['Address']

0        829 Folsom St UNIT 906, San Francisco, CA
1       690 Market St UNIT 1705, San Francisco, CA
2      401 Grand View Ave APT 3, San Francisco, CA
3                250 Concord St, San Francisco, CA
4            88 King St APT 317, San Francisco, CA
                          ...                     
995     310 Townsend St APT 311, San Francisco, CA
996               1343 31st Ave, San Francisco, CA
997           3916 Alemany Blvd, San Francisco, CA
998            430 Fair Oaks St, San Francisco, CA
999       720 Stockton St APT 3, San Francisco, CA
Name: Address, dtype: object

> Let's check the result type:

In [25]:
type(df['Address'])

pandas.core.series.Series

Columns can also be retrieved using "attribute" access as `DataFrame`s add a property for each column with the names of the properties as the names of the columns.  This won't work however for columns that have spaces or dots in their name.

> Let's check the value of `df`'s `.Address` property:

In [26]:
df.Address

0        829 Folsom St UNIT 906, San Francisco, CA
1       690 Market St UNIT 1705, San Francisco, CA
2      401 Grand View Ave APT 3, San Francisco, CA
3                250 Concord St, San Francisco, CA
4            88 King St APT 317, San Francisco, CA
                          ...                     
995     310 Townsend St APT 311, San Francisco, CA
996               1343 31st Ave, San Francisco, CA
997           3916 Alemany Blvd, San Francisco, CA
998            430 Fair Oaks St, San Francisco, CA
999       720 Stockton St APT 3, San Francisco, CA
Name: Address, dtype: object

> Use the `.name` property (not `.columns`, that's for a `DataFrame`) to get the name of the variable stored inside it.

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.name.html)

In [27]:
df.Address.name

'Address'

> ## `[]`: slicing on rows

> E.g., on the first five rows:

In [28]:
df[:5]

Unnamed: 0,ID,Address,Latitude,Longitude,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,2121978635,"829 Folsom St UNIT 906, San Francisco, CA",37781429,-122401860,False,...,557.0,sqft,,,2010.0
1,89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,...,1050.0,sqft,,,2007.0
2,15131782,"401 Grand View Ave APT 3, San Francisco, CA",37752157,-122442356,False,...,937.0,sqft,,,1983.0
3,15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,...,1574.0,sqft,1947.0,sqft,1959.0
4,52266124,"88 King St APT 317, San Francisco, CA",37780630,-122389635,False,...,1205.0,sqft,,,2000.0


> ## `.loc[]` and `.iloc[]`: subsetting rows by index label and location

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.loc.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iloc.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.set_index.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reset_index.html)

Until now, the index of the `DataFrame` is a numerical starting from 0 but you can specify which column(s) should be in the index.

> E.g., `ID`:

In [29]:
df = df.set_index('ID')

In [30]:
df.index

Int64Index([2121978635,   89239580,   15131782,   15179502,   52266124,
            2100994004,   15067755,   15112556,   15133321,   61288341,
            ...
              63197318,   15064669,   15142024,   61288364,   69819412,
              82786211,   15103435,   15195183,   15180783,   54854296],
           dtype='int64', name=u'ID', length=1000)

In [31]:
df

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2121978635,"829 Folsom St UNIT 906, San Francisco, CA",37781429,-122401860,False,1.0,...,557.0,sqft,,,2010.0
89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,2.0,...,1050.0,sqft,,,2007.0
15131782,"401 Grand View Ave APT 3, San Francisco, CA",37752157,-122442356,False,2.0,...,937.0,sqft,,,1983.0
15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,4.0,...,1574.0,sqft,1947.0,sqft,1959.0
52266124,"88 King St APT 317, San Francisco, CA",37780630,-122389635,False,2.0,...,1205.0,sqft,,,2000.0
...,...,...,...,...,...,...,...,...,...,...,...
82786211,"310 Townsend St APT 311, San Francisco, CA",37777027,-122395736,False,1.0,...,853.0,sqft,,,2006.0
15103435,"1343 31st Ave, San Francisco, CA",37762152,-122490254,False,3.0,...,1886.0,sqft,3000.0,sqft,1934.0
15195183,"3916 Alemany Blvd, San Francisco, CA",37711527,-122467755,False,3.0,...,1300.0,sqft,2553.0,sqft,1941.0
15180783,"430 Fair Oaks St, San Francisco, CA",37749725,-122424094,False,4.0,...,2678.0,sqft,,,1911.0


> E.g., row with index 15063505:

In [32]:
df.loc[15063505]

Address        740 Francisco St, San Francisco, CA
Latitude                                  37804420
Longitude                               -122417389
IsAStudio                                    False
Beds                                           NaN
                              ...                 
Size                                          1430
SizeUnit                                      sqft
LotSize                                       2435
LotSizeUnit                                   sqft
BuiltInYear                                   1948
Name: 15063505, dtype: object

In [33]:
type(df.loc[15063505])

pandas.core.series.Series

A single row is also a `Series`.

In [34]:
df.loc[15063505].name

15063505

Its name is its value in the index.

> E.g., rows with indices 15063505 and 15064044:

In [35]:
df.loc[ [15063505, 15064044] ]

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,False,,...,1430.0,sqft,2435.0,sqft,1948.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,False,1.0,...,1060.0,sqft,,,1930.0


> E.g., rows #1 and #3:

In [36]:
df.iloc[ [1, 3] ]

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,2.0,...,1050.0,sqft,,,2007.0
15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,4.0,...,1574.0,sqft,1947.0,sqft,1959.0


> ## Subsetting rows by Boolean selection (a.k.a., masking)

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns.  This allows us to build more complicated selections than those based simply upon index labels or positions.

> E.g., what homes have been built before 1900?

In [37]:
df.BuiltInYear < 1900

ID
2121978635    False
89239580      False
15131782      False
15179502      False
52266124      False
              ...  
82786211      False
15103435      False
15195183      False
15180783      False
54854296      False
Name: BuiltInYear, dtype: bool

This results in a `Series` that can be used to subset on the rows which values are `True`.

> Let's subset on that `Series`:

In [38]:
df[df.BuiltInYear < 1900]

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
123597388,"667 Shotwell St # A, San Francisco, CA",37757851,-122415629,False,1.0,...,1212.0,sqft,,,1890.0
63197592,"3021 20th St, San Francisco, CA",37758878,-122411147,False,1.0,...,1267.0,sqft,,,1890.0
15145720,"956 S Van Ness Ave, San Francisco, CA",37757832,-122417139,False,4.0,...,3500.0,sqft,4165.0,sqft,1872.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,False,4.0,...,3006.0,sqft,,,1885.0
15181209,"1001 Diamond St # 1001A, San Francisco, CA",37749461,-122435844,False,3.0,...,2032.0,sqft,1913.0,sqft,1892.0
...,...,...,...,...,...,...,...,...,...,...,...
15065140,"1407 Montgomery St APT 2, San Francisco, CA",37802299,-122404941,False,1.0,...,1000.0,sqft,,,1870.0
15084954,"1954 Golden Gate Ave, San Francisco, CA",37778420,-122443073,False,2.0,...,1515.0,sqft,,,1895.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,False,2.0,...,1593.0,sqft,,,1895.0
15082108,"3016 Sacramento St, San Francisco, CA",37788970,-122442995,False,2.0,...,1408.0,sqft,,,1890.0


Multiple conditions can be put together.

> E.g., subset for `BuiltInYear` below 1900 and `Size` over 1500:

In [39]:
df[df.BuiltInYear < 1900][df.Size > 1500]

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15145720,"956 S Van Ness Ave, San Francisco, CA",37757832,-122417139,False,4.0,...,3500.0,sqft,4165.0,sqft,1872.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,False,4.0,...,3006.0,sqft,,,1885.0
15181209,"1001 Diamond St # 1001A, San Francisco, CA",37749461,-122435844,False,3.0,...,2032.0,sqft,1913.0,sqft,1892.0
82785514,"1394 Mcallister St, San Francisco, CA",37778463,-122434933,False,3.0,...,2300.0,sqft,,,1890.0
15078866,"753-755 Oak St, San Francisco, CA",37773576,-122431663,False,,...,2430.0,sqft,3781.0,sqft,1890.0
2122992200,"129 Octavia St, San Francisco, CA",37773192,-122424037,True,,...,3655.0,sqft,,,1883.0
15084954,"1954 Golden Gate Ave, San Francisco, CA",37778420,-122443073,False,2.0,...,1515.0,sqft,,,1895.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,False,2.0,...,1593.0,sqft,,,1895.0
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,False,6.0,...,7375.0,sqft,2748.0,sqft,1890.0


(or)

In [40]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)]

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15145720,"956 S Van Ness Ave, San Francisco, CA",37757832,-122417139,False,4.0,...,3500.0,sqft,4165.0,sqft,1872.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,False,4.0,...,3006.0,sqft,,,1885.0
15181209,"1001 Diamond St # 1001A, San Francisco, CA",37749461,-122435844,False,3.0,...,2032.0,sqft,1913.0,sqft,1892.0
82785514,"1394 Mcallister St, San Francisco, CA",37778463,-122434933,False,3.0,...,2300.0,sqft,,,1890.0
15078866,"753-755 Oak St, San Francisco, CA",37773576,-122431663,False,,...,2430.0,sqft,3781.0,sqft,1890.0
2122992200,"129 Octavia St, San Francisco, CA",37773192,-122424037,True,,...,3655.0,sqft,,,1883.0
15084954,"1954 Golden Gate Ave, San Francisco, CA",37778420,-122443073,False,2.0,...,1515.0,sqft,,,1895.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,False,2.0,...,1593.0,sqft,,,1895.0
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,False,6.0,...,7375.0,sqft,2748.0,sqft,1890.0


It is possible to subset on columns simultaneously.

> E.g., subset (a `DataFrame`) on `Address` for `BuiltInYear` below 1900 and `Size` over 1500:

In [41]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)][ ['Address'] ]

Unnamed: 0_level_0,Address
ID,Unnamed: 1_level_1
15145720,"956 S Van Ness Ave, San Francisco, CA"
119684777,"967 Hayes St, San Francisco, CA"
15181209,"1001 Diamond St # 1001A, San Francisco, CA"
82785514,"1394 Mcallister St, San Francisco, CA"
15078866,"753-755 Oak St, San Francisco, CA"
2122992200,"129 Octavia St, San Francisco, CA"
15084954,"1954 Golden Gate Ave, San Francisco, CA"
15078536,"640 Steiner St, San Francisco, CA"
15076156,"1533 Sutter St, San Francisco, CA"


> To get a `Series` instead of a `DataFrame`:

In [42]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)]['Address']

ID
15145720           956 S Van Ness Ave, San Francisco, CA
119684777                967 Hayes St, San Francisco, CA
15181209      1001 Diamond St # 1001A, San Francisco, CA
82785514           1394 Mcallister St, San Francisco, CA
15078866               753-755 Oak St, San Francisco, CA
2122992200             129 Octavia St, San Francisco, CA
15084954         1954 Golden Gate Ave, San Francisco, CA
15078536               640 Steiner St, San Francisco, CA
15076156               1533 Sutter St, San Francisco, CA
Name: Address, dtype: object

(or)

In [43]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)].Address

ID
15145720           956 S Van Ness Ave, San Francisco, CA
119684777                967 Hayes St, San Francisco, CA
15181209      1001 Diamond St # 1001A, San Francisco, CA
82785514           1394 Mcallister St, San Francisco, CA
15078866               753-755 Oak St, San Francisco, CA
2122992200             129 Octavia St, San Francisco, CA
15084954         1954 Golden Gate Ave, San Francisco, CA
15078536               640 Steiner St, San Francisco, CA
15076156               1533 Sutter St, San Francisco, CA
Name: Address, dtype: object

## Part B | Wrangling the SF Housing dataset (take 2) with `pandas`

In [44]:
properties_df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow-properties.csv'), index_col = 'ID')
transactions_df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow-transactions.csv'), index_col = 'ID')

(`pd.read_csv` can load the dataset and set the index column for the `DataFrame` at the same time)

In [45]:
properties_df.head()

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2121978635,"829 Folsom St UNIT 906, San Francisco, CA",37781429,-122401860,False,1.0,...,557.0,sqft,,,2010.0
89239580,"690 Market St UNIT 1705, San Francisco, CA",37788246,-122403198,False,2.0,...,1050.0,sqft,,,2007.0
15131782,"401 Grand View Ave APT 3, San Francisco, CA",37752157,-122442356,False,2.0,...,937.0,sqft,,,1983.0
15179502,"250 Concord St, San Francisco, CA",37710141,-122442063,False,4.0,...,1574.0,sqft,1947.0,sqft,1959.0
52266124,"88 King St APT 317, San Francisco, CA",37780630,-122389635,False,2.0,...,1205.0,sqft,,,2000.0


In [46]:
transactions_df.head()

Unnamed: 0_level_0,DateOfSale,SalePrice,SalePriceUnit
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15165953,1/5/16,650000.0,$
80749447,11/10/15,1.15,$M
15155751,11/10/15,665000.0,$
15143887,12/31/15,2.1,$M
15117639,12/23/15,1.35,$M


> ### Merge both `DataFrames`

(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)

In [47]:
df = properties_df.merge(transactions_df,  left_index = True, right_index = True)

In [48]:
df.columns

Index([u'Address', u'Latitude', u'Longitude', u'IsAStudio', u'Beds', u'Baths',
       u'Size', u'SizeUnit', u'LotSize', u'LotSizeUnit', u'BuiltInYear',
       u'DateOfSale', u'SalePrice', u'SalePriceUnit'],
      dtype='object')

In [49]:
df

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,LotSizeUnit,BuiltInYear,DateOfSale,SalePrice,SalePriceUnit
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15165953,"14 Stoneyford Ave, San Francisco, CA",37730923,-122421186,False,3.0,...,sqft,1942.0,1/5/16,650000.00,$
80749447,"199 New Montgomery St UNIT 1201, San Francisco...",37786632,-122399101,False,2.0,...,,2004.0,11/10/15,1.15,$M
15155751,"1546 Innes Ave, San Francisco, CA",37739068,-122387681,False,2.0,...,sqft,1938.0,11/10/15,665000.00,$
15143887,"3065-3069 16TH St, San Francisco, CA",37764699,-122420993,False,,...,sqft,1909.0,12/31/15,2.10,$M
15117639,"2195 28th Ave, San Francisco, CA",37746487,-122485890,False,4.0,...,sqft,1939.0,12/23/15,1.35,$M
...,...,...,...,...,...,...,...,...,...,...,...
69819708,"260 King St UNIT 421, San Francisco, CA",37777641,-122393417,False,1.0,...,,2004.0,12/15/15,731000.00,$
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,False,6.0,...,sqft,1890.0,11/12/15,5.53,$M
119685619,"23 Rodgers St, San Francisco, CA",37775309,-122409205,False,1.0,...,,1908.0,12/21/15,625000.00,$
15113584,"1858 47th Ave, San Francisco, CA",37751825,-122506072,False,3.0,...,sqft,1958.0,12/21/15,895000.00,$


> ### Sort rows by increasing ID

(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)

In [50]:
df.sort_index(inplace = True)

In [51]:
df

Unnamed: 0_level_0,Address,Latitude,Longitude,IsAStudio,Beds,...,LotSizeUnit,BuiltInYear,DateOfSale,SalePrice,SalePriceUnit
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,False,1.0,...,,1980.0,12/4/15,710000.00,$
15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,False,,...,sqft,1948.0,11/30/15,2.15,$M
15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,False,2.0,...,sqft,1976.0,11/12/15,5.60,$M
15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,False,1.0,...,,1930.0,12/11/15,1.50,$M
15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,False,2.0,...,,1993.0,1/15/16,970000.00,$
...,...,...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",37800040,-122406100,True,,...,,2012.0,1/15/16,390000.00,$
2126960082,"355 1st St UNIT 1905, San Francisco, CA",37787029,-122393638,False,1.0,...,,2004.0,11/20/15,860000.00,$
2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,False,3.0,...,sqft,1976.0,12/10/15,830000.00,$
2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,False,2.0,...,,1975.0,12/15/15,835000.00,$


> ### Remove the `Latitude` and `Longitude` columns

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)

In [52]:
df.drop(['Latitude', 'Longitude'], axis = 1, inplace = True)

In [53]:
df

Unnamed: 0_level_0,Address,IsAStudio,Beds,Baths,Size,...,LotSizeUnit,BuiltInYear,DateOfSale,SalePrice,SalePriceUnit
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",False,1.0,,550.0,...,,1980.0,12/4/15,710000.00,$
15063505,"740 Francisco St, San Francisco, CA",False,,2.0,1430.0,...,sqft,1948.0,11/30/15,2.15,$M
15063609,"819 Francisco St, San Francisco, CA",False,2.0,3.5,2040.0,...,sqft,1976.0,11/12/15,5.60,$M
15064044,"199 Chestnut St APT 5, San Francisco, CA",False,1.0,1.0,1060.0,...,,1930.0,12/11/15,1.50,$M
15064257,"111 Chestnut St APT 403, San Francisco, CA",False,2.0,2.0,1299.0,...,,1993.0,1/15/16,970000.00,$
...,...,...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",True,,1.0,264.0,...,,2012.0,1/15/16,390000.00,$
2126960082,"355 1st St UNIT 1905, San Francisco, CA",False,1.0,1.0,691.0,...,,2004.0,11/20/15,860000.00,$
2128308939,"33 Santa Cruz Ave, San Francisco, CA",False,3.0,3.0,1738.0,...,sqft,1976.0,12/10/15,830000.00,$
2131957929,"1821 Grant Ave, San Francisco, CA",False,2.0,2.0,1048.0,...,,1975.0,12/15/15,835000.00,$


> ### `SalePrice`: scale all amount to `$M`

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)

In [54]:
df.SalePriceUnit.unique()

array(['$', '$M'], dtype=object)

In [55]:
df_1 = df[df.SalePriceUnit == '$']
df_1 = df_1.drop('SalePriceUnit', axis = 1)

# Scaling sale price to $M
df_1.SalePrice /= 10 ** 6

df_6 = df[df.SalePriceUnit == '$M']
df_6 = df_6.drop('SalePriceUnit', axis = 1)

In [56]:
# Concatenate of two DataFrames by rows
df = pd.concat([df_1, df_6])

In [57]:
# Resort the new DataFrame
df.sort_index(inplace = True)

> ### `IsAStudio`: convert from a Boolean to a binary variable (i.e., 0 or 1)

In [58]:
df.IsAStudio *= 1

In [59]:
df.IsAStudio

ID
15063471      0
15063505      0
15063609      0
15064044      0
15064257      0
             ..
2124214951    1
2126960082    0
2128308939    0
2131957929    0
2136213970    0
Name: IsAStudio, dtype: object

> ### `Size`

In [60]:
df.SizeUnit.unique()

array(['sqft', nan], dtype=object)

Size is either in square feet or missing.  Almost no work needed except to remove size unit.

In [61]:
df.drop('SizeUnit', axis = 1, inplace = True)

> ### `LotSize`: scale all values to square feet

In [62]:
df.LotSizeUnit.unique()

array([nan, 'sqft', 'ac'], dtype=object)

Lot sizes are either in square feet or in acres.  Let's convert them all to square feet.

> Group #1: the `na` values:

In [63]:
df_na = df[df.LotSizeUnit.isnull()]
df_na = df_na.drop('LotSizeUnit', axis = 1)

df_na.shape[0]

444

> Group #2: the `sqft` values:

In [64]:
df_sqft = df[df.LotSizeUnit == 'sqft']
df_sqft = df_sqft.drop('LotSizeUnit', axis = 1)

df_sqft.shape[0]

552

> Group #3: the `ac` values:

In [65]:
df_ac = df[df.LotSizeUnit == 'ac']
df_ac = df_ac.drop('LotSizeUnit', axis = 1)

df_ac.shape[0]

4

> Let's scale these `acre` values into `sqft`:

In [66]:
# (1 acre = 43,560 sqft)

df_ac.LotSize *= 43560.

Let's now put everything back together...

In [67]:
df = pd.concat([df_na, df_sqft, df_ac]).sort_index()

In [68]:
df

Unnamed: 0_level_0,Address,IsAStudio,Beds,Baths,Size,LotSize,BuiltInYear,DateOfSale,SalePrice
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",0,1.0,,550.0,,1980.0,12/4/15,0.710
15063505,"740 Francisco St, San Francisco, CA",0,,2.0,1430.0,2435.0,1948.0,11/30/15,2.150
15063609,"819 Francisco St, San Francisco, CA",0,2.0,3.5,2040.0,3920.0,1976.0,11/12/15,5.600
15064044,"199 Chestnut St APT 5, San Francisco, CA",0,1.0,1.0,1060.0,,1930.0,12/11/15,1.500
15064257,"111 Chestnut St APT 403, San Francisco, CA",0,2.0,2.0,1299.0,,1993.0,1/15/16,0.970
...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",1,,1.0,264.0,,2012.0,1/15/16,0.390
2126960082,"355 1st St UNIT 1905, San Francisco, CA",0,1.0,1.0,691.0,,2004.0,11/20/15,0.860
2128308939,"33 Santa Cruz Ave, San Francisco, CA",0,3.0,3.0,1738.0,2299.0,1976.0,12/10/15,0.830
2131957929,"1821 Grant Ave, San Francisco, CA",0,2.0,2.0,1048.0,,1975.0,12/15/15,0.835


> ## `.to_csv`: save the `DataFrame` into a `.csv` file

At the end of each phase (i.e., wrangling) of your data science project, it is a good idea to save your dataset into disk.  Then for the next step, create a new Jupyther notebook and load your updated dataset

In [69]:
df.to_csv(os.path.join('..', 'datasets', 'dataset-03-zillow.csv'), index_label = 'ID')

## Part C | Advanced topics

### `.groupby()`

(http://pandas.pydata.org/pandas-docs/stable/groupby.html)

> What is the mean price of houses by number of bathrooms?

In [70]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow.csv'))

In [71]:
df[ ['Baths', 'SalePrice'] ].groupby('Baths').mean()

Unnamed: 0_level_0,SalePrice
Baths,Unnamed: 1_level_1
1.00,0.987656
1.10,1.420000
1.25,1.600000
1.50,1.223378
1.75,0.928000
...,...
6.50,16.000000
7.00,0.999000
7.50,5.530000
8.00,13.100000


### `.map()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html)

When converting `SalePrice`, `Size`, and `LotSize` into `$M` and sqft, we could also have done the following:

In [72]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow-transactions.csv'))

In [73]:
df.SalePriceUnit.unique()

array(['$', '$M'], dtype=object)

In [74]:
df.SalePriceUnit.map({'$': 1. / (10 ** 6), '$M': 1.})

0      0.000001
1      1.000000
2      0.000001
3      1.000000
4      1.000000
         ...   
995    0.000001
996    1.000000
997    0.000001
998    0.000001
999    0.000001
Name: SalePriceUnit, dtype: float64

In [75]:
df.SalePrice *= df.SalePriceUnit.map({'$': 1. / (10 ** 6), '$M': 1.})

In [76]:
df.SalePrice

0      0.650
1      1.150
2      0.665
3      2.100
4      1.350
       ...  
995    0.731
996    5.530
997    0.625
998    0.895
999    0.650
Name: SalePrice, dtype: float64

In [77]:
df.drop('SalePriceUnit', axis = 1, inplace = True)

In [78]:
df

Unnamed: 0,ID,DateOfSale,SalePrice
0,15165953,1/5/16,0.650
1,80749447,11/10/15,1.150
2,15155751,11/10/15,0.665
3,15143887,12/31/15,2.100
4,15117639,12/23/15,1.350
...,...,...,...
995,69819708,12/15/15,0.731
996,15076156,11/12/15,5.530
997,119685619,12/21/15,0.625
998,15113584,12/21/15,0.895


> ### Activity:  Using `.map()`, convert `Size` and `LotSize` to sqft.

In [79]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow-properties.csv'))

In [80]:
df.SizeUnit.unique()

array(['sqft', nan], dtype=object)

In [81]:
df.drop('SizeUnit', axis = 1, inplace = True)

In [82]:
df.LotSizeUnit.unique()

array([nan, 'sqft', 'ac'], dtype=object)

In [83]:
df.LotSize *= df.LotSizeUnit.map({'sqft': 1., 'ac': 43560.})

In [84]:
df.drop('LotSizeUnit', axis = 1, inplace = True)

### `.to_datetime()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)

In [85]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-03-zillow.csv'))

In [86]:
df.DateOfSale

0       12/4/15
1      11/30/15
2      11/12/15
3      12/11/15
4       1/15/16
         ...   
995     1/15/16
996    11/20/15
997    12/10/15
998    12/15/15
999     1/10/16
Name: DateOfSale, dtype: object

So far, the dates stored in the `DataFrame` are just strings.  We cannot easily extract the day, month, year.  Thanksfully, `pandas` provides some facilities to do so.

In [87]:
pd.to_datetime(df.DateOfSale)

0     2015-12-04
1     2015-11-30
2     2015-11-12
3     2015-12-11
4     2016-01-15
         ...    
995   2016-01-15
996   2015-11-20
997   2015-12-10
998   2015-12-15
999   2016-01-10
Name: DateOfSale, dtype: datetime64[ns]

In [88]:
df.DateOfSale = pd.to_datetime(df.DateOfSale)

### `.apply()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)

In [89]:
df.DateOfSale.apply(lambda date_of_sale: date_of_sale.year)

0      2015
1      2015
2      2015
3      2015
4      2016
       ... 
995    2016
996    2015
997    2015
998    2015
999    2016
Name: DateOfSale, dtype: int64

In [90]:
df['YearOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.year)
df['MonthOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.month)
df['DayOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.day)
df['WeekDayOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.weekday_name)

df.drop('DateOfSale', axis = 1, inplace = True)

Now, we have the day, day of the week, month, and year of the sale as features in our dataset.

In [91]:
df

Unnamed: 0,ID,Address,IsAStudio,Beds,Baths,...,SalePrice,YearOfSale,MonthOfSale,DayOfSale,WeekDayOfSale
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",0.0,1.0,,...,0.710,2015,12,4,Friday
1,15063505,"740 Francisco St, San Francisco, CA",0.0,,2.0,...,2.150,2015,11,30,Monday
2,15063609,"819 Francisco St, San Francisco, CA",0.0,2.0,3.5,...,5.600,2015,11,12,Thursday
3,15064044,"199 Chestnut St APT 5, San Francisco, CA",0.0,1.0,1.0,...,1.500,2015,12,11,Friday
4,15064257,"111 Chestnut St APT 403, San Francisco, CA",0.0,2.0,2.0,...,0.970,2016,1,15,Friday
...,...,...,...,...,...,...,...,...,...,...,...
995,2124214951,"412 Green St APT A, San Francisco, CA",1.0,,1.0,...,0.390,2016,1,15,Friday
996,2126960082,"355 1st St UNIT 1905, San Francisco, CA",0.0,1.0,1.0,...,0.860,2015,11,20,Friday
997,2128308939,"33 Santa Cruz Ave, San Francisco, CA",0.0,3.0,3.0,...,0.830,2015,12,10,Thursday
998,2131957929,"1821 Grant Ave, San Francisco, CA",0.0,2.0,2.0,...,0.835,2015,12,15,Tuesday
