# DS-SF-27 | Codealong 02 | Research Design and `pandas`

## Part A - Introduction to `pandas` with the SF Housing dataset

> ### Set up the environment

In [None]:
import os

import pandas as pd
pd.set_option('display.max_rows', 6)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

> ### Load data from files and the Web

`pandas` provides powerful facilities for easy retrieval of data from a variety of data sources.  In particular, it provides built-in support for loading data in `.csv` format, a common means of storing structured data in text files.

In [None]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-02-starter.csv'))

> Let's verify the type of `df`:

In [None]:
# TODO

The result is a `DataFrame`.  A `DataFrame` stores tabular data:

In [None]:
df

> ### Shape of the `DataFrame`: number of rows and columns

In [None]:
df.shape

The first value is the number of rows, the second the number of columns.

In [None]:
df.shape[0]

In [None]:
df.shape[1]

(you can also use the idiomatic `len` to get the number of rows)

In [None]:
len(df)

> We can get the "names" of the rows of the `DataFrame` with the `index` property.

In [None]:
df.index

Here, rows are just numbered from 0 to 1,000 (excluded).

> We can also get the names of the different columns of the `DataFrame` with the `column` property.

In [None]:
df.columns

> ### Subsetting on columns of a `DataFrame`

Selecting data in specific columns of a `DataFrame` is performed by using the `[]` operator.

Passing a single integer, or a list of integers, to `[]` will perform a location based lookup of the columns.

> E.g., columns #5 and #6:

In [None]:
# TODO

> Let's check that the column subsetting returns a `DataFrame`:

In [None]:
# TODO

A `DataFrame` can be made of a single column.

> E.g., column #7 only:

In [None]:
# TODO

If the values passed to `[]` are non-integers, the `DataFrame` will attempt to match them to those in the `columns` property.

> Let's subset the `DataFrame` on columns `SalePrice` and `SalePriceUnit`:

In [None]:
# TODO

However, you cannot mix integers and non-integers.  E.g.,

In [None]:
# "df[ ['SalePrice', 6] ]" errors out...  Try it!

Not passing a list always results in a value based lookup of the column:

In [None]:
df['Address']

> Let's double check that result is a `Series`:

In [None]:
# TODO

Columns can also be retrieved using "attribute access" as `DataFrame`s add a property for each column with the names of the properties as the names of the columns.  However, this won't work for columns that have spaces or dots in their name.

> Let's check the value of `df`'s `Address` property:

In [None]:
# TODO

> Use the `name` property (not `columns`, that's for a `DataFrame`) to get the name of the variable stored inside it.

In [None]:
# TODO

To find the zero-based location of a column, use the `.get_loc()` method of the `columns` property.  E.g.,

In [None]:
df.columns.get_loc('BedCount')

In [None]:
df[ [df.columns.get_loc('BedCount')] ]

> We should get the same output as subsetting a `DataFrame` on `BedCount`:

In [None]:
# TODO

> ### Subsetting on rows of a `DataFrame`; and values of a `DataFrame` using the index

#### Slice using the `[]` operator

> E.g., the first five rows:

In [None]:
# TODO

> ### Select rows by index label and location: `.loc[]` and `.iloc[]`

Until now, the index of the `DataFrame` is a numerical starting from 0 but you can specify which column(s) should be in the index.  E.g., `ID`:

In [None]:
df = df.set_index('ID')

In [None]:
df.index

In [None]:
df

> E.g., row with index 15063505:

In [None]:
# TODO

Its name is its value in the index.

> E.g., rows with indices 15063505 and 15064044:

In [None]:
# TODO

> E.g., rows #1 and #3:

In [None]:
# TODO

> ### Scalar lookup by label or location using `.at[]` and `.iat[]`

Scalar values can be looked up by label using .at, by passing both the row label and then the column name/value.

> E.g., row with index 15064044 and column `DateOfSale`.

In [None]:
# TODO

Scalar values can also be looked up by location using .iat by passing both the row location and then the column location.

> E.g., row #3 and column #3:

In [None]:
# TODO

> ### Select rows of a `DataFrame` by Boolean selection

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns.  This allows us to build more complicated selections than those based simply upon index labels or positions.

> E.g., what homes have been built before 1900?

In [None]:
# TODO

This results in a `Series` that can be used to subset on the rows which values are `True`.

> Let's subset on that `Series`:

In [None]:
# TODO

Multiple conditions can be put together.

> E.g., subset for `BuiltInYear` below 1900 and `Size` over 1500:

In [None]:
# TODO

It is possible to subset on columns simultaneously.

> E.g., subset (a `DataFrame`) on `Address` for `BuiltInYear` below 1900 and `Size` over 1500:

In [None]:
# TODO

> To get a `Series` instead of a `DataFrame`:

In [None]:
# TODO

## Part B - Wrangling the SF Housing dataset (take 2) with `pandas`

In [None]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-02-starter.csv'), index_col = 'ID')

(`pd.read_csv` can load the dataset and set the index column for the `DataFrame` at the same time)

> ### Remove the `Latitude` and `Longitude` columns

In [None]:
df.drop(['Latitude', 'Longitude'], axis = 1, inplace = True)

In [None]:
df

> ### `SalePrice`: scale all values to dollars

In [None]:
df.SalePriceUnit.unique()

In [None]:
df_1 = df[df.SalePriceUnit == '$']

df_1 = df_1.drop('SalePriceUnit', axis = 1)
# Really a workaround as DataFrame.drop() with inplace = True issue warnings...

df_6 = df[df.SalePriceUnit == '$M']

df_6 = df_6.drop('SalePriceUnit', axis = 1)

# Replacing the content of a column
df_6.SalePrice = df_6.SalePrice * (10 ** 6)

# Adding rows to a DataFrame
# Concatenation of two DataFrame objects
df = pd.concat([df_1, df_6])

In [None]:
df.sort_index()

> ### `IsAStudio`: convert from a Boolean to a binary variable (i.e., 0 or 1)

In [None]:
# TODO

In [None]:
df

> ### `Size`

In [None]:
df.SizeUnit.unique()

Size is either in square feet or missing.  Almost no work needed except to remove size unit.

In [None]:
df.drop('SizeUnit', axis = 1, inplace = True)

> ### `LotSize`: scale all values to square feet

In [None]:
df.LotSizeUnit.unique()

Lot sizes are either in square feet or in acres.  Let's convert them all to square feet.

> Group #1: the `na` values:

In [None]:
df_na = df[df.LotSizeUnit.isnull()]
df_na = df_na.drop('LotSizeUnit', axis = 1)

df_na.shape[0]

> Group #2: the `sqft` values:

In [None]:
# TODO (use df_sqft)

> Group #3: the `ac` values:

In [None]:
# TODO (use df_ac)

> Let's scale these `acre` values into `sqft`:

In [None]:
# (1 acre = 43,560 sqft)

# TODO

Let's now put everything back together...

In [None]:
df = pd.concat([df_na, df_sqft, df_ac]).sort_index()

In [None]:
df

> ### Save the `pandas` `DataFrame` to a `.csv` file

In [None]:
df.to_csv(os.path.join('..', 'datasets', 'zillow-02.csv'), index_label = 'ID')