# Even More `pandas`

In [None]:
import pandas as pd
import numpy as np
import requests as rq
from sklearn.preprocessing import OneHotEncoder
from zipfile import ZipFile

## Agenda

SWBAT:

- Use `pandas.set_option()` to adjust display options;
- Use `.pivot()`, `.join()`, `.merge()`, and `pd.concat()` to manipulate DataFrames;
- Perform one-hot-encoding on categorical columns of a DataFrame

We'll work with the Austin Animal Center dataset and with data from King County's Department of Assessments (Seattle housing data).

## `pandas.set_option()`

We can adjust how `pandas` works by setting options in advance.

### Block Scientific Notation

For example, suppose we want to prevent numbers from being displayed in scientific notation.

In [None]:
df = pd.DataFrame([[1e9, 2e9], [3e9, 4e9]])
df

Then we can use:

In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
df

### See More Rows

Or suppose we want `pandas` to show more rows.

In [None]:
df2 = pd.DataFrame(np.array(range(100)))
df2

In that case we can use:

In [None]:
pd.set_option('display.max_rows', 100)

In [None]:
df2

For complete documentation, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).

## Austin Animal Center

In [None]:
data = rq.get('https://data.austintexas.gov/resource/9t4d-g238.json').text

In [None]:
animals = pd.read_json(data)

In [None]:
animals.head()

### Reshaping a DataFrame

#### .pivot_table()

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

Grouping by two different columns can be very helpful.

In [None]:
animals.groupby(by=['outcome_type', 'sex_upon_outcome']).agg(len)

But it has the unsavory side effect of creating a two-level index. This can be a good time to use `.pivot_table()`.

(There is also a `.pivot()`. For the somewhat subtle differences, see [here](https://stackoverflow.com/questions/30960338/pandas-difference-between-pivot-and-pivot-table-why-is-only-pivot-table-workin).)

In [None]:
animals.pivot_table(index='outcome_type', columns='sex_upon_outcome', aggfunc=len)

### Methods for Combining DataFrames: .join(), .merge(), .concat(), .melt()

#### .join()

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

toy1

In [None]:
toy2

In [None]:
toy1.set_index('age').join(toy2.set_index('age'))

For more on this method, check out the [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)!

#### .merge()

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

In [None]:
states = pd.read_csv('data/states.csv', index_col=0)
states

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='outer')

#### pd.concat()

This method takes a *list* of pandas objects as arguments.

In [None]:
ds_full = pd.concat([ds_chars, states])
ds_full

`pd.concat()`––and many other pandas operations––make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

In [None]:
ds_full = pd.concat([ds_chars, states], axis=1)
ds_full

#### pd.melt()

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
pd.melt(ds_full)

[Here](https://towardsdatascience.com/transforming-data-in-python-with-pandas-melt-854221daf507) is a use case for `pd.melt()`.

## Making Use of Categories: One-Hot Encoding

Pandas has a one-hot encoder called `get_dummies()`, which is good for exploratory data analysis (EDA).

This might be good to use if we're in the **data-understanding** stage (Stage 2) of our CRISP-DM process.

We can call it on a DataFrame as a whole or on a Series (column).

In [None]:
pd.get_dummies(animals['animal_type'])

If however we're in a later stage of the process and we're interested, say, in preparing a data pipeline, `pandas.get_dummies()` will prove inferior to other tools.

In practice, we will **not** use `pandas.get_dummies()`. The library Scikit-Learn (`sklearn`, included with your Anaconda installation) has a `OneHotEncoder` class that creates an object that persists. This makes it much more apt for production environments, and so it's good to get in the habit of using it.

Ultimately, we will use **many** tools from `sklearn`.

In [None]:
ohe = OneHotEncoder()

ohe.fit(animals[['animal_type']])

Now that the `OneHotEncoder` object has been fitted to our data, it has newly available attributes and methods. In particular, it has access to the different categories that we're replacing:

In [None]:
ohe.get_feature_names()

We'll have much more to say about `sklearn` syntax and about Python's object structure. But let's now transform our data to see what the new table looks like:

In [None]:
ohe.transform(animals[['animal_type']])

For the sake of saving storage space, the return is a **sparse matrix**, but we can "re-inflate it if we want to see it in tabular form:

In [None]:
types_encoded = ohe.transform(animals[['animal_type']]).todense()
types_encoded

Let's put it into a DataFrame:

In [None]:
pd.DataFrame(types_encoded, columns=ohe.get_feature_names()).head()

## King County Assessments

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

Go [here](https://info.kingcounty.gov/assessor/DataDownload/default.aspx) and download two files: "Real Property Sales" and "Residential Building". Then unzip them. (Or you can run the cells below if you prefer.)

In [None]:
# %%bash
# cd data
# curl -o property_sales.zip https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip 

In [None]:
# %%bash
# cd data
# curl -o res_bldg.zip https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip 

In [None]:
# zf = ZipFile('data/property_sales.zip', 'r')
# zf.extractall('data')
# zf.close()

In [None]:
# zf = ZipFile('data/res_bldg.zip', 'r')
# zf.extractall('data')
# zf.close()

In [None]:
# You'll need to use a new encoding here. List of
# all encodings here:
# https://docs.python.org/3/library/codecs.html#standard-encodings

sales_df = pd.read_csv('/Users/gdamico/Downloads/EXTR_RPSale.csv',
                       encoding='mac_roman')

### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [None]:
sales_df.head().T

### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [None]:
sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']]

In [None]:
sales_df.info()

In [None]:
bldg_df = pd.read_csv('/Users/gdamico/Downloads/EXTR_ResBldg.csv')

### Another warning! Which column has index 11?

In [None]:
bldg_df.columns[11]

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [None]:
bldg_df.head().T

### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [None]:
bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']]

In [None]:
bldg_df.info()

In [None]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [None]:
sales_data.head()

In [None]:
sales_data.info()

We can see right away that we're missing zip codes for many of the sales transactions.

In [None]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Because we are interested in finding houses in Seattle ZIP codes, we will need to drop the rows with missing zip codes.

In [None]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

sales_data.head()

## Time Permitting: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [None]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

### 3. Add a column for PricePerSqFt



### 4. Subset the data to 2020 sales only.

We can assume that the DocumentDate is approximately the sale date.

### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

### 6. What is the mean price per square foot for a house sold in Seattle in 2020?

Don't just type the answer. Type code that generates the answer as output!