# DS-SF-25 | Codealong 03 | Descriptive Statistics for Exploratory Data Analysis

In [None]:
import os
import math
import numpy as np

import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [None]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-03-starter.csv'), index_col = 'ID')

In [None]:
df

## Part A

- `.mean()`
- `.var()`, `.std()`

### `Series.mean()` - Compute the `Series` mean value

In [None]:
df.SalePrice.mean()

> What's `Size`'s mean?

In [None]:
# TODO

> What's fraction of the properties sold in the dataset are studios?

In [None]:
# TODO

### `DataFrame.mean()` - Compute the `DataFrame` mean value

In [None]:
# TODO

`DataFrame.mean()` only applies to numerical columns.  Address and date of sales aren't included.

### `.var()` - Compute the unbiased variance (normalized by `N-1` by default)

In [None]:
# TODO

> What's the variance for the number of beds in the dataset?

In [None]:
# TODO

### `.std()` - Compute the unbiased standard deviation (normalized by `N-1` by default)

In [None]:
# TODO

> What's the standard deviation for the number of beds in the dataset?

In [None]:
# TODO

## Part B

- `.median()`
- `.count()`, `.dropna()`, `.isnull()`
- `.min()`, `.max()`
- `.quantile()`
- `.describe()`

### `.median()` - Compute the median value

In [None]:
# TODO

> What's the median sale price for properties in the dataset?

In [None]:
# TODO

### `.count()` - Compute the number of rows/observations without `NaN` and `.sum()` - Compute the sum of the values

In [None]:
df.count()

In [None]:
df.IsAStudio.count()

`count()` counts the number of non-`NaN` values:

In [None]:
df.IsAStudio.dropna().shape[0]

In [None]:
df.IsAStudio.isnull().sum()

Which leaves 14 houses for which we don't know if they are studios or not.

In [None]:
df.IsAStudio.dropna().shape[0] + df.IsAStudio.isnull().sum()

In [None]:
df.IsAStudio.sum()

29 properties are studios.

### `.min()` and `.max()` - Compute the minimum and maximum values

In [None]:
df.min()

> What are properties that were sold at the lowest price?  At what price?

In [None]:
# TODO

In [None]:
df.max()

> What are properties that were sold at the highest price?  At what price?

In [None]:
# TODO

### `.quantile()` - Compute values at the given quantile

In [None]:
df.quantile(.5)

In [None]:
df.median()

In [None]:
df.quantile(.25)

In [None]:
df.quantile(.75)

### `.describe()` - Generate various summary statistics

In [None]:
df.describe()

In [None]:
df.SalePrice.describe()

## Part C

- Boxplots

In [None]:
df.SalePrice.plot(kind = 'box', figsize = (8, 8))

> In the same plot, plot the boxplots of `BedCount` and `BathCount`

In [None]:
# TODO

## Part D

- Histograms

In [None]:
df.BedCount.plot(kind = 'hist', figsize = (8, 8))

> In the same plot, plot the histograms of `BedCount` and `BathCount`

In [None]:
# TODO

## Part E

- `.mode()`

### `.mode()` - Compute the mode value(s)

In [None]:
df.mode()

From the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html): Gets the mode(s) of each element along the columns.  Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with `NaN`.  Note that there could be multiple values returned in the columns (when more than one value share the maximum frequency), which is the reason why a dataframe is returned.

In [None]:
df.Address[df.Address == '1 Mono St # B, San Francisco, CA']

In [None]:
df.Address[df.Address == '829 Folsom St UNIT 906, San Francisco, CA']

In [None]:
df[df.DateOfSale == '11/20/15'].shape[0]

In [None]:
(df.DateOfSale == '11/20/15').sum()

## Part F

- `.corr()`
- Heatmaps
- Scatter plots
- Scatter matrices

In [None]:
corr = df.corr()

corr

### Heatmaps

In [None]:
# TODO

Let's pretty this up.

In [None]:
list(corr.columns)

In [None]:
figure = plt.figure()
subplot = figure.add_subplot(1, 1, 1)
figure.colorbar(subplot.matshow(corr))
subplot.set_xticklabels([None] + list(corr.columns), rotation = 90)
subplot.set_yticklabels([None] + list(corr.columns))

### Scatter plots

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'scatter', x = 'BedCount', y = 'BathCount', s = 100, figsize = (8, 8))

### Scatter matrices

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['BedCount', 'BathCount'] ], diagonal = 'kde', s = 500, figsize = (8, 8))

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['SalePrice', 'Size'] ], s = 200, figsize = (8, 8))

## Part G

- `.value_counts()`
- `.crosstab()`

> Reproduce the `BedCount` histogram above.  For each possible bed count, how many properties share that bed count?

In [None]:
# TODO

> Careful on checking for `NaN` values

In [None]:
# TODO

> Create a frequency table for `BathCount` over `BedCount`.

In [None]:
# TODO