# DS-SF-27 | Codealong 03 | Exploratory Data Analysis

In [1]:
import os

import math

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')



## Part A - Review and Activity | Subsetting with pandas

In [2]:
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank'],
    'gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [24, 34, 44, 41, 52, 43],
    'marital_status': [0, 2, 1, 2, 0, 1]}).\
        set_index('name')

In [3]:
df

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,24,Female,0
Bob,34,Male,2
Carol,44,Female,1
Dave,41,Male,2
Eve,52,Female,0
Frank,43,Male,1


> Question 1.  Subset the dataframe on the age and gender columns

In [6]:
df_sub = df[['age','gender']]
df_sub

Unnamed: 0_level_0,age,gender
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,24,Female
Bob,34,Male
Carol,44,Female
Dave,41,Male
Eve,52,Female
Frank,43,Male


> Question 2.  Subset the dataframe on the age column alone, first as a `DataFrame`, then as a `Series`

In [9]:
df[['age']]

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Alice,24
Bob,34
Carol,44
Dave,41
Eve,52
Frank,43


In [10]:
df['age']

name
Alice    24
Bob      34
Carol    44
Dave     41
Eve      52
Frank    43
Name: age, dtype: int64

> Question 3.  Subset the dataframe on the rows Bob and Carol

In [25]:
# df[(df.name == 'Bob') or (df.name == 'Carol')]
# df[1:3]
df.loc[['Bob','Carol']]

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,34,Male,2
Carol,44,Female,1


> Question 4.  Subset the dataframe on the row Eve alone, first as a `DataFrame`, then as a `Series`

In [21]:
df.loc[['Eve']]

Unnamed: 0_level_0,age,gender,marital_status
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Eve,52,Female,0


In [23]:
df.loc['Eve']

age                   52
gender            Female
marital_status         0
Name: Eve, dtype: object

> Question 5.  How old is Frank?

In [26]:
# df.at['Frank','age']
df.age.Frank

43

## Part B

- `.mean()`
- `.var()`, `.std()`

In [27]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-03-starter.csv'), index_col = 'ID')

In [28]:
df

Unnamed: 0_level_0,Address,DateOfSale,SalePrice,IsAStudio,BedCount,BathCount,Size,LotSize,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",12/4/15,710000.0,0.0,1.0,,550.0,,1980.0
15063505,"740 Francisco St, San Francisco, CA",11/30/15,2150000.0,0.0,,2.0,1430.0,2435.0,1948.0
15063609,"819 Francisco St, San Francisco, CA",11/12/15,5600000.0,0.0,2.0,3.5,2040.0,3920.0,1976.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",12/11/15,1500000.0,0.0,1.0,1.0,1060.0,,1930.0
15064257,"111 Chestnut St APT 403, San Francisco, CA",1/15/16,970000.0,0.0,2.0,2.0,1299.0,,1993.0
15064295,"111 Chestnut St APT 702, San Francisco, CA",12/17/15,940000.0,0.0,2.0,2.0,1033.0,,1993.0
15064391,"1821 Grant Ave APT 101, San Francisco, CA",12/15/15,835000.0,0.0,1.0,1.0,1048.0,,1975.0
15064536,"2300 Leavenworth St, San Francisco, CA",12/4/15,2830000.0,0.0,3.0,2.0,2115.0,1271.0,1913.0
15064640,"1047-1049 Lombard St, San Francisco, CA",1/14/16,4050000.0,1.0,,,4102.0,3049.0,1948.0
15064669,"1055 Lombard St # C, San Francisco, CA",12/31/15,2190000.0,0.0,2.0,3.0,1182.0,,1986.0


### `Series.mean()` - Compute the `Series` mean value

In [29]:
df.SalePrice.mean()

1397422.943

> What's `Size`'s mean?

In [30]:
df.Size.mean()
# df.Size.mean()

1641.3009307135471

> What's fraction of the properties sold in the dataset are studios?

In [32]:
1. * len(df[df.IsAStudio > 0])/df.shape[0]

0.029

In [33]:
df.count()

Address        1000
DateOfSale     1000
SalePrice      1000
IsAStudio       986
BedCount        836
BathCount       942
Size            967
LotSize         556
BuiltInYear     975
dtype: int64

In [36]:
df.isnull().sum().sum()

738

### `DataFrame.mean()` - Compute the `DataFrame` mean value

In [None]:
# TODO

### `.var()` - Compute the unbiased variance (normalized by `N-1` by default)

In [None]:
# TODO

> What's the variance for the number of beds in the dataset?

In [None]:
# TODO

### `.std()` - Compute the unbiased standard deviation (normalized by `N-1` by default)

In [None]:
# TODO

> What's the standard deviation for the number of beds in the dataset?

In [None]:
# TODO

## Part C

- `.median()`
- `.count()`, `.dropna()`, `.isnull()`
- `.min()`, `.max()`
- `.quantile()`
- `.describe()`

### `.median()` - Compute the median value

In [None]:
# TODO

> What's the median sale price for properties in the dataset?

In [None]:
# TODO

### `.count()` - Compute the number of rows/observations without `NaN` and `.sum()` - Compute the sum of the values

In [None]:
df.count()

In [None]:
df.IsAStudio.count()

`count()` counts the number of non-`NaN` values:

In [None]:
df.IsAStudio.dropna().shape[0]

In [None]:
df.IsAStudio.isnull().sum()

Which leaves 14 houses for which we don't know if they are studios or not.

In [None]:
df.IsAStudio.dropna().shape[0] + df.IsAStudio.isnull().sum()

In [None]:
df.IsAStudio.sum()

29 properties are studios.

### `.min()` and `.max()` - Compute the minimum and maximum values

In [None]:
df.min()

> What are properties that were sold at the lowest price?  At what price?

In [None]:
# TODO

In [None]:
df.max()

> What are properties that were sold at the highest price?  At what price?

In [None]:
# TODO

### `.quantile()` - Compute values at the given quantile

In [None]:
df.quantile(.5)

In [None]:
df.median()

In [None]:
df.quantile(.25)

In [None]:
df.quantile(.75)

### `.describe()` - Generate various summary statistics

In [None]:
df.describe()

In [None]:
df.SalePrice.describe()

## Part D

- Boxplots

In [None]:
df.SalePrice.plot(kind = 'box', figsize = (8, 8))

> In the same plot, plot the boxplots of `BedCount` and `BathCount`

In [None]:
# TODO

## Part E

- Histograms

In [None]:
df.BedCount.plot(kind = 'hist', figsize = (8, 8))

> In the same plot, plot the histograms of `BedCount` and `BathCount`

In [None]:
# TODO

## Part F

- `.mode()`

### `.mode()` - Compute the mode value(s)

In [None]:
df.mode()

From the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html): Gets the mode(s) of each element along the columns.  Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with `NaN`.  Note that there could be multiple values returned in the columns (when more than one value share the maximum frequency), which is the reason why a dataframe is returned.

In [None]:
df.Address[df.Address == '1 Mono St # B, San Francisco, CA']

In [None]:
df.Address[df.Address == '829 Folsom St UNIT 906, San Francisco, CA']

In [None]:
df[df.DateOfSale == '11/20/15'].shape[0]

In [None]:
(df.DateOfSale == '11/20/15').sum()

## Part G

- `.corr()`
- Heatmaps
- Scatter plots
- Scatter matrices

In [None]:
df.corr()

### Heatmaps

In [None]:
corr = df.corr()

corr

In [None]:
# TODO

Let's pretty this up.

In [None]:
list(corr.columns)

In [None]:
figure = plt.figure()
subplot = figure.add_subplot(1, 1, 1)
figure.colorbar(subplot.matshow(corr))
subplot.set_xticklabels([None] + list(corr.columns), rotation = 90)
subplot.set_yticklabels([None] + list(corr.columns))

### Scatter plots

In [None]:
df[ ['BedCount', 'BathCount'] ].plot(kind = 'scatter', x = 'BedCount', y = 'BathCount', s = 100, figsize = (8, 8))

### Scatter matrices

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['BedCount', 'BathCount'] ], diagonal = 'kde', s = 500, figsize = (8, 8))

In [None]:
pd.tools.plotting.scatter_matrix(df[ ['SalePrice', 'Size'] ], s = 200, figsize = (8, 8))

## Part H

- `.value_counts()`
- `.crosstab()`

> Reproduce the `BedCount` histogram above.  For each possible bed count, how many properties share that bed count?

In [None]:
# TODO

> Careful on checking for `NaN` values!

In [None]:
# TODO

> Create a frequency table for `BathCount` over `BedCount`.

In [None]:
# TODO