# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Boston Housing dataset, which contains housing values in the suburbs of Boston. The Boston housing data is commonly used by aspiring Data Scientists.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file `'train.csv'`) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create two new DataFrames based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't 
    - Create two new DataFrames based on a certain split for crime rate 
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions

This DataFrame contains the following columns:

- `crim`: per capita crime rate by town  
- `zn`: proportion of residential land zoned for lots over 25,000 sq.ft  
- `indus`: proportion of non-retail business acres per town   
- `chas`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)  
- `nox`: nitrogen oxide concentration (parts per 10 million)   
- `rm`: average number of rooms per dwelling   
- `age`: proportion of owner-occupied units built prior to 1940  
- `dis`: weighted mean of distances to five Boston employment centers   
- `rad`: index of accessibility to radial highways   
- `tax`: full-value property-tax rate per \$10,000   
- `ptratio`: pupil-teacher ratio by town    
- `b`: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town   
- `lstat`: lower status of the population (percent)   
- `medv`: median value of owner-occupied homes in $10000s 
  
    
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.


In [1]:
#Everything in this box is purely exploratory. Sandbox. Dinosaurs and sandcastles.

%matplotlib notebook
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing

#a function for normalizing a DataFrame
def normalize(df):
    result = df.copy()
    for feature in result.columns:
        max_value = result[str(feature)].max()
        min_value = result[str(feature)].min()
        result[str(feature)] = (result[str(feature)]-min_value) / (max_value-min_value)
    return result

df = pd.read_csv('train.csv')
# df.info()
# pd.plotting.scatter_matrix(df)
selected_df = df.loc[:, ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv']]
normalized_selected = normalize(selected_df)
# normalized_selected.plot('zn', 'crim', kind='scatter')
# df_crime.plot(kind='hist')
pd.plotting.scatter_matrix(normalized_selected)

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a150676d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a14f2ba58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a14f8d048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a14fbc5f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a14f4cba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a152ca198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a152f8748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1532bd30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1532bd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a15398898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a153c9e48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a15405438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a154389e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a154

In [2]:
#Summary and 2 measures of centrality for all variables
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 15 columns):
ID         333 non-null int64
crim       333 non-null float64
zn         333 non-null float64
indus      333 non-null float64
chas       333 non-null int64
nox        333 non-null float64
rm         333 non-null float64
age        333 non-null float64
dis        333 non-null float64
rad        333 non-null int64
tax        333 non-null int64
ptratio    333 non-null float64
b          333 non-null float64
lstat      333 non-null float64
medv       333 non-null float64
dtypes: float64(11), int64(4)
memory usage: 39.1 KB


Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,250.951952,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,147.859438,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,123.0,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,244.0,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,377.0,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,506.0,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [8]:
# A DF, df_high_crime, that captures info about towns at or above the 75% quantile for crime rate
# and its counterpart, df_low_crime, for everything else.
crime_rate_75 = df['crim'].quantile(.75)
df_high_crime = df.loc[df['crim'] >= crime_rate_75]
df_low_crime = df.loc[df['crim'] < crime_rate_75]
df_high_crime.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0
mean,423.27381,12.027551,0.0,18.1,0.059524,0.677988,5.920393,90.807143,2.072092,24.0,666.0,20.2,297.187976,18.610714,16.85119
std,39.359435,10.634535,0.0,7.148103e-15,0.238024,0.059953,0.701382,10.392198,0.569271,0.0,0.0,0.0,141.369261,6.984341,9.094517
min,357.0,3.67822,0.0,18.1,0.0,0.532,3.561,53.2,1.1296,24.0,666.0,20.2,3.5,2.96,5.0
25%,387.75,5.685405,0.0,18.1,0.0,0.631,5.55775,87.55,1.638925,24.0,666.0,20.2,233.1325,13.9975,11.45
50%,428.5,9.08499,0.0,18.1,0.0,0.693,6.1205,94.6,1.98315,24.0,666.0,20.2,379.04,18.08,14.75
75%,458.25,13.9941,0.0,18.1,0.0,0.713,6.38375,98.825,2.465825,24.0,666.0,20.2,396.9,22.97,20.025
max,488.0,73.5341,0.0,18.1,1.0,0.77,7.393,100.0,3.5459,24.0,666.0,20.2,396.9,37.97,50.0


In [9]:
df_low_crime.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0
mean,192.819277,0.436463,14.295181,8.997309,0.060241,0.516378,6.38208,60.608835,4.262459,4.787149,322.674699,17.857028,380.475582,10.459197,24.76506
std,123.685066,0.698417,25.229635,6.676523,0.238412,0.099361,0.66691,28.150636,1.98259,2.970092,96.018031,2.192794,39.882979,5.806747,8.313389
min,1.0,0.00632,0.0,0.74,0.0,0.385,4.926,6.0,1.3216,1.0,188.0,12.6,70.8,1.73,8.1
25%,90.0,0.05735,0.0,4.05,0.0,0.439,5.949,36.8,2.5979,4.0,264.0,16.4,383.37,6.21,19.4
50%,179.0,0.1403,0.0,6.91,0.0,0.504,6.245,62.2,3.9454,4.0,305.0,18.0,392.8,9.45,22.9
75%,280.0,0.40771,21.0,10.59,0.0,0.547,6.718,86.5,5.5027,5.0,391.0,19.2,396.21,13.44,28.0
max,506.0,3.56868,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,34.41,50.0


In [10]:
# Two DFs separated by nitrogen oxide (pp10m) at the 75th quantile
# to investigate the correlation of pollution with crime rate
nox_75 = df['nox'].quantile(.75)
df_high_nox = df.loc[df['nox'] >= nox_75]
df_low_nox = df.loc[df['nox'] < nox_75]
df_high_nox.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0
mean,369.797619,10.323496,1.666667,17.09869,0.083333,0.718488,5.97831,92.241667,1.93287,20.154762,601.190476,18.945238,303.215595,18.416905,18.252381
std,96.237254,11.171964,5.560908,4.010952,0.278045,0.066358,0.812132,9.186294,0.477017,7.679568,133.086391,2.534409,136.244664,7.564317,10.646041
min,143.0,0.52014,0.0,3.97,0.0,0.631,3.561,48.2,1.1296,5.0,264.0,13.0,3.5,2.96,5.0
25%,360.75,3.755472,0.0,18.1,0.0,0.679,5.532,88.925,1.578825,24.0,666.0,20.2,260.27,13.825,11.45
50%,394.0,7.795775,0.0,18.1,0.0,0.7065,6.128,94.95,1.86455,24.0,666.0,20.2,379.585,17.21,15.05
75%,441.25,12.1639,0.0,18.1,0.0,0.74,6.40525,99.475,2.23205,24.0,666.0,20.2,395.9925,23.6475,21.55
max,467.0,73.5341,20.0,19.58,1.0,0.871,8.398,100.0,3.0665,24.0,666.0,20.2,396.9,37.97,50.0


In [34]:
df_low_nox.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0
mean,210.859438,1.011325,13.732932,9.3351,0.052209,0.502715,6.362542,60.1249,4.309425,6.084337,344.538153,18.280321,378.442169,10.524578,24.292369
std,140.605288,2.937771,25.319291,6.701642,0.222896,0.06652,0.636603,27.750349,1.935874,5.694943,128.304858,1.984185,48.891805,5.652216,8.092244
min,1.0,0.00632,0.0,0.74,0.0,0.385,4.973,6.0,1.4394,1.0,188.0,12.6,3.65,1.73,8.1
25%,90.0,0.05735,0.0,4.49,0.0,0.439,5.949,36.8,2.7147,4.0,270.0,17.0,383.23,6.29,19.3
50%,192.0,0.1403,0.0,6.96,0.0,0.504,6.23,62.0,3.9454,4.0,307.0,18.4,392.9,9.52,22.6
75%,306.0,0.40771,21.0,11.93,0.0,0.547,6.635,85.2,5.5027,6.0,398.0,20.2,396.24,13.45,26.6
max,506.0,28.6558,100.0,27.74,1.0,0.624,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,34.41,50.0


In [71]:
standardized_crime = normalized_selected.copy()
standardized_crime['crim'] = preprocessing.scale(standardized_crime['crim'])
standardized_high_crime = df_high_crime.copy()
standardized_high_crime = normalize(standardized_high_crime)
standardized_high_crime['crim'] = preprocessing.scale(standardized_high_crime['crim'])

In [83]:
ax = standardized_crime.plot.scatter('dis', 'medv', c='crim', colormap='viridis', figsize=(10,6))

<IPython.core.display.Javascript object>

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!