# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section you've learned a lot about importing, cleaning up, analysing (using descriptive statistics) and visualizing data. In this more free form project you'll get a chance to practice all of these skills with the Boston Housing data set, which contains housing values in suburbs of Boston. The Boston Housing Data is commonly used by aspiring data scientists.

## Objectives

You will be able to:

* Load csv files using Pandas
* Find variables with high correlation
* Create box plots

# Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At minimum, this should include:

* Loading the data (which is stored in the file train.csv)
* Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations using `.loc`, `.iloc` or related operations. Explain why you used the chosen subsets and do this for 3 possible 2-way splits. State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create a 2 new dataframes based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't.
    - Create 2 new datagrames based on a certain split for crime rate.
* Next, use histograms and scatterplots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

In [100]:
import pandas as pd

In [101]:
train_df = pd.read_csv('train.csv')
train_df = train_df[['crim', 'indus', "chas", 'rm', 'age', 'tax', 'black']]   
train_df.head()


Unnamed: 0,crim,indus,chas,rm,age,tax,black
0,0.00632,2.31,0,6.575,65.2,296,396.9
1,0.02731,7.07,0,6.421,78.9,242,396.9
2,0.03237,2.18,0,6.998,45.8,222,394.63
3,0.06905,2.18,0,7.147,54.2,222,396.9
4,0.08829,7.87,0,6.012,66.6,311,395.6


In [102]:
# Use built-in python functions to explore measures of centrality 
# and dispersion for at least 3 variables

print(train_df["age"].mean())
print(train_df["tax"].mean())
print(train_df["rm"].median())

68.22642642642641
409.27927927927925
6.202000000000001


In [103]:
#Create meaningful subsets of the data using selection operations 
#using .loc, .iloc or related operations. 

In [104]:
#Create 2 new dataframes based on a certain split for crime rate.
train_df["crim"].head()

train_num_crim_df = pd.to_numeric(train_df['crim'])

def crime_rate(rate): 
    if rate >= 0.01: 
        return "high"
    else: 
        return "low"

high_crime_rate = train_num_crim_df.apply(crime_rate)
type(high_crime_rate)

train_df['high_crime_rate']=high_crime_rate
train_df.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
0,0.00632,2.31,0,6.575,65.2,296,396.9,low
1,0.02731,7.07,0,6.421,78.9,242,396.9,high
2,0.03237,2.18,0,6.998,45.8,222,394.63,high
3,0.06905,2.18,0,7.147,54.2,222,396.9,high
4,0.08829,7.87,0,6.012,66.6,311,395.6,high


In [105]:
#create a function to find all the high rates  
def high_crime(text):
    return text.endswith('high')
    
#apply the function 
high_crime = train_df.loc[train_df['high_crime_rate'].apply(high_crime), : ]
high_crime.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
1,0.02731,7.07,0,6.421,78.9,242,396.9,high
2,0.03237,2.18,0,6.998,45.8,222,394.63,high
3,0.06905,2.18,0,7.147,54.2,222,396.9,high
4,0.08829,7.87,0,6.012,66.6,311,395.6,high
5,0.22489,7.87,0,6.377,94.3,311,392.52,high


In [106]:
#create a function to find all the low rates  
def low_crime(text):
    return text.endswith('low')
    
#apply the function 
low_crime = train_df.loc[train_df['high_crime_rate'].apply(low_crime), : ]
low_crime.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
0,0.00632,2.31,0,6.575,65.2,296,396.9,low
190,0.00906,2.97,0,7.088,20.8,285,394.72,low


In [107]:
#Create a 2 new dataframes based on your existing data, where one 
#contains all the properties next to the Charles river, and the other 
#one contains properties that aren't.

In [112]:
train_df.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
0,0.00632,2.31,0,6.575,65.2,296,396.9,low
1,0.02731,7.07,0,6.421,78.9,242,396.9,high
2,0.03237,2.18,0,6.998,45.8,222,394.63,high
3,0.06905,2.18,0,7.147,54.2,222,396.9,high
4,0.08829,7.87,0,6.012,66.6,311,395.6,high


In [115]:
#The properties next to the Charles river 
type(train_df['chas'])
yes_charles = [1]

charles_river_data = train_df.loc[train_df.chas.isin(yes_charles)]
charles_river_data.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
97,3.32105,19.58,1,5.403,100.0,403,396.9,high
104,1.41385,19.58,1,6.129,96.0,403,321.02,high
108,1.27346,19.58,1,6.25,92.6,403,338.92,high
110,1.51902,19.58,1,8.375,93.9,403,388.45,high
145,0.13587,10.59,1,6.064,59.1,277,381.32,high


In [116]:
charles_river_data.mean()

crim       2.163972
indus     12.330000
chas       1.000000
rm         6.577750
age       75.815000
tax      394.550000
black    380.681000
dtype: float64

In [117]:
#The properties not next to the Charles River 
type(train_df['chas'])
no_charles = [0]

outside_charles_river_data = train_df.loc[train_df.chas.isin(no_charles)]
outside_charles_river_data.head()

Unnamed: 0,crim,indus,chas,rm,age,tax,black,high_crime_rate
0,0.00632,2.31,0,6.575,65.2,296,396.9,low
1,0.02731,7.07,0,6.421,78.9,242,396.9,high
2,0.03237,2.18,0,6.998,45.8,222,394.63,high
3,0.06905,2.18,0,7.147,54.2,222,396.9,high
4,0.08829,7.87,0,6.012,66.6,311,395.6,high


In [118]:
outside_charles_river_data.mean()

crim       3.436787
indus     11.227252
chas       0.000000
rm         6.245674
age       67.741534
tax      410.220447
black    358.110511
dtype: float64

# Variable Descriptions

This data frame contains the following columns:

#### crim  
per capita crime rate by town.

#### zn  
proportion of residential land zoned for lots over 25,000 sq.ft.

#### indus  
proportion of non-retail business acres per town.

#### chas  
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

#### nox  
nitrogen oxides concentration (parts per 10 million).

#### rm  
average number of rooms per dwelling.

#### age  
proportion of owner-occupied units built prior to 1940.

#### dis  
weighted mean of distances to five Boston employment centres.

#### rad  
index of accessibility to radial highways.

#### tax  
full-value property-tax rate per $10,000.

#### ptratio  
pupil-teacher ratio by town.

#### black  
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

#### lstat  
lower status of the population (percent).

#### medv  
median value of owner-occupied homes in $10000s.
  
  
  
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

## Summary

Congratulations, you've completed your first "freeform" exploratory data analysis of a popular data set!