# Project - EDA with Pandas Using the Ames Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains housing values in the suburbs of Ames.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file ``ames_train.csv``) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data.
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions
Look in ``data_description.txt`` for a full description of all variables.

A preview of some of the columns:

**MSZoning**: Identifies the general zoning classification of the sale.
		
       A	 Agriculture
       C	 Commercial
       FV	Floating Village Residential
       I	 Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

**OverallCond**: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

**KitchenQual**: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

**YrSold**: Year Sold (YYYY)

**SalePrice**: Sale price of the house in dollars

In [54]:
# Let's get started importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook

In [55]:
# Loading the data
df = pd.read_csv("ames_train.csv")

In [56]:
# Investigate the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [22]:
df.plot()

  fig = self.plt.figure(figsize=self.figsize)


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f7f603cb198>

In [39]:
testdf = df[["YearBuilt","SalePrice"]]
#groupdf = testdf.groupby("YearBuilt")
#print (groupdf.groups)
#plt.hist(testdf, bins = 14)
#plt.show()
bar = testdf.plot.bar(x = "YearBuilt", y = "SalePrice")

<IPython.core.display.Javascript object>

In [65]:
# Investigating Distributions using scatter_matrix
testdf2 = df[["YearBuilt","OverallQual","OverallCond"]]
testdf2 = testdf2.set_index("YearBuilt")
testdf2 = testdf2.sort_index()
testdf2.head()



Unnamed: 0_level_0,OverallQual,OverallCond
YearBuilt,Unnamed: 1_level_1,Unnamed: 2_level_1
1872,8,5
1875,5,8
1880,7,7
1880,7,9
1880,6,4


In [61]:
testdf2["OverallQual"].unique()

array([ 7,  6,  8,  5,  9,  4, 10,  3,  1,  2])

In [69]:
sns.heatmap(testdf2)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f7f1cef6518>

In [70]:
df3 = df[["GarageArea","SalePrice"]]
df3.head()

Unnamed: 0,GarageArea,SalePrice
0,548,208500
1,460,181500
2,608,223500
3,642,140000
4,836,250000


In [72]:
sns.lmplot(x = "GarageArea", y = "SalePrice", data = df3)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x7f7f1e115f98>

In [None]:
# Create a plot that shows the SalesPrice Distribution

In [None]:
# Create a plot that shows the LotArea Distribution

In [None]:
# Create a plot that shows the Distribution of the overall house condition

In [None]:
# Create a Box Plot for SalePrice

In [None]:
# Perform an Exploration of home values by age

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!