# Project - EDA with Pandas Using the Ames Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains housing values in the suburbs of Ames.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file ``ames_train.csv``) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data.
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions
Look in ``data_description.txt`` for a full description of all variables.

A preview of some of the columns:

**MSZoning**: Identifies the general zoning classification of the sale.
		
       A	 Agriculture
       C	 Commercial
       FV	Floating Village Residential
       I	 Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

**OverallCond**: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

**KitchenQual**: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

**YrSold**: Year Sold (YYYY)

**SalePrice**: Sale price of the house in dollars

In [62]:
# Let's get started importing the necessary libraries
import numpy as np
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt

In [63]:
# Loading the data
plt.style.use('ggplot')

df = pd.read_csv('ames_train.csv')

In [64]:
# Investigate the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [11]:
# Investigating Distributions using scatter_matrix
pd.plotting.scatter_matrix(df);

<IPython.core.display.Javascript object>

In [65]:
# Create a plot that shows the SalesPrice Distribution
df.hist(column = 'SalePrice',bins='auto');
plt.ylabel('Number of Houses');
plt.xlabel('SalePrice');
plt.title('Frequency of Sales');

<IPython.core.display.Javascript object>

In [67]:
# Create a plot that shows the LotArea Distribution
df.hist(column = 'LotArea', bins ='auto');
plt.xlabel('Lot Area')
plt.ylabel('Number of properties')
plt.title('Lot Area')

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Lot Area')

In [69]:
# Create a plot that shows the Distribution of the overall house condition
df.hist(column = 'OverallCond', bins = 'auto');
plt.xlabel('Overall Condition')
plt.ylabel('Number of Houses')
plt.title('Quality of Houses')

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Quality of Houses')

In [72]:
# Create a Box Plot for SalePrice
df.boxplot(column = 'SalePrice')
plt.ylabel('Price')

<IPython.core.display.Javascript object>

Text(0.5, 0, '')

In [56]:
# Perform an Exploration of home values by age
df['age'] = df['YrSold'] - df['YearBuilt']
df['decades'] = df.age // 10
to_plot = df.groupby('decades').SalePrice.mean()
to_plot.plot(kind='barh', figsize=(10,8))
plt.ylabel('House Age in Decades')
plt.xlabel('Average Sale Price of Homes')
plt.title('Average Home Values by Home Age');


## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!