# Course Data

One of the main objectives of this course is to teach a rigorous and robust workflow that can be applied to many machine learning problems. In order to do so, we concentrate our efforts on a few datasets so that we can go deep into this workflow and explore it fully. By concentrating on just a few datasets, you will get to know them well and more easily build upon your knowledge from each chapter.

## Introducing the Ames, Iowa Housing Dataset

A popular dataset for getting started with machine learning is the [Ames, Iowa Housing Dataset found on Kaggle][1]. We will be using it for all of our supervised regression coverage in this book. It contains residential property sales data from Ames, Iowa from 2006 to 2010. There are 79 features (columns) of data on 1,460 homes. Along with these features is the final sale price of each home. You can read more about how the [data was collected here][2].

### Begin with a sample of the housing dataset

Instead of using the full dataset with all 79 features, we begin by narrowing our focus to a subset of ten of them. We eventually use the entire set of data, but attempting to understand this many features when first being introduced to machine learning can be overwhelming. Using this smaller subset allows us to concentrate on the techniques and tools of machine learning without being overloaded with data. 

### First look at the data

Let's read in the dataset with pandas and output the first few rows.

[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
[2]: http://jse.amstat.org/v19n3/decock.pdf

In [1]:
import pandas as pd
housing = pd.read_csv('../data/housing_sample.csv')
housing.head()

Unnamed: 0,Neighborhood,Exterior1st,YearBuilt,LotFrontage,GrLivArea,GarageArea,BedroomAbvGr,FullBath,OverallQual,HeatingQC,SalePrice
0,CollgCr,VinylSd,2003,65.0,1710,548,3,2,7,Ex,208500
1,Veenker,MetalSd,1976,80.0,1262,460,3,2,6,Ex,181500
2,CollgCr,VinylSd,2001,68.0,1786,608,3,2,7,Ex,223500
3,Crawfor,Wd Sdng,1915,60.0,1717,642,3,1,7,Gd,140000
4,NoRidge,VinylSd,2000,84.0,2198,836,4,2,8,Ex,250000


### Dimensions

Let's get the exact number of rows and columns with the `shape` attribute.

In [2]:
housing.shape

(1460, 11)

## The data dictionary

Before doing any analysis, you should have descriptions of each of the columns in the dataset. This is usually referred to as a **data dictionary**. This dataset has its data dictionary stored as a text file. We can embed it directly into the notebook with the following command:

In [3]:
print(open('../data/housing_sample_data_dictionary.txt').read())

Neighborhood: Physical locations within Ames city limits

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker

Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       Im