<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Working-with-Real-Data" data-toc-modified-id="Working-with-Real-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Working with Real Data</a></span><ul class="toc-item"><li><span><a href="#Popular-open-data-repositories" data-toc-modified-id="Popular-open-data-repositories-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Popular open data repositories</a></span></li><li><span><a href="#A-Housing-Pricing-Problem" data-toc-modified-id="A-Housing-Pricing-Problem-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>A Housing Pricing Problem</a></span><ul class="toc-item"><li><span><a href="#Frame-the-problem" data-toc-modified-id="Frame-the-problem-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Frame the problem</a></span></li><li><span><a href="#Select-a-Performance-Measure" data-toc-modified-id="Select-a-Performance-Measure-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Select a Performance Measure</a></span></li><li><span><a href="#Check-the-Assumptions" data-toc-modified-id="Check-the-Assumptions-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Check the Assumptions</a></span></li><li><span><a href="#Get-the-Data" data-toc-modified-id="Get-the-Data-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Get the Data</a></span><ul class="toc-item"><li><span><a href="#Download-the-data" data-toc-modified-id="Download-the-data-1.2.4.1"><span class="toc-item-num">1.2.4.1&nbsp;&nbsp;</span>Download the data</a></span></li><li><span><a href="#Load-the-data-into-Pandas" data-toc-modified-id="Load-the-data-into-Pandas-1.2.4.2"><span class="toc-item-num">1.2.4.2&nbsp;&nbsp;</span>Load the data into Pandas</a></span></li><li><span><a href="#Take-a-Quick-Look-at-the-Data-Structure" data-toc-modified-id="Take-a-Quick-Look-at-the-Data-Structure-1.2.4.3"><span class="toc-item-num">1.2.4.3&nbsp;&nbsp;</span>Take a Quick Look at the Data Structure</a></span></li><li><span><a href="#Get-the-Summary-of-the-Data" data-toc-modified-id="Get-the-Summary-of-the-Data-1.2.4.4"><span class="toc-item-num">1.2.4.4&nbsp;&nbsp;</span>Get the Summary of the Data</a></span></li><li><span><a href="#Get-a-Feel-of-the-Data" data-toc-modified-id="Get-a-Feel-of-the-Data-1.2.4.5"><span class="toc-item-num">1.2.4.5&nbsp;&nbsp;</span>Get a Feel of the Data</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Working with Real Data

## Popular open data repositories
* [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php)
* [Kaggle Datasets](https://www.kaggle.com/datasets)
* [AWS Public Datasets](https://aws.amazon.com/fr/datasets/)
* [http://dataportals.org](http://dataportals.org)
* [http://opendatamonitor.eu/](http://opendatamonitor.eu/)
* [http://quandl.com/](http://quandl.com/)
* [Wikipedia’s list of Machine Learning datasets](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)
* [Datasets Subreddit](https://www.reddit.com/r/datasets/)
* [Quora Suggestion](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)

## A Housing Pricing Problem

### Frame the problem 

* Determine whether it is worth investing in a given area or not. 
* Is it supervised, unsupervised, or Reinforce‐ ment Learning? 
* Is it a classification task, a regression task, or something else?
* Should you use batch learning or online learning techniques? 

### Select a Performance Measure

* Root Mean Square Error

$RMSE(X,h)=\sqrt{\frac{1}{m}\sum_i(h(x^{(i)})-y^{(i)})^2}$

where $h$ is the *hypothesis*
    * in case when there are many outlier districts, consider using the *Mean Absolute Error*./median regression
$MAE(X,h)=\frac{1}{m}\sum_i|h(x^{(i)})-y^{(i)}|$

### Check the Assumptions
no particular assumptions

### Get the Data

#### Download the data

In [3]:
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): 
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

#### Load the data into Pandas

In [5]:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH): 
    csv_path = os.path.join(housing_path, "housing.csv") 
    return pd.read_csv(csv_path)

#### Take a Quick Look at the Data Structure

In [13]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


#### Get the Summary of the Data

In [14]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


* ```total_bedrooms``` attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later.
* ```ocean_proximity``` attribute is object type, but since you loaded this data from a CSV file you know that it must be a text attribute.

In [15]:
housing["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

* more detailed information

In [16]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


#### Get a Feel of the Data

In [20]:
%matplotlib inline
import matplotlib.pyplot as plt 
housing.hist(bins=50, figsize=(20,15)) 

NameError: name '_converter' is not defined