#### Machine Learning Project
1. [Look at the big picture](#ch1)
2. [Get the data](#ch2)
3. [Discover and visualize the data to gain insights](#ch3)
4. [Prepare the data for Machine Learning algorithms](#ch4)
5. [Select a model and train it](#ch5)
6. [Fine-tune your model](#ch6)
7. [Present your solution](#ch7)
8. [Launch, monitor, and maintain your system](#ch8)
---

#### Look at the Big Picture<a id='ch1'></a>
* Frame the Problem

| Question | Answer |
|:------|:------|
| what is business objective? | your model's output will be fed to another ML system |
| what the current solution looks like? | estimated manually by experts; costly, time-consuming and not accurate |
| is it supervised, unsupervised, or reinforcement learning? | supervised |
| is it a classification task or a regression task? | (multivariate) regression |
| should I use batch learning or online learning? | batch learning |

* Select a Performance Measure
    * Performance Measures
        * Root Mean Square Error (RMSE) : $$RMSE(X, h) = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(x^i)-y^i)^2}$$    
$$=l_2$$

        * Mean Absolute Error (MAE) : $$MAE(X, h) = \frac{1}{m}\sum_{i=1}^{m}\left|(h(x^i)-y^i)\right|$$
$$=l_1$$
    * Notations

| Notation | What is means |
|:------:|:------|
| m | the number of instances in the dataset |
| $$x^i$$| vector of all the feature values of $i^{th}$ instance |
| $$y^i$$| label of $i^{th}$ instance |
| $$X$$| - a matrix containing all the feature values of all instances <br> - one row per instance, <br> - the $i^{th}$ row is equal to the transpose of $x^i$, noted $(x^i)^T$ |
| $$h$$ | - system's prediction function called hypothesis <br> - $\hat {y}=h(x^i)$|
| $RMSE(X,h)$| cost function measured on the set of examples using hypothesis |

---
$$x^1 = \left\lgroup \matrix{-118.29 \cr 33.91 \cr 1,416 \cr 38,372} \right\rgroup$$

$$y^1 = 156,400$$

$$ X = \left\lgroup \matrix{(x^1)^T \cr (x^2)^T \cr \vdots \cr (x^1999)^T \cr (x^2000)^T} \right\rgroup \
 = \left\lgroup \matrix{-118.29 & 33.91 & 1,416 & 38,372 \cr \vdots & \vdots & \vdots & \vdots} \right\rgroup $$

#### Get the Data<a id='ch2'></a>

* Download the Data

In [None]:
import os
import tarfile
from six.moves import urllib
import pandas as pd
import matplotlib.pyplot as plt
import hashlib
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.tools.plotting import scatter_matrix
from skimage import io
%matplotlib inline

In [None]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing") # 'datasets/housing'
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz" # DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

In [None]:
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    '''
    1. (없을 경우) 디렉토리 생성
    2. housing.tgz 다운로드
    3. housing.csv 추출
    '''
    if not os.path.isdir(housing_path): # 현재 directory에 housing_path(HOUSING_PATH => datasets/housing 존재 여부 확인)
        os.makedirs(housing_path) # housing_path가 없을 경우 directory를 생성
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path) # url => housing_url, filename => tgz_path 
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [None]:
fetch_housing_data()

In [None]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv") # 위에서 디렉토리 생성했으므로 확인하는 로직 생략
    return pd.read_csv(csv_path) # dataframe

* Take a Quick Look at the Data Structure

In [None]:
housing = load_housing_data() # dataframe을 housing이라는 variable에 저장
housing.head() # head는 상위 n row 보여준다. default=5이지만 수동으로 변경 가능

In [None]:
housing.info() # 교재에는 () 빠져있음

In [None]:
housing["ocean_proximity"].value_counts() # column 단위로 적용하며 sql count & group by 와 유사

In [None]:
housing.describe() # numerical data'만' 나타낸다 (null 무시)
                   # ocean_proximity 정보가 궁금하면 housing['ocean_proximity].describe()
                   # 보여주는 지표는 다르다

In [None]:
housing.hist(bins=50, figsize=(15, 10)) # bins 설정에 따라 더 잘 드러나거나 한다
plt.show()

* 하한선, 상한선 설정된 raw data
    * feature
        * median income, housing median => 조정
    * target
        * 조정된 값 이상은 target이 갈 수 없다고 학습하는 문제 발생
        * 방안 (상한선 이상의 값에 대한)
            * 조정된 dataset 선별
            * dataset 제거

In [None]:
def split_train_test(data, test_ratio):
    shuffled_indicies = np.random.permutation(len(data)) # np.random.permutation(k) => k이하 랜덤 숫자를 k번 출력
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indicies[:test_set_size]
    train_indices = shuffled_indicies[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices] # tuple 형태로 train_set, test_set 반환

In [None]:
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

In [None]:
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

* loc :  works on labels in the index.
* iloc : works on the positions in the index (so it only takes integers).

In [None]:
housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index') # index column 사용
print(len(train_set), "train +", len(test_set), "test") # 주의할 점 
                                                        # 새로운 row은 하단에 append
                                                        # 어떠한 row도 삭제하지 않는다

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42) # scikit-learn 이용해 train, test split

In [None]:
housing["median_income"].hist();

In [None]:
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) # np.ceil(k) => k 보다 크거나 같은 정수 반환 <=> np.floor(k)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) # 데이터 상한선을 5로 설정 

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) # n_split : train/test split 몇 번 할 건지
                                                                           # n_split = 1이므로 for loop 1번, 즉 단순 split
for train_index, test_index in split.split(housing, housing["income_cat"]): # housing['income_cat] 에 따라 stratify
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
housing["income_cat"].value_counts() / len(housing)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

#### Discover and Visualize the Data to Gain Insights <a id='ch3'></a>

In [None]:
housing = strat_train_set.copy()

* Visualizing Geographical Data

In [None]:
housing.head(3)

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude");

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1);

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend();

* Looking for Correlations

In [None]:
link = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1200px-Correlation_examples2.svg.png"
io.imshow(io.imread(link))
io.show()

In [None]:
# 방법1
corr_matrix = housing.corr() 
corr_matrix["median_house_value"].sort_values(ascending=False) # linear corrleation
                                                               # nothing to do with slope

In [None]:
# 방법2
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8)); # main diagonal would be full of straight lines 
                                                      # if Pandas plotted each variable again itself
                                                      # instead, pandas displays a histogram of each attribute (option)

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1);

* Experimenting with Attribute Combinations

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)