Create a *fetch_housing_data* function to download dataset with ease 

In [None]:
import os
import tarfile # for unzipping the dataset
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    print("Data downloaded")


Fetch the data by calling the function

In [None]:
fetch_housing_data()

Import pandas and create *load_housing_data()* funciton to load data as pandas dataframe

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

Create *housing* dataframe and display the first five rows using *head()* method

In [None]:
housing = load_housing_data()

housing.head()


Get the quick description of data using *info()* method

In [None]:
housing.info()

Find out what categories exist and how many districts belong to each category using *value_counts()* method

In [None]:
housing["ocean_proximity"].value_counts()

View the summary of the numerical attributes using *describe()* method

In [None]:
housing.describe()

Importing matplotlib and plot the histogram of the housing price

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

Add *split_train_test()* function to split the data into train and test set given data and test_ratio


In [None]:
import numpy as np

def split_train_test(data, test_ratio):
    np.random.seed(42)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

Save test data to *test_set* and train data to *train_set*

In [None]:
train_set, test_Set = split_train_test(housing, 0.2)
len(train_set)
len(test_Set)

To make the test and train set consistent we use hash function to seperate them

In [None]:
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Create *split_train_test()* function to split the data by id using hash function

In [None]:
def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

Add index column to the train and test set to use *split_train_test_by_id()* function

In [None]:
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

Combine housing latitude and longitude to create a unique identifier "id"

In [None]:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

Use sklearn's train_test_split function to split the data into train and test set

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_Set = train_test_split(housing, test_size=0.2, random_state=42)

*pd.cut()* function is used to divide the income data into five class labeld from 1 to 5: catagory 1 ranges from 0 to 1.5 (ie., less than $15,000) , catagory 2 from 1.5 to 3 and so on.

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                                bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                labels=[1, 2, 3, 4, 5])

Plot the histogram of *income_cat* column

In [None]:
housing["income_cat"].hist()
plt.show()

Import  StratifiedKFold from sklearn.model_selection to stratify the train set.
Strarificaton will eliminate sampling bias in the train set.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Take a look at income category proportions in the test set

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

Remove the *income_cat* attribute so the data is back to original state

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Copy the *strat_train_set* dataframe to visualize the train set

In [None]:
housing = strat_train_set.copy()

Visualize the geographical distribution of the train set

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")

Use the allpha variable to visualize the density of the data points

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

The radius of each circle representsthe district’s population (option s), and the color represents the price (option c). Wewill  use  a  predefined  color  map  (option  cmap)  called  jet,  which  ranges  from  blue(low values) to red (high prices)

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
                s=housing["population"]/100, label="population",figsize=(10,7),
                c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

Since  the  dataset  is  not  too  large,  you  can  easily  compute  the  standard  correlationcoefficient  (also  called  Pearson’s  r)  between  every  pair  of  attributes  using  the  *corr()*method

In [None]:
corr_matrix = housing.corr()
print(corr_matrix)

Now let’s look at how much each attribute correlates with the median house value



In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

Import *scatter_matrix* from pandas to plot the correlation matrix.
This scatter matrix plots every numerical attribute against every othernumerical attribute, plus a histogram of each numerical attribute.

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
                "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

lksadjflkj aslkdf


Plot the correlation between *median_income* and *mdian_house_value*

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
                alpha=0.1)

Create new attributes by combining the available ones that are more likely to have signigicant effect on the required output

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

And now let's look at the correlation matrix again

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

let’s revert to a clean training set (by copying strat_train_set once again).Let’s  also  separate  the  predictors  and  the  labels,  since  we  don’t  necessarily  want  to apply  the  same  transformations  to  the  predictors  and  the  target  values

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

print(housing_labels)


## Data Cleaning

We can either
1. Get rid of the corresponding districts    
1. Get rid of the whole attribute
1. Set the values to some value (zero, the mean, the median, 
    etc.)

In [None]:
housing.dropna(subset=["total_bedrooms"])   # option 1
housing.drop("total_bedrooms", axis=1)      # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median, inplace=True)

Using *SimpleImputer* from *sklearn* create a *SimpleImputer* instance, specifyingthat  you  want  to  replace  each  attribute’s  missing  values  with  the  median  of  that attribute

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Since the median can only be computed on numerical attributes, you need to create a copy of the data without the text attribute ocean_proximity and use the *fit* method to automatically fit all the na variables with the respective column(Attribute) median values. 

In [None]:
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)

Now  you  can  use  this  “trained”  imputer  to  transform  the  training  set  by  replacingmissing values with the learned medians. *transform* method now transforms the *housing_num* data using the previously calculated median usingn *housing_num* using the *fit()* method

In [None]:
X = imputer.transform(housing_num)

Transform the plain Numpy array to pandas DataFrame

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

## Handeling Text and Categorical Attributes

So  far  we  have  only  dealt  with  numerical  attributes,  but  now  let’s  look  at  textattributes. In this dataset, there is just one: the ocean_proximity attribute. Let’s lookat its value for the first 10 instances:Prepare the Data for Machine Learning Algorithms | 65


In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head()

Let's convert these categories from text to numbers, for this we can use Scikit-Learn's *OrdinalEncoder* Class

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

Get the list of categories using the *categories_* instance variable


In [None]:
ordinal_encoder.categories_

Using one hot encoding to encode five classes of proximity to numerical values

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

Convert the SciPy sparse matrix to NumPy array (Although it is not compulsory). The space occupied by Scipy Sparse matrix is less than NumPy array.

In [None]:
housing_cat_1hot.toarray()

## Custom Transformers

Although  Scikit-Learn  provides  many  useful  transformers,  you  will  need  to  writeyour   own   for   tasks   such   as   custom   cleanup   operations   or   combining   specificattributes. Scikit-Learn uses duck typing (not inheritance), all  you  need  to  do  is  create  a  class  and  implement  three  methods:  fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class.If you add BaseEstimator as a base class (and avoid &ast;args and &ast;&ast;kargs in your constructor), you will also get two extra methods (get_params() and set_params()) thatwill be useful for automatic hyperparameter tuning.


For  example,  here  is  a  small  transformer  class  that  adds  the  combined  attributes  wediscussed above.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3,4,5,6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_atrribs = attr_adder.transform(housing.values)

## Feature Scaling
One of the most important transformations you need to apply to your data is featurescaling. With few exceptions, Machine Learning algorithms don’t perform well whenthe input numerical attributes have very different scales. This is the case for the hous‐ing data: the total number of rooms ranges from about 6 to 39,320, while the medianincomes  only  range  from  0  to  15.  Note  that  scaling  the  target  values  is  generally  notrequired.</br>
There  are  two  common  ways  to  get  all  attributes  to  have  the  same  scale:min-maxscaling and standardization.</br>

**Min-max scaling** (many people call this normalization) is the simplest: values are shif‐ted  and  rescaled  so  that  they  end  up  ranging  from  0  to  1.  We  do  this  by  subtractingthe min value and dividing by the max minus the min. Scikit-Learn provides a trans‐former called MinMaxScaler for this. It has a feature_range hyperparameter that letsyou change the range if, for some reason, you don’t want 0–1.</br>

**Standardization**  is  different:  first  it  subtracts  the  mean  value  (so  standardized  valuesalways  have  a  zero  mean),  and  then  it  divides  by  the  standard  deviation  so  that  theresulting  distribution  has  unit  variance.  Unlike  min-max  scaling,  standardizationdoes  not  bound  values  to  a  specific  range,  which  may  be  a  problem  for  some  algo‐rithms (e.g., neural networks often expect an input value ranging from 0 to 1). How‐ever, standardization is much less affected by outliers. For example, suppose a districthad a median income equal to 100 (by mistake). Min-max scaling would then crushall the other values from 0–15 down to 0–0.15, whereas standardization would not bemuch   affected.   Scikit-Learn   provides   a   transformer   called   StandardScaler   forstandardization.</br>
As  with  all  the  transformations,  it  is  important  to  fit  the  scalers  tothe training data only, not to the full dataset (including the test set).Only  then  can  you  use  them  to  transform  the  training  set  and  thetest set (and new data).


## Transformation Pipelines

As you can see, there are many data transformation steps that need to be executed inthe  right  order.  Fortunately,  Scikit-Learn  provides  the  Pipeline  class  to  help  withsuch  sequences  of  transformations.  Here  is  a  small  pipeline  for  the  numericalattributes.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)


The Pipeline constructor takes a list of name/estimator pairs defining a sequence ofsteps.  All  but  the  last  estimator  must  be  transformers  (i.e.,  they  must  have  a  fit_transform() method). The names can be anything you like (as long as they are unique  and  don’t  contain  double  underscores. they  will  come  in  handy  later  forhyperparameter tuning.</br>

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call until it reaches the final estimator, for which it calls the fit() method.</br>

The pipeline exposes the same methods as the final estimator. In this example, the last estimator  is  a  StandardScaler,  which  is  a  transformer,  so  the  pipeline  has  a  transform() method that applies all the transforms to the data in sequence (and of course also a fit_transform() method, which is the one we used).
</br>

So  far,  we  have  handled  the  categorical  columns  and  the  numerical  columns  separately. It would be more convenient to have a single transformer able to handle all columns, applying  the  appropriate  transformations  to  each  column.  In  version  0.20, Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news is that it works great with pandas DataFrames. Let’s use it to apply all the transforma‐tions to the housing data.

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)

## Select and Train a Model

At  last!  You  framed  the  problem,  you  got  the  data  and  explored  it,  you  sampled  a training  set  and  a  test  set,  and  you  wrote  transformation  pipelines  to  clean  up  and prepare your data for Machine Learning algorithms automatically. You are now ready to select and train a Machine Learning model.</br>

### Training and Evaluating on the Training Set

Thanks to all the previous steps, things are now going to be much simpler. Let's first train a Linear Regression model.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

Done! Now we have a working Linear Regression model. Let's try it out on a few instances from the training set.

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))


It works, although the predictions are not exactly accurate (e.g., the first prediction is off by close to 40%!). Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error() function.

In [None]:
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

The above *lin_rmse* value is not satisfactory, so we now use more powerful algorithm to predict the housing price. Let’s  train  a  DecisionTreeRegressor.  This  is  a  powerful  model,  capable  of  findingcomplex  nonlinear  relationships  in  the  data.

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

Now that the model is trained, let's evaluate it on the training set.

In [None]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Looking at the value of *tree_rmse* which is zero, it is more likely that the model has bsdly overfit the data. So for the better evaluation we will be using Cross-Validation.

### Better Evaluation Using Cross-Validation

One    way    to    evaluate    the    Decision    Tree    model    would    be    to    use    thetrain_test_split() function to split the training set into a smaller training set and avalidation  set,  then  train  your  models  against  the  smaller  training  set  and  evaluatethem  against  the  validation  set.  It’s  a  bit  of  work,  but  nothing  too  difficult,  and  itwould work fairly well </br>

A great alternative is to use Scikit-Learn’s K-fold cross-validation feature. The following  code  randomly  splits  the  training  set  into  10  distinct  subsets  called  folds,  then  it trains  and  evaluates  the  Decision  Tree  model  10  times,  picking  a  different  fold  for evaluation  every  time  and  training  on  the  other  9  folds.  The  result  is  an  array  con‐taining the 10 evaluation scores.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared,
                         housing_labels,
                         scoring="neg_mean_squared_error",
                         cv=10
                        )
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores

Scikit-Learn’s  cross-validation  features  expect  a  utility  function (greater  is  better) rather  than  a  cost  function  (lower  is  better), so the scoring function is actually the opposite of the MSE (i.e., a neg‐ative  value),  which  is  why  the  preceding  code  computes  -scores before calculating the square root. </br>
create a function to display the rmse.

In [None]:
def display_scores(scores):
    print("Scores:", scores, "\n")
    print("Mean:", scores.mean(), "\n")
    print("Standard Deviation:", scores.std(), "\n")
    
display_scores(tree_rmse_scores)

Let’s compute the same scores for the Linear Regression model.

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared,
                            housing_labels,
                            scoring="neg_mean_squared_error",
                            cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

display_scores(lin_rmse_scores)

The Decision Tree model is overfitting so badly that it performs worse that the Linear Regression model.</br>

Let's try one last model now: **The RandomForestRegressor**. Random  Forests  work  by  training  many  Decision  Trees  on random subsets of the features, then averaging out their predictions. Building  a  model  on  top  of  many other models is called Ensemble Learning, and it is often a great way to push ML algo‐rithms  even  further.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

forest_reg_predicitons = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_reg_predicitons) 
forest_rmse = np.sqrt(forest_mse)
print("rmse without cv:", forest_rmse, "\n")

forest_mse_scores = cross_val_score(forest_reg, housing_prepared,
                                housing_labels,
                                scoring="neg_mean_squared_error",
                                cv=10
                               )

forest_rmse_scores = np.sqrt(-forest_mse_scores)
display_scores(forest_rmse_scores)


The above model is performing better than pervious models but the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. Possible solutions for overfitting are to simplify the model, constrain it .(i.e., regularize it), or get a lot more training data.

### Saving The Model

We  can  easily  save Scikit-Learn  models  by  using Python’s  **pickle  module**  or  by using the **joblib library**,  which  is  more  efficient  at  serializing  largeNumPy arrays (you can install this library using pip).

In [None]:
import joblib

# Save the model
#joblib.dump(forest_reg, "RandomForestRegression.pkl")

# Load the saved model
#my_model_loaded = joblib.load("RandomForestRegression.pkl")