### Your name:

<pre> SV</pre>

### Collaborators:

<pre> Enter the name of the people you worked with if any</pre>


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Open the housing data


In [2]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Build full pipeline for the data analysis following the example of the notebook.
 Hint: the main part requested to change is the algorithm used (Lasso regression)

If you want to learn more about the Lasso regression, see resources below:
- http://scikit-learn.org/stable/modules/linear_model.html#lasso
- https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

#### Considerations for building pipeline:

- Split data into training and testing sets below.
- Convert all categorical data to one-hot vectors below
- Normalize all non-categorical data 
-  Perform Lasso-based regression using a variety of values for $\alpha$ between 0 and 1 via a grid search where  *housing_labels* is the output and all other features are the input (similar to as seen in lecture two.)

In [3]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import Lasso
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer



# Write your code here:

# STEP 1: split the data into training (80%) and test (20%) sets - create a stratified sample based on income 
# category
# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
# Do the split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
# Drop the income category column as we don't need it any more.    
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# STEP 2: prepare the data for machine learning algorithms
# STEP 2A: drop target values from training set and copy to a new list
housing = strat_train_set.drop("median_house_value", axis=1) 
housing_labels = strat_train_set["median_house_value"].copy()


# STEP 3: Add new calculated fields: 
# create a custom transformer to add extra attributes:
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])


# STEP 4: build a pipeline for preprocessing the numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
num_pipeline = Pipeline([
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

# STEP 5: create a mechanism to separate numerical and categorical features
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

# STEP 6: view housing_prepared dataset
housing_prepared



array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

In [4]:
from sklearn.model_selection import GridSearchCV

# STEP 7: fit a lasso regression model
# Perform hyper-parameter tuning by doing a grid search with varying values of alpha
lassoRegressionModel = Lasso(random_state=999, tol=0.1)
param_grid = [
    # try 
    {'alpha': np.arange(0.1, 10, 0.1) }
  ]

# Fit our training set to this model
lasso_grid_search = GridSearchCV(lassoRegressionModel, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
lasso_grid_search.fit(housing_prepared, housing_labels)

# Print details of the hyper parameters that worked best
print(lasso_grid_search.best_params_)
print(lasso_grid_search.best_estimator_)

{'alpha': 2.6}
Lasso(alpha=2.6, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=999,
   selection='cyclic', tol=0.1, warm_start=False)


In [18]:
cvres = pd.DataFrame(lasso_grid_search.cv_results_).sort_values('mean_test_score', ascending=False)
cvres['rmse'] = cvres['mean_test_score'].apply(lambda x: np.sqrt(-x))
# Display grid search results ordered by mean_test_score
cvres.sort_values('mean_test_score', ascending=False)[['param_alpha', 'mean_test_score', 'rmse']]

Unnamed: 0,param_alpha,mean_test_score,rmse
25,2.6,-4.781230e+09,69146.436445
28,2.9,-4.781230e+09,69146.441790
24,2.5,-4.781231e+09,69146.446315
27,2.8,-4.781232e+09,69146.453846
21,2.2,-4.781234e+09,69146.468209
31,3.2,-4.781234e+09,69146.469634
23,2.4,-4.781235e+09,69146.471407
30,3.1,-4.781235e+09,69146.476498
22,2.3,-4.781235e+09,69146.478130
20,2.1,-4.781237e+09,69146.490576


In [21]:
from sklearn.metrics import mean_squared_error

best_model = lasso_grid_search.best_estimator_
housing_predictions = best_model.predict(housing_prepared)
lasso_mse = mean_squared_error(housing_labels, housing_predictions)
lasso_rmse = np.sqrt(lasso_mse)
lasso_rmse

68628.6226919338


Why is it necessary to normalize all continuous variables before performing Lasso? (OPTIONAL)

<p>Lasso regression works by minimizing the error in predicted values along with the absolute values of the coefficients of each of the predictors. For this reason, it is important to have all features/predictors on similar scales as features that have larger magnitudes will be penalized heavier than features that have smallar magnitudes. If all features are normalized, they are on comparable scales and their coefficients will have comparable magnitudes and will be penalized more equitably. </p>

### Conclusions
For what values of $\alpha$ does Lasso perform best? Does it perform as well on the housing data as the linear regressor from the lectures? Why do you think this is?

<p>Lasso seemed to perform best for an Alpha value of 2.6. 
The linear regressor seems to perform slightly better than the Lasso regressor based on root mean error values.  
Lasso regression forces coefficients of certain features to zero and/or selects randomly between co-related/colinear variables, keeping one and losing the other in order to minimize the objective function. This can result in underfitting, especially for larger values of alpha. This might explain the difference in performance compared to Linear Regression. </p>

### Read appending B

- Reflect on your last data project, read appendix B. Then, write down a few of the checklist items that your last data project could have used. If you have not yet done a data project, then write down a few of the items that you found most interesting.


<p>My last data project (as part of this course) involved the analysis of BikeShare ridership data for the City of Toronto.
It involved most of the items in the checklist, except for No.8 since it was a one-off analysis initiative and did not involve deploying the solution in a production environment:<br><br>

<b>1. Frame the problem and look at the big picture.</b><br>
The objective of the project was to analyze the data for geospatial/temporal patterns and to predict the number of bicycle rides likely to occur on a given day.<br><br>
<b>2. Get the data.</b><br>
The data was downloaded from the City of Toronto open data website as Excel files.<br><br>
<b>3. Explore the data to gain insights.</b><br>
Exploratory data analysis was carried out. A few ambiguities were found regarding the timestamp values in the data. An email was sent out to the City of Toronto open data team to seek clarifications. When no reply was forthcoming, these were noted as assumptions and potential limitations to the validity of the analysis. <br><br>
<b>4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.</b><br>
The data was cleaned to drop incomplete rows, categorial values were one-hot encoded, new additional calculated values added in order to prepare the data for mL algorithms. <br><br>
<b>5. Explore many different models and short-list the best ones.</b><br>
<b>6. Fine-tune your models and combine them into a great solution.</b><br>
Only two models were applied and compared: Linear Regression and Random Forest Regression. The project could have done better on this item: a few more models could have been applied and a grid search performed to also include hyper-parameter tuning.<br><br> 
<b>7. Present your solution.</b><br>
A graphical animated representation of the total rides on a given day was built and presented to the audience. This really helped identify intra-day patterns within the data. Wherever possible, generic terms were used instead of more arcane mathematical or statistical terms and graphical comparisons of both models used was presented along with the rationale behind final model selection.

</p>