In this exercise, the task is to **train a regression model** to predict housing prices. To do this, you will use the *Boston Housing Dataset*, which contains data from the U.S. Census Service and contains data about housing in the area of Boston, Massachusetts. 

In [29]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Input, Model

**Load the dataset** either via [sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset) or directly from its [original source](http://lib.stat.cmu.edu/datasets/boston)
into two *pandas* DataFrames X and Y (the features and target).

Provide a **description** of the dataset in *pandas*. This will help you to get a first overview.

In [1]:
# 'describe()' the dataset

**Clean up the data** by checking for missing values and removing them.

*Hint: DataFrames have a built-in method for doing that.*

In [2]:
# remove missing entries (if there are any)

To better understand the dataset, **visualize** correlations between feature values and target values. This should highlight which features are most important for our models.

In [3]:
# create a scatter plot for each feature against the target variable

As you should be able to see from the scatter plots, features 'CHAS' and 'RAD' appear to be categorical values, as they form straight lines (vertical lines, if target value is represented by the y-axis). This coincides with the dataset attribute description.

**Visualize** the distribution of each feature using a plot of your choice. Additionally, **plot the cross-correlations** of all features. This will highlight redundancies in our dataset.

*Hint: DataFrames have built-in methods for simple visualizations. You may use them.*

In [4]:
# visualize the distribution of features

In [5]:
# calculate and print the cross-correlation matrix for X

# visualize the cross-correlations between all features

Now, **preprocess the data and targets** by normalizing continuous features using mean normalization and one-hot-encoding categorical features (CHAS is already one-hot-encoded).

In [9]:
# normalize continuous features
cont_features = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']

In [6]:
# one-hot-encode the 'RAD' feature

**Split the data** into training set and test set (80%/20%).


In [7]:
# split data and target DataFrames into data train, data test, target train and target test datasets

**Define a total of 4 linear regression models**. For weight regularization, one should use Ridge ($L^2$) regression and the other should use Lasso ($L^1$) regression. **Train both models** with MSE and MAE loss functions. You can use an optimizer of your choice.

In [8]:
# define linear regression models in TensorFlow
# with L2 weight regularization (Ridge regression)
# with L1 regularization (Lasso regression)
# test MSE and MAE loss functions, what is the difference?


Now, additionally **train a non-linear neural network** consisting of three dense layers (64/32/1 units). Use ReLU activations after each layer except the last.

It's time to evaluate the performance of your models. For regression tasks, it is interesting to compare the values predicted by the models for a given set of features with the actual ground truth. Start by **creating a scatter plot** that has predicted values on one axis and the true values on the other axis. Also **calculate the $R^2$ score** for each model. You may use the [`r2_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn-metrics-r2-score) function from scikit learn.

If most of the plotted points lie around the x=y line, the model is performing well.

In [9]:
# plot true against predicted values for each model

Compare the $R^2$ scores of the models. You should be able to recognize large differences between the neural networks and linear regression models. Why do neural networks perform so much better on this task?

Plot the error (also called 'residual') distribution for each model by plotting histograms of the differences between predicted and ground truth values.

Ideally, the residuals are normally distributed and centered at 0. 

In [10]:
# plot the error (residual) distribution for each model

In [11]:
# print abs correlation with target

Train a linear regression model and a non-linear regression model on the variable that has the highest correlation with the target only.
Do a scatterplot of the data points and plot the regression model into the scatterplot.