# Linear Regression with a Real Dataset

This notebook uses a real dataset to predict the prices of houses in California.   

## The Dataset
  
The [dataset for this exercise](https://developers.google.com/machine-learning/crash-course/california-housing-data-description) is based on 1990 census data from California. The dataset is old but still provides a great opportunity to learn about machine learning programming.

In [1]:
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt

# Adjust the granularity of reporting.
# pd.options.display.max_ros = 10
# pd.options.display.float_format = "{:.2f}".format

Import the .csv file into a pandas DataFrame and scales the values in the label (`median_house_value`):

Scaling `median_house_value` puts the value of each house in units of thousands. Scaling will keep loss values and learning rates in a friendlier range.  

Although scaling a label is usually *not* essential, scaling features in a multi-feature model usually *is* essential.

In [2]:
# Import the dataset.
training_df = pd.read_csv(filepath_or_buffer="https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")

# Scale the label.
training_df["median_house_value"] /= 1000.0

# Print the first rows of the pandas DataFrame.
training_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66.9
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80.1
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85.7
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73.4
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65.5


## Examine the dataset

A large part of most machine learning projects is getting to know your data. The pandas API provides a `describe` function that outputs the following statistics about every column in the DataFrame:

* `count`, which is the number of rows in that column. Ideally, `count` contains the same value for every column. 

* `mean` and `std`, which contain the mean and standard deviation of the values in each column. 

* `min` and `max`, which contain the lowest and highest values in each column.

* `25%`, `50%`, `75%`, which contain various [quantiles](https://developers.google.com/machine-learning/glossary/#quantile).

In [3]:
# Get statistics on the dataset.
training_df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207.300912
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115.983764
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14.999
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119.4
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180.4
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500.001


In [5]:
from linear_regression import build_model, train_model_real_dataset

In [18]:
# Hyperparameters
learning_rate = 0.05
epochs = 18
batch_size = 3

# Specify feature and label:
# the model aims to predict the house value based on solely total_rooms
my_feature = "population"
my_label = "median_house_value"

my_model = None
my_model = build_model(learning_rate)
weight, bias, epochs, rmse = train_model_real_dataset(my_model,
                                         training_df,
                                         my_feature,
                                         my_label,
                                         epochs, batch_size)
print("\nThe learned weight for your model is %.4f" % weight)
print("The learned bias for your model is %.4f\n" % bias )

Epoch 1/18
Epoch 2/18
Epoch 3/18
Epoch 4/18
Epoch 5/18
Epoch 6/18
Epoch 7/18
Epoch 8/18
Epoch 9/18
Epoch 10/18
Epoch 11/18
Epoch 12/18
Epoch 13/18
Epoch 14/18
Epoch 15/18
Epoch 16/18
Epoch 17/18
Epoch 18/18

The learned weight for your model is 0.0201
The learned bias for your model is 210.4003



## Use the model to make predictions
NOT TO DO:
Make predictions using the same data for training.

In [16]:
def predict_house_values(n, feature, label):
  """Predict house values based on a feature."""

  batch = training_df[feature][10000:10000 + n]
  predicted_values = my_model.predict_on_batch(x=batch)

  print("feature   label          predicted")
  print("  value   value          value")
  print("          in thousand$   in thousand$")
  print("--------------------------------------")
  for i in range(n):
    print ("%5.0f %6.0f %15.0f" % (training_df[feature][10000 + i],
                                   training_df[label][10000 + i],
                                   predicted_values[i][0] ))

In [17]:
predict_house_values(10, my_feature, my_label)

feature   label          predicted
  value   value          value
          in thousand$   in thousand$
--------------------------------------
 1960     53             239
 3400     92             280
 3677     69             288
 2202     62             246
 2403     80             252
 5652    295             344
 3318    500             278
 2552    342             256
 1364    118             223
 3468    128             282


## Define a synthetic feature
As `total_rooms` and `population` were not useful features. Perhaps though, the *ratio* of `total_rooms` to `population` might have some predictive power. That is, perhaps block density relates to median house value.

To explore this hypothesis, do the following: 

1. Create a [synthetic feature](https://developers.google.com/machine-learning/glossary/#synthetic_feature) that's a ratio of `total_rooms` to `population`. (If you are new to pandas DataFrames, please study the [Pandas DataFrame Ultraquick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=linearregressionreal-colab&utm_medium=colab&utm_campaign=colab-external&utm_content=pandas_tf2-colab&hl=en).)
2. Tune the three hyperparameters.
3. Determine whether this synthetic feature produces 
   a lower loss value than any of the single features you 
   tried earlier in this exercise.

In [None]:
training_df["rooms_per_person"] = training_df["total_rooms"] / training_df["population"]

my_feature = "rooms_per_person"

learning_rate = 0.1
epochs = 30
batch_size = 3

my_model = build_model(learning_rate)
wei