<a href="https://colab.research.google.com/github/rahiakela/mlops-research-and-practice/blob/main/MLOps-Specialization/course-3-machine-learning-modeling-pipelines-in-production/week-2-model-resource-management-techniques/01_manual_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Manual Feature Engineering

Welcome, during this ungraded lab you are going to perform feature engineering using TensorFlow and Keras. By having a deeper understanding of the problem you are dealing with and proposing transformations to the raw features you will see how the predictive power of your model increases. In particular you will:


1. Define the model using feature columns.
2. Use Lambda layers to perform feature engineering on some of these features.
3. Compare the training history and predictions of the model before and after feature engineering.

**Note**: This lab has some tweaks compared to the code you just saw on the lectures. The major one being that time-related variables are not used in the feature engineered model.

Let's get started!

##Setup

First, install and import the necessary packages, set up paths to work on and download the dataset.

In [1]:
# Utilities
import os
import logging

# For visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd

# For modelling
import tensorflow as tf
from tensorflow import feature_column as fc
from tensorflow.keras import layers, models

# Set TF logger to only print errors (dismiss warnings)
logging.getLogger("tensorflow").setLevel(logging.ERROR)

## Load taxifare dataset

For this lab you are going to use a tweaked version of the [Taxi Fare dataset](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data), which has been pre-processed and split beforehand. 

First, create the directory where the data is going to be saved.

In [2]:
data_folder = os.path.join('.', 'data')

if not os.path.isdir(data_folder):
  os.makedirs(data_folder)

Now download the data in csv format from a cloud storage bucket.

In [3]:
!gsutil cp gs://cloud-training-demos/feat_eng/data/taxi*.csv $data_folder

Copying gs://cloud-training-demos/feat_eng/data/taxi-test.csv...
Copying gs://cloud-training-demos/feat_eng/data/taxi-train.csv...
Copying gs://cloud-training-demos/feat_eng/data/taxi-valid.csv...
/ [3 files][  5.3 MiB/  5.3 MiB]                                                
Operation completed over 3 objects/5.3 MiB.                                      


Let's check that the files were copied correctly and look like we expect them to.

In [4]:
os.listdir(data_folder)

['taxi-train.csv', 'taxi-test.csv', 'taxi-valid.csv']

Everything looks fine. Notice that there are three files, one for each split of `training`, `testing` and `validation`.

##Inspect tha data

Now take a look at the training data.

In [5]:
pd.read_csv(os.path.join(data_folder, "taxi-train.csv")).head()

Unnamed: 0,fare_amount,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,hourofday,dayofweek
0,8.1,1,-73.973731,40.79191,-73.962737,40.767318,14,4
1,4.5,2,-73.986495,40.739278,-73.986083,40.730933,10,6
2,2.9,1,-73.956043,40.772026,-73.956245,40.773934,22,3
3,7.0,1,-74.006557,40.705797,-73.980017,40.713617,6,3
4,6.5,1,-73.986443,40.741612,-73.990215,40.746467,10,2


The data contains a total of 8 variables.

The `fare_amount` is the target, the continuous value we’ll train a model to predict. This leaves you with 7 features. 

However this lab is going to focus on transforming the geospatial ones so the time features `hourofday` and `dayofweek` will be ignored.

##Create an input pipeline 

To load the data for the model you are going to use an experimental feature of Tensorflow that lets loading directly from a `csv` file.

For this you need to define some lists containing relevant information of the dataset such as the type of the columns.

In [6]:
# Specify which column is the target
LABEL_COLUMN = "fare_amount"

# Specify numerical columns
# Note you should create another list with STRING_COLS if you had text data but in this case all features are numerical
NUMERIC_COLS = [
   "pickup_longitude", "pickup_latitude",
   "dropoff_longitude", "dropoff_latitude",
   "passenger_count", "hourofday", "dayofweek"             
]

In [7]:
# A function to separate features and labels
def features_and_labels(row_data):
  label = row_data.pop(LABEL_COLUMN)
  return row_data, label

In [8]:
# A utility method to create a tf.data dataset from a CSV file
def load_dataset(pattern, batch_size=1, mode="eval"):
  dataset = tf.data.experimental.make_csv_dataset(pattern, batch_size)
  # features, label
  dataset = dataset.map(features_and_labels)
  if mode == "train":
    # Notice the repeat method is used so this dataset will loop infinitely
    dataset = dataset.shuffle(1000).repeat()
    # take advantage of multi-threading; 1=AUTOTUNE
    dataset = dataset.prefetch(1)
  return dataset

## Create a DNN Model in Keras

Now you will build a simple Neural Network with the numerical features as input represented by a [`DenseFeatures`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/DenseFeatures) layer (which produces a dense Tensor based on the given features), two dense layers with ReLU activation functions and an output layer with a linear activation function (since this is a regression problem).

Since the model is defined using `feature columns` the first layer might look different to what you are used to. This is done by declaring two dictionaries, one for the inputs (defined as Input layers) and one for the features (defined as feature columns).

Then computing the `DenseFeatures` tensor by passing in the feature columns to the constructor of the `DenseFeatures` layer and passing in the inputs to the resulting tensor (this is easier to understand with code):