## Squeezing the most out of data

- Making data useful before training a model
- Representing data in forms that help models learn
- Increasing predictive quality
- Reducing dimensionality with feature engineering

-> benefit: reduce compute resources required, so as cost

**Feature engineering during training must be also applied correctly during serving**

## Main preprocessing operations

* **Data cleansing**: eliminating or correcting erroneous data

* **Feature tuning**: scaling or normalizing 

* **Dimensionality reduction**: reducing the number of features by creating lower dimension and more robust data represents

* **Feature construction**: create new features by using several different techniques

## Empirical knowledge of data

* **Text**: stemming, lemmatization, TF-IDF, n-grams, embedding lookup

* **Image**: clipping, resizing, cropping, blur, Canny filters, Sobel filters, photometric distortions

## Feature engineering technique

Numerical range:

1. Feature scaling
* Grayscale imge pixel intensity scale $[0, 255]$ usually rescaled to  $[-1,1]$ (image = (image - 127.5)/127.5)

2. Normalization and standardization
* Normalization $X_{norm} = \frac{X - X_{min}}{X_{max}  -  X_{min}}$, $X_{norm}\in[0,1]$
* Standardization (y: Z-score), centered on zero <br>
(try both during testing)

Grouping:

3. Bucketizing / Binning
* One-hot encoding for data ranges like histogram ( binning with Facets)

4. Other techniques

Dimensionality reduction:
* PCA: project data along the principal components
* t-SNE: t-Distributed stochastic neighbor embedding
* UMAP: uniform manifold approximation and projection

Feature crossing:
* combine multiple features together into a new feature
* encode nonlinearity in the feature space, or encode same information in fewer features<br>
(e.g. [Day of week, Hour] => [Hour of week])

## TensorFlow embedding projector

* Intuitive exploration of high-dimensional data

<img src="../Figs/TF_projector.png" width="800"/>

Link: projector.tensorflow.org

## tf.Transform

<img src="../Figs/tfTransform.png" width="800"/>

**Transform graph**: the graph expresses all of the transformations on data

<img src="../Figs/tfTransform2.png" width="800"/>

## tf.Transform Analyzers

Analyzer full pass over the data to find constants such as min and max.

<img src="../Figs/tfana.png" width="700"/>

Beam DirectRunner for single system

In [1]:
import tensorflow as tf
import apache_beam as beam
import apache_beam.io.iobase

import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
