<a href="https://colab.research.google.com/github/satyanarayanaallam/tensorflow_practice/blob/main/Dataset_Preprocessing_with_TensorFlow_Transform_(TFT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Stage 1: Installing dependencies and setting up the environment

In [15]:
!pip install tensorflow-transform



## Stage 2: Import project dependencies

In [17]:
import tempfile
import pandas as pd
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as tft_beam

from __future__ import print_function
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils

## Stage 3: Dataset preprocessing

### Loading the Pollution dataset

In [18]:
dataset = pd.read_csv("pollution_small.csv")

In [None]:
dataset.head()

### Dropping the Data column

In [19]:
features = dataset.drop("Date", axis=1)

In [20]:
features.head()

Unnamed: 0,pm10,no2,so2,soot
0,98.67,14.1,44.38,34.81
1,52.33,14.1,29.75,33.06
2,74.67,20.5,36.25,39.25
3,72.0,17.3,46.44,34.38
4,81.0,25.64,56.56,45.59


### Converting the dataset from dataframe to list of Python dictionaries

In [21]:
dict_features = list(features.to_dict("index").values())

In [22]:
dict_features[:2]

[{'pm10': 98.67, 'no2': 14.1, 'so2': 44.38, 'soot': 34.81},
 {'pm10': 52.33, 'no2': 14.1, 'so2': 29.75, 'soot': 33.06}]

### Defining the dataset metadata

In [24]:
data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        "no2":tf.io.FixedLenFeature([], tf.float32),
        "so2":tf.io.FixedLenFeature([], tf.float32),
        "pm10":tf.io.FixedLenFeature([], tf.float32),
        "soot":tf.io.FixedLenFeature([], tf.float32),
    }
    )
)

In [25]:
data_metadata

{'_schema': feature {
  name: "no2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "pm10"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "so2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "soot"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
, '_output_record_batches': True}

## Stage 4: The preprocessing function

In [26]:
def preprocessing_fn(inputs):

    no2 = inputs['no2']
    pm10 = inputs['pm10']
    so2 = inputs['so2']
    soot = inputs['soot']

    no2_normalized = no2 - tft.mean(no2)
    so2_normalized = so2 - tft.mean(so2)

    pm10_normalized = tft.scale_to_0_1(pm10)
    soot_normalized = tft.scale_by_min_max(soot)

    return {
        "no2_normalized":no2_normalized,
        "so2_normalized":so2_normalized,
        "pm10_normalized":pm10_normalized,
        "soot_normalized":soot_normalized
    }

## Stage 5: Putting everything together

Tensorflow Transform uses **Apache Beam** in the background to perform scalable data transforms. In this function we will use a direct runner.

Arguments to provide to the runner:

    dict_features - This is our dataset converted into Python Dictionary.
    data_metadata - This is our mada data for the dataset that we have created.
    preprocessing_fn - The main preprocessing function. Called to perform preprocessing operation per column.


This is a special syntax used in Apache Beam. This is used to stack operations and invoke transforms on our data.

```
result = data_to_pass | where_to_pass_the_data
```

Let's break down our case:

**result**  -> `transformed_dataset, transform_fn`

**data_to_pass** -> `(dict_features, data_metadata)`

**where_to_pass_the_data** -> `tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)`

```
transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

```

If you want to learn more about the syntax, we recommend this link:
https://beam.apache.org/documentation/programming-guide/#applying-transforms

LINKS:
> more about Apache Beam: https://beam.apache.org/

In [27]:
def data_transform():

    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

    transformed_data, transformed_metadata = transformed_dataset

    for i in range(len(transformed_data)):
        print("Raw: ", dict_features[i])
        print("Transformed:", transformed_data[i])

In [28]:
data_transform()





Raw:  {'pm10': 98.67, 'no2': 14.1, 'so2': 44.38, 'soot': 34.81}
Transformed: {'no2_normalized': -18.57798194885254, 'pm10_normalized': 0.3407169580459595, 'so2_normalized': 28.85540771484375, 'soot_normalized': 0.283423513174057}
Raw:  {'pm10': 52.33, 'no2': 14.1, 'so2': 29.75, 'soot': 33.06}
Transformed: {'no2_normalized': -18.57798194885254, 'pm10_normalized': 0.16963857412338257, 'so2_normalized': 14.225406646728516, 'soot_normalized': 0.26620757579803467}
Raw:  {'pm10': 74.67, 'no2': 20.5, 'so2': 36.25, 'soot': 39.25}
Transformed: {'no2_normalized': -12.177982330322266, 'pm10_normalized': 0.25211358070373535, 'so2_normalized': 20.725406646728516, 'soot_normalized': 0.32710281014442444}
Raw:  {'pm10': 72.0, 'no2': 17.3, 'so2': 46.44, 'soot': 34.38}
Transformed: {'no2_normalized': -15.377983093261719, 'pm10_normalized': 0.24225644767284393, 'so2_normalized': 30.9154052734375, 'soot_normalized': 0.27919331192970276}
Raw:  {'pm10': 81.0, 'no2': 25.64, 'so2': 56.56, 'soot': 45.59}
Trans