# Preprocessing

What is the motivation for preprocessing?

1. Compatibility

    * Enable to compatibility with the library we use. For example TensorFlow work with `Tensor` and not with `Excel` or `csv` etc.
    * Data can be in any format, we need to make it compatiable with whatever tools we use.

## Standardization

* The process of transforming data into a standard scale.
* This is also know as `Feature Scaling`.

```
standardized variable = original variable - mean of original variable / standard deviation of original variable
```

Consider the algorithm has 2 input variables

1. Exchange rate
2. Daily trading volume

And we have 3 days worth of observations as below:

|Day| Exchange rate | Daily trading volume|
|:---|:---|:---|
|1|1.3|110000|
|2|1.34|98700|
|3|1.25|135000|

Here,

* The mean for exchange rate is `1.3`

* The standard deviation is `0.0.45`





## One-hot encoding

* One-hot encoding is a encoding technique to transform data into numerical form which model can understand.

* This technique is applied on categorical data when dealing with few categories.

### Categorical data

* Categorical data are variables that contain label values rather than numeric values.
* Categorical variables are also called `Nominal`.

For example:

1. A "pet" variable with the values "dog", "cat" etc.
2. A "color" variable with the values "red", "green" and "blue".

**Notes**

* Some algorithms can work with categorical data directly, for eg. a decision tree can be learned directly from categorical data with no data transformation.

* Many algorithms cannot operate on label data directly, they require all input and output variables to be numeric form. Thus, encoding is required.

### How to transform categorical data to numerical data?

There are 2 steps involve

1. Label/Integer encoding
2. One-hot encoding

#### Integer encoding

* Each unique category value is assigned an integer value.

For example

|Food name|Categorical #|Calories|
|:---|:---|:---|
|Apple|1|95|
|Orange|2|100|
|Broccoli|3|50|

* There are few problems with above encoding:
   
   1. The integer values have a natural ordered relationship between each other. Now, if your model internally needs to calculate the average across categirues, it might do `1+3 = 4/2 = 2`. This means that according to your model, the average of Apple, Orange together is Broccali.

#### One-hot encoding

* For categorical variables where no relationship exists, the integer encoding is not enough.

* In fact, using integer encoding and allowing model to assume a natural ordering between categories may result in poor performance or unexpected results.

* In this case, a one-hot encoding can be applied to the integer representation.

For example:

|Apple|Orange|Broccoli|Calories|
|:---|:---|:---|:---|
|1|0|0|95|
|0|1|0|100|
|0|0|1|50|

### One-hot encoding using TensorFlow 2.0.0/Keras

`one_hot` method in TensorFlow that can convert a set of sparse labels to a dense one-hot representation

In [67]:
import tensorflow.compat.v1 as tf

output = tf.one_hot(indices=[0, 1, 2], depth=3)
print(output)

with tf.Session() as sess:
    result = sess.run(output)
print(result)

Tensor("one_hot_23:0", shape=(3, 3), dtype=float32)
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


### One-hot encoding using Sk-Learn

In [68]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = ["Apple", "Orange", "Broccoli", "Apple", "Grape"]

docs1 = array(data)
print(docs1)

label_encoding = LabelEncoder()
integer_encoded = label_encoding.fit_transform(data)
print(integer_encoded)

onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoder = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoder)

['Apple' 'Orange' 'Broccoli' 'Apple' 'Grape']
[0 3 1 0 2]
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


## References

* [Nominal Category](https://en.wikipedia.org/wiki/Nominal_category)

* [Categorical Variable](https://en.wikipedia.org/wiki/Categorical_variable)

* [One-hot Encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)

* [One-hot Tensor](https://www.tensorflow.org/api_docs/python/tf/one_hot)

https://www.programcreek.com/python/example/90553/tensorflow.one_hot