# Data Transformation

You won't always get data in a convenient format, often you will have to deal with data that is non-numerical, such as customer names, or zipcodes, country names, etc...

Spark has several built in methods of dealing with these transformations: http://spark.apache.org/docs/latest/ml-features.html

## StringIndexer

We often have to convert string information into numerical information as a categorical feature. This is easily done with the **StringIndexer** Method:

Let's create a DataFrame with a **non numerical** categorical feature.

In [13]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('data').getOrCreate()

In [14]:
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
                           ["user_id", "category"])
df.show()

+-------+--------+
|user_id|category|
+-------+--------+
|      0|       a|
|      1|       b|
|      2|       c|
|      3|       a|
|      4|       a|
|      5|       c|
+-------+--------+



Creating another column named *categoryIndex* as a **numerical** category.

In [15]:
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
df_indexed = indexer.fit(df).transform(df)
df_indexed.show()

+-------+--------+-------------+
|user_id|category|categoryIndex|
+-------+--------+-------------+
|      0|       a|          0.0|
|      1|       b|          2.0|
|      2|       c|          1.0|
|      3|       a|          0.0|
|      4|       a|          0.0|
|      5|       c|          1.0|
+-------+--------+-------------+



## VectorIndexer

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for **combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models** like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order. 

Assume that we have a DataFrame with the columns id, hour, mobile, userFeatures, and clicked:

     id | hour | mobile | userFeatures     | clicked
    ----|------|--------|------------------|---------
     0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0
     
userFeatures is a vector column that contains three user features. We want to combine hour, mobile, and userFeatures into a single feature vector called features and use it to predict clicked or not. If we set VectorAssembler’s **input columns** to hour, mobile, and userFeatures and **output column** to features, after transformation we should get the following DataFrame:

     id | hour | mobile | userFeatures     | clicked | features
    ----|------|--------|------------------|---------|-----------------------------
     0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]

In [16]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# A dense vector is when most of the values in the vector are non zero.

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])
dataset.show()

+---+----+------+--------------+-------+
| id|hour|mobile|  userFeatures|clicked|
+---+----+------+--------------+-------+
|  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|
+---+----+------+--------------+-------+



A VectorAssembler will create the column **features** from input dataset columns (hour, mobile, userFeatures) and create the **output** dataframe.

In [17]:
assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
#output.select("features", "clicked").show()
output.show()

Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+---+----+------+--------------+-------+--------------------+
| id|hour|mobile|  userFeatures|clicked|            features|
+---+----+------+--------------+-------+--------------------+
|  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|[18.0,1.0,0.0,10....|
+---+----+------+--------------+-------+--------------------+

