### Useful links
#### MLLib docs:
https://spark.apache.org/docs/2.3.2/ml-guide.html

#### Check if we have SparkSession active

In [None]:
spark

#### Most important facts about Spark MLLib

* works only with numeric data
* models accepts training data and labels as two columns. For labels it is normal, but training data has to be vector of variables you want to use. Below you will learn how to build such vectors
* API is very similar to scikit-learn with .fit and .transform methods

#### For sample execution we will load original data. After checking how it works I would like you to load data prepared in previous step and build model based on it

In [None]:
taxi = spark.sql("""SELECT taxi_id,
                        trip_start_timestamp,
                        trip_end_timestamp,
                        trip_seconds,
                        trip_miles,
                        pickup_census_tract,
                        dropoff_census_tract,
                        pickup_community_area,
                        dropoff_community_area,
                        fare,
                        tips,
                        tolls,
                        extras,
                        trip_total,
                        company,
                        IF(payment_type='Credit Card',1,0) target
                    FROM
                        tomek.taxi_cleaned
                    WHERE
                        yyyymm BETWEEN 201601 AND 201612
                    """)

In [None]:
# we will show only first 4 columns for readability
taxi = taxi.select('target','company','fare','trip_seconds','pickup_community_area','dropoff_community_area')
taxi.show(4)

#### Basic statistics in MLLib

Check if we have all numeric features stored as numeric as spark is accepting only numeric data

Hint: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.printSchema

In [None]:
taxi.printSchema()

Looks like there is nothing to convert

#### Get rid of null values

As you can see in initial show statement there are some null values in our data. We have to get rid of them, because null values are not accepted in some spark methods.

In [None]:
from pyspark.sql.functions import col

In [None]:
taxi.filter(col("fare").isNull()).count()

In [None]:
taxi.filter(col("pickup_community_area").isNull()).count()

In [None]:
taxi.filter(col("dropoff_community_area").isNull()).count()

In [None]:
taxi.count()

In [None]:
taxi.printSchema()

In [None]:
taxi = taxi.na.fill(99999999)

In [None]:
taxi.filter(col("pickup_community_area").isNull()).count()

In [None]:
taxi.filter(col("dropoff_community_area").isNull()).count()

#### Transformation of columns

##### Vector assembler

Vector assembler is used for building column of vectors of variables. Most MLLib algos accepts only such input

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
assembler = VectorAssembler(
    inputCols=['fare','trip_seconds','pickup_community_area','dropoff_community_area'],
    outputCol="numeric_features")

In [None]:
assembler\
    .transform(taxi)\
    .select('fare',
            'trip_seconds',
            'pickup_community_area',
            'dropoff_community_area',
            'numeric_features')\
    .show(10)

Please remember that to save such operation you have to overwrite original df

In [None]:
taxi = assembler.transform(taxi)

##### Binarizer

Binarization is the process of thresholding numerical features to binary (0/1) features.

In [None]:
from pyspark.ml.feature import Binarizer

In [None]:
from pyspark.sql.functions import avg

Let's add column indicating if trip time is higher or lower than average

In [None]:
# let's calculate average trip time
taxi.select(avg(col("trip_seconds"))).show()

In [None]:
# and now extract time
taxi.select(avg(col("trip_seconds"))).collect()[0][0]

In [None]:
taxi.select(avg(col("trip_seconds"))).collect()[0][0]

In [None]:
taxi = taxi.withColumn("trip_seconds_double",col("trip_seconds").cast("double"))

In [None]:
binarizer = Binarizer(threshold=taxi.select(avg(col("trip_seconds"))).collect()[0][0],
                      inputCol="trip_seconds_double",
                      outputCol="binarized_features")

In [None]:
binarizer.transform(taxi).select("trip_seconds","binarized_features").show(10)

And again save results!

In [None]:
taxi = binarizer.transform(taxi)

You can also binarize whole vector at once. For this case we will use features vector created in last step.

In [None]:
binarizer = Binarizer(threshold=0.5, inputCol="numeric_features", outputCol="binarized_features_2")

In [None]:
binarizer.transform(taxi).select("numeric_features","binarized_features_2").show(10)

This time we will not save it as it does not make any sense

##### StringIndexer

MLLib is accepting only numeric variables as input. We have to convert any categorical columns to numeric ones. We will use string indexer for it.

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.

In [None]:
from pyspark.ml.feature import StringIndexer

Again check for strings

In [None]:
taxi.printSchema()

In [None]:
taxi.select('company').show(4)

Check if we have empty strings. If yes we have to put there some kind of representation as it is needed for further usage. Some string methods in spark does not accept variables with empty strings!

In [None]:
taxi.groupBy("company").count().sort(col("count").desc()).show()

This is most popular value. Let's populate it with something.

In [None]:
from pyspark.sql.functions import when

In [None]:
taxi = taxi.withColumn("company",when(col("company")=="","empty").otherwise(col("company")))

In [None]:
taxi.groupBy("company").count().sort(col("count").desc()).show()

In [None]:
from pyspark.ml.feature import StringIndexer

In [None]:
string_indexer = StringIndexer(inputCol="company",outputCol="company_indexed")
taxi = string_indexer.fit(taxi).transform(taxi)

Label 0 is assigned to most popular value, label 1 for next and so on.

Additionally, there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
* throw an exception (which is the default)
* skip the row containing the unseen label entirely
* put unseen labels in a special additional bucket, at index numLabels

##### OneHotEncoder

Some algos are expecting features to be continues ones - like Logistic regression. We have to encode our categorical features.

One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using StringIndexer first.

In [None]:
from pyspark.ml.feature import OneHotEncoderEstimator

In [None]:
# Please modify to transform column you have used. It can transform multiple columns. Just put them in list.
encoder = OneHotEncoderEstimator(inputCols=["company_indexed"],
                                 outputCols=["company_indexed_vec"],
                                 handleInvalid='keep')

In [None]:
taxi = encoder.fit(taxi).transform(taxi)

In [None]:
taxi.select("company_indexed","company_indexed_vec").show(10)

#### Let's prepare label now

In [None]:
taxi = taxi \
    .withColumn("target",taxi.target.cast("double"))

### Now we will build pipeline

#### First part of it will be ChiSquareSelector which will allow us to reduce number of of categorical variables

In [None]:
taxi.printSchema()

In [None]:
from pyspark.ml.feature import ChiSqSelector

In [None]:
css = ChiSqSelector(fpr=0.05,featuresCol="company_indexed_vec",outputCol="selectedFeatures", labelCol="target")



#### Now we will gather all features as MLLib is accepting only feature vector as a column

In [None]:
feature_assembler = VectorAssembler(inputCols=["numeric_features",
                                               "binarized_features",
                                               "selectedFeatures"],
                                    outputCol="training_features")

#### And model now!

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
lr = LogisticRegression(labelCol="target", featuresCol="training_features")

#### And finally pipeline!

In [None]:
from pyspark.ml import Pipeline

In [None]:
pipeline = Pipeline(stages=[css,feature_assembler,lr])

#### Let's train our pipeline now

In [None]:
(trainingData, testData) = taxi.randomSplit([0.7, 0.3],seed=1254129345)

In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [None]:
paramGrid = ParamGridBuilder() \
    .addGrid(css.fpr, [0.05, 0.1]) \
    .addGrid(lr.regParam, [0.0, 0.1]) \
    .build()

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="target", predictionCol="prediction", metricName="accuracy")

In [None]:
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=2)  # use 3+ folds in practice

In [None]:
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainingData)

In [None]:
predictions = cvModel.transform(testData)

In [None]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

### Now I would like you to experiment with your data and other models.