# Cover Type Prediction using ensembles

## Dataset Description
The dataset represents the data about trees which were planted in the US. The dataset consists of the information about 500000 trees. Your aim is to build  Random Forest Ensemble to predict the cover type of trees. In order to successfully complete this assignment you have to follow this algorithm:
* Load the training data
* Transform categorical features into vector representations
* Split dataset into the train and validation part
* Fit the Random Forest Ensemble into the training set
* Compare the accuracy of the fitted model with the Logistic Regression Model, which is about 0.67 for this set


If you have enough time, it will be very interesting to dig into further research through these steps:
* Determine which features are valuable for your model (calculate feature importance of your model).
* Try to reduce number of trees and see the results.
* Understand why the linear models have poor performance on this dataset.


## Loading data

Init pyspark session

In [3]:
from __future__ import division, print_function, unicode_literals # For the compatibility with Python 2

In [4]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder\
                            .enableHiveSupport()\
                            .appName("spark sql")\
                            .master("local[4]")\
                            .getOrCreate()

Load train dataset located at /data/covertype2 with at least 60 partitions (use function repartition for this case). Use option `inferSchema` to save numerical features.

In [5]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
        StructField("Elevation", IntegerType(), False),
        StructField("Aspect", IntegerType(), False),
        StructField("Slope", IntegerType(), False),
       StructField("Horizontal_Distance_To_Hydrology", IntegerType(), False),
       StructField("Vertical_Distance_To_Hydrology", IntegerType(), False),
       StructField("Horizontal_Distance_To_Roadways", IntegerType(), False),
       StructField("Hillshade_9am", IntegerType(), False),
       StructField("Hillshade_Noon", IntegerType(), False),
       StructField("Hillshade_3pm", IntegerType(), False),
       StructField("Horizontal_Distance_To_Fire_Points", IntegerType(), False),
       StructField("Wild_Type", StringType(), False),
       StructField("Soil_Type", StringType(), False),
       StructField("Target", IntegerType(), False)
    ])

In [6]:
trees = spark_session.read.csv("/data/covertype2/train.csv", header="true",schema=schema).repartition(60)

## Transforming data

As you can see, there are two categorical features in dataset: 'Soil_Type' and 'Wild_Type'. You have to transform them into the vector embeddings.

First of all, you have to use StringIndexer to transform feature types to indexes

In [7]:
from pyspark.ml.feature import StringIndexer

In [8]:
cat_cols=['Soil_Type','Wild_Type']
cat_cols_index={'Soil_Type':'Soil_Index','Wild_Type':'Wild_Index'}
cat_cols_encoder={'Soil_Index':'SoilEncoder','Wild_Index':'WildEncoder'}

In [9]:
cat_indexers = [ StringIndexer(inputCol=col, outputCol=cat_cols_index[col])
                 for col in cat_cols ]

Apply OneHotEncoder technique to the dataset in order to get vectors for the Random Forest classification

In [10]:
from pyspark.ml.feature import OneHotEncoder

In [11]:
encoders = [ OneHotEncoder(dropLast=False, inputCol=indexer.getOutputCol(),
                           outputCol=cat_cols_encoder[indexer.getOutputCol()])
             for indexer in cat_indexers ]

Use the VectorAssembler technique to accumulate all features into one vector. Don't forget to use features that you have generated

In [12]:
from pyspark.ml.feature import VectorAssembler

In [13]:
vector_assembler = VectorAssembler(inputCols=['SoilEncoder', # feature name of Soil type encoded
                                              'WildEncoder', # feature name of Wild type encoded
                                              'Elevation',
                                              'Aspect',
                                              'Slope',
                                              'Horizontal_Distance_To_Hydrology',
                                              'Vertical_Distance_To_Hydrology',
                                              'Horizontal_Distance_To_Roadways',
                                              'Hillshade_9am',
                                              'Hillshade_Noon',
                                              'Hillshade_3pm',
                                              'Horizontal_Distance_To_Fire_Points'
                                              ], outputCol='features')

## Training

Fit the Random Forest model to the train dataset. Don't forget to split dataset into two parts to check your trained models. It is desirable to use about 100 trees with depth about 7 in order to avoid wasting too much time waiting while your model will be fit to the data. Try to adjust the options 'subsamplingRate' and 'featureSubsetStrategy' to get better results

<b> Extra task.</b> Use the Cross-Validation to check your model.

In [14]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
rf = RandomForestClassifier(labelCol='Target',featuresCol= "features",numTrees=100, maxDepth=9)

In [15]:
pipeline = Pipeline(stages=[ *cat_indexers, *encoders, vector_assembler, rf] )

In [16]:
trainingData, testData = trees.randomSplit([0.8, 0.2], seed = 123)

In [17]:
model = pipeline.fit(trainingData)

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid = ParamGridBuilder()\
        .addGrid(rf.subsamplingRate,[1,0.9])\
        .addGrid(rf.featureSubsetStrategy,["12","11","10"])\
        .build()

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol = "Target", predictionCol="prediction", metricName = "accuracy")

crossval = CrossValidator(estimator = pipeline,
                         estimatorParamMaps = paramGrid,
                         evaluator = evaluator,
                         numFolds = 5)

cvModel = crossval.fit(trees)

#trainingData, testData = trees.randomSplit([0.8, 0.2], seed = 123)
#model = pipeline.fit(trainingData)

Get the feature importances of the trained model. What 5 features are the most important in the dataset?

model.featureImportances

Apply model to the validation part of your set and get the accuracy score for the data. Use the MulticlassClassificationEvaluator class from the ml.evaluation module. 

In [19]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
predictions = model.transform(testData)
evaluator = MulticlassClassificationEvaluator(labelCol = "Target", predictionCol = "prediction", metricName = "accuracy")
accuracy = evaluator.evaluate(predictions)
print(accuracy)

0.7287621228157332


Are your results better than the results from the Logistic Regression model?

# Performing test submission

Apply the models to the test dataset.

<b>Note!</b> Dataset will be changed during the test phase. Your last cell output must be the accuracy score.

In [20]:
# Load dataset, transform dataset
test = spark_session.read.csv("/data/covertype2/test.csv", header="true",schema=schema).repartition(60)

In [21]:
# Calculate accuracy
prediction = model.transform(test)

In [22]:
# This is last cell. The results of the evaluator must be here
evaluator2 = MulticlassClassificationEvaluator(labelCol = "Target", predictionCol = "prediction", metricName = "accuracy")
accuracy = evaluator2.evaluate(prediction)
print(round(accuracy,2))

0.73
