# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Palaniappan Muthukkaruppan
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Any code is allowed from the class text books or class provided code.__
- Please do not change the file names. The FAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and FAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`).

In [2]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import os
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

The following cell is used to determine if the environment is databricks or personal computer and load the csv file accordingly.

In [4]:
def get_training_dataframe(data_file_name):  
    # get the databricks runtime version
    db_env = os.getenv("DATABRICKS_RUNTIME_VERSION")
    grading_env = os.getenv("GRADING_RUNTIME_ENV")
    
    # if the databricks env var exists
    if db_env != None:
        full_path_name = "/FileStore/tables/%s" % data_file_name
    elif grading_env != None:
        full_path_name = "C:/Users/Will/Desktop/SU/datasets/%s" % data_file_name
    else:
        full_path_name = data_file_name
        
    return spark.read.csv(full_path_name, inferSchema=True, header=True)

# Random Forest

This assignment gives you some practice using the spark documentation to figure out how to do some common tasks on your own.  Note that there is a subjective component to the grading of this assignment.  Graders reserve the right to take points off for lack of effort or poorly presented explanations.

In these questions, we will examine the famous Titanic dataset

[Column Descriptions](https://data.world/nrippner/titanic-disaster-dataset): <br>
survival - Survival (0 = No; 1 = Yes) <br>
class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) <br>
name - Name <br>
sex - Sex <br>
age - Age <br>
sibsp - Number of Siblings/Spouses Aboard <br>
parch - Number of Parents/Children Aboard <br>
ticket - Ticket Number <br>
fare - Passenger Fare <br>
cabin - Cabin <br>
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) <br>
boat - Lifeboat (if survived) <br>
body - Body number (if did not survive and body was recovered) <br>

In [9]:
# read-only
titanic_df = get_training_dataframe("titanic_original.csv")

drop_cols = ['boat', 'body']
titanic_df = titanic_df.\
    drop(*drop_cols).\
    fillna('O').\
    dropna(subset=['pclass', 'age', 'sibsp', 'parch', 'fare', 'survived']).\
    select((fn.col('sex') == 'male').alias('is_male').cast('float'),           
           'pclass',
           'age',
           'sibsp',
           'parch',
           'fare',
           'survived')
training_df, testing_df = titanic_df.randomSplit([0.7, 0.3], seed=0)
titanic_df.printSchema()

In [10]:
titanic_df.show(10)

# Question 1: (10 pts)
Create a spark RandomForestClassifier using all default parameters.  Train the model and calculate the AUC using a BinaryClassificationEvaluator.

In [12]:
# Your code here
rf_assembler = VectorAssembler(inputCols=training_df.columns[0:6], outputCol="features")
rf = classification.RandomForestClassifier(featuresCol = "features", labelCol = "survived")
rf_pipeline = Pipeline(stages=[rf_assembler, rf]).fit(training_df)
bce = BinaryClassificationEvaluator(labelCol='survived')
default_rf_model = bce.evaluate(rf_pipeline.transform(testing_df))
print(default_rf_model)
#raise NotImplementedError()

# Question 2: (10 pts)
Use spark RandomForestClassifier, ParamGridBuilder, and CrossValidator objects to perform a random forest grid search.  Use 3 fold cross validation and a BinaryClassificationEvaluator to evaluate the results.  The goal is to see if you can improve upon the AUC score produced by the default random forest model above.  Obvious things that you might consider varying in the grid search include the number of randomly selected columns for each split point, the number of trees in the forest, and the impurity measurement (gini / entropy).  You are free to choose any hyper parameters you want in your grid search.

In [14]:
# Your grid search code here
paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [20,24])
             .addGrid(rf.maxDepth, [6,10])
             .addGrid(rf.impurity,['gini', 'entropy'])
             .build())

cv_rf= CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=bce, numFolds=3)
# Run cross validations
cv_rf_Model = Pipeline(stages = [rf_assembler, cv_rf]).fit(training_df)
#raise NotImplementedError()

In [15]:
print(bce.evaluate(cv_rf_Model.transform(testing_df)))

In [16]:
rf_prediction = cv_rf_Model.transform(testing_df)
rf_prediction.show()

# Question 3 (10 pts)
Print the AUC and hyper parameters of the best random forest model in the code cell below.  Describe in words in the markdown cell below what specific parameters you used in your grid and what those parameters do in the model.  For example, if you specified impurity in your grid, describe what impurity does in the random forest.  Keep the descriptions brief and at a high level, I'm just trying to see if you understand the high level concept of what the tuning parameter does.

In [18]:
# Your AUC code here
best_model_rf = cv_rf_Model.stages[-1].bestModel
best_model_rf.extractParamMap()
#raise NotImplementedError()

Write your grid search parameter descriptions here:

```
numTrees (20):Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. 
maxDepth(6): Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
impurity(entropy): The node impurity is a measure of the homogeneity of the labels at the node. This measure must match the algo parameter.
```

# Question 4: 10 pts
Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features (`is_male`, `pclass`, etc.) and their feature importance as determined by the random forest model. Sort the dataframe by `importance` in descending order. In the markdown cell, add comments on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the titanic dataset?

In [22]:
# Your code here
importances = rf_pipeline.stages[-1].featureImportances.toArray()
features = training_df.columns[0:6]

feature_importance = pd.DataFrame(list(zip(features,importances)),
                                 columns = ['feature', 'importance']).sort_values('importance',ascending = False)
feature_importance
#raise NotImplementedError()

Unnamed: 0,feature,importance
0,is_male,0.500845
1,pclass,0.154806
5,fare,0.133774
2,age,0.128699
3,sibsp,0.055356
4,parch,0.026521


### Your feature importance comments here.
According to the default model of Random Forest, the most important feature is 'ismale' so maybe that means that since men are physically stronger than women, they would have a better chance of survivng.

The second highest feature is 'fare' which makes sense to me as customers who've paid more for a better class may have the first priority to save their life by getting out of the ship first.

The third highest feature also makes sense because higher the passenger class, better the services provided would be, which includes safety and life boats. 

Age also could be a factor, since it is easier to save your life when you are young compared to older people. Also, the younger you are, the more prone your body is to take a shock well than older people.

The last two features dont make a lot of sense to me because logically we would give first prefernce to kids, followed by elder people & females and then men.

# Question 5:  10 pts.
Print any of the trees in the forest from the final model.  Copy the printed text to the tree printout markdown cell below.  Add comments to the markdown cell below describing how the root node is split:  What variable is being split and what is the value that determines the left / right split.  It is important to copy the printed tree output to the tree pritout markdown cell because the trees are grown with random parameters and the graders will get a totally different tree when we run your code.

In [25]:
# your code here
print(rf_pipeline.stages[-1].trees[3].toDebugString)
#raise NotImplementedError()

## Paste the tree printout in this cell:

DecisionTreeClassificationModel (uid=dtc_ab3151614376) of depth 5 with 51 nodes

```
If (feature 0 <= 0.5)
   If (feature 2 <= 49.5)
    If (feature 1 <= 2.5)
     Predict: 1.0
    Else (feature 1 > 2.5)
     If (feature 2 <= 26.25)
      Predict: 1.0
     Else (feature 2 > 26.25)
      Predict: 0.0
   Else (feature 2 > 49.5)
    If (feature 1 <= 1.5)
     Predict: 1.0
    Else (feature 1 > 1.5)
     If (feature 5 <= 24.075)
      Predict: 1.0
     Else (feature 5 > 24.075)
      If (feature 2 <= 54.5)
       Predict: 1.0
      Else (feature 2 > 54.5)
       Predict: 0.0
  Else (feature 0 > 0.5)
   If (feature 5 <= 27.825)
    If (feature 5 <= 18.375)
     Predict: 0.0
    Else (feature 5 > 18.375)
     If (feature 2 <= 2.5)
      Predict: 1.0
     Else (feature 2 > 2.5)
      Predict: 0.0
   Else (feature 5 > 27.825)
    If (feature 3 <= 2.5)
     If (feature 2 <= 17.5)
      If (feature 1 <= 2.5)
       Predict: 1.0
      Else (feature 1 > 2.5)
       Predict: 0.0
     Else (feature 2 > 17.5)
      If (feature 5 <= 30.5979)
       Predict: 1.0
      Else (feature 5 > 30.5979)
       Predict: 0.0
    Else (feature 3 > 2.5)
     Predict: 0.0
```

#### Comment on the tree top level split in this cell:

Variable being split = feature 1 
Value determining the left / right split: left if <= 0.5 and right if > 0.5

According to the example tree:
    The root is feature 1, first it would compare feature 1 values:
    If feature 1 <= 0.5, it will execute if part, otherwise else for feature 1 > .5

# Question 6:  10 pts.
Create a spark GBTClassifier using all default parameters.  Train the model and calculate the AUC using a BinaryClassificationEvaluator.

In [29]:
# Your Code Here
gbt_assembler = VectorAssembler(inputCols=training_df.columns[0:6], outputCol="features")
gbt = GBTClassifier(labelCol="survived", featuresCol="features", seed = 1234)
gbt_pipeline = Pipeline(stages=[gbt_assembler, gbt]).fit(training_df)

bce = BinaryClassificationEvaluator(labelCol='survived')
default_gbt_model = bce.evaluate(gbt_pipeline.transform(testing_df))
print(default_gbt_model)
#raise NotImplementedError()

# Question 7:  10 pts.
Use spark GBTClassifier, ParamGridBuilder, and CrossValidator objects to perform a GBT grid search.  Use 3 fold cross validation and a BinaryClassificationEvaluator to evaluate the results.  The goal is to see if you can improve upon the AUC score produced by the default GBT model above.  Obvious choices for the grid search include maximum tree depth, maxIter, and stepSize.  Note that maxIter and stepSize are not very well explained in the documentation.  maxIter indicates the number of trees to grow in the series and stepSize is the multiplier for each tree.  If stepSize is reduced then maxIter probably needs to increase to create an equivalent number of trees. You are free to experiment with any search parameters you wish.  It is also okay to experiment with maxIter and stepSize outside of the grid if you wish as long as you comment what you are doing.

In [31]:
# Your Code Here
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 30])
             .addGrid(gbt.maxIter, [10, 15])
             .addGrid(gbt.stepSize,[0.4, 0.6, 0.8])
             .build())

cv_gbt= CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=bce, numFolds=3)
# Run cross validations
cv_gbt_Model = Pipeline(stages = [gbt_assembler, cv_gbt]).fit(training_df)
#raise NotImplementedError()

In [32]:
gbt_prediction = cv_gbt_Model.transform(testing_df)
gbt_prediction.show()

# Question 8 10 pts
Print the AUC and hyper parameters of the best GBT model.  Add comments to the markdown cell which indicate the specific search parameter you used and how they relate to the GBT training process.

In [34]:
print(bce.evaluate(cv_gbt_Model.transform(testing_df)))

In [35]:
# Your Code Here
best_model_gbt = cv_gbt_Model.stages[-1].bestModel
best_model_gbt.extractParamMap()
#raise NotImplementedError()

Add search parameter comments here

```
maxDepth(2): Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
setStepSize(0.4): Step size means the learning rate for the algorithm. This rate should be appropriatelt determined so that it converges to the local minima without taking too much time.
maxBins(30): Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. 
maxIter(15): Each iteration produces one tree. Increasing this number makes the model more expressive, improving training data accuracy.
```

# Question 9: 10 pts
Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features (`is_male`, `pclass`, etc.) and their feature importance as determined by the GBT algorithm. Sort the dataframe by `importance` in descending order.  Add comments about the feature importances similar to what you did for random forest above.  Also, compare the feature importances to random forest - how different or similar are the feature importances between GBT and random forest.

In [38]:
# Your code here
# Your code here
importances = gbt_pipeline.stages[-1].featureImportances.toArray()
features = training_df.columns[0:6]

feature_importance = pd.DataFrame(list(zip(features,importances)),
                                 columns = ['feature', 'importance']).sort_values('importance',ascending = False)
feature_importance
#raise NotImplementedError()

Unnamed: 0,feature,importance
2,age,0.358323
5,fare,0.320546
0,is_male,0.093107
3,sibsp,0.089562
4,parch,0.070254
1,pclass,0.068207


### Add feature importance comments here:

The GBT model has given the most importance to 'age' which makes sense to me because the younger you are, the more prone your body is to take a shock well than older people. 

The second important feature 'fare' makes sense, because it might so happen that the first class people who had higher fares were given first priority to save their life than economy fare people. Or they might even have more number of life boats reserved for them than the economy passengers. 

The third most important feature is 'is_male' which shows that gender determines the probability of surviving.

Our model also suggests that the number of siblings/spouses and Parents/Children had not much of a significance to the survival rate just like 

And lastly, our model suggests having the feature 'pclass' important, which makes sense based on the above discussed factors.

# Question 10: 10 pts
Calculate the AUC of the best random forest and GBT models using the held out test data.  Which model produces the best accuracy.

In [41]:
# Your code here
print("Best Model on Testing DF with RF: ",bce.evaluate(cv_rf_Model.transform(testing_df)))
print("Best Model on Testing DF with GBT: ",bce.evaluate(cv_gbt_Model.transform(testing_df)))
#raise NotImplementedError()

### Which model produced the best results?
For me, the random forest's best model gave me the best accuracy as evident from above.