# DATA SCIENTIST NANODEGREE - SPARKIFY CAPSTONE PROJECT
## Part 2 - Modelling

<img src="images/udacity-logo.png" alt="udacity logo" width="800"/>

## Introduction

In Part 1 of this project, we performed analysis of user event log dataset for the fictitious Sparkify service. The outcome of that was a train and test dataset containing engineered features. These datasets can be used to model Machine Learning models.
<img src='images/sparkify.png' alt='sparkify logo'/>
In Part 2, we will do just that. Using the train and test data, we will attempt to use ML algorithms supported by PySpark to model the data and use it to predict users who are likely to churn in the near future.

For modelling a supervised Machine Learning algorithm, we require the data to be rows of individual objects,users in this case, and several features as columns and also an additional label column. Let's investigate if the data in hand conforms to this structure.

## Structure of this Notebook
- Importing and inspecting the train and test datasets
- Apply preprocessing require for numerical and categoric features and creating a pipeline using the preprocessing steps
- Model selection
- Conclusion

In [4]:
# importing necessary libraries
from pyspark.sql import SparkSession
from utility_functions import *
import pandas as pd
import seaborn as sns
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer,MinMaxScaler, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import OneHotEncoder
import numpy as np

pd.set_option('max_colwidth', None)
pd.set_option("display.max_rows", None)

## Importing and inspecting the train and test datasets

In [2]:
# importing train and test dataset
train_file_path = "train_data.parquet"
test_file_path = "test_data.parquet"

train_data = import_data_into_dataframe(train_file_path, 'parquet')
test_data = import_data_into_dataframe(test_file_path, 'parquet')

# caching the datasets
train_data.persist()
test_data.persist()

DataFrame[userId: string, churned: double, avg_num_of_add_to_playlist_per_session: double, avg_num_of_addfriends_per_session: double, avg_num_of_adverts_per_session: double, avg_num_of_artists_per_session: double, avg_num_of_songs_per_session: double, avg_num_of_thumbs_down_per_session: double, avg_num_of_thumbs_up_per_session: double, avg_num_of_times_settings_changed_per_session: double, average_number_of_visits_to_the_about_page_per_session: double, average_number_of_visits_to_the_help_page_per_session: double, avg_num_of_visits_to_home_per_session: double, avg_num_of_visits_to_the_settings_page_per_session: double, avg_num_of_visits_to_upgrade_page: double, avg_number_of_errors_per_session: double, avg_number_of_visits_to_downgrade_page: double, num_times_user_changed_levels: bigint, num_of_downgrades_submitted: bigint, num_of_upgrades_submitted: bigint, gender: string]

In [3]:
# using a utility function to summarize the data sets
print_data_summary(train_data, test_data)

There are 21 columns and 357 rows in the train data set
The train data set has information about 357 users
Out of which 276 are non churners and 81 are churners
There are 21 columns and 91 rows in the test data set
The test data set has information about 91 users
Out of which 73 are non churners and 18 are churners


From the summary, we can see that this is an <b>imbalanced dataset</b>. There are about three times more non-churnes compared to churners. And this is a problem for ML algorithms and they have little information about churners to learn. As a result, the model can end up with bad performance. 

Also, we have to be careful of what metric we use to measure the performance. For a dataset like this, we cannot use accuracy as a measure. We will resort to using the <b>AUC-ROC score</b> as our metric.

In [6]:
# Schema of the train data
train_data.printSchema()

root
 |-- userId: string (nullable = true)
 |-- churned: double (nullable = true)
 |-- avg_num_of_add_to_playlist_per_session: double (nullable = true)
 |-- avg_num_of_addfriends_per_session: double (nullable = true)
 |-- avg_num_of_adverts_per_session: double (nullable = true)
 |-- avg_num_of_artists_per_session: double (nullable = true)
 |-- avg_num_of_songs_per_session: double (nullable = true)
 |-- avg_num_of_thumbs_down_per_session: double (nullable = true)
 |-- avg_num_of_thumbs_up_per_session: double (nullable = true)
 |-- avg_num_of_times_settings_changed_per_session: double (nullable = true)
 |-- average_number_of_visits_to_the_about_page_per_session: double (nullable = true)
 |-- average_number_of_visits_to_the_help_page_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_home_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_the_settings_page_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_upgrade_page: double (nullable = true)
 |-- 

In [7]:
# Schema of the test data
test_data.printSchema()

root
 |-- userId: string (nullable = true)
 |-- churned: double (nullable = true)
 |-- avg_num_of_add_to_playlist_per_session: double (nullable = true)
 |-- avg_num_of_addfriends_per_session: double (nullable = true)
 |-- avg_num_of_adverts_per_session: double (nullable = true)
 |-- avg_num_of_artists_per_session: double (nullable = true)
 |-- avg_num_of_songs_per_session: double (nullable = true)
 |-- avg_num_of_thumbs_down_per_session: double (nullable = true)
 |-- avg_num_of_thumbs_up_per_session: double (nullable = true)
 |-- avg_num_of_times_settings_changed_per_session: double (nullable = true)
 |-- average_number_of_visits_to_the_about_page_per_session: double (nullable = true)
 |-- average_number_of_visits_to_the_help_page_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_home_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_the_settings_page_per_session: double (nullable = true)
 |-- avg_num_of_visits_to_upgrade_page: double (nullable = true)
 |-- 

<b>Both train and test have the same schema. The churned column indicates whether the user churned from the Sparkify service. All columns after the churned column are engineered features.</b>

In [9]:
# looking for null values in the train data
count_null_values_for_each_column(train_data)

{'userId': 0,
 'churned': 0,
 'avg_num_of_add_to_playlist_per_session': 0,
 'avg_num_of_addfriends_per_session': 0,
 'avg_num_of_adverts_per_session': 0,
 'avg_num_of_artists_per_session': 0,
 'avg_num_of_songs_per_session': 0,
 'avg_num_of_thumbs_down_per_session': 0,
 'avg_num_of_thumbs_up_per_session': 0,
 'avg_num_of_times_settings_changed_per_session': 0,
 'average_number_of_visits_to_the_about_page_per_session': 0,
 'average_number_of_visits_to_the_help_page_per_session': 0,
 'avg_num_of_visits_to_home_per_session': 0,
 'avg_num_of_visits_to_the_settings_page_per_session': 0,
 'avg_num_of_visits_to_upgrade_page': 0,
 'avg_number_of_errors_per_session': 0,
 'avg_number_of_visits_to_downgrade_page': 0,
 'num_times_user_changed_levels': 0,
 'num_of_downgrades_submitted': 0,
 'num_of_upgrades_submitted': 0,
 'gender': 0}

In [10]:
# looking for null values in the test data
count_null_values_for_each_column(test_data)

{'userId': 0,
 'churned': 0,
 'avg_num_of_add_to_playlist_per_session': 0,
 'avg_num_of_addfriends_per_session': 0,
 'avg_num_of_adverts_per_session': 0,
 'avg_num_of_artists_per_session': 0,
 'avg_num_of_songs_per_session': 0,
 'avg_num_of_thumbs_down_per_session': 0,
 'avg_num_of_thumbs_up_per_session': 0,
 'avg_num_of_times_settings_changed_per_session': 0,
 'average_number_of_visits_to_the_about_page_per_session': 0,
 'average_number_of_visits_to_the_help_page_per_session': 0,
 'avg_num_of_visits_to_home_per_session': 0,
 'avg_num_of_visits_to_the_settings_page_per_session': 0,
 'avg_num_of_visits_to_upgrade_page': 0,
 'avg_number_of_errors_per_session': 0,
 'avg_number_of_visits_to_downgrade_page': 0,
 'num_times_user_changed_levels': 0,
 'num_of_downgrades_submitted': 0,
 'num_of_upgrades_submitted': 0,
 'gender': 0}

<b>There are no null values. This is a result of the way the dataset was created where missing values of all features were filled with zeros.</b>

## Apply preprocessing required for numerical and categoric features and creating a pipeline using the preprocessing steps

In [3]:
print("Checking Train data set for correctness:")
print(count_column_types(train_data).iloc[:, :2])
print('-'*40)
print("Checking Test data set for correctness:")
print(count_column_types(test_data).iloc[:, :2])

Checking Train data set for correctness:
     type  count
0  bigint      3
1  double     16
2  string      2
----------------------------------------
Checking Test data set for correctness:
     type  count
0  bigint      3
1  double     16
2  string      2


In [59]:
# create list of numeric column names, have to remove the churned column itself
numeric_column_names = get_columns_of_type(train_data, "bigint")
numeric_column_names.extend(get_columns_of_type(train_data, "double"))
numeric_column_names.remove('churned') #remove churned column
numeric_column_names

['num_times_user_changed_levels',
 'num_of_downgrades_submitted',
 'num_of_upgrades_submitted',
 'avg_num_of_add_to_playlist_per_session',
 'avg_num_of_addfriends_per_session',
 'avg_num_of_adverts_per_session',
 'avg_num_of_artists_per_session',
 'avg_num_of_songs_per_session',
 'avg_num_of_thumbs_down_per_session',
 'avg_num_of_thumbs_up_per_session',
 'avg_num_of_times_settings_changed_per_session',
 'average_number_of_visits_to_the_about_page_per_session',
 'average_number_of_visits_to_the_help_page_per_session',
 'avg_num_of_visits_to_home_per_session',
 'avg_num_of_visits_to_the_settings_page_per_session',
 'avg_num_of_visits_to_upgrade_page',
 'avg_number_of_errors_per_session',
 'avg_number_of_visits_to_downgrade_page']

In [5]:
# create a list of categoric column names and remove userId from it 
categoric_column_names = get_columns_of_type(train_data, "string")
categoric_column_names.remove('userId')
categoric_column_names

['gender']

In [11]:
# as we perform the necessary preprocessing, we will add each step as a stage of a pipeline
pipeline_stages = []

In [7]:
# scaling numeric columns
numeric_vec_assembler = VectorAssembler(inputCols=numeric_column_names, outputCol="numeric_features") # create a vector of all the numeric columns
pipeline_stages.append(numeric_vec_assembler)

minmax_scaler = MinMaxScaler(inputCol="numeric_features", outputCol="numeric_features_scaled") # minmax scale all numeric columns
pipeline_stages.append(minmax_scaler)

In [8]:
# Label encoding the categorical column as gender is binary in this case
str_indexer = StringIndexer(inputCols=categoric_column_names, outputCols=[name+'_indexed' for name in categoric_column_names], handleInvalid='skip') # encode all categorical features
pipeline_stages.append(str_indexer)

In [9]:
# creating a vector of all the features
feature_columns = ["numeric_features_scaled", "gender_indexed"]
feature_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features") # create the combined features vector
pipeline_stages.append(feature_assembler)

output_label_indexer = StringIndexer(inputCol='churned', outputCol='label') # encode the churned column
pipeline_stages.append(output_label_indexer)

In [14]:
# tranforming the train data set using the pipeline
# we will use this transformed data to train the model before adding the model itself as the final stage
data_pipeline = Pipeline(stages=pipeline_stages)
data_pipeline_model = data_pipeline.fit(train_data)
transformed_data = temp_model.transform(train_data)
transformed_data.select("churned", "features", "label").show(5, truncate=False)
transformed_data.persist()

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|churned|features                                                                                                                                                                                                                                                                                                   |label|
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|1.0    |[0.0,0.0,0.0,0.25,0.3213793103448276,0.6769

DataFrame[userId: string, churned: double, avg_num_of_add_to_playlist_per_session: double, avg_num_of_addfriends_per_session: double, avg_num_of_adverts_per_session: double, avg_num_of_artists_per_session: double, avg_num_of_songs_per_session: double, avg_num_of_thumbs_down_per_session: double, avg_num_of_thumbs_up_per_session: double, avg_num_of_times_settings_changed_per_session: double, average_number_of_visits_to_the_about_page_per_session: double, average_number_of_visits_to_the_help_page_per_session: double, avg_num_of_visits_to_home_per_session: double, avg_num_of_visits_to_the_settings_page_per_session: double, avg_num_of_visits_to_upgrade_page: double, avg_number_of_errors_per_session: double, avg_number_of_visits_to_downgrade_page: double, num_times_user_changed_levels: bigint, num_of_downgrades_submitted: bigint, num_of_upgrades_submitted: bigint, gender: string, numeric_features: vector, numeric_features_scaled: vector, gender_indexed: double, features: vector, label: doubl

## Model selection

In [17]:
# Instantiate different algorithms
model_names = ['logistic regression', 'random forest', 'gradient-boosted tree', 'linear svc',
               'decision tree', 'naive bayes']
estimators = []

lr = LogisticRegression()
estimators.append(lr)

rf = RandomForestClassifier(seed = 42)
estimators.append(rf)

gbt = GBTClassifier(seed = 42)
estimators.append(gbt)

svc = LinearSVC()
estimators.append(svc)

dt = DecisionTreeClassifier(seed = 42)
estimators.append(dt)

nb = NaiveBayes()
estimators.append(nb)

In [20]:
# to evaluate base line models, we will use cross validation with the default number of folds 
# the evaluation metric used by BinaryClassificationEvaluator is AUCROC, a number ranging from 0 to 1
# a value of 1 means the model has great classification separation capacity
# a value of 0 also means the model has great classification separation capacity, however the predicted labels are flipped.
# a value near 0.5 means the model has no separation capacity at all

best_metric_value = -99
best_model = None

for model_name, est in zip(model_names,estimators):
    
    evaluator = BinaryClassificationEvaluator() # using a binary classification evalution with metric as AUCROC
    
    grid = ParamGridBuilder().build() # using an empty grid
    
    crossval = CrossValidator(estimator = est,
                            estimatorParamMaps=grid,
                            evaluator = evaluator) # using the default value for number of folds: 3
    
    cvmodel = crossval.fit(transformed_data)
    
    metric_val = evaluator.evaluate(cvmodel.transform(transformed_data))
    
    print(f"Algorithm: {model_name}")
    print(f"Cross validation score: {cvmodel.avgMetrics[0]}")
    print(f"AUCROC: {metric_val}")
    print()
    
    if metric_val > best_metric_value:
        best_metric_value = metric_val
        best_model = cvmodel.bestModel

Algorithm: logistic regression
Cross validation score: 0.6065959019718983
AUCROC: 0.7153784219001608

Algorithm: random forest
Cross validation score: 0.5990755120479732
AUCROC: 0.9315172660583285

Algorithm: gradient-boosted tree
Cross validation score: 0.5207247125103985
AUCROC: 0.9997316156736447

Algorithm: linear svc
Cross validation score: 0.6033980599829254
AUCROC: 0.6823224190373949

Algorithm: decision tree
Cross validation score: 0.5244351974131821
AUCROC: 0.40915190552871716

Algorithm: naive bayes
Cross validation score: 0.500353935878279
AUCROC: 0.5078278761853641



<b>As it can be seen, logistic regression had the best average cross validation score. Followed by Linear SVC.
We will tune hyperparameters for the logistic regression model.</b>

In [32]:
def train_classifier(estimator, evaluator, paramGrid, data):
    crossval = CrossValidator(estimator = est,
                            estimatorParamMaps=grid,
                            evaluator = evaluator) # using the default value for number of folds: 3
    
    cvmodel = crossval.fit(data)
    
    return cvmodel

In [33]:
# we will tune the maximum iterations, regularization parameter and the classification threshold parameters of the logistic regression
# algorithm

est = LogisticRegression()

grid = ParamGridBuilder().addGrid(est.maxIter, [100, 200, 300]).addGrid(est.regParam, [0.001, 0.01, 0.1, 1, 3, 5]).addGrid(est.threshold, [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]).build()

evaluator = BinaryClassificationEvaluator()

lr = train_classifier(est, evaluator, grid, transformed_data)

In [34]:
print(f"Cross validation score: {lr.avgMetrics[0]}")

Cross validation score: 0.6184790292341781


<b>Even after fine tuning, the average score only slighlty improved.</b>

In [74]:
# the parameters of the best performing model
print("maxIter: ",lr.bestModel.getMaxIter())
print("threshold: ", lr.bestModel.getThreshold())
print("regParam: ", lr.bestModel.getRegParam())

maxIter:  100
threshold:  0.3
regParam:  0.01


In [49]:
lr.bestModel.coefficients.values

array([-0.24864945, -0.10793773, -0.25570171,  1.03269487, -0.49079552,
        2.30837552,  0.46266292,  0.79770161,  1.70916452, -1.33829473,
       -1.29602326, -0.72130076,  0.92615866, -0.20009427,  0.05294991,
       -0.04719589, -1.50679953,  1.85935604,  0.29654608])

In [60]:
all_features = numeric_column_names + categoric_column_names

In [62]:
feature_coefficient = pd.DataFrame.from_dict({'feature':all_features,'LR coefficient': lr.bestModel.coefficients.values})
feature_coefficient

Unnamed: 0,feature,LR coefficient
0,num_times_user_changed_levels,-0.248649
1,num_of_downgrades_submitted,-0.107938
2,num_of_upgrades_submitted,-0.255702
3,avg_num_of_add_to_playlist_per_session,1.032695
4,avg_num_of_addfriends_per_session,-0.490796
5,avg_num_of_adverts_per_session,2.308376
6,avg_num_of_artists_per_session,0.462663
7,avg_num_of_songs_per_session,0.797702
8,avg_num_of_thumbs_down_per_session,1.709165
9,avg_num_of_thumbs_up_per_session,-1.338295


From the above table we get an idea about how each feature contributed to the decision of the algorithm.
The 5 most influential features were:
- avg_num_of_adverts_per_session
- avg_number_of_visits_to_downgrade_page
- avg_num_of_thumbs_down_per_session
- avg_number_of_errors_per_session
- avg_num_of_thumbs_up_per_session

In [75]:
# Add the logistic regression model as the final stage
pipeline_stages.append(LogisticRegression(maxIter=100, regParam=0.01, threshold=0.3))

Transform and evaluate performance on Test data

In [77]:
evaluator = BinaryClassificationEvaluator()

training_pipeline = Pipeline(stages=pipeline_stages)
training_pipeline_model = training_pipeline.fit(train_data)

transformed_test_data = training_pipeline_model.transform(test_data)

print("AUCROC score on test data: ", evaluator.evaluate(transformed_test_data))

AUCROC score on test data:  0.5144596651445966


The performance of the model is poor. The model has no separation capacity at all. 

There are a few reasons for this:
- The imbalanced data means that the model does not have enough information to learn the characteristics of churners and non-churners. There are several ways to improve this situation like upsampling, downsampling and SMOTE.
- The features we engineered were not very helpful to the algorithm. We would have to go back to the drawing board and think of other features. There could be other features that can be engineered using the timestamp which haven't done in this project.
- This subset of the full dataset is too small for the model to learn anything useful. We would have to train the model on the full set before we can draws any conclusions.

## Conclusion

In part 2 of this project, we attempted to use ML algorithms to model the data in hand which was gathered in part 1. The aim was to use the model to predict user churn using the various features engineered from the user event logs data.

We trained different base models on the training data and evaluated their performance using cross validation based on the AUC-ROC score. The logistic regression algorithm had the best score and as a result we tried to tune parameters to improve the performance. 

The tuned model only performed slighlty better. And finally we evaluated the performance of the tuned model on the test dataset. The model performed poorly and we discussed possible reasons and actions that can be taken to improve further on.

Afterall, we can conclude that Data Science is an iterative process and not all projects are successful. However, with each project we learn to take better decisions and improve our skills. The bigger picture of this project was to learn PySpark while executing the data science process on a real business problem and I believe I have learned a great deal about it.