Sparkify is a fake music streaming service invented by Udacity. Here users can listen to music for free (with ads between songs) or for a flat fee. Users can upgrade, downgrade, or cancel. My task is to predict the user who is going to leave in order to offer him a great discount before canceling the subscription.
I need to create a binary classifier based data thad Udacity provided. It's a large dataset of simulated Sparkify user activity logs. Due to the size of the dataset, the project implemented using Apache Spark and Python API for Spark, PySpark.
The full Sparkify dataset is 12 GB. Due to its size, this could not be done locally, so an Elastic MapReduce (EMR) cluster was deployed to perform tasks in the AWS cloud.
Release label: emr-5.30.1
Applications: Spark 2.4.5, Zeppelin 0.8.2
Instance type: m5.xlarge
Number of instances: 5 (1 master and 4 core nodes)
A post for this project is on Medium.
- Sparkify_small.ipynb includes data preparation, analysis, visualization, and machine learning models for small dataset
- Sparkify_full.ipynb includes data preparation, analysis, visualization, and machine learning models for full dataset
Accuracy: 0.773 F-1 Score: 0.674 Total training time: 0.0 minutes
Accuracy: 0.778 F-1 Score: 0.691 Total training time: 41.2 minutes
Accuracy: 0.875 F-1 Score: 0.861 Total training time: 54.6 minutes
Accuracy: 0.857 F-1 Score: 0.84 Total training time: 133.4 minutes
Accuracy: 0.773 F-1 Score: 0.674 Total training time: 40.5 minutes
Available memory: 11171M Total training time 5.26 hours
The best model is Random Forest.
Credit to Udacity for the data.
Apache License 2.0
See the LICENSE file for details