This project will cover how to create a scalable model using spark to predict a customer churn. The data used for this project was available by Udacity and it’s about a user log for a fictional streaming music app called Sparkify.
The necessary libraries to run the code in Python version 3.*:
- Anaconda (Pandas, Numpy, MatPlotLib, Datetime) and PySpark.
Obs.: It's necessary to unzip the data to run the project.
In the business world, churn is defined when a customer cancels or abandons the service. Predicting when a customer tends to churn can be very profitable to companies, since this could increase the retention rate, by offering discounts and incentives.
- Sparkify.ipynb: a notebook containing all the processes to build a scalable model to predict customer churn for a fictional streaming music app called Sparkify using spark.
- mini_sparkify_event_data.json.zip: a zipped data
- Sparkify.html: an HTML version of the notebook
All the process and results of this project can be found at the post available here.
Thanks to Udacity for providing such a amazing project.
Must give credit to Udacity for the data. Otherwise, feel free to use the code here as you would like!