Honing skills of:
- Loading large datasets into Spark and manipulating them using Spark SQL and Spark Dataframes
- Using the machine learning APIs within Spark ML to build and tune models
- Integrating the skills I've learned in the Spark course and the Data Scientist Nanodegree program
Our primary task is to predict churned users based on logs of a music app. The size of original datasets is 12GB. Due to the limited computation power of free version of IBM Cloud, a medium-sized sub-datasets is utilized.
- Pyspark SQL and Pyspark ML
- Data Preprocessing
- Exploratory Data Analysis
- Feature Engineering
- Modeling
- Evaluation
LogisticRegression was implemented to predict the churn of a customer.
Prediction on test set - Area under ROC - 0.9333 , Accuracy - 83.87% (After Tuning Hyperparameters)
sparkify.ipynb
- Analysis in Jupyter Notebook
- Dataset by Udacity
- Jupyter Notebook instruction by Udacity
Copyright (c) 2019 Rohit Swami
This project is licensed under the MIT License - see the LICENSE file for details