Skip to content

Contains Jupyter Notebook of Capstone Project of Data Scientist Nanodegree

License

Notifications You must be signed in to change notification settings

rowhitswami/Sparkify-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Sparkify-App

Overview

Motivation

Honing skills of:

  1. Loading large datasets into Spark and manipulating them using Spark SQL and Spark Dataframes
  2. Using the machine learning APIs within Spark ML to build and tune models
  3. Integrating the skills I've learned in the Spark course and the Data Scientist Nanodegree program

Task and Datasets

Our primary task is to predict churned users based on logs of a music app. The size of original datasets is 12GB. Due to the limited computation power of free version of IBM Cloud, a medium-sized sub-datasets is utilized.

Frameworks & Libraries

  • Pyspark SQL and Pyspark ML

Summary of Project

  1. Data Preprocessing
  2. Exploratory Data Analysis
  3. Feature Engineering
  4. Modeling
  5. Evaluation

Methodology

LogisticRegression was implemented to predict the churn of a customer.

Prediction on test set - Area under ROC - 0.9333 , Accuracy - 83.87% (After Tuning Hyperparameters)

View a detailed analysis report on Medium

Medium Post

Files in the repo

  • sparkify.ipynb - Analysis in Jupyter Notebook

Acknowledgement

  • Dataset by Udacity
  • Jupyter Notebook instruction by Udacity

License

MIT license

Copyright (c) 2019 Rohit Swami

This project is licensed under the MIT License - see the LICENSE file for details

About

Contains Jupyter Notebook of Capstone Project of Data Scientist Nanodegree

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published