# Sparkify Project - Part 2

If you didn't read the first part, you should! This one picks up where it leaves off and starts using the full 12GB dataset. We will first verify our assumptions from the first notebook, then run the feature engineering script and train a few basic models to see how they perform. First we'll create the spark session and import some libraries.

In [1]:
# Starter code
from pyspark.sql import SparkSession

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,application_1549211192169_0004,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
import pyspark.sql.functions as sf
import pyspark.sql.types as st

from pyspark.ml.feature import VectorAssembler

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import LogisticRegression

from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier

VBox()

In [3]:
# Create spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

VBox()

Next read in the full dataset from the public s3 bucket!

In [4]:
# Read in full sparkify dataset
event_data = "s3n://udacity-dsnd/sparkify/sparkify_event_data.json"
df_raw = spark.read.json(event_data)

VBox()

Next check the head and the schema to see if everything looks similar:

In [6]:
df_raw.head()

VBox()

Row(artist=u'Popol Vuh', auth=u'Logged In', firstName=u'Shlok', gender=u'M', itemInSession=278, lastName=u'Johnson', length=524.32934, level=u'paid', location=u'Dallas-Fort Worth-Arlington, TX', method=u'PUT', page=u'NextSong', registration=1533734541000, sessionId=22683, song=u'Ich mache einen Spiegel - Dream Part 4', status=200, ts=1538352001000, userAgent=u'"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId=u'1749042')

In [12]:
df_raw.printSchema()

VBox()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

## Exploration

In [9]:
df_raw.count()

VBox()

26259199

Holy cow, that's a big increase from the subset. Although we expected it.

In [16]:
df_raw.select([sf.count(sf.when(sf.isnull(c), c)).alias(c) for c in df_raw.columns]).show()

VBox()

+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+
| artist|auth|firstName|gender|itemInSession|lastName| length|level|location|method|page|registration|sessionId|   song|status| ts|userAgent|userId|
+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+
|5408927|   0|   778479|778479|            0|  778479|5408927|    0|  778479|     0|   0|      778479|        0|5408927|     0|  0|   778479|     0|
+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+

After looking at some missing values, it looks like the pattern is the same as the subset. Let's just check for the `""` user like before.

In [10]:
df_raw.filter(df_raw.userId == "").count()

VBox()

0

Hmmmm, that's weird, there aren't any. Let's see what user shows up then?

In [17]:
df_raw.where(sf.isnull(df_raw.registration)).select('userId').distinct().show()

VBox()

+-------+
| userId|
+-------+
|1261737|
+-------+

In [18]:
df_raw.where(sf.isnull(df_raw.gender)).select('userId').distinct().show()

VBox()

+-------+
| userId|
+-------+
|1261737|
+-------+

Wow, looks like `1261737` is the null user instead of empty string!

In [21]:
df.filter(df.userId==1261737).count()

VBox()

778479

In [20]:
df_raw.where(sf.isnull(df_raw.registration)).select('sessionId').distinct().show()

VBox()

+---------+
|sessionId|
+---------+
|    23116|
|    33760|
|    30428|
|    35323|
|    35484|
|    13248|
|    38878|
|    38543|
|    27919|
|    48763|
|    52001|
|    17048|
|    52051|
|    48899|
|    47492|
|    49983|
|    52611|
|    50049|
|    50329|
|    50287|
+---------+
only showing top 20 rows

And it looks like sessions aren't specific problem.

## Cleaning

We will use the same functions from the first notebook.

In [5]:
def convert_ms(x):
    """Converts given ns to ms"""
    if x is None:
        return None
    
    return x//1000

convert_ms_udf = sf.udf(convert_ms, st.LongType())

VBox()

In [6]:
def clean_df(df_raw):
    """
    Takes in a raw events dataframe, makes a few extra columns and cleans it.
    """
    df = df_raw.filter(df_raw.userId != 1261737)
    
    df = df.withColumn('timestamp', convert_ms_udf(df.ts).cast('timestamp'))
    df = df.withColumn('registration_ts', convert_ms_udf(df.registration).cast('timestamp'))
    
    return df

VBox()

In [7]:
# after reading in the df just running this cell catches up with the exploration
df = clean_df(df_raw)

VBox()

In [12]:
df.select([sf.count(sf.when(sf.isnull(c), c)).alias(c) for c in df_raw.columns]).show()

VBox()

+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+
| artist|auth|firstName|gender|itemInSession|lastName| length|level|location|method|page|registration|sessionId|   song|status| ts|userAgent|userId|
+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+
|4630448|   0|        0|     0|            0|       0|4630448|    0|       0|     0|   0|           0|        0|4630448|     0|  0|        0|     0|
+-------+----+---------+------+-------------+--------+-------+-----+--------+------+----+------------+---------+-------+------+---+---------+------+

Now we can check out the number of users and sessions in the full data, turns out there are still a sizeable number.

In [25]:
df.agg(sf.countDistinct('userId'), sf.countDistinct('sessionId')).show()

VBox()

+----------------------+-------------------------+
|count(DISTINCT userId)|count(DISTINCT sessionId)|
+----------------------+-------------------------+
|                 22277|                   223096|
+----------------------+-------------------------+

It also turns out the timestamp range is similar to the other data, although the registration range is longer as we would expect from a larger user group.

In [30]:
df.agg(sf.min('timestamp'), sf.max('timestamp')).show()

VBox()

+-------------------+-------------------+
|     min(timestamp)|     max(timestamp)|
+-------------------+-------------------+
|2018-10-01 00:00:01|2018-12-01 00:00:02|
+-------------------+-------------------+

In [29]:
df.agg(sf.min('registration_ts'), sf.max('registration_ts')).show()

VBox()

+--------------------+--------------------+
|min(registration_ts)|max(registration_ts)|
+--------------------+--------------------+
| 2017-10-14 22:05:25| 2018-12-03 07:23:42|
+--------------------+--------------------+

Interesting the that last registration is after the end of the events data - seems like a possible error but probably won't be a big deal.

## Moving On

In [8]:
def get_feature_df(df):
    """Takes in a cleaned event dataframe and returns a feature dataframe"""
    
    # Column names for the vector
    vector_cols = []
    
    # Get session counts
    session_counts = df.groupby('userId').agg(sf.countDistinct('sessionId').alias('session_count'))
    vector_cols.append('session_count')
    
    # Get pages and ignore events
    pages = df.select('page').distinct().sort('page')
    pages_list = [r.page for r in pages.collect()]
    drop_events = ['Cancel']
    
    # Get event counts
    feat_df = df.groupby('userId').pivot('page', pages_list).count()
    feat_df = feat_df.withColumnRenamed('Cancellation Confirmation', 'label')
    feat_df = feat_df.drop(*drop_events).fillna(0)
    
    feat_df = feat_df.join(session_counts, on='userId')
    
    # Normalize by session counts
    ignore_cols = {'userId', 'session_count', 'label'}
    remaining_cols = sorted(list(set(feat_df.columns) - ignore_cols))
    for column in remaining_cols:
        feat_df = feat_df.withColumn(column, sf.col(column) / feat_df.session_count)
    vector_cols.extend(remaining_cols)
    
    # Get account ages
    max_timestamp = df.agg(sf.max('timestamp')).first()[0]
    account_ages = df.select('userId', 
                             sf.datediff(sf.lit(max_timestamp), df.registration_ts).alias('account_age')).distinct()
    
    vector_cols.append('account_age')
    feat_df = feat_df.join(account_ages, on='userId')
    
    # Get weekly song counts
    week_counts = df.where(df.page=='NextSong') \
                .groupby('userId', sf.date_trunc('week', 'timestamp').cast('date').alias('week')).count() \
                .groupby('userId').pivot('week').sum().fillna(0)
    
    vector_cols.extend(week_counts.columns[1:])
    feat_df = feat_df.join(week_counts, on='userId')
    
    # Get genders
    genders = df.select('userId', sf.when(sf.col('gender')=='F', 0).otherwise(1).alias('genders')).distinct()
    
    vector_cols.append('genders')
    feat_df = feat_df.join(genders, on='userId')
    
    # Assemble the vector
    assembler = VectorAssembler(inputCols=vector_cols, outputCol='features')
    
    return assembler.transform(feat_df)

VBox()

Now we set up the feature dataframe and make sure everything is ok.

In [None]:
feature_df = get_feature_df(df)

In [10]:
feature_df.printSchema()

VBox()

root
 |-- userId: string (nullable = true)
 |-- About: double (nullable = true)
 |-- Add Friend: double (nullable = true)
 |-- Add to Playlist: double (nullable = true)
 |-- label: long (nullable = true)
 |-- Downgrade: double (nullable = true)
 |-- Error: double (nullable = true)
 |-- Help: double (nullable = true)
 |-- Home: double (nullable = true)
 |-- Logout: double (nullable = true)
 |-- NextSong: double (nullable = true)
 |-- Roll Advert: double (nullable = true)
 |-- Save Settings: double (nullable = true)
 |-- Settings: double (nullable = true)
 |-- Submit Downgrade: double (nullable = true)
 |-- Submit Upgrade: double (nullable = true)
 |-- Thumbs Down: double (nullable = true)
 |-- Thumbs Up: double (nullable = true)
 |-- Upgrade: double (nullable = true)
 |-- session_count: long (nullable = false)
 |-- account_age: integer (nullable = true)
 |-- 2018-10-01: long (nullable = true)
 |-- 2018-10-08: long (nullable = true)
 |-- 2018-10-15: long (nullable = true)
 |-- 2018-10-22

In [11]:
feature_df.persist()

VBox()

DataFrame[userId: string, About: double, Add Friend: double, Add to Playlist: double, label: bigint, Downgrade: double, Error: double, Help: double, Home: double, Logout: double, NextSong: double, Roll Advert: double, Save Settings: double, Settings: double, Submit Downgrade: double, Submit Upgrade: double, Thumbs Down: double, Thumbs Up: double, Upgrade: double, session_count: bigint, account_age: int, 2018-10-01: bigint, 2018-10-08: bigint, 2018-10-15: bigint, 2018-10-22: bigint, 2018-10-29: bigint, 2018-11-05: bigint, 2018-11-12: bigint, 2018-11-19: bigint, 2018-11-26: bigint, genders: int, features: vector]

In [12]:
feature_df.head(1)

VBox()

[Row(userId=u'1000280', About=0.0, Add Friend=0.6363636363636364, Add to Playlist=1.1363636363636365, label=1, Downgrade=0.13636363636363635, Error=0.13636363636363635, Help=0.36363636363636365, Home=2.0, Logout=0.6818181818181818, NextSong=46.45454545454545, Roll Advert=3.3636363636363638, Save Settings=0.045454545454545456, Settings=0.4090909090909091, Submit Downgrade=0.045454545454545456, Submit Upgrade=0.045454545454545456, Thumbs Down=1.5, Thumbs Up=2.409090909090909, Upgrade=0.4090909090909091, session_count=22, account_age=95, 2018-10-01=308, 2018-10-08=34, 2018-10-15=291, 2018-10-22=57, 2018-10-29=187, 2018-11-05=38, 2018-11-12=107, 2018-11-19=0, 2018-11-26=0, genders=1, features=DenseVector([22.0, 0.0, 0.6364, 1.1364, 0.1364, 0.1364, 0.3636, 2.0, 0.6818, 46.4545, 3.3636, 0.0455, 0.4091, 0.0455, 0.0455, 1.5, 2.4091, 0.4091, 95.0, 308.0, 34.0, 291.0, 57.0, 187.0, 38.0, 107.0, 0.0, 0.0, 1.0]))]

Next let's check the number of users in each class:

In [15]:
feature_df.groupby('label').count().show()

VBox()

+-----+-----+
|label|count|
+-----+-----+
|    0|17259|
|    1| 5002|
+-----+-----+

Looks like a decent balance after all!

## Modeling

First we'll split the data into a test and train sets, then we'll run through a few different models to see which is the most effective. My goal for this project is not to use a complicated algorithm to proceed, but rather to try out a few basic algorithms available in pyspark and see how they do. There are quite a few deicision tree varieties available so this seems like a good opportunity to compare and contrast them.

In [13]:
train, test = feature_df.randomSplit([0.8, 0.2], seed=42)

VBox()

We can define this useful function to print out some evaluation metrics from the results vector.

In [14]:
def get_metrics(res):
    """Given a results vector returns accuracy and f1-score"""
    total = res.count()
    
    tp = res.where((res.label==1) & (res.prediction==1)).count()
    tn = res.where((res.label==0) & (res.prediction==0)).count()
    
    fp = res.where((res.label==0) & (res.prediction==1)).count()
    fn = res.where((res.label==1) & (res.prediction==0)).count()
        
    accuracy = (1.0*tp + tn) / total
    precision = 1.0*tp / (tp + fp)
    recall = 1.0*tp / (tp + fn)
    f1 = 2.0 * (precision * recall) / (precision + recall)
    
    print('Accuracy: ', round(accuracy, 2))
    print('Precision: ', round(precision, 2))
    print('Recall: ', round(recall, 2))
    print('F1-Score: ', round(f1, 2))

VBox()

I want to start with a logistic regression just to get a feel for some code, but then I'm going to move on to decision trees. Luckily, decision trees don't require feature scaling so it should be ok in the current state of the feature vector. First we will create the model.

In [14]:
lr =  LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

VBox()

Then, train on the training dataset.

In [15]:
lr_model = lr.fit(train)

VBox()

Finally, generate the results vector from the test set.

In [18]:
results_lr = lr_model.transform(test)

VBox()

In [19]:
get_metrics(results_lr)

VBox()

('Accuracy: ', 0.85)
('Precision: ', 0.78)
('Recall: ', 0.46)
('F1-Score: ', 0.58)

Looks like the logisitic regression model actually performed fairly well! Not great but not terribly. Next we can move on to a basic decision tree.

In [25]:
dt = DecisionTreeClassifier()
dt_model = dt.fit(train)
results_dt = dt_model.transform(test)
get_metrics(results_dt)

VBox()

('Accuracy: ', 0.9)
('Precision: ', 0.8)
('Recall: ', 0.71)
('F1-Score: ', 0.76)

Well the decision tree actually performed quite well! 0.76 f1-score is respectable, and 0.9 accuracy is great. Next let's see how Random Forests does.

In [16]:
rf = RandomForestClassifier()

VBox()

In [27]:
rf_model = rf.fit(train)
results_rf = rf_model.transform(test)
get_metrics(results_rf)

VBox()

('Accuracy: ', 0.89)
('Precision: ', 0.93)
('Recall: ', 0.57)
('F1-Score: ', 0.71)

Hmmmm, it's weird that the random forest wouldn't perform as well as a basic decision tree. It does seem like there is potential for improvement over a plain decision tree because the precision is much higher here. Maybe some parameter tuning could help it improve, although I definitely found this surprising. This result actually was similar for gradient boosted trees which I decided to cut from the project in favor of random forests. Also model training and testing takes quite a long time on pyspark even with a relatively small dataset, those were both complications that showed up during this section.

So even though random forests had a lower f1-score out of the bat I will try tuning them to see if we can get higher performance than just the decision tree.

## Refinement

It's really advantageous to use the `CrossValidator` here because it means we can try out a variety of parameters and automatically select the best ones using a validation set. First we will check out random forests and see if we can improve the performance to above what we had with the base decision tree. For parameters I chose the max depth of the tree, the min number of samples per node, and the number of trees:

In [17]:
rf_paramGrid = ParamGridBuilder() \
                .addGrid(rf.maxDepth, [5, 7]) \
                .addGrid(rf.minInstancesPerNode, [1, 3]) \
                .addGrid(rf.numTrees, [20, 40]) \
                .build()

Next we have to build the crossvalidator - binary classification evaluator uses area under ROC as a optimization metric which is pretty good in this case.

In [18]:
rf_crossval = CrossValidator(estimator = rf,
                             estimatorParamMaps=rf_paramGrid,
                             evaluator = BinaryClassificationEvaluator(),
                             numFolds=3)

VBox()

Next up, train the model and predict just as before.

In [19]:
rf_cross_model = rf_crossval.fit(train)

In [20]:
results_rf_cross = rf_cross_model.transform(test)

VBox()

An added bonus of using crossval is that we can peek at the average metrics of each sub-model to see how the overall performance varies. Looks like we do pretty well with all models being between 0.91 and 0.93 area under ROC.

In [21]:
rf_cross_model.avgMetrics

VBox()

[0.9169346891040968, 0.9173023225091668, 0.9254453792776469, 0.9257542682113922, 0.9173122897441244, 0.9169262158854977, 0.9253361009197547, 0.9255381851780136]

Now we can check out the final metrics.

In [22]:
get_metrics(results_rf_cross)

VBox()

('Accuracy: ', 0.91)
('Precision: ', 0.91)
('Recall: ', 0.69)
('F1-Score: ', 0.79)

Wow it seems like it does perform better than the plain old decision tree, that's great! I wonder what the parameter set was:

In [23]:
print('Num Trees', rf_cross_model.bestModel.getNumTrees)
print('Max Depth', rf_cross_model.bestModel._java_obj.getMaxDepth())
print('Min Samples Node', rf_cross_model.bestModel._java_obj.getMinInstancesPerNode())

VBox()

('Num Trees', 20)
('Max Depth', 7)
('Min Samples Node', 3)

Interesting that it's only 20 trees, I expected 40 to have better performance. It seems like searching the tree deeper (7 laters) with lower granularity (3 samples per node instead of 1) gives the best performance with random forests. We can also look at the most important features too.

In [30]:
rf_importances = list(rf_cross_model.bestModel.featureImportances)

VBox()

In [28]:
cols = feature_df.columns[1:4]+feature_df.columns[5:-1]

VBox()

In [33]:
from pprint import pprint

VBox()

In [34]:
pprint(sorted(zip(cols, rf_importances), key=lambda x: x[1], reverse=True))

VBox()

[('2018-11-26', 0.24872878836907866),
 ('2018-11-19', 0.17520050382643623),
 ('2018-10-01', 0.1438442430274558),
 ('2018-11-12', 0.11298791939802035),
 ('2018-11-05', 0.08173734301357599),
 ('About', 0.05818187911482335),
 ('2018-10-29', 0.03639303510577493),
 ('2018-10-08', 0.03188444021589177),
 ('Save Settings', 0.01682809675748398),
 ('Error', 0.016804971016558194),
 ('Thumbs Up', 0.01662328184502604),
 ('2018-10-22', 0.016306980101822275),
 ('2018-10-15', 0.008291761522428883),
 ('Thumbs Down', 0.005079927902946381),
 ('Upgrade', 0.004928414662012405),
 ('Roll Advert', 0.003517665888227861),
 ('Submit Downgrade', 0.0035063281160905573),
 ('NextSong', 0.0027751695854539874),
 ('Logout', 0.002392298084843223),
 ('Add to Playlist', 0.00235308900223043),
 ('session_count', 0.002241110387822691),
 ('account_age', 0.0017137474992622945),
 ('Downgrade', 0.001537621796301837),
 ('Add Friend', 0.0015071196473681932),
 ('Home', 0.0014890784435418571),
 ('Settings', 0.0013405297050881988),
 

Well it seems like the song counts per week are contributing way more than any other feature, with some evnts per session also contributing a fair amount. Account age was less predicting than I had hoped, and gender doesn't seem to contribute at all which makes intuitive sense.

All in all, the tuned random forest is the best model that we have with a f1-score of 0.79.

The refinement section with random forests gave me a ton of trouble, it would routinely crash my EMR notebook. This was by far the hardest part of the entire project for me and probably ate up over 6-8 hours of my time. In the end, the problem was that the 'Spark Job Progress' box would take too much memory and chrome would kill the tab. If you want to run the cells be sure to **CLEAR THE CELL OUTPUT WHILE IT IS RUNNING**.

Other than that, the hardest part of the project was probably feature engineering using spark. Everything takes quite a bit longer than with local datasets! Be prepared to wait.

## Conclusion

To sum it up, we found and engineered a useful set of features, tried a few different models, and ended up with a fairly well performing model with an f1-score of 0.79. A few alternatives were tried, event though we weren't going for the highest performance, more for a proof of concept. The random forest ended up being the highest performing model, which isn't too surprising compared to decision trees and logistic regression.

It's also important to consider the robustness of the model. We gain some insight into this from using a 3 Fold cross validator and looking at the average scores of the different models. We saw earlier that even with varied parameters each model had around the same score, which indicates that the model is pretty robust to deviations in the training data.

Also, the type of data that is entered into this model is unlikely to have large scale changes unless users actually change their behavior. Events data lends itself to features that are abstracted from the density and sequence of events, and small changes in the number of events are unlikely to cause big swings in any particular feature. We even saw that the most impactful features to distinguish churned from non-churned users was the number of songs they listened to in a given week.

This method has proven to be pretty effective, and can clearly distinguish between churned and non churned users when provided the entire timeframe of data. This could be useful in a production service to provide some sort of promotion or treatment to users that are identified as likely to churn.

I think both the hardest part and the most interesting part of this project was identifying and extracting features from the events data. It's much less intuitive than many projects we had in the class, so it provided a nice challenge. I would say my main goal for this project was to learn the basics of PySpark, and I think I accomplished that nicely. I have a few other interesting ideas that I didn't have time to investigate in this project.

## Improvements

In this project I found a lot of interesting similarites to the recommendation engines section of the nanodegree even though it wasn't implemented in the same way. A really interesting improvement to this project could be to approach the problem as a timerange problem rather than a random split problem - training on a few weeks of data and testing to see if you can predict the users who churn in the following few weeks. This would be a much more realistic way to apply the model to how we'd like to use it in the real world. It does present a bunch of additional challenges and might require some custom algorithms/evaluators which makes it a more longer term project.

Another obvious extension would be to look for instances of `Submit Downgrade` instead of the cancellation events. This would be tracking users who are going to downgrade from paid to free. It would require a different set of features, likely more difficult ones since it would be required to separate different user periods by level. It also would benefit to a similar approach mentioned above, combining the two would be very interesting.

That about wraps my project, thanks for reading!