# Users analysis

In [None]:
playlog = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv")
playlog.printSchema()

1. Compute a new column `datetime` that converts the timestamp to a datetime, drop the `timestamp` column, and order by `datetime`, save this as a new DataFrame `df`, show the first 5 rows of `df`.

> TIP: use the method `.from_unixtime(...)`, this method converts integers into dates.

Unnamed: 0,user,song,datetime,year,month,dayofmonth,dayofyear,weekofyear
0,4,nRa-eGzpT6o,1965-07-26 03:21:43,1965,7,26,207,30
1,0,t1l8Z6gLPzo,2014-02-14 14:18:53,2014,2,14,45,7
2,22,Q24VZL8wpOM,2014-02-14 14:18:57,2014,2,14,45,7
3,70,VJ6ofd0pB_c,2014-02-14 14:18:57,2014,2,14,45,7
4,1,t1l8Z6gLPzo,2014-02-14 14:18:58,2014,2,14,45,7


Now that we have a datetime column, we can compute new columns, namely:
- [year](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.year.html#pyspark.sql.functions.year)
- [month](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.month.html#pyspark.sql.functions.month)
- [dayofmonth](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofmonth.html#pyspark.sql.functions.dayofmonth)
- [dayofweek](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofweek.html#pyspark.sql.functions.dayofweek)
- [dayofyear](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.dayofyear.html#pyspark.sql.functions.dayofyear)
- [weekofyear](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.weekofyear.html#pyspark.sql.functions.weekofyear)

We will put the resulting DataFrame in a variable called `df_enriched`.

2. Follow previous instructions

*Tip: you use the reduce function from the functools package in order to automatically produce all the columns, otherwise you can just manually create them one by one*

### Aggregates

#### `firstPlay`, `lastPlay`, `playCount`, `uniquePlayCount`
For each user, we will compute these metrics:
- `firstPlay`: datetime of the first listening
- `lastPlay`: datetime of the last listening
- `playCount`: total play counts
- `uniquePlayCount`: unique play counts

We'll save all these in a new DataFrame: `users`.  
When you're done, print out the first 5 rows of `users` ordered by descending `playCount`.

3. Compute, for each user
- firstPlay
- lastPlay
- playCount
- uniquePlayCount
Save the results in a DataFrame named `users`

Unnamed: 0,user,firstPlay,lastPlay,playCount,uniquePlayCount
0,213,2014-02-14 15:34:17,2019-04-02 06:04:08,278749,161406
1,7290,2014-04-30 20:12:41,2019-04-03 06:50:05,151513,83831
2,435,2014-02-14 19:51:09,2019-04-03 19:36:28,144711,20055
3,21950,2014-10-23 09:09:36,2019-02-06 00:54:54,126285,15075
4,6270,2014-04-13 18:45:54,2018-08-11 20:46:08,125056,9247


4. Run a sanity check that all firstPlay are anterior to lastPlay

5. Another sanity check, we grouped on the user column, so each user should represent a single row. Make sure all users are unique in the DataFrame

### `timespan`
We will compute `timespan`: the overall span of activity from a user in days, rounded to the inferior, for example:
- if a user was active 23 hours on the service, we will say he was active 0 days
- for 53 hours, that would be 2 days of activity

We **will not** transform the `users` DataFrame in place, but instead save the result as a new DataFrame: `users_with_timespan`.

6. Compute timespan and save the result a new DataFrame: `users_with_timespan`

Let's check how this looks like, we will be using Databricks' `display` to plot an histogram of `timespan`.

7. Plot an histogram of `timespan`

Looking like a powerlaw, let's try to log transform.

8. Use describe on the `timespan` column

9. Plot a histogram of log transformed `timespan`

10. Plot a QQ-Plot of log transformed `timespan`

We'll filter out users who stayed for less than a day and plot an histogram of this filtered data.

11. Plot a histogram of log transformed `timespan` of users who stayed more than one day

### `isSingleDayUser`
What percentage of users used the service for less than one day?

12. Compute the percentage of users who used the service for less than a day

Wow, that's a lot! We will flag this as its own column.  
That means we will create a new Boolean column `isSingleDayUser` that is `True` if the user used the service for less than a day and `False` otherwise.

13. Create a new column (isSingleDayUser) to flag if a user used the service for less than a day

### Measure of activity: `activeDaysCount` and `meanPlaycountByActiveDay`
This one is a bit harder, we want to compute:
- the number of active days for each user (not the `timespan`)
- the average play count on these active days for each user

14. Create 2 new columns
- activeDaysCount: the count of days each user was active
- dailyAvgPlayCount: the daily average playcount per user (active days only)
- activeDay

15. Plot a histogram of log of `activeDaysCount`

16. Plot a histogram of log of `dailyAvgPlayCount`

## Going further
What else do you think would be interesting to compute?
What about the ratio of activity, e.g. the ratio between `timespan` and `activeDaysCount`?