<hr style="border:2px solid black"></hr>

# Initialization

Imports:

In [2]:
from toolbox.config import Config
from toolbox.imports import *

Create a Spark Session:

In [3]:
spark = t.spark.create_session('Music_Activity_Jupyter')

Creating a Spark session.
	Execution time: 68.34380 s.


Get the paths of data files and folders:

In [4]:
data_root, data_subfolders, data_files = t.get_data_paths()

Getting the data paths.
	Number of folders: 62
	Number of files: 275621


Load two sample dataframes: one containing 1e5 records, the second one containing 1e6 records:

In [5]:
path_df_1E5 = Config.Path.project_data_root / 'df_sample_1E5.parquet'
path_df_1E6 = Config.Path.project_data_root / 'df_sample_1E6.parquet'

df_1E5 = t.spark.load_dataframe_from_parquet(path_df_1E5, spark)
df_1E6 = t.spark.load_dataframe_from_parquet(path_df_1E6, spark)

Loading dataframe from "/data/work/shared/s001284/Music_Project/resources/data/df_sample_1E5.parquet".
	Execution time: 2.81771 s.
Loading dataframe from "/data/work/shared/s001284/Music_Project/resources/data/df_sample_1E6.parquet".
	Execution time: 0.81694 s.


Cache the dataframes:

In [6]:
# df_1E5.cache()
# df_1E6.cache()

# number_of_rows_df_1E5 = t.spark.count_rows(df_1E5)
# number_of_rows_df_1E6 = t.spark.count_rows(df_1E6)

<hr style="border:2px solid black"></hr>

# Removing irrelevant columns

Take the first 100 rows of the small database and transform it to Pandas for the comfort of viewing.

In [7]:
df_sample_pd = df_1E5.limit(100).toPandas()

Get the description of the the columns from the writer schema in the avro files:

In [8]:
column_docs = t.get_avro_docs(data_files[0])
for column in column_docs:
    if len(column[1]) > 0:
        w.printmd(f'**{column[0]}**: *{column[1]}*')
    else:
        w.printmd(f'**{column[0]}**: *-*')            

**id**: *Activity UUID*

**deleted_time**: *If this activity has been deleted, this is the timestamp when it occurred*

**useruuid**: *User UUID*

**start_time**: *Start time*

**end_time**: *End time*

**devices**: *Devices used during this activity*

**devices_name**: *Name of the device*

**devices_type**: *Device type (PHONE, SMARTBAND, SMARTCAMERA, UNKNOWN, ...)*

**devices_id**: *Unique identifier of the device*

**tracks**: *-*

**tracks_start_time**: *-*

**tracks_end_time**: *-*

**tracks_artist**: *-*

**tracks_album**: *-*

**tracks_title**: *-*

**tracks_uri**: *-*

**tracks_player**: *-*

**tracks_id**: *-*

Show the the top 5 rows:

In [9]:
df_sample_pd.head(100)

Unnamed: 0,id,deleted_time,useruuid,start_time,end_time,devices_name,devices_type,devices_id,tracks_start_time,tracks_end_time,tracks_artist,tracks_album,tracks_title,tracks_uri,tracks_player,tracks_id,yearmonth
0,98a18948-40f8-4aa0-8f88-b1b0cef08d0f-2015-08-17,2015-08-18T04:10:52.168Z,c43b75c6-6390-4331-a997-5b8008d50e77,2015-08-17T21:48:05.419+02:00,2015-08-18T05:05:38.077+02:00,D6603,PHONE,fa07eea4645b4be3,2015-08-18T03:53:55.138+02:00,2015-08-18T03:58:29.741+02:00,Anima Mia,ClearMusicDownloader,Anima Mia-I Cugini Di Campagna-Balla Italia - ...,content://media/external/audio/media/7373,Walkman,,201508
1,61cc3be9-45bb-4081-afbe-9bcc97e029e8-2015-08-17,2015-08-18T04:11:01.192Z,a9043d17-0ef9-4ea4-84b6-5be303ceca98,2015-08-18T01:27:20.543+08:00,2015-08-18T11:09:06.673+08:00,D5833,PHONE,5b49f4072ac55033,2015-08-18T04:11:06.517+08:00,2015-08-18T04:16:08.392+08:00,<unknown>,Android,���������[hd]������������������ - ������������...,content://media/external/audio/media/10437,Walkman,,201508
2,63c184bb-45e8-479c-991a-b2f6d83e9023-2015-10-30,,408a1a32-32cc-47a8-8e99-4a9b9b679007,2015-10-30T13:55:39.218-06:00,2015-10-30T13:58:36.668-06:00,D5803,PHONE,c3a31b5e9ba8244,2015-10-30T13:55:39.218-06:00,2015-10-30T13:58:36.668-06:00,<unknown>,Unknown album,Drake - Energy,content://media/external/audio/media/16147,Walkman,,201510
3,52c04991-7a7c-48f4-965f-0727596fc6f1-2016-11-17,,40e1310c-81a9-4589-b2bb-6562c35282e8,2016-11-17T20:14:03.535+09:00,2016-11-17T21:08:41.226+09:00,SO-03H,PHONE,e0a17178d44f7077,2016-11-17T21:04:56.520+09:00,2016-11-17T21:08:41.226+09:00,ももいろクローバーＺ,白金の夜明け MOMOIRO CLOVER Z DOME TRECK 2016 DAY2 L...,MOON PRIDE,content://media/external/audio/media/52759,Walkman,,201611
4,c806f877-0628-4848-8a3b-33e0741bda09-2016-11-01,,f81f5012-56f2-4fbd-ae97-242e3ebdb88d,2016-11-01T18:05:56.191+04:00,2016-11-01T18:06:50.327+04:00,D6503,PHONE,23b839a2bafb76a4,2016-11-01T18:05:56.191+04:00,2016-11-01T18:06:50.327+04:00,Rahat Fateh Ali Khan,Sultan (2016),Jag Ghoomeya - Songspk.LIVE,content://media/external/audio/media/151650,Walkman,,201611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,a2c785b4-45e5-4210-8764-597319616dfd-2016-09-06,,e59f7b77-567a-4c1e-8c77-1a2c4153d877,2016-09-06T23:38:55.033+02:00,2016-09-07T02:03:35.104+02:00,D6603,PHONE,d6d3405db458a079,2016-09-07T00:17:00.880+02:00,2016-09-07T00:19:54.866+02:00,Drake,Views,One Dance,,Spotify,spotify:track:1xznGGDReH1oQq0xzbwXa3,201609
96,ef585d05-b5e0-485f-beb7-7b403d0dc374-2016-11-22,,6ebeccd0-1aa0-4f49-ae11-102bd7af021e,2016-11-22T10:25:23.520+01:00,2016-11-22T13:33:41.925+01:00,C6903,PHONE,1ec0aa2e17d5312a,2016-11-22T10:29:46.876+01:00,2016-11-22T10:30:17.414+01:00,Prison Break,Zik,Opening Theme,content://media/external/audio/media/11301,Walkman,,201611
97,ac6a2cf7-b5b3-4c53-8b18-6a06f2e6ceed-2016-10-08,,ab0fd133-7b51-4718-84ab-c7507f3948f9,2016-10-08T14:23:24.458+02:00,2016-10-08T17:38:50.116+02:00,D6603,PHONE,7bf20912c13a1fc8,2016-10-08T15:52:06.826+02:00,2016-10-08T15:58:33.042+02:00,<unknown>,Download,Athom's & Nadège : Adonai,content://media/external/audio/media/51054,Walkman,,201610
98,e144173b-626d-4e9f-9e12-4b5d4f55be5f-2016-10-09,,4dd99265-f840-4ef8-aa14-88921bdb4184,2016-10-09T14:10:47.266+02:00,2016-10-09T14:12:10.722+02:00,E5823,PHONE,7d1c651f413b4eb6,2016-10-09T14:10:47.266+02:00,2016-10-09T14:12:10.722+02:00,State Of Shock,"Life, Love & Lies",Best I Ever Had,,Spotify,spotify:track:4iRqXAa6ICmxC67JWR5z0e,201610


### Activity Deleted

In [23]:
(df_1E6
 .where(
     (f.col('deleted_time').isNotNull())
     & (f.col('deleted_time') != '')
 )
 .count()
)

121841

Around 12 % of all the activities have the "Deleted time" property set. What to do with them:
 * Nothing
 * Delete the column
 * Delete the rows where the property is set

### Device types

In [10]:
(df_1E6
 .groupBy(f.col('devices_type'))
 .count()
 .orderBy(f.desc('count'))                    
 .show()
)

+------------+------+
|devices_type| count|
+------------+------+
|       PHONE|996704|
|        null|  2218|
|     UNKNOWN|   587|
+------------+------+



Almost all the devices are phones. It could be reasonable remove the column.

### Track players

In [11]:
(df_1E6
 .groupBy(f.col('tracks_player'))
 .count()
 .orderBy(f.desc('count'))                    
 .show()
)

+-------------+------+
|tracks_player| count|
+-------------+------+
|      Walkman|679703|
|      Spotify|198657|
|             |121149|
+-------------+------+



### Track ID's

In [12]:
(df_1E6
 .groupBy(f.col('tracks_id'))
 .count()
 .orderBy(f.desc('count'))                    
 .show()
)

+--------------------+------+
|           tracks_id| count|
+--------------------+------+
|                    |800955|
|spotify:track:3CR...|   143|
|spotify:track:1i1...|   142|
|spotify:track:1br...|   141|
|spotify:track:7BK...|   138|
|spotify:track:27S...|   122|
|spotify:track:1HN...|   116|
|spotify:track:6JV...|   114|
|spotify:track:0r4...|   107|
|spotify:track:2CP...|   106|
|spotify:track:7vf...|    98|
|spotify:track:0az...|    98|
|spotify:track:5kn...|    97|
|spotify:track:34g...|    93|
|spotify:track:4pd...|    93|
|spotify:track:23L...|    92|
|spotify:track:6i0...|    92|
|spotify:track:2Z8...|    91|
|spotify:track:0gb...|    89|
|spotify:track:3hB...|    84|
+--------------------+------+
only showing top 20 rows



### Track ID vs Track URI

In [21]:
(df_1E6
 .where(
     (f.col('tracks_uri') != '')
     & (f.col('tracks_uri').isNotNull())
     & (f.col('tracks_id') != '')
     & (f.col('tracks_id').isNotNull())
 )
 .count()
)

0

The Track ID and Track URI are never full at the same time. Therefore, the columns could be merged into one column.

Also, the information might be redundant. 

We could also split the dataset into a couple of databases. Does it make sense? Maybe not.
 

In [17]:
print(df_1E6.dtypes)

[('id', 'string'), ('deleted_time', 'string'), ('useruuid', 'string'), ('start_time', 'string'), ('end_time', 'string'), ('devices_name', 'string'), ('devices_type', 'string'), ('devices_id', 'string'), ('tracks_start_time', 'string'), ('tracks_end_time', 'string'), ('tracks_artist', 'string'), ('tracks_album', 'string'), ('tracks_title', 'string'), ('tracks_uri', 'string'), ('tracks_player', 'string'), ('tracks_id', 'string'), ('yearmonth', 'int')]
