<hr style="border:2px solid black"></hr>

# Initialization

Imports:

In [1]:
from IPython.core.interactiveshell import InteractiveShell
from toolbox.config import Config
from toolbox.imports import *

Options:

In [2]:
# Display all outputs in cells.
InteractiveShell.ast_node_interactivity = 'all'

# In Pandas tables, display no more than 100 rows.
pd.set_option('display.max_rows', 100)

Create a Spark Session:

In [3]:
spark = t.spark.create_session('Music_Activity_Jupyter')

Creating a Spark session.
	Execution time: 80.46912 s.


Get the paths of data files and folders:

In [4]:
data_root, data_subfolders, data_files = t.get_data_paths()

Getting the data paths.
	Number of folders: 62
	Number of files: 275621


Load the full data:

In [5]:
df_raw = t.load_data_from_files(data_root, spark, method='avro')
df_full = t.format_dataframe(df_raw)

Loading data from path: "/data/work/src/musicactivity".
	Execution time: 8.11790 s.
Formatting dataframe:
	Exploding columns containing lists.
	Flattening the dataframe schema.
	Execution time: 0.22957


Load a sample dataframe containing 1e6 records:

In [6]:
path_df_1E6 = Config.Path.project_data_root / 'df_sample_1E6.parquet'

df_1E6 = t.load_data_from_files(path_df_1E6, spark)

Loading data from path: "/data/work/shared/s001284/Music_Project/resources/data/df_sample_1E6.parquet".
	Execution time: 1.51409 s.


Cache the dataframes:

In [7]:
# df_1E6.cache()
# number_of_rows_df_1E6 = t.spark.count_rows(df_1E6)

<hr style="border:2px solid black"></hr>

# Removing irrelevant columns

Take the first 100 rows of the small database and transform it to Pandas for the comfort of viewing.

In [8]:
df_sample_pd = df_1E6.limit(100).toPandas()

Get the description of the the columns from the writer schema in the avro files:

In [12]:
column_docs = t.get_avro_docs(data_files[0])
for column in column_docs:
    if len(column[1]) > 0:
        w.printmd(f'**{column[0]}**: *{column[1]}*')
    else:
        w.printmd(f'**{column[0]}**: *-*')

**id**: *Activity UUID*

**deleted_time**: *If this activity has been deleted, this is the timestamp when it occurred*

**useruuid**: *User UUID*

**start_time**: *Start time*

**end_time**: *End time*

**devices**: *Devices used during this activity*

**devices_name**: *Name of the device*

**devices_type**: *Device type (PHONE, SMARTBAND, SMARTCAMERA, UNKNOWN, ...)*

**devices_id**: *Unique identifier of the device*

**tracks**: *-*

**tracks_start_time**: *-*

**tracks_end_time**: *-*

**tracks_artist**: *-*

**tracks_album**: *-*

**tracks_title**: *-*

**tracks_uri**: *-*

**tracks_player**: *-*

**tracks_id**: *-*

Show the top rows:

In [13]:
df_sample_pd.head(30)

Unnamed: 0,id,deleted_time,useruuid,start_time,end_time,devices_name,devices_type,devices_id,tracks_start_time,tracks_end_time,tracks_artist,tracks_album,tracks_title,tracks_uri,tracks_player,tracks_id,yearmonth
0,96c4958e-781b-477c-9823-6a6fc5cdfed0-2016-10-16,,217abb4b-cf26-4b28-a1af-bfa0565f64d8,2016-10-16T11:14:24.281+02:00,2016-10-16T11:58:41.070+02:00,D6503,PHONE,30575b739121b7a7,2016-10-16T11:42:33.218+02:00,2016-10-16T11:45:42.689+02:00,<unknown>,mp3,Maroon 5 - Maps (Lyric Video),content://media/external/audio/media/87943,Walkman,,201610
1,a8eec483-caa5-4297-ab53-12dfae409b44-2016-10-16,,d98eb3db-d1b5-47a3-a6cd-7f8b651913e0,2016-10-16T12:05:13.901+02:00,2016-10-16T12:15:11.685+02:00,D5503,PHONE,c9d2ad0a5e3b17a,2016-10-16T12:08:19.172+02:00,2016-10-16T12:09:07.304+02:00,ROMAIN VIRGO,CORNER SHOP RIDDIM [FULL PROMO] - 21ST HAPILOS...,BEAT YOU DOWN,content://media/external/audio/media/18753,Walkman,,201610
2,741a6351-cd04-47ce-bfa0-c906848ca52f-2016-10-16,,07b500bf-b975-4fc1-b203-d25973809a8a,2016-10-16T12:07:44.666+05:00,2016-10-16T14:45:58.478+05:00,E6633,PHONE,84b6ce0d67c6ab0c,2016-10-16T13:10:43.036+05:00,2016-10-16T13:14:48.150+05:00,Ma muzik,Music,Claudio Ismael - A tua escolha 2013,content://media/external/audio/media/2549,Walkman,,201610
3,2720ff80-e5f3-4a92-a8cd-5f7840b2743a-2016-10-16,,0f0742d4-454f-4234-a7aa-6980eeedf5fa,2016-10-16T15:37:42.058+09:00,2016-10-16T19:30:59.929+09:00,SO-03F,PHONE,5e0957ef6094f22f,2016-10-16T16:38:21.892+09:00,2016-10-16T16:44:09.193+09:00,<unknown>,audio,DaizyStripper ー Brilliant Days．,content://media/external/audio/media/4816,Walkman,,201610
4,2868bc0a-1fb3-4020-8706-811ae31f9f92-2016-10-16,,5f500365-af05-4ff0-ae64-abacd71eac06,2016-10-16T10:59:54.106+02:00,2016-10-16T11:21:26.814+02:00,E6653,PHONE,a4b2701122758f84,2016-10-16T11:03:12.722+02:00,2016-10-16T11:08:09.438+02:00,<unknown>,Music,Kizomba - Philipe Monteiro - Alta Segurança,content://media/external/audio/media/220,Walkman,,201610
5,65290fff-d70c-41c7-9ca9-f2b22cfa5fdc-2016-10-16,,62a8ea8f-9a27-4116-b880-a4050bad5926,2016-10-15T23:50:06.296-04:00,2016-10-16T01:14:11.333-04:00,D6616,PHONE,1b4054332c02838f,2016-10-16T00:21:04.592-04:00,2016-10-16T00:25:13.417-04:00,Cali y El Dandee Ft. Juan Magan y Sebastian Yatra,www.planetaexitos.com,Por Fin Te Encontré,content://media/external/audio/media/73063,Walkman,,201610
6,2abfb220-1b29-4df0-a395-9490947ccae6-2016-10-16,2016-10-16T11:55:22.114Z,35389166-37c2-4cd8-8ff1-152deeb0c11a,2016-10-16T17:26:08.565+09:00,2016-10-16T19:28:11.968+09:00,SO-02G,PHONE,b53e0801692fedf8,2016-10-16T17:42:59.463+09:00,2016-10-16T17:46:37.396+09:00,Sum 41,"The Best of Sum 41: 8 Years of Blood, Sake and...",Handle This,content://media/external/audio/media/8402,Walkman,,201610
7,694dfcc0-5406-44ca-8739-c0c57b365dd5-2016-10-16,,cd35971e-fdef-4b3e-909e-7c10a92e9718,2016-10-16T13:22:24.687+02:00,2016-10-16T13:39:22.476+02:00,D6603,PHONE,39d083c10ff8583f,2016-10-16T13:38:49.599+02:00,2016-10-16T13:39:22.476+02:00,A$AP Rocky,AT.LONG.LAST.A$AP,Lord Pretty Flacko Jodye 2 (LPFJ2),,Spotify,spotify:track:1j6kDJttn6wbVyMaM42Nxm,201610
8,9dda8f1a-4409-40ef-b569-e7dcfdd45a89-2016-10-16,,9c73877f-0277-4a73-bc25-f7910dd3bc7e,2016-10-16T02:52:18.382+02:00,2016-10-16T03:38:40.497+02:00,D6503,PHONE,29ff78f4b611dee3,2016-10-16T03:32:16.784+02:00,2016-10-16T03:36:07.264+02:00,Panet.co.il_Rabih-Baroud,www.panet.co.il,Taibo,content://media/external/audio/media/338585,Walkman,,201610
9,1c06a1a4-11e5-479d-b4eb-d430e56e8b41-2016-10-16,,77d2f9e5-68f0-4182-9ca9-dd8fdf128989,2016-10-16T17:06:48.356+05:00,2016-10-16T17:07:24.935+05:00,E6553,PHONE,8bbf450ec7eceee9,2016-10-16T17:06:48.356+05:00,2016-10-16T17:07:24.935+05:00,<unknown>,Download,TairoJ'étais prêt,content://media/external/audio/media/27003,Walkman,,201610


### Number of rows

Raw Database:

In [14]:
number_of_rows_raw = df_raw.count()
print(f'Number of rows in the raw database: {number_of_rows_raw}')

Number of rows in the raw database: 627101296


Full Database:

In [15]:
number_of_rows_full = df_full.count()
print(f'Number of rows in the full database: {number_of_rows_full}')

Number of rows in the full database: 3571278161


In [14]:
# For execution of the notebook without counting
number_of_rows_raw = 627101296
number_of_rows_full = 3571278161

### Activity ID

Detecting nonempty cells:

In [17]:
start_time = time.time()

n_empty_id_cells = (
    df_raw
    .where(
        (f.col('id').isNull())
        | (f.col('id') == '')
        | (f.col('id') == '<unknown>')
    )
    .count()
)

print(f'Number of empty "id" cells: {n_empty_id_cells}.')
print(f'Execution time: {time.time() - start_time:.5f} s.')

Number of empty id cells: 0.
Execution time: 338.16046 s.


Detecting duplicates:

In [19]:
start_time = time.time()

(df_raw
 .groupBy(f.col('id'))
 .count()
 .orderBy(f.desc('count'))                    
 .show(5)
)

print(f'Execution time: {time.time() - start_time:.5f} s.')

+--------------------+-----+
|                  id|count|
+--------------------+-----+
|8c71293b-fe42-432...|  410|
|ff29d186-0614-42d...|  386|
|934c3a82-a73c-4ec...|  314|
|a2ca4895-8074-426...|  301|
|ae2fedfb-ef75-4ec...|  283|
+--------------------+-----+
only showing top 5 rows

Execution time: 585.76851 s.


Number of duplicate activity ID's:

In [15]:
start_time = time.time()

number_of_duplicate_activity_ids = (
    df_raw
    .groupBy(f.col('id'))
    .count()
    .where(f.col('count') > 1)
    .count()
)

print(f"Number of duplicated activity ID's: {number_of_duplicate_activity_ids}.")
print(f'That constitutes '
      f'{number_of_duplicate_activity_ids / number_of_rows_raw * 100:.5f} % '
      f'of the total number of cells.')
print(f'Execution time: {time.time() - start_time:.5f} s.')

Number of duplicated activity ID's: 46360934.
That constitutes 7.39289 % of the total number of cells.
Execution time: 580.64515 s.


### Activity Deleted

In [16]:
start_time = time.time()

n_nonempty_deleted_time_cells = (
    df_raw
    .where(
        (f.col('deleted_time').isNotNull())
        & (f.col('deleted_time') != '')
    )
    .count()
)

print(f'Number of non-empty "deleted_time" cells: {n_nonempty_deleted_time_cells}.')
print(f'That constitutes '
      f'{n_nonempty_deleted_time_cells / number_of_rows_raw * 100:.5f} % '
      f'of the total number of cells.')
print(f'Execution time: {time.time() - start_time:.5f} s.')

Number of non-empty "deleted_time" cells: 52764881.
That constitutes 8.41409 % of the total number of cells.
Execution time: 331.54952 s.


Around 8.5 % of all the activities have the "Deleted time" property set. What to do with them:
 * Nothing
 * Delete the column
 * Delete the rows where the property is set

### Device types

In [17]:
start_time = time.time()

(df_full
 .groupBy(f.col('devices_type'))
 .count()
 .orderBy(f.desc('count'))                    
 .show()
)

print(f'Execution time: {time.time() - start_time:.5f} s.')

+------------+----------+
|devices_type|     count|
+------------+----------+
|       PHONE|3561473696|
|        null|   7633807|
|     UNKNOWN|   2170658|
+------------+----------+

Execution time: 953.80290 s.


Almost all the devices are phones. It could be reasonable remove the column.

### Track players

In [18]:
start_time = time.time()

(df_full
 .groupBy(f.col('tracks_player'))
 .count()
 .orderBy(f.desc('count'))                    
 .show()
)

print(f'Execution time: {time.time() - start_time:.5f} s.')

+-------------+----------+
|tracks_player|     count|
+-------------+----------+
|      Walkman|2428561686|
|      Spotify| 710807090|
|             | 431909382|
|        Music|         3|
+-------------+----------+

Execution time: 997.08188 s.


### Track ID vs Track URI

In [None]:
start_time = time.time()

n_track_id_and_uri_nonempty = (
    df_full
    .where(
        (f.col('tracks_uri') != '')
        & (f.col('tracks_uri').isNotNull())
        & (f.col('tracks_id') != '')     
        & (f.col('tracks_id').isNotNull())
    )
    .count()
)

print(f'Number of rows where both "tracks_id" and '
      f'"tracks_uri" are full: {n_track_id_and_uri_nonempty}.')
print(f'Execution time: {time.time() - start_time:.5f} s.')

The Track ID and Track URI are never full at the same time. Therefore, the columns could be merged into one column.

Also, the information might be redundant. 

We could also split the dataset into a couple of databases. Does it make sense? Maybe not.
 