
<p style="border:2px solid black"> </p>
<span style="font-family:Lucida Bright;">
<p style="margin-bottom:0.8cm"></p>
<center>
<font size="6"><b>Understanding Music Listening Habits</b></font>
<p style="margin-bottom:-0.1cm"></p>
<font size="6"><b>Using Large-scale Smartphone Data</b>  </font>

<p style="margin-bottom:0.5cm"></p>
<font size="3"><b>Wojciech Mazurkiewicz, DTU, 14 May 2021</b></font>
<p style="margin-bottom:1cm"></p>
<font size="5"><b>Overview</b></font>
<br>
<font size="3"><b></b></font>
</center>
<p style="margin-bottom:0.4cm"></p>
<p style="border:2px solid black"> </p>

    

# Initialization
<p style="border:2px solid black"> </p>


The initializaiton procedure is defined in the notebook [Initialization](initialization.ipynb).

In [1]:
%run initialization.ipynb

# The first look at the data
<p style="border:2px solid black"> </p>


The analysis described in this section is performed in the notebook [Preliminary data analysis](preliminary_data_analysis.ipynb). Main findings:

1. The raw data has multilevel column index, which we convert to a flat index, resulting in following 18 columns and corresponding descriptions (as stated in the provided avro files): 
    1. `id`: Activity UUID
    1. `deleted_time`: If this activity has been deleted, this is the timestamp when it occurred
    1. `useruuid`: User UUID
    1. `start_time`: Start time
    1. `end_time:` End time
    1. `devices`: Devices used during this activity
    1. `devices_name`: Name of the device
    1. `devices_type`: Device type (PHONE, SMARTBAND, SMARTCAMERA, UNKNOWN, ...)
    1. `devices_id`: Unique identifier of the device
    1. `tracks`: -
    1. `tracks_start_time`: -
    1. `tracks_end_time`: -
    1. `tracks_artist`: -
    1. `tracks_album`: -
    1. `tracks_title`: -
    1. `tracks_uri`: -
    1. `tracks_player`: -
    1. `vtracks_id`: -


2. The raw data has **627,101,296** rows, each corresponding to an "activity". 


3. The data two columns where data contains lists: `devices` and `tracks`. We unpack these columns so that each row corresponds to a playback of one song on one device. This increases the number of rows in the database to **3,571,278,161** 


4. Activity UUID (column `id`):
    1. No empty cells
    1. **46,360,934** id's have at least one duplicate (this amounts to at least 7.4% all rows in the raw database, i.e. where each row represents an activivty).
    1. It might be reasonable to remove the duplicates.
 
 
5. Activity deleted (column `deleted_time`):
    1. 8.4% of all the rows (activities) in the raw database have the "Deleted time" property set. What to do with them:
        1. Nothing
        1. Delete the property (column).
        1. Delete the rows where the property is set.


6. Device types (column `devices_type`):
    1. All the defined devices are of type: "PHONE".
    1. 0.275 % of the devices in the unpacked database (where each row represents a playback of one song on one device) are undefined (null or UNKNOWN)
    1. Therefore, it could be reasonable to remove the column.
  
  
7. Track players (column `tracks_player`):
    1. Following track players are represented in the database (percentages are expressed relative to the number of entries in the unpacked database, , i.e. where each row represents a playback of one song on one device):
        1. Walkman: 68%
        1. Spotify: 20%
        1. Music: 8.42e-8%
        1. Undefined: 12%


8. Track ID and Track URI (columns `tracks_id` and  `tracks_uri`):
    1. The Track ID and Track URI are never full at the same time. Therefore, the columns could be merged into one column.
    
   
9. Activity and track start and stop times (columns `start_time`, `end_time`, `tracks_start_time` and  `tracks_end_time`):
    1. The data should be converted to timestamps.
    1. The data can be expressed in terms of start time and duration.




    

    
    
    
    

# Data cleaning
<p style="border:2px solid black"> </p>


The analysis described in this section is performed in the notebook [Data Cleaning](data_cleaning.ipynb).

The cleaning procedure:

1. Remove the rows where the activity has been deleted.
1. Drop the duplicates of activity ID, keeping the most recent activities.
1. Flatten the schema (make column index flat)
1. Explode the columns `devices` and `tracks` so that each line rows represents a playback of one track on one device.
1. Ensure that all the data representing timestamps are formatted as such.
1. Replace the start and end time of each activity and and track playback with start time and duration.
1. Drop following columns:
    1. `deleted_time`, as we have removed the rows where this property is set.
    1. `devices_type`, as all defined devices are phones.
    1. `yearmonth`, as it doesn't contain any information that is not present in other columns.
1. Rename columns as follows:
    1. Column `id` is renamed to `activity_id`.
    1. Column `useruuid` is renamed to `user_id`.
    1. Column `start_time` is renamed to `activity_start_time`.
    1. Column `devices_name` is renamed to `device_name`.
    1. Column `devices_id` is renamed to `device_id`.
    1. Column `tracks_start_time` is renamed to `track_start_time`.
    1. Column `tracks_artist` is renamed to `track_artist`.
    1. Column `tracks_album` is renamed to `track_album`.
    1. Column `tracks_title` is renamed to `track_title`.
    1. Column `tracks_player` is renamed to `track_player`.
    1. Column `tracks_id` is renamed to `track_spotify_uri`.
    1. Column `tracks_uri` is renamed to `track_sony_uri`.
1. Order the columns so that the ones belonging to the same group (e.g. devices) next to each other.
1. Replace undefined cell values with null:
    1. Empty cells.
    1. Cells whose value is: "\<unknown\>"
    1. Cells containing the character "�"
1. Resulting database has 2,530,475,843 rows, each representing playback of one audio track on one playback device.
1. Save the dataframe as in parquet format, where the files are sorted in folder by the first two letters of the column `user_id`.

# Initial statistics
<p style="border:2px solid black"> </p>


The analysis described in this section is performed in the notebook [Initial Statistics](initial_statistics.ipynb).

Main findings:

In [2]:
display(pd.read_pickle(Config.Path.initial_stats_df_combined_stats))

Unnamed: 0_level_0,Undefined,Undefined,Number of distinct values
Unnamed: 0_level_1,Number of cells,%,Unnamed: 3_level_1
activity_id,0,0.0,520747470
activity_start_time,0,0.0,519487179
activity_duration,0,0.0,235387
device_id,7253534,0.29,4742108
device_name,7258178,0.29,4188
track_artist,710429434,28.07,9931077
track_title,180364595,7.13,73003873
track_player,338135480,13.36,4
track_start_time,0,0.0,2498319684
track_playback_duration,0,0.0,226103


## Number of undefined

In [3]:
# If saved before, it is also possible to load the
# dataframe from pickle.
dfp_undefined_entries = pd.read_pickle(
    Config.Path.initial_stats_df_n_undefined)

# Show the result
display(dfp_undefined_entries)

Unnamed: 0_level_0,Undefined,Undefined
Unnamed: 0_level_1,Number of cells,%
activity_id,0,0.0
activity_start_time,0,0.0
activity_duration,0,0.0
device_id,7253534,0.29
device_name,7258178,0.29
track_artist,710429434,28.07
track_title,180364595,7.13
track_player,338135480,13.36
track_start_time,0,0.0
track_playback_duration,0,0.0
