In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data Visualization and Exploratory Data Analysis Lab
## Visualizing and exploring data. The data mining process

In this lab, you'll get acquainted with the most streamed songs on Spotify in 2024. The dataset and its associated metadata can be found [here](https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024). The version you'll need is provided in the `data/` folder.

You know the drill. Do what you can / want / need to answer the questions to the best of your ability. Answers do not need to be trivial, or even the same among different people.

### Problem 1. Read the dataset (1 point)
Read the file without unzipping it first. You can try a different character encoding, like `unicode_escape`. Don't worry too much about weird characters.

In [19]:
spotify = pd.read_csv("data/spotify_most_streamed_2024.zip", compression="zip", encoding="unicode_escape")
spotify

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179,457017,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,For the Last Time,For the Last Time,$uicideboy$,9/5/2017,QM8DG1703420,4585,19.4,305049963,65770,5103054,...,,2.0,14217,,20104066,13184,50633006,656337,,1
4596,Dil Meri Na Sune,"Dil Meri Na Sune (From ""Genius"")",Atif Aslam,7/27/2018,INT101800122,4575,19.4,52282360,4602,1449767,...,,1.0,927,,,,,193590,,0
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,189972685,72066,6704802,...,,1.0,74,6.0,84426740,28999,,1135998,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,145467020,14037,7387064,...,,,,7.0,6817840,,,448292,,0


### Problem 2. Perform some cleaning (1 point)
Ensure all data has been read correctly; check the data types. Give the columns better names (e.g. `all_time_rank`, `track_score`, etc.). To do so, try to use `apply()` instead of a manual mapping between old and new name. Get rid of any unnecessary ones.

In [20]:
spotify.columns

Index(['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track'],
      dtype='object')

In [21]:
new_names = []
for nam in spotify.columns:
    new_name = nam
    new_name = new_name.lower()
    new_name = new_name.replace(" ", "_")
    new_names.append(new_name)

spotify.columns = new_names
spotify.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'tidal_popularity',
       'explicit_track'],
      dtype='object')

In [22]:
spotify['tidal_popularity'].isna().count()

4600

In [23]:
spotify = spotify.drop(columns=['tidal_popularity'])

### Problem 3. Most productive artists (1 point)
Who are the five artists with the most songs in the dataset?

Who are the five "clean-mouthed" artists (i.e., with no explicit songs)? **Note:** We're not going into details but we can start a discussion about whether a song needs swearing to be popular.

In [24]:
songs_count = spotify['artist'].value_counts()
artist_list_5 = songs_count.head(5)
artist_list_5

artist
Drake           63
Taylor Swift    63
Bad Bunny       60
KAROL G         32
The Weeknd      31
Name: count, dtype: int64

In [25]:
clean_mounthed_artists = spotify[spotify['explicit_track'] == 0]
clean_songs_count = clean_mounthed_artists['artist'].value_counts()
clean_artist_list_5 = clean_songs_count.head(5)
clean_artist_list_5

artist
Taylor Swift     50
Billie Eilish    25
Bad Bunny        18
KAROL G          18
Morgan Wallen    17
Name: count, dtype: int64

### Problem 4. Most streamed artists (1 point)
And who are the top five most streamed (as measured by Spotify streams) artists?

In [26]:
spotify['spotify_streams']

0         390,470,936
1         323,703,884
2         601,309,283
3       2,031,280,633
4         107,034,922
            ...      
4595      305,049,963
4596       52,282,360
4597      189,972,685
4598      145,467,020
4599      255,740,653
Name: spotify_streams, Length: 4600, dtype: object

The data format in table 'spotify_streams' is object, should be number

In [27]:
spotify['spotify_streams'] = spotify['spotify_streams'].astype(str)
spotify['spotify_streams'] = spotify['spotify_streams'].str.replace(',', '')
spotify['spotify_streams'] = spotify['spotify_streams'].astype(float)
spotify['spotify_streams']

0       3.904709e+08
1       3.237039e+08
2       6.013093e+08
3       2.031281e+09
4       1.070349e+08
            ...     
4595    3.050500e+08
4596    5.228236e+07
4597    1.899727e+08
4598    1.454670e+08
4599    2.557407e+08
Name: spotify_streams, Length: 4600, dtype: float64

In [28]:
spotify['spotify_streams'].isna().any()

True

In [29]:
spotify['spotify_streams'].fillna(0, inplace=True)
spotify_streams = spotify.groupby('artist')['spotify_streams'].sum()
spotify_streams = spotify_streams.sort_values(ascending=False)
top_five_artists_ss = spotify_streams.head(5)
top_five_artists_ss

artist
Bad Bunny       3.705483e+10
The Weeknd      3.694854e+10
Drake           3.496216e+10
Taylor Swift    3.447077e+10
Post Malone     2.613747e+10
Name: spotify_streams, dtype: float64

### Problem 5. Songs by year and month (1 point)
How many songs have been released each year? Present an appropriate plot. Can you explain the behavior of the plot for 2024?

How about months? Is / Are there (a) popular month(s) to release music?

In [30]:
spotify['release_date']

0       4/26/2024
1        5/4/2024
2       3/19/2024
3       1/12/2023
4       5/31/2024
          ...    
4595     9/5/2017
4596    7/27/2018
4597    2/28/2020
4598    11/8/2016
4599    4/11/2017
Name: release_date, Length: 4600, dtype: object

### Problem 6. Playlists (2 points)
Is there any connection (correlation) between users adding a song to playlists in one service, or another? Only Spotify, Apple, Deezer, and Amazon offer the ability to add a song to a playlist. Find a way to plot all these relationships at the same time, and analyze them. Experiment with different types of correlations.

### Problem 7. YouTube views and likes (1 point)
What is the relationship between YouTube views and likes? Present an appropriate plot. 

What is the mean YouTube views-to-likes ratio? What is its distribution? Find a way to plot it and describe it.

In [31]:
spotify['youtube_views']

0          84,274,754
1         116,347,040
2         122,599,116
3       1,096,100,899
4          77,373,957
            ...      
4595      149,247,747
4596      943,920,245
4597      201,027,333
4598    1,118,595,159
4599      866,300,755
Name: youtube_views, Length: 4600, dtype: object

In [32]:
spotify['youtube_views'] = spotify['youtube_views'].astype(float)
spotify['youtube_views']

ValueError: could not convert string to float: '84,274,754'

In [33]:
spotify['youtube_views'].isna().any()

True

In [34]:
spotify['youtube_likes'] = pd.to_numeric(spotify['youtube_likes'].str.replace(',', ''))
spotify['youtube_likes']

0        1713126.0
1        3486739.0
2        2228730.0
3       10629796.0
4        3670188.0
           ...    
4595     1397590.0
4596     5347766.0
4597     1081402.0
4598     3868828.0
4599     3826829.0
Name: youtube_likes, Length: 4600, dtype: float64

In [35]:
spotify['youtube_likes'].isna().any()

True

In [36]:
youtube = spotify.dropna(subset=['youtube_views', 'youtube_likes'])
youtube

Unnamed: 0,track,album_name,artist,release_date,isrc,all_time_rank,track_score,spotify_streams,spotify_playlist_count,spotify_playlist_reach,...,airplay_spins,siriusxm_spins,deezer_playlist_count,deezer_playlist_reach,amazon_playlist_count,pandora_streams,pandora_track_stations,soundcloud_streams,shazam_counts,explicit_track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,3.904709e+08,30716,196631588,...,40975,684,62.0,17598718,114.0,18004655,22931,4818457,2669262,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,3.237039e+08,28113,174597137,...,40778,3,67.0,10422430,111.0,7780028,28444,6623075,1118279,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,6.013093e+08,54331,211607669,...,74333,536,136.0,36321847,172.0,5022621,5639,7208651,5285340,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2.031281e+09,269802,136569078,...,1474799,2182,264.0,24684248,210.0,190260277,203384,,11822942,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,1.070349e+08,7223,151469874,...,12185,1,82.0,17660624,105.0,4493884,7006,207179,457017,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,For the Last Time,For the Last Time,$uicideboy$,9/5/2017,QM8DG1703420,4585,19.4,3.050500e+08,65770,5103054,...,6,,2.0,14217,,20104066,13184,50633006,656337,1
4596,Dil Meri Na Sune,"Dil Meri Na Sune (From ""Genius"")",Atif Aslam,7/27/2018,INT101800122,4575,19.4,5.228236e+07,4602,1449767,...,412,,1.0,927,,,,,193590,0
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,1.899727e+08,72066,6704802,...,204,,1.0,74,6.0,84426740,28999,,1135998,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,1.454670e+08,14037,7387064,...,1200,,,,7.0,6817840,,,448292,0


In [37]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='youtube_views', y='youtube_likes', data=youtube)
plt.xlabel('YouTube Views')
plt.ylabel('YouTube Likes')
plt.title('YouTube Views and Likes Relationship')
plt.show()

NameError: name 'sns' is not defined

<Figure size 1200x800 with 0 Axes>

In [38]:
youtube['views_to_likes_ratio'] = youtube['youtube_views'] / youtube['youtube_likes']
youtube['views_to_likes_ratio'].mean()

TypeError: unsupported operand type(s) for /: 'str' and 'float'

### Problem 8. TikTok stuff (2 points)
The most popular songs on TikTok released every year show... interesting behavior. Which years peaked the most TikTok views? Show an appropriate chart. Can you explain this behavior? For a bit of context, TikTok was created in 2016.

Now, how much popular is the most popular song for each release year, than the mean popularity? Analyze the results.

In both parts, it would be helpful to see the actual songs.

### * Problem 9. Explore (and clean) at will
There is a lot to look for here. For example, you can easily link a song to its genres, and lyrics. You may also try to link artists and albums to more info about them. Or you can compare and contrast a song's performance across different platforms, in a similar manner to what you already did above; maybe even assign a better song ranking system (across platforms with different popularity metrics, and different requirements) than the one provided in the dataset.