In [None]:
#Overview: Is it possible to predict which songs will be hits? 
#The purpose of this project is to explore data sets from the Billboard Top 40 songs and compare the 
#songs with Spotify's music metrics to determine if there is a winning formula in music. 
#
#This project will use Spotify's metrics of danceability, tempo, popularity, and others that will map onto
#the Billboard Top 40 songs. The two data sets will be connected by song title and will take some prep to 
#match them up.


In [None]:
#Data Profile:

#What entities/terms/features need to be extracted?
    #I'm interested in merging the data set on the keys of artist name and song to understand what spotify metrics apply to the billboards top 40 worldwide in 2019

#Are there restrictions or limitations to using it?
    #None, these are public Kaggle data sets

#If it is an API, how might you use its functions to retrieve useful data? That is, how could you use it or combine it with something to produce a useful report/answer a question? Why is it interesting to you?
    #not an API
    
#If it is a dataset, what would you need or not need from it to explore your question?
    #for the Billboard data, I would like to keep the song name, artist, weeks on #1, and weeks on the chart
    #I then want to merge this data with spotify to understand each top songs spotify metrics including:
        #danceability, tempo, energy and artist popularity
    #I'll also be only using the 2019 data from spotify since that's what I have for billboards

#Write a short project plan describing

#The questions you hope to answer
    #I hope to answer the question of "What makes a song go to the top of the Billboard list?". 
    #I'm curious to understand if I can identify patterns in the types of songs that go the to top 40s
    #Given this data, can I predict based on a songs metrics and artist popularity of whether or not it will be a top 40s hit?


#How will this data be analyzed? Will you simply do aggregations and comparisons or will you use techniques such as descriptive statistics, correlation, regression?
    #I'll be initially doing some basic comparisons between the data sets to understand them better with descriptive stats
    #then I'll try doing correlations to understand if I can begin creating a model for predicting a top 40s hit based on it's spotify stats
    
#How are you using the entities/terms/features?

   #for the Billboard data, I would like to keep the song name, artist, weeks on #1, and weeks on the chart
    #I then want to merge this data with spotify to understand each top songs spotify metrics including:
        #danceability, tempo, energy and artist popularity

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
Billboard = pd.read_csv("../input/billboard-charts/Billboard.csv")

In [None]:
Billboard.head(20)

In [None]:
import pandas as pd
spotify_global_2019_most_streamed_tracks_audio_features = pd.read_csv("../input/spotify-global-2019-moststreamed-tracks/spotify_global_2019_most_streamed_tracks_audio_features.csv")

In [None]:
spotify = pd.read_csv("../input/spotify-global-2019-moststreamed-tracks/spotify_global_2019_most_streamed_tracks_audio_features.csv")

spotify.head(20)

In [None]:
#Step 1: linking two data sets together with inner join on the 'Track name' column for Spotify 
#and the 'Song' for billboard data set

popsongs = spotify.merge(Billboard, left_on='Track Name', right_on='Song', how='inner')

popsongs

#There are 211 matches between the spotify and top 40s data set! Let's play with them to see some correlations.

In [None]:
#let's get matplotlib in here so we can make some nice visualizations of our new merged data set
import matplotlib.pyplot as plt

#after some playing around, I realized I needed to create a more meaningful column for billboard. Peak rank is better when it's a lower number
#so it goes in the denominator. So this new column is a combo of our 1/Peak rank x #of weeks on the billboard.

#let's take a look at if there is some correlation between peak rank and #of weeks on chart first
plt.style.use('ggplot')
popsongs.plot(kind='scatter', x='Streams', y='Weeks On Chart')

#the scatterplot looks like there may be some correlation between Weeks on chart and streams. 

popsongs['Streams'].corr(popsongs['Weeks On Chart'])

#after checking the pearson value for this correlation, it looks like it's pretty low.


In [None]:
#Perhaps there is a better variable we can use that combines Peak Rank and Weeks on chart. I'll call that 'bill_rating'
#I need to do Weeks On Chart x 1/Peak rank because the lower the Peak rank the better ie. #1 is the best peak rank


popsongs['bill_rating'] = popsongs['Weeks On Chart']/popsongs['Peak Rank']

#now I'll plot this correlation
plt.style.use('ggplot')
popsongs.plot(kind='scatter', x='Streams', y='bill_rating')

#and get the correlation coeffecient to make sure

popsongs['Streams'].corr(popsongs['bill_rating'])

#after checking the pearson value for this correlation, it looks like it's even lower than the previous one.


In [None]:
#Now that I know there isn't much correlation between my data sets, I'll revisit this topic in my conclusion.
#For now, I'll play around with the spotify data set to identify potential correlations between song stats and popularity rating

#Let's first see if artist popularity is correlated to # of streams. 
popsongs.plot(kind='scatter', x='Streams', y='Artist_popularity')

#It appears that there are a lot of entries with very little streams. This graph makes it look like there are 0
#streams but really, I would want to zoom in and see what's happening on a more granular level with songs
#between 0 and 100,000 Streams.



In [None]:
#It appears that there are a lot of entries with very little streams. This graph makes it look like there are 0
#streams but really, I would want to zoom in and see what's happening on a more granular level with songs
#between 0 and 1e8 and those that have greater than 1e8 Streams.

popsongs_small = popsongs[popsongs.Streams<1e8]
popsongs_big = popsongs[popsongs.Streams>=1e8]




In [None]:
#Now, using the smaller data set with <1e8 streams, I want to see if there is a correlation between
#artist_popularity and streams
popsongs_small.plot(kind='scatter', x='Streams', y='Artist_popularity')

#yea that doesn't look like there is any correlation. 
#and get the correlation coeffecient to make sure

popsongs_small['Streams'].corr(popsongs_small['Artist_popularity'])


In [None]:
#I'm curious to see if there is some correlation between artist popularity and artist followers in the spotify data set.
#my hypothesis is that the # of artist followers is involved somehow in the artist popularity rating

spotify.plot(kind='scatter', x='Artist_follower', y='Artist_popularity')
#that looks promising! Let's double check the correaltion value 

spotify['Artist_follower'].corr(popsongs_small['Artist_popularity'])

#that's surprising.. I would have expected the number of artist followers would be part of the artist popularity equation.
#let's zoom in on this data

In [None]:
# I'm going to try zooming in on this data and make the spotify data more manageable 

spotify_small = spotify[spotify.Artist_follower<1e7]
spotify_big = spotify[spotify.Artist_follower>=1e7]



In [None]:
#Let's try playing with the smaller data set of less than 1e7 followers

spotify_small.plot(kind='scatter', x='Artist_follower', y='Artist_popularity')
#that looks promising! Let's double check the correaltion value 

spotify_small['Artist_follower'].corr(spotify_small['Artist_popularity'])

#Awesome! A moderate correlation. This definitely leaves an open question of what's going on with this 
#correlation for the larger follower data set. 

In [None]:
#Let's try playing with the bigger data set of more than 1e7 followers

spotify_big.plot(kind='scatter', x='Artist_follower', y='Artist_popularity')
#that looks promising! Let's double check the correaltion value 

spotify_big['Artist_follower'].corr(spotify_big['Artist_popularity'])

#Awesome! A moderate correlation. Unfortunately, I'm not fluent enough in stats to fully explain why the combined data sets
#somehow dilute each other and create a small correlation value for two variables seemingly related. 

In [None]:
#Conclusion:
#Unfortunately, I wasn't able to answer the questions I sought out to answer. The Billboard data was not correlated with
#the variables I researched with the spotify data. I tried looking at it from different angles to no avail. 
#I then pivoted to explore a bit of the spotify data to get a better understanding of how they calculate their 'Artist Popularity'
#rating and have come to the conclusion that artist followers may have some influence or part of that calculation. 

#Next steps would be to explore other data sets that may play better with the spotify data set to build a better model of predicting hits!
