# Analysis of the popular music of the last 55 years

### Data Wrangling

##### Imports - API settings - Constants definition

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import scipy.stats as sstats

# EchoNest API
from pyechonest import config
from pyechonest import song
from pyechonest import artist

# LastFM API
import pylast

# Geopy
from geopy.geocoders import Nominatim

# Functions used in this notebook
import dataStoryFunctions as dsf

In [2]:
# Loads the credentials from the yaml file
secrets = dsf.load_secrets()

# Set ECHO_NEST_API_KEY value
config.ECHO_NEST_API_KEY = secrets["echonest_api_key"]

# Set LastFM API_KEY and API_SECRET
# Obtain yours from http://www.last.fm/api/account for Last.fm
API_KEY = secrets["lastfm_api_key"]
API_SECRET = secrets["lastfm_api_secret"]

# In order to perform a write operation you need to authenticate yourself
username = secrets["lastfm_username"]
password_hash = pylast.md5(secrets["lastfm_password_hash"])

last_fm_network = pylast.LastFMNetwork(api_key = API_KEY, api_secret =
    API_SECRET, username = username, password_hash = password_hash)

In [3]:
# Define the starting and ending years 
start_year = 1960
end_year = 2015

##### Original dataframe creation

The next steps take a lot of time to complete and have been done in the data processing and cleaning process.

The final dataframe has been completed manually for the remaining missing data.

In [6]:
# Creation of the global dataframe

# billboard_df = dsf.create_billboard_df_from_CSV(start_year, years)
# s = billboard_df['Title'].str.split('" / "').apply(pd.Series, 1).stack()
# s.index = s.index.droplevel(-1)
# s.name = 'Title'
# del billboard_df['Title']
# billboard_df = billboard_df.join(s)
# billboard_df = billboard_df[['Num', 'Artist(s)', 'Title', 'Year']] 

In [7]:
# Addition of new characteristics to the dataframe (artist location, audio summary...)

# billboard_df_final = dsf.add_songs_characteristics_to_df(billboard_df, 
#                                                        'CSV_data/billboard_df-final.csv')

The final dataframe has been built using the previous commands and completed manually in Excel. Everything has been saved in a CSV file which will be loaded in a pandas dataframe.

In [10]:
billboard_df_final = pd.read_csv('CSV_data/billboard_df-final.csv', sep=';')
del billboard_df_final['Colonne1']

The country of origin of each artist is added at the end of every row of the dataframe. This will be used in one of the charts in the study.

In [11]:
billboard_df_final = dsf.add_Track_Country_Of_Origin_To_DF(billboard_df_final)

As the data as already been generated, you can skip all the previous step and directly build the final dataframe from the CSV file.

In [8]:
#billboard_df_final = pd.read_csv('CSV_data/billboard_df-final.csv')
#del billboard_df_final['Unnamed: 0']

##### Number of songs by artists in the Billboard Hot 100 year-end

The methodology used to create this dataframe is explained in the article related to the project.

In [56]:
unique_artist_df = dsf.create_entries_by_unique_artist(billboard_df_final,
                                                       start_year, end_year)

Finally, we add one last feature to the dataframe groupping the number of tracks by artist. This feature is the dominance of the artist on given period. This is calculated by summing the number of tracks ranked in the charts for one particular artist during n years and dividing that number by the total number of tracks in the Billboard Hot 100 during those n years.

In this study we have chosen to use rolling periods of 3 years.

In [57]:
unique_artist_df = dsf.get_most_dominant_artist_per_years(unique_artist_df, start_year,
                                                          end_year, 3, 1)

For visualization purposes, we add an image of the artist to the dataframe (this is a url returned by the last fm API).

In [29]:
unique_artist_df = dsf.add_image_url_to_artist_count_df(unique_artist_df, last_fm_network)

2 additional features are also added to the previous dataframe:
- the artist 'hotttnesss'
- the artist 'familiarity'

In [32]:
unique_artist_df = dsf.add_items_to_billboard_df_artist_count(unique_artist_df,
                                                              ["familiarity", "hotttnesss"])

As the data as already been generated, you can skip all the previous step and directly build the final artist count dataframe from the CSV file.

In [66]:
#unique_artist_df = pd.read_csv('CSV_data/billboard_df-artist_count-test.csv', sep=';')
#del unique_artist_df['Unnamed: 0']

To try to analyze the relation that exists between the number of tracks in the Billboard Hot 100 and the number of years of presence in the charts for each artist, we are going to use a linear regression. We will only consider the top 100 performers in the regression.

To do that we need to import scikit-learn and numpy.

In [12]:
import numpy as np
from sklearn import linear_model

In [45]:
artists_top_100 = unique_artist_df[:101]
artists_X_train = artists_top_100[['Counts']].values
artists_Y_train = artists_top_100[['Years of presence']].values

In [46]:
# Create linear regression object
regr = linear_model.LinearRegression()

In [50]:
# Train the model using the training sets
regr.fit(artists_X_train, artists_Y_train)

# The coefficients
print 'Coefficients: \n', regr.coef_
print 'Intercept: \n', regr.intercept_

Coefficients: 
[[ 0.31592062]]
Intercept: 
[ 3.32566998]
