# Exposition
This is a detective story in which the Twenty Week Outlier and the Epoch of 1991 rear their ugly heads.


<img src=https://i.imgur.com/93XpAU7.png width=300></img>
<img src=https://i.imgur.com/xbCSBt4.png width=300></img>


It is a tale of the incredibly charting Dionne Warwick. And the betrayal of Missy Elliot by Eminem.

It is a tale of Nielsen's SoundScan and Broadcast Data Systems.

And it is a tale of befuddlement at just what makes the top and bottom ranges of the chart so darn special.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np


all_data=pd.read_csv("../input/billboard-the-hot-100-songs/charts.csv")
all_data['date'] = pd.to_datetime(all_data.date)
all_data['year'] = all_data.date.dt.year
all_data['month'] = all_data.date.dt.month
all_data['day-of-year'] = all_data.date.dt.dayofyear

# Top Songs and Artists

## \#2 Hits

One of the most popular Hot 100 metrics is how many weeks a song is ranked number one. However, it would be just as simple to track how many weeks a song is ranked number two.

### Songs with most time at \#2

In [None]:
weeks_at_two = pd.DataFrame(all_data.loc[all_data['rank']==2, 'rank':'artist'].value_counts()).reset_index()
weeks_at_two.columns=['rank', 'song', 'artist', 'weeks-at-two']
weeks_at_two.head(10)

As it turns out, Whitney Houston holds the current record for song with most weeks at \#2 with "Exhale (Shoop Shoop)" having accumulated 11 weeks at second place. Let's see which artists have spent the most time at \#2 throughout their career.

### Artists with most time at \#2

In [None]:
artists_at_two = all_data.loc[all_data['rank']==2]
artists_at_two = pd.DataFrame(artists_at_two['artist'].value_counts()).reset_index()
artists_at_two.columns = ['artist', 'weeks-at-two']

artists_at_two.head(10)

These metrics tell us about the overall time at number two regardless of what other positions they held. However, I'm also curious about which songs spent the longest at number two, never to reach number one.

### Songs with most weeks peaking at \#2

In [None]:
top_spot_two = pd.DataFrame(all_data.groupby(['song', 'artist'])['peak-rank'].min()).reset_index()
top_spot_two = top_spot_two.loc[top_spot_two['peak-rank']==2]
top_spot_two = pd.merge(left=top_spot_two, right=weeks_at_two, on=['song', 'artist'])

top_spot_two.sort_values('weeks-at-two', ascending=False).head(10)

Let's see what songs were at \#1 that kept these \#2 hits from reaching the top

In [None]:
for row in top_spot_two.sort_values('weeks-at-two', ascending=False).head().iterrows():
    print("\n"+row[1]['song'] + ",", row[1]['artist'])
    relocator = all_data.loc[all_data['song']==row[1]['song']]
    relocator = relocator[relocator['artist']==row[1]['artist']]
    relocator = relocator[relocator['rank']==2]
    top_blocker = all_data.iloc[relocator.index-1]
    for i in range(len(top_blocker.song.unique())):
        print("\tBlocked by "+top_blocker.song.unique()[i]+" by "+top_blocker.artist.unique()[i])

In general, the list of artists whose songs peaked at \#2 seems quite different from the artists whose songs accumulated total time at \#2.

### Artists with most weeks peaking at \#2

In [None]:
artist_peaked_two = pd.DataFrame(top_spot_two.groupby('artist')['weeks-at-two'].sum()).reset_index()
artist_peaked_two.columns = ['artist', 'weeks-peaked-at-two']
artist_peaked_two.sort_values('weeks-peaked-at-two', ascending=False).head(10)

This seems to show that most of the artists that accumluated time at number two also spent time at number one (though less so with Madonna).

### Artists with most songs peaking at \#2

It looks like Madonna is also leading the pack with six separate songs that peaked at \#2 on the Hot 100 chart.

In [None]:
song_peaked_two = pd.DataFrame(top_spot_two.value_counts('artist')).reset_index()
song_peaked_two.columns = ['artist', 'songs-peaked-at-two']
print(song_peaked_two.head(5), '\n')
top_spot_two[top_spot_two['artist']=='Madonna']

However, Madonna *has* spent time at \#1. What artist spent the most time at \#2 and never made it to \#1 in their career?

### Most charting artists peaking at \#2

In [None]:
artist_weeks = pd.DataFrame(all_data.value_counts('artist')).reset_index()
artist_two_weeks = pd.DataFrame(all_data[all_data['rank']==2].value_counts('artist')).reset_index()
peak_two = all_data.groupby('artist')['peak-rank'].min()==2
peak_two = pd.merge(left=artist_weeks, right=peak_two, on='artist')
peak_two = pd.merge(left=peak_two, right=artist_two_weeks, on='artist')
peak_two.columns = ['artist', 'weeks-on-chart', 'peak-rank-two', 'weeks-at-two']

print('Artists that spent the most total time on the chart and never made it to #1:')
print(peak_two.sort_values(['peak-rank-two', 'weeks-on-chart'], ascending=False).head(10)[['artist', 'weeks-on-chart']])
print('\nArtists that spent the most time at #2 and never made it to #1:')
print(peak_two.sort_values(['peak-rank-two', 'weeks-at-two'], ascending=False).head(10)[['artist', 'weeks-at-two']])

We can also investigate which artists have spent the most time at \#2 in comparison to time at \#1.

### Artists with highest \#2 to \#1 ratio

In [None]:
rank_one = all_data.loc[all_data['rank']==1]
artists_at_one = pd.DataFrame(rank_one['artist'].value_counts()).reset_index()
artists_at_one.columns = ['artist', 'weeks-at-one']

one_v_two = pd.merge(left=artists_at_two, right=artists_at_one, on='artist')
one_v_two['two-per-one'] = one_v_two['weeks-at-two'] / one_v_two['weeks-at-one']

one_v_two.sort_values('two-per-one', ascending=False).head(10)

## \#100 Hits

We could also just as well look at similar metrics for \#100 on the Hot 100 chart.

### Songs with most time at \#100

In [None]:
weeks_at_hundred = pd.DataFrame(all_data.loc[all_data['rank']==100, 'rank':'artist'].value_counts()).reset_index()
weeks_at_hundred.columns = ['rank', 'song', 'artist', 'weeks-at-hundred']
weeks_at_hundred.head(10)

### Artists with most time at \#100

In [None]:
artists_at_hundred = all_data.loc[all_data['rank']==100]
artists_at_hundred = pd.DataFrame(artists_at_hundred['artist'].value_counts()).reset_index()
artists_at_hundred.columns = ['artist', 'weeks-at-hundred']

artists_at_hundred.head()

### Songs with most weeks peaking at \#100

In [None]:
weeks_at_hundred = pd.DataFrame(all_data.loc[all_data['rank']==100, 'rank':'artist'].value_counts()).reset_index()
weeks_at_hundred.columns=['rank', 'song', 'artist', 'weeks-at-hundred']

top_spot_hundred = pd.DataFrame(all_data.groupby(['song', 'artist'])['peak-rank'].min()).reset_index()
top_spot_hundred = top_spot_hundred.loc[top_spot_hundred['peak-rank']==100]
top_spot_hundred = pd.merge(left=top_spot_hundred, right=weeks_at_hundred, on=['song', 'artist'])

top_spot_hundred.sort_values('weeks-at-hundred', ascending=False).head(10)

### Artists with most songs peaking at \#100

In [None]:
songs_peaked_hundred = pd.DataFrame(top_spot_hundred.value_counts('artist').head(10)).reset_index()
songs_peaked_hundred.columns = ['artist', 'songs-peaked-at-hundred']
songs_peaked_hundred.head()

# Position Persistence

Interestingly enough, the songs with the most accumulated weeks at \#100 have spent much less time at the 100th spot than the \#2 songs spent at their ranking. Here's a chart of the most accumulated weeks at each ranking on the Hot 100 chart.

In [None]:
time_at_place = pd.DataFrame()
time_at_place['rank']=range(1,101)
time_at_place['max_length']=[all_data.loc[all_data['rank']==r, 'rank':'artist'].value_counts().to_frame().head().iloc[0, -1] for r in range(1, 101)]

sns.set_theme(style="whitegrid", palette='colorblind')
length_chart = sns.scatterplot(x='rank', y='max_length', data=time_at_place)
length_chart.set_ylabel("Max weeks at position")
length_chart.set_xlabel("Chart position")
length_chart.set_title("Max weeks at each position on the Hot 100");

This plot seems to show that the top 10 positions in the chart have historically logged more weeks with the same songs than the lower 90 positions have.

As above, let's see what happens when this is constrained to peak ranking and not just accumulated time.

In [None]:
songs_by_persistence = pd.DataFrame(all_data.loc[:,'song':'artist'].value_counts()).reset_index()
songs_by_persistence.columns=['song', 'artist', 'weeks-on-chart']

songs_by_peak = pd.DataFrame(all_data.groupby(['song', 'artist'])['peak-rank'].min()).reset_index()
songs_by_peak = pd.merge(left=songs_by_peak, right=songs_by_persistence, on=['song', 'artist'])

songs_by_peak.sort_values('weeks-on-chart', ascending=False).head(10)
weeks_per_peak = sns.scatterplot(x='peak-rank', y='weeks-on-chart', alpha=0.03,
                data=songs_by_peak)
weeks_per_peak.set_title("Weeks on chart for each song's peak ranking");

Plotting peak rank against weeks on the chart seems to show a few interesting things. As before, there is a correlation between ranking on the chart and length of time on the chart. And again, the highest ranked songs can spend disproportionately more time on the chart compared to lower ranking songs.

## Distribution of Weeks on Chart

While the median length of time on the chart is 10 weeks, and each quartile in the ranking gives about 5 more weeks on the chart, the outliers at the top of the rankings skew the average chart time up just past 11 weeks.

In [None]:
print("Metrics for number of weeks that songs have spent on the Hot 100:")
songs_by_peak['weeks-on-chart'].describe()

The "weeks per peak-rank" chart also seems to show that there are two "outlier modes" around 20 weeks and 1 week. That is, songs seem to spend more time around these two durations than they would seem to normally given their peak ranking.

In fact, if you look at the number of weeks that the most songs spend on the chart, the top ten places are occupied by weeks one through nine with the addition of the twenty week duration in second place. Very interesting.

In [None]:
print(songs_by_peak.value_counts('weeks-on-chart').head(10), '\n')
sns.displot(songs_by_peak['weeks-on-chart'], discrete=True)
plt.title('Number of weeks songs spend on the Hot 100');

To dig more deeply into this, let's make a dataframe with metrics about general chart performance per year.

In [None]:
# Add true-peak data to all_data dataframe,
#    select subset of columns,
#    remove duplicate rows for multiple weeks

all_duration = pd.merge(left=all_data, right=songs_by_peak, on=['song', 'artist'])
all_duration = all_duration[['song', 'artist', 'year', 'peak-rank_y', 'weeks-on-chart']].drop_duplicates()
all_duration = all_duration.rename(columns={'peak-rank_y':'true-peak-rank'})

# Make dataframe for number of songs on chart each year
songs_by_year = pd.DataFrame(all_data.groupby('year')['song'].unique()).reset_index()
songs_by_year['songs'] = songs_by_year['song'].apply(lambda x: len(x))
songs_by_year.pop('song')

# Make dataframe for number of artists on chart each year
artists_by_year = pd.DataFrame(all_data.groupby('year')['artist'].unique()).reset_index()
artists_by_year['artists'] = artists_by_year['artist'].apply(lambda x: len(x))
artists_by_year.pop('artist')

chart_per_year = pd.merge(left=artists_by_year, right=songs_by_year, on='year')

# Save descriptive metrics for songs on chart each year
year_mode = pd.DataFrame(all_duration.groupby('year')['weeks-on-chart'].agg(lambda x:x.value_counts().index[0]))
year_mean = pd.DataFrame(all_duration.groupby('year')['weeks-on-chart'].mean())
year_max = pd.DataFrame(all_duration.groupby('year')['weeks-on-chart'].max())
year_stdev = pd.DataFrame(all_duration.groupby('year')['weeks-on-chart'].std())

# Add metrics as columns to dataframe
performance_by_year = pd.merge(left=year_mode, right=year_mean, on='year').reset_index()
performance_by_year = performance_by_year.rename(columns={'weeks-on-chart_x': 'mode-weeks',
                                                          'weeks-on-chart_y': 'mean-weeks'})

performance_by_year = pd.merge(left=performance_by_year, right=year_max, on='year')
performance_by_year = pd.merge(left=performance_by_year, right=year_stdev, on='year')
performance_by_year = performance_by_year.rename(columns={'weeks-on-chart_x': 'max-weeks',
                                                          'weeks-on-chart_y': 'stdev-weeks'})

performance_by_year = pd.merge(left=performance_by_year, right=chart_per_year, on='year')

# Add column for number of #1 peaks for the year
one_peaks = pd.DataFrame(all_duration[all_duration['true-peak-rank']==1].value_counts('year').reset_index())
one_peaks.columns = ['year', 'number-one-peaks']
performance_by_year = pd.merge(left=performance_by_year, right=one_peaks, on='year')

sns.scatterplot(data=performance_by_year, x='year', y='mode-weeks', legend='full')
sns.scatterplot(data=performance_by_year, x='year', y='mean-weeks', legend='full')
sns.scatterplot(data=performance_by_year, x='year', y='stdev-weeks', legend='full')
plt.legend(labels=['Mode', 'Mean', 'StDev'])
plt.title('Weeks on chart per year');

It looks like something happened in the early-90s that has caused a non-proportional number of songs to spend exactly 20 weeks on the chart.

In [None]:
duration_subset = all_duration[all_duration.year>1984]
duration_subset = duration_subset[duration_subset.year<1995]
g = sns.FacetGrid(duration_subset, col="year", col_wrap=5)
g.map(sns.histplot, "weeks-on-chart", discrete=True);

As it turns out, late in 1991, Billboard instituted a ["recurrent rule"](https://www.billboard.com/p/billboard-charts-legend) which currently states that songs that have been on the chart for 20 weeks will be removed if they rank below 50th place.

## Distribution of Chart Ranking

Plotting the distribution of peak-rank also shows an interesting pattern.

In [None]:
print(songs_by_peak.value_counts('peak-rank').head(10), '\n')
sns.displot(songs_by_peak['peak-rank'], discrete=True)
plt.title('Distribution of song peak-rankings');

This chart seems to show that songs on the Hot 100 are roughly equally likely to be ranked anywhere between \#90 and \#20 but are more likely to peak at \#91 and below and even more likely to peak between \#1 and \#10. And songs are more than twice as likely to peak at \#1 than \#2. Curious.

### Trends in chart ranking

Plotting weeks-on-chart for \#1 hits over the years shows yet another change happening around 1991. Duration on the chart is fairly tightly groups for songs until just past 1990 when a lot more variation appears.

In [None]:
all_ones = all_duration[all_duration['true-peak-rank']==1]
sns.scatterplot(data=all_ones, x='year', y='weeks-on-chart', alpha=0.2)
plt.title('Weeks on chart for #1 hits per year');

In [None]:
duration_subset = all_duration[all_duration['year']%5==0]
g=sns.FacetGrid(data=duration_subset, col='year', col_wrap=5)
g.map(sns.scatterplot, 'true-peak-rank', 'weeks-on-chart', alpha=0.2);

# Movement Metrics

Now that we've got that all sorted out, I'm interested in how rankings change week to week.

## Chart movement per ranking

Luckily, the original data frame has a column that shows what rank the song held in the previous week. After filling in the missing values with a "rank 101" to account for songs not being on the chart in the previous week, we can calculate metrics for movement from week to week.

In [None]:
all_filled = all_data.fillna(101)

movement_mean = pd.DataFrame(all_filled.groupby('rank')['last-week'].mean()).reset_index()
movement_mean.columns=['rank', 'mean-previous']

movement_median = pd.DataFrame(all_filled.groupby('rank')['last-week'].median()).reset_index()
movement_median.columns=['rank', 'median-previous']

movement_metrics = pd.merge(left=movement_mean, right=movement_median, on='rank')

movement_metrics['change-mean'] = movement_metrics['mean-previous']-movement_metrics['rank']
movement_metrics['change-median'] = movement_metrics['median-previous']-movement_metrics['rank']

sns.scatterplot(data=movement_metrics, x='rank', y='change-mean')
plt.title("Average change from previous rank to each future rank");

By plotting the current rank based on average change from previous rank, it seems that songs in the bottom 20 of the Hot 100 are more likely to have been higher hits that fell instead of songs that crept up from off the chart.

To get more insight, let's plot a chart with the distribution of each previous-future ranking pair.

In [None]:
sns.displot(data=all_filled, x='rank', y='last-week')
plt.title('Distribution of each previous-future rank pair');

In contrast to the impression given by just looking at chart movement, given that a song is not currently on the chart (even knowing that it will be on the chart the following week), it is most likely that the song will appear at the bottom of the chart rather than entering at the middle.

And after entering the chart, there is fairly little motion week to week--especially at the very top of the chart.

## Likelihood of entering at a given rank

Below is a plot of the likelihood of entering the Hot 100 at every rank.

In [None]:
entry_points = pd.DataFrame(all_filled[all_filled['last-week']==101].value_counts('rank')).reset_index()
entry_points.columns=['entry-rank', 'count']

print('Most likely to enter at:\n', entry_points.head())
print('\nLeast likely to enter at:\n', entry_points.tail())
print('\nNumber of songs entering at #1:\n', entry_points.loc[entry_points['entry-rank']==1])
print('')

sns.scatterplot(data=entry_points, x='entry-rank', y='count')
plt.title('Likelihood of entering at each rank');

The discontinuity in this chart as well as the distribution of peak-rankings seems to suggest something special about the bottom ten places on the Hot 100.

# Data by Date

I'm also curious as to how these numbers might have changed throughout the years as the music industry has changed and the way Billboard metrics have changed.

## Chart metrics per year

In [None]:
data_by_year = pd.DataFrame()
songs_by_year = pd.DataFrame(all_data.groupby('year')['song'].unique()).reset_index()
songs_by_year['songs'] = songs_by_year['song'].apply(lambda x: len(x))
songs_by_year.pop('song')

artists_by_year = pd.DataFrame(all_data.groupby('year')['artist'].unique()).reset_index()
artists_by_year['artists'] = artists_by_year['artist'].apply(lambda x: len(x))
artists_by_year.pop('artist')

data_by_year = pd.merge(left=songs_by_year, right=artists_by_year, on='year')

Let's see what plotting the number of songs and artists on the Hot 100 each year looks like.

In [None]:
sns.scatterplot(data=data_by_year, y='songs', x='year')
sns.scatterplot(data=data_by_year, y='artists', x='year')
plt.title('Number of charting songs and artists per year')
plt.ylim(0)
plt.legend(labels=['Songs', 'Artists']);

It seems like these metrics roughly track each other in terms of chart turnover but it also illustrates that most artists on the chart have more than one song that charts in a given year. Just to confirm, let's plot songs-per-artist through the years.

In [None]:
data_by_year['songs-per-artist'] = data_by_year['songs'] / data_by_year['artists']
sns.scatterplot(data=data_by_year, y='songs-per-artist', x='year')
plt.title('Songs per artist on the Hot 100 per year');

So it looks like the 1960s had more song turnover **and** more artist turnover but the rate of song turnover was so high even for individual artists that artists would chart more songs per year than the following decades. However, even though song turnover and artist turnover has risen again in recent years, artist turnover has risen even more so that the rate of songs-per-artist on the charts has not seen the same size of spike as either of the other two metrics.

# Conclusion and Further Work
In sum:
* Song ranking is correlated with total time on the chart
* Songs tend to enter at the bottom of the chart and not move around that quickly
* Since 1991, Billboard has removed low charting songs after 20 weeks
* Billboard ranking and/or music industry has changed since the 90s so that songs' rankings and durations vary much more than in previous decades

While there was a conclusive answer for the mystery of the 20-week chart duration, it seems like there is still a mystery as to why the bottom 10 and top 20 look different from the rest of the chart.