# I have the following hypothesis:

* In the mid to late 1980's, there was a big push to put a warning label on records and CD's that were deemed 'explicit' and had suggestive or 'satanic' lyrics. This was supposed to help parents monitor the music their children were buying. I propose it had the opposite effect. Albums and music with explicit lyrics and images were more popular after the mid to late 1980's.

I will test this hypothesis by creating different groups of data separated by genre, explicit vs non-explicit, and a combination of both. I will also separate out the 'Top 10' songs, so that I can test the popularity of songs. The graphs and statistical data will compare these groups before and after an arbritary date (sometime between 1985 and 1990). If songs and albums with explicit lyrics and images (rap, hip-hop, heavy metal, etc.) were more popular than non-explicit songs after a certain year, it would suggest my hypothesis is correct.

In [None]:
%reload_ext nb_black

In [None]:
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.multitest import multipletests
import ast
from datetime import datetime

%matplotlib inline

# Read dataframes, drop duplicate genres

In [None]:
hot_100 = pd.read_csv("data/Hot_Stuff.csv")
spotify = pd.read_excel("data/Hot_100_Audio_Features.xlsx")

In [None]:
spotify["spotify_genre"].explode().value_counts()

In [None]:
spotify = spotify.drop_duplicates()

# Drop columns that are not needed

In [None]:
hot_100 = hot_100.drop(columns=["url"])

In [None]:
spotify = spotify.drop(
    columns=[
        "spotify_track_id",
        "spotify_track_preview_url",
        "spotify_track_duration_ms",
        "spotify_track_popularity",
        "danceability",
        "energy",
        "key",
        "loudness",
        "mode",
        "acousticness",
        "speechiness",
        "liveness",
        "instrumentalness",
        "valence",
        "tempo",
        "time_signature",
    ]
)

# Basic table description data

In [None]:
hot_100.shape

In [None]:
hot_100.isna().sum()

In [None]:
hot_100.head()

In [None]:
hot_100.dtypes

In [None]:
spotify.shape

In [None]:
spotify.isna().mean()

In [None]:
spotify.head()

In [None]:
spotify.dtypes

# Fill null values, convert previous week position to an integer

In [None]:
spotify["spotify_track_album"].fillna(" ", inplace=True)
hot_100["Previous Week Position"].fillna(int(0), inplace=True)

In [None]:
hot_100 = hot_100.astype({"Previous Week Position": int})

In [None]:
spotify["spotify_genre"].fillna("[]", inplace=True)
spotify["spotify_genre_list"] = spotify["spotify_genre"].apply(ast.literal_eval)

# Join the 2 tables by SongID (song and performer)

In [None]:
full_table = hot_100.merge(spotify, left_on="SongID", right_on="SongID")
full_table

# Remove unnecessary columns, re-label columns

In [None]:
full_table = full_table.drop(columns=["Performer_y", "Song_y"])

In [None]:
full_table.rename(
    columns={
        "WeekID": "week",
        "Week Position": "week_pos",
        "Song_x": "song",
        "Performer_x": "artist",
        "SongID": "song_id",
        "Instance": "instance",
        "Previous Week Position": "prev_week_pos",
        "Peak Position": "peak_pos",
        "Weeks on Chart": "weeks_on_chart",
        "spotify_genre": "genre_str",
        "spotify_genre_list": "genre_list",
        "spotify_track_album": "album",
        "spotify_track_explicit": "explicit",
    },
    inplace=True,
)

# Format week to a date format

In [None]:
full_table["week"] = full_table["week"].apply(
    lambda x: datetime.strptime(x, "%m/%d/%Y")
)
full_table["week"] = pd.to_datetime(full_table["week"])

# Put month and year in new columns

In [None]:
full_table["year"], full_table["month"] = (
    full_table["week"].dt.year,
    full_table["week"].dt.month,
)

# Check explicit vs not explicit counts for reference

In [None]:
full_table["explicit"].value_counts()

# Create a table with means, only use songs in the top 75 to make the dataframe a little smaller.  This is the main table used in analysis.

In [None]:
top_song_limit = 75
top_songs = full_table[full_table["week_pos"] <= top_song_limit]
top_songs = top_songs.groupby(["song_id", "genre_str"]).mean()
top_songs = top_songs.reset_index()

In [None]:
top_songs["year"] = round(top_songs["year"])
top_songs["month"] = round(top_songs["month"])
top_songs

# Add a genre label column.  Evaluate all of the genres in the list, and make a single genre decision.  Because there are lots of songs with a 'pop' genre, I override those with the other genres.

In [None]:
genre_list = ["rap", "hip hop", "metal", "country", "pop"]

In [None]:
for genre in genre_list:
    top_songs[genre] = top_songs["genre_str"].str.contains(fr"\b{genre}\b")

In [None]:
top_songs.loc[top_songs["pop"], "genre_label"] = "pop"
top_songs.loc[top_songs["rap"] | top_songs["hip hop"], "genre_label"] = "rap_hiphop"
top_songs.loc[top_songs["metal"], "genre_label"] = "metal"
top_songs.loc[top_songs["country"], "genre_label"] = "country"

top_songs["genre_label"].value_counts()

# Split the top songs by genre.  These violin plots are interesting because they show:

* There were very few 'explicit' songs before 1990
* Rap and hip-hop became popular right after the PMRC was formed in 1985
* Rap has many more explicit songs than not explicit songs

In [None]:
pop = top_songs[top_songs["genre_label"] == "pop"]
rap_hiphop = top_songs[top_songs["genre_label"] == "rap_hiphop"]
country = top_songs[top_songs["genre_label"] == "country"]
metal = top_songs[top_songs["genre_label"] == "metal"]

In [None]:
ax = sns.violinplot(x="explicit", y="year", data=top_songs)
ax.set_xticklabels(["Not Explicit", "Explicit"])
ax.set_xlabel("")
plt.savefig("violin_expl.png")
plt.show()

In [None]:
ax = sns.violinplot(x="genre_label", y="year", data=top_songs)
ax.set_xticklabels(["Pop", "Rap/Hip-Hop", "Country", "Heavy Metal"])
ax.set_xlabel("")
plt.savefig("violin_genre.png")
plt.show()

In [None]:
plt.hist(pop["explicit"], label="pop")
plt.hist(rap_hiphop["explicit"], label="rap/hip hop")
plt.hist(country["explicit"], label="country")
plt.hist(metal["explicit"], label="metal")
plt.legend()
plt.plot()

## ANOVA Assumption Check - Popularity by genre 
### The week position is the mean of all the weekly Hot 100 positions of each song.

In [None]:
qqplot(rap_hiphop["week_pos"], line="s")
plt.show()

In [None]:
qqplot(metal["week_pos"], line="s")
plt.show()

In [None]:
qqplot(country["week_pos"], line="s")
plt.show()

In [None]:
qqplot(pop["week_pos"], line="s")
plt.show()

# These plots show the data is not normalized, so I will perform the non-perimetric tests


## Kruskal Wallis test - Popularity by genre

In [None]:
plt.hist(pop["week_pos"], label="pop")
plt.hist(rap_hiphop["week_pos"], label="rap/hip hop")
plt.hist(country["week_pos"], label="country")
plt.hist(metal["week_pos"], label="metal")
plt.xlabel("Hot 100 Ranking")
plt.ylabel("# of Songs")
plt.legend()
plt.savefig("hist_genre.png")
plt.plot()

# This plot further shows the data is not normalized, so I will perform the non-perimetric tests

In [None]:
_, p = stats.kruskal(
    pop["week_pos"], rap_hiphop["week_pos"], country["week_pos"], metal["week_pos"]
)
p

## At least one median between the is genres is different

In [None]:
_, p1 = stats.mannwhitneyu(pop["week_pos"], rap_hiphop["week_pos"])
_, p2 = stats.mannwhitneyu(pop["week_pos"], country["week_pos"])
_, p3 = stats.mannwhitneyu(pop["week_pos"], metal["week_pos"])
_, p4 = stats.mannwhitneyu(rap_hiphop["week_pos"], country["week_pos"])
_, p5 = stats.mannwhitneyu(rap_hiphop["week_pos"], metal["week_pos"])
_, p6 = stats.mannwhitneyu(country["week_pos"], metal["week_pos"])


p_values = [p1, p2, p3, p4, p5, p6]
reject, corr_p, sidak, bonf = multipletests(p_values, alpha=0.05)

In [None]:
reject

In [None]:
corr_p

In [None]:
sidak

In [None]:
bonf

## Based on our analysis we found:

* The mean rank of at least one of our samples is significantly different than the others
* There is a significant difference in mean rank between the popularity of pop and rap/hip-hop
* There is a significant difference in mean rank between the popularity of rap/hip-hop and metal

## ANOVA Assumption Check - Explicit vs not explicit by year

In [None]:
sns.violinplot(x="explicit", y="year", data=top_songs)
plt.show()

In [None]:
top_songs["explicit"].value_counts()

### I added rap and hip-hop to the tests because it became popular right after the PMRC was formed in 1985

In [None]:
top_expl = top_songs[top_songs["explicit"] == 1]
top_not_expl = top_songs[top_songs["explicit"] == 0]

rap_hiphop_expl = rap_hiphop[rap_hiphop["explicit"] == 1]
rap_hiphop_not_expl = rap_hiphop[rap_hiphop["explicit"] == 0]

In [None]:
top_expl.sort_values("year").head()

In [None]:
top_not_expl.sort_values("year").head()

# I'm only analyzing explicit songs after 1985 because there were so few of them before 1985

In [None]:
top_expl = top_expl[top_expl["year"] > 1985]
top_expl

In [None]:
qqplot(top_songs["year"], line="s")
plt.show()

In [None]:
qqplot(rap_hiphop["year"], line="s")
plt.show()

# These plots show the data is not normalized, so I will perform the non-perimetric tests

## Kruskal Wallis - Explicit vs not explicit by year

In [None]:
_, p = stats.kruskal(
    top_expl["year"],
    top_not_expl["year"],
    rap_hiphop_expl["year"],
    rap_hiphop_not_expl["year"],
)
p

In [None]:
_, p1 = stats.mannwhitneyu(top_expl["year"], top_not_expl["year"])
_, p2 = stats.mannwhitneyu(top_expl["year"], rap_hiphop_expl["year"])
_, p3 = stats.mannwhitneyu(top_expl["year"], rap_hiphop_not_expl["year"])

_, p4 = stats.mannwhitneyu(top_not_expl["year"], rap_hiphop_expl["year"])
_, p5 = stats.mannwhitneyu(top_not_expl["year"], rap_hiphop_not_expl["year"])
_, p6 = stats.mannwhitneyu(rap_hiphop_expl["year"], rap_hiphop_not_expl["year"])


p_values = [p1, p2, p3, p4, p5, p6]
reject, corr_p, sidak, bonf = multipletests(p_values, alpha=0.05)

In [None]:
reject

In [None]:
corr_p

In [None]:
sidak

In [None]:
bonf

# Conclusions:
## 1)
* There were very few 'explicit' songs before 1990
* Rap and hip-hop became popular right after the PMRC was formed in 1985
* Rap has many more explicit songs than not explicit songs

## 2)
* There is a significant difference in mean rank between the popularity of pop songs and rap/hip-hop songs
* There is a significant difference in mean rank between the popularity of rap/hip-hop songs and metal songs

## 3)
* There is a significant difference in mean rank between the popularity of explicit songs and non-explicit songs
* There is a significant difference in mean rank between the popularity of explicit rap songs and all non-explicit songs