## Predicting song popularity

As a machine learning expert but also a member of a Taylor Swift fan club, you are developing a machine learning model that, given some information about the audio characteristics of a Taylor Swift song, predicts its popularity on Spotify.

("Popularity" on Spotify is a score from 0-100 that includes factors like the number of times a song is streamed, how recently it is streamed, and how often users save the song to their libraries.)

Here are some `scikit-learn` docs that you may find useful:

- [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)
- [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [GroupShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html)
- [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

Here are some `pandas` docs that you may find useful:

- [assign](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html)
- [to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)
- [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
- [reset_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)
- [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
- [idxmax](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html)

Use the following random state in your workspace:

> random_state = 5

| Name | Type | Description |
| ---- | ---- | ---- |
|`X`	|pandas dataframe	|1. Feature data.|
|`y`	|pandas dataframe or dataseries	|1. Target variable.|
|`X_tr`	|pandas dataframe	|1. Feature data (training).|
|`X_ts`	|pandas dataframe	|1. Feature data (test).|
|`y_tr`	|pandas dataframe or dataseries	|1. Target variable (training).|
|`y_ts`	|pandas dataframe or dataseries	|1. Target variable (test).|
|`y_pred_rf`	|1d numpy array	|2. Prediction of Random Forest on test data.|
|`rsq_rf`	|float	|2. R2 score of Random Forest on test data.|
|`X_tr_grp`	|pandas dataframe	|3. Feature data (training, grouped by base name).|
|`X_ts_grp`	|pandas dataframe	|3. Feature data (test, grouped by base name).|
|`y_tr_grp`	|pandas dataframe or dataseries	|3. Target variable (training, grouped by base name).|
|`y_ts_grp`	|pandas dataframe or dataseries	|3. Target variable (test, grouped by base name).|
|`y_pred_grp`	|1d numpy array	|3. Prediction of Random Forest on test data (grouped by base name).|
|`rsq_grp`	|float	|3. R2 score of Random Forest on test data (grouped by base name).|
|`X_tr_srt`	|pandas dataframe	|4. Feature data (training, chronological order).|
|`X_ts_srt`	|pandas dataframe	|4. Feature data (test, chronological order).|
|`y_tr_srt`	|pandas dataframe or dataseries	|4. Target variable (training, chronological order).|
|`y_ts_srt`	|pandas dataframe or dataseries	|4. Target variable (test, chronological order).|
|`y_pred_srt`	|1d numpy array	4. Prediction of Random Forest on test data (chronological order).|
|`rsq_srt`	|float	|4. R2 score of Random Forest on test data (chronological order).|
|`X_tr_max`	|pandas dataframe	|5. (Bonus) Feature data (training, max popularity sample).|
|`X_ts_max`	|pandas dataframe	|5. (Bonus) Feature data (test, max popularity sample).|
|`y_tr_max`	|pandas dataframe or dataseries	|5. (Bonus) Target variable (training, max popularity sample).|
|`y_ts_max`	|pandas dataframe or dataseries	|5. (Bonus) Target variable (test, max popularity sample).|
|`y_pred_max`	|1d numpy array	|5. (Bonus) Prediction of Random Forest on test data (max popularity sample).|
|`rsq_max`	|float	|5. (Bonus) R2 score of Random Forest on test data (max popularity sample).|

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GroupShuffleSplit, TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

Assign the value specified in the question page to the `random_state` variable. Use this variable throughout your work on this question.

In [2]:
random_state = 5

You are working with a dataset of Taylor Swift songs:

In [3]:
df = pd.read_csv("taylor_swift_spotify.csv", index_col = 0)
df.head()

Unnamed: 0,name,album,release_date,track_number,id,uri,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
0,Fortnight (feat. Post Malone),THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,1,6dODwocEuGzHAavXqTbwHv,spotify:track:6dODwocEuGzHAavXqTbwHv,0.502,0.504,0.386,1.5e-05,0.0961,-10.976,0.0308,192.004,0.281,82,228965
1,The Tortured Poets Department,THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,2,4PdLaGZubp4lghChqp8erB,spotify:track:4PdLaGZubp4lghChqp8erB,0.0483,0.604,0.428,0.0,0.126,-8.441,0.0255,110.259,0.292,79,293048
2,My Boy Only Breaks His Favorite Toys,THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,3,7uGYWMwRy24dm7RUDDhUlD,spotify:track:7uGYWMwRy24dm7RUDDhUlD,0.137,0.596,0.563,0.0,0.302,-7.362,0.0269,97.073,0.481,80,203801
3,Down Bad,THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,4,1kbEbBdEgQdQeLXCJh28pJ,spotify:track:1kbEbBdEgQdQeLXCJh28pJ,0.56,0.541,0.366,1e-06,0.0946,-10.412,0.0748,159.707,0.168,82,261228
4,"So Long, London",THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,5,7wAkQFShJ27V8362MqevQr,spotify:track:7wAkQFShJ27V8362MqevQr,0.73,0.423,0.533,0.00264,0.0816,-11.388,0.322,160.218,0.248,80,262974


You may add some additional code here to explore the data (the following cell is not graded).

#### 1. Prepare data

To prepare data for use by your model, create a data frame of features, `X`, including the following columns:

* `acousticness`
* `danceability`
* `energy`
* `instrumentalness`
* `liveness`
* `loudness`
* `speechiness`
* `tempo`
* `valence`
* `duration_ms`

and a data series `y` with the song's popularity (`popularity`).


In [4]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X = df[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'duration_ms']]
y = df['popularity']

Then, split `X` and `y` into a training and test set. Reserve 100 samples for the test set, and put the rest in the training set. Shuffle the data, and use the random state specified in the question page.

In [5]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=100, shuffle=True, random_state=random_state)

#### 2. Predict song popularity

Next, train a `RandomForestRegressor` to predict the popularity of each song.

* Use the random state specified in the question page
* and put 100 trees in your random forest
* leave other settings at their default values

Save the predictions of the model on the test set in `y_pred_rf`, and its R2 score in `rsq_rf`.

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
rf = RandomForestRegressor(n_estimators=100, random_state=random_state)
rf.fit(X_tr, y_tr)

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
y_pred_rf = rf.predict(X_ts)
rsq_rf = r2_score(y_ts, y_pred_rf)
print(rsq_rf)

0.26509064471639


#### 3. Group split

You notice that some songs are represented more than once in the data (e.g. in different versions). You are concerned there may be some data leakage due to this duplication, since a song that is popular in one version is likely to be also be popular in another version of the same song.

For example, both versions of the song "Fortnight" in the data are currently very popular:

In [8]:
df[df['name'].str.contains("Fortnight", case=False, na=False)]

Unnamed: 0,name,album,release_date,track_number,id,uri,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
0,Fortnight (feat. Post Malone),THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY,2024-04-19,1,6dODwocEuGzHAavXqTbwHv,spotify:track:6dODwocEuGzHAavXqTbwHv,0.502,0.504,0.386,1.5e-05,0.0961,-10.976,0.0308,192.004,0.281,82,228965
31,Fortnight (feat. Post Malone),THE TORTURED POETS DEPARTMENT,2024-04-18,1,2OzhQlSqBEmt7hmkYxfT6m,spotify:track:2OzhQlSqBEmt7hmkYxfT6m,0.502,0.504,0.386,1.5e-05,0.0961,-10.976,0.0308,192.004,0.281,91,228965


Given the following `clean_title` function (which you will not change):

In [9]:
import re
def clean_title(title):
    title_no_parentheticals = re.sub(r'\s*\(.*?\)', '', title)
    return re.sub(r'\s*-\s*.*$', '', title_no_parentheticals).strip()

create a `numpy` array `base_name` which has the "cleaned" title for each song title in the data. 

Different versions of the same song, e.g. "Shake It Off" and "Shake It Off (Taylor's Version)", will have the same base name.


In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
base_name = df['name'].apply(clean_title).values

Now, you will use a `GroupShuffleSplit` to create a training and test set where samples with the same `base_name` will either all be in the training set together, or will all be in the test set together.

* Use 1 split.
* Reserve 100 samples for the test set, and put the rest in the training set.
* Use the random state specified in the question page

and save the training and test data in `X_tr_grp`, `X_ts_grp`, `y_tr_grp`, `y_ts_grp`.

In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
gss = GroupShuffleSplit(n_splits=1, test_size=100, random_state=random_state)
for train_idx, test_idx in gss.split(X, y, groups=base_name):
    X_tr_grp = X.iloc[train_idx]
    X_ts_grp = X.iloc[test_idx]
    y_tr_grp = y.iloc[train_idx]
    y_ts_grp = y.iloc[test_idx]

Train a random forest (using the same configuratoin as before) on *this* data, and report the results in `y_pred_grp` and `rsq_grp`.

In [14]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
rf_grp = RandomForestRegressor(n_estimators=100, random_state=random_state)
rf_grp.fit(X_tr_grp, y_tr_grp)
y_pred_grp = rf_grp.predict(X_ts_grp)
rsq_grp = r2_score(y_ts_grp, y_pred_grp)
print(rsq_grp)

0.14595588803194648


#### 4. Time split

You ultimately want to predict the popularity of *newly* released songs based on *past* data, but you divided data into training and test sets without consideration of time.

You want to make sure that the evaluation above is not an "overly optimistic" one, due to temporal data leakage.

In the next cell,

* Make a copy of the data frame in `df_srt`.
* In `df_srt`, convert the release date to a `datetime` field.
* Sort by this field, so that the oldest songs are the first rows in `df_srt`, then by track number (e.g. for songs released on the same date, order by track number). (You may want to use `sort_values`)
* Reset the index of the data frame (and drop the old index) so that rows are re-numbered starting at zero.  (You may want to use `reset_index`)

In [17]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
df_srt = df.copy()
df_srt['release_date'] = pd.to_datetime(df_srt['release_date'])
df_srt = df_srt.sort_values(by=['release_date', 'track_number'])
df_srt = df_srt.reset_index(drop=True)

Now, create `X_srt` and `y_srt` using the same columns as before, but they will now be sorted.

In [18]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X_srt = df_srt[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'duration_ms']]
y_srt = df_srt['popularity']

 Use them to divide the data into a training and test set *without* shuffling. (As before, leave 100 samples for the test set.)

In [23]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X_tr_srt, X_ts_srt, y_tr_srt, y_ts_srt = train_test_split(X_srt, y_srt, test_size=100, shuffle=False, random_state=random_state)

Train a random forest (using the same configuratoin as before) on *this* data, and report the results in `y_pred_srt` and `rsq_srt`.

In [24]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
rf_srt = RandomForestRegressor(n_estimators=100, random_state=random_state)
rf_srt.fit(X_tr_srt, y_tr_srt)
y_pred_srt = rf_srt.predict(X_ts_srt)
rsq_srt = r2_score(y_ts_srt, y_pred_srt)
print(rsq_srt)

-4.530578742011937


#### 5. (Bonus) Predict max popularity

**This is a bonus section. If you complete this part correctly, you can earn above 100% on the overall exam. But, don't spend time on this section before you have answered the rest of the questions on the exam.**

You decide that instead of predicting the popularity of *every* version of a song, you just want to predict the popularity of the most popular version of each song.

So, you create a new data frame in which:

* for each unique value of `base_name`, you only include the sample in the original data that is the most popular.
* this data is also sorted by release date, then by track number.

From this data frame, you create `X_max` and `y_max`, then `X_tr_max`, `X_ts_max`, `ytr_max`, `yts_max`. Since this is a smaller sample, you reserve only 50 samples for the test set. Once again, you do not shuffle the data.

In [25]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
idx_max = df.groupby(df['name'].apply(clean_title))['popularity'].idxmax()
df_max = df.loc[idx_max]
df_max = df_max.sort_values(by=['release_date', 'track_number'])
df_max = df_max.reset_index(drop=True)
X_max = df_max[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'duration_ms']]
y_max = df_max['popularity']
X_tr_max, X_ts_max, y_tr_max, y_ts_max = train_test_split(X_max, y_max, test_size=50, shuffle=False, random_state=random_state)

Train a random forest (using the same configuratoin as before) on *this* data, and report the results in `y_pred_max` and `rsq_max`.

In [26]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
rf_max = RandomForestRegressor(n_estimators=100, random_state=random_state)
rf_max.fit(X_tr_max, y_tr_max)
y_pred_max = rf_max.predict(X_ts_max)
rsq_max = r2_score(y_ts_max, y_pred_max)
print(rsq_max)

-2.635623223535182
