## Questions I aim to ask in this project:

* Can we predict the number of streams a song will get (on Spotify) based on its audio features (danceability, energy, loudness, etc.) and metadata (artist, album type)?

* What factors contribute to a higher number of views, likes, and comments on the YouTube videos of these songs and can we predict them?

* Is there a correlation between the number of streams on Spotify and the number of views/likes on YouTube for the same song?

### 

__Tools I intend to use:__

Linear Regression, Random Forest, Neural Networks, Decision Trees, SVM, k-NN, Adaboost.

## About the Spotify - Youtube Dataset

Dataset of songs of various artist in the world and for each song is present:

* Several statistics of the music version on spotify, including the number of streams;
* Number of views of the official music video of the song on youtube.

### Content
It includes 26 variables for each of the songs collected from spotify. These variables are briefly described next:

* __Track__: name of the song, as visible on the Spotify platform.

* __Artist__: name of the artist.

* __Url_spotify__: the Url of the artist.

* __Album__: the album in wich the song is contained on Spotify.

* __Album_type__: indicates if the song is relesead on Spotify as a single or contained in an album.

* __Uri__: a spotify link used to find the song through the API.

* __Danceability__: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

* __Energy__: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

* __Key__: the key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

* __Loudness__: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

* __Speechiness__: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

* __Acousticness__: a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

* __Instrumentalness__: predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

* __Liveness__: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

* __Valence__: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

* __Tempo__: the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

* __Duration_ms__: the duration of the track in milliseconds.

* __Stream__: number of streams of the song on Spotify.

* __Url_youtube__: url of the video linked to the song on Youtube, if it have any.

* __Title__: title of the videoclip on youtube.

* __Channel__: name of the channel that have published the video.

* __Views__: number of views.

* __Likes__: number of likes.

* __Comments__: number of comments.

* __Description__: description of the video on Youtube.

* __Licensed__: Indicates whether the video represents licensed content, which means that the content was uploaded to a channel linked to a YouTube content partner and then claimed by that partner.

* __official_video__: boolean value that indicates if the video found is the official video of the song.

In [97]:
import pandas as pd

# Load the dataset
file_path = 'Spotify_Youtube.csv'
data = pd.read_csv(file_path)

# Handle missing values
data = data.dropna()

# Drop duplicate rows
data = data[~data.duplicated()]

# Drop unnecessary columns 
data = data.drop(['Unnamed: 0', "Url_spotify", "Uri", "Url_youtube", "Licensed", "official_video"], axis=1)

# Set the display option to show all columns
pd.set_option('display.max_columns', None)

# Display the dataset
data

Unnamed: 0,Artist,Track,Album,Album_type,Danceability,Energy,Key,Loudness,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,Title,Channel,Views,Likes,Comments,Description,Stream
0,Gorillaz,Feel Good Inc.,Demon Days,album,0.818,0.705,6.0,-6.679,0.1770,0.008360,0.002330,0.6130,0.7720,138.559,222640.0,Gorillaz - Feel Good Inc. (Official Video),Gorillaz,693555221.0,6220896.0,169907.0,Official HD Video for Gorillaz' fantastic trac...,1.040235e+09
1,Gorillaz,Rhinestone Eyes,Plastic Beach,album,0.676,0.703,8.0,-5.815,0.0302,0.086900,0.000687,0.0463,0.8520,92.761,200173.0,Gorillaz - Rhinestone Eyes [Storyboard Film] (...,Gorillaz,72011645.0,1079128.0,31003.0,The official video for Gorillaz - Rhinestone E...,3.100837e+08
2,Gorillaz,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,0.695,0.923,1.0,-3.930,0.0522,0.042500,0.046900,0.1160,0.5510,108.014,215150.0,Gorillaz - New Gold ft. Tame Impala & Bootie B...,Gorillaz,8435055.0,282142.0,7399.0,Gorillaz - New Gold ft. Tame Impala & Bootie B...,6.306347e+07
3,Gorillaz,On Melancholy Hill,Plastic Beach,album,0.689,0.739,2.0,-5.810,0.0260,0.000015,0.509000,0.0640,0.5780,120.423,233867.0,Gorillaz - On Melancholy Hill (Official Video),Gorillaz,211754952.0,1788577.0,55229.0,Follow Gorillaz online:\nhttp://gorillaz.com \...,4.346636e+08
4,Gorillaz,Clint Eastwood,Gorillaz,album,0.663,0.694,10.0,-8.627,0.1710,0.025300,0.000000,0.0698,0.5250,167.953,340920.0,Gorillaz - Clint Eastwood (Official Video),Gorillaz,618480958.0,6197318.0,155930.0,The official music video for Gorillaz - Clint ...,6.172597e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20713,SICK LEGEND,JUST DANCE HARDSTYLE,JUST DANCE HARDSTYLE,single,0.582,0.926,5.0,-6.344,0.0328,0.448000,0.000000,0.0839,0.6580,90.002,94667.0,JUST DANCE HARDSTYLE,SICK LEGEND - Topic,71678.0,1113.0,0.0,Provided to YouTube by Routenote\n\nJUST DANCE...,9.227144e+06
20714,SICK LEGEND,SET FIRE TO THE RAIN HARDSTYLE,SET FIRE TO THE RAIN HARDSTYLE,single,0.531,0.936,4.0,-1.786,0.1370,0.028000,0.000000,0.0923,0.6570,174.869,150857.0,SET FIRE TO THE RAIN HARDSTYLE,SICK LEGEND - Topic,164741.0,2019.0,0.0,Provided to YouTube by Routenote\n\nSET FIRE T...,1.089818e+07
20715,SICK LEGEND,OUTSIDE HARDSTYLE SPED UP,OUTSIDE HARDSTYLE SPED UP,single,0.443,0.830,4.0,-4.679,0.0647,0.024300,0.000000,0.1540,0.4190,168.388,136842.0,OUTSIDE HARDSTYLE SPED UP,SICK LEGEND - Topic,35646.0,329.0,0.0,Provided to YouTube by Routenote\n\nOUTSIDE HA...,6.226110e+06
20716,SICK LEGEND,ONLY GIRL HARDSTYLE,ONLY GIRL HARDSTYLE,single,0.417,0.767,9.0,-4.004,0.4190,0.356000,0.018400,0.1080,0.5390,155.378,108387.0,ONLY GIRL HARDSTYLE,SICK LEGEND - Topic,6533.0,88.0,0.0,Provided to YouTube by Routenote\n\nONLY GIRL ...,6.873961e+06


In [98]:
data.info()
data.columns


<class 'pandas.core.frame.DataFrame'>
Index: 19170 entries, 0 to 20717
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Artist            19170 non-null  object 
 1   Track             19170 non-null  object 
 2   Album             19170 non-null  object 
 3   Album_type        19170 non-null  object 
 4   Danceability      19170 non-null  float64
 5   Energy            19170 non-null  float64
 6   Key               19170 non-null  float64
 7   Loudness          19170 non-null  float64
 8   Speechiness       19170 non-null  float64
 9   Acousticness      19170 non-null  float64
 10  Instrumentalness  19170 non-null  float64
 11  Liveness          19170 non-null  float64
 12  Valence           19170 non-null  float64
 13  Tempo             19170 non-null  float64
 14  Duration_ms       19170 non-null  float64
 15  Title             19170 non-null  object 
 16  Channel           19170 non-null  object 
 17

Index(['Artist', 'Track', 'Album', 'Album_type', 'Danceability', 'Energy',
       'Key', 'Loudness', 'Speechiness', 'Acousticness', 'Instrumentalness',
       'Liveness', 'Valence', 'Tempo', 'Duration_ms', 'Title', 'Channel',
       'Views', 'Likes', 'Comments', 'Description', 'Stream'],
      dtype='object')

In [99]:
data.describe()

Unnamed: 0,Danceability,Energy,Key,Loudness,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,Views,Likes,Comments,Stream
count,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0,19170.0
mean,0.621178,0.63615,5.292645,-7.615305,0.094944,0.287817,0.055476,0.191322,0.528267,120.607345,224761.2,97197680.0,682353.1,28386.56,138274600.0
std,0.165533,0.213439,3.579947,4.617605,0.104931,0.28563,0.192768,0.165217,0.244996,29.588308,127846.8,279999700.0,1820550.0,197797.9,247730900.0
min,0.0,2e-05,0.0,-46.251,0.0,1e-06,0.0,0.0145,0.0,0.0,30985.0,26.0,0.0,0.0,6574.0
25%,0.52,0.51,2.0,-8.745,0.0357,0.0436,0.0,0.0941,0.338,96.9975,180267.0,2070213.0,24473.5,583.0,17869370.0
50%,0.639,0.667,5.0,-6.504,0.0506,0.188,2e-06,0.125,0.535,119.969,213321.0,15689590.0,133277.0,3515.5,50379380.0
75%,0.742,0.798,8.0,-4.9185,0.104,0.469,0.000436,0.234,0.724,139.946,251963.0,73690400.0,542346.2,14941.0,140757900.0
max,0.975,1.0,11.0,0.92,0.964,0.996,1.0,1.0,0.993,243.372,4676058.0,8079649000.0,50788650.0,16083140.0,3386520000.0


In [100]:
# from sklearn.preprocessing import LabelEncoder

# # Encode categorical variables
# label_encoders = {}
# categorical_columns = ['Artist', 'Track', 'Album', 'Album_type', 'Title', 'Channel', 'Description']
# for col in categorical_columns:
#     le = LabelEncoder()
#     data[col] = le.fit_transform(data[col])
#     label_encoders[col] = le


# data

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encode categorical variables
label_encoders = {}
categorical_columns = ['Artist', 'Track', 'Album', 'Album_type', 'Title', 'Channel', 'Description']
for col in categorical_columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

# Features and target variable for different questions
features_streams = ['Danceability', 'Energy', 'Key', 'Loudness', 'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration_ms', 'Artist', 'Album_type']
target_streams = 'Stream'

features_youtube = ['Danceability', 'Energy', 'Key', 'Loudness', 'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration_ms', 'Artist', 'Album_type']
target_youtube = ['Views', 'Likes', 'Comments']

X_streams = data[features_streams]
y_streams = data[target_streams]

X_youtube = data[features_youtube]
y_youtube = data[target_youtube]

# Split the data
X_train_streams, X_test_streams, y_train_streams, y_test_streams = train_test_split(X_streams, y_streams, test_size=0.2, random_state=42)
X_train_youtube, X_test_youtube, y_train_youtube, y_test_youtube = train_test_split(X_youtube, y_youtube, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_streams_scaled = scaler.fit_transform(X_train_streams)
X_test_streams_scaled = scaler.transform(X_test_streams)

X_train_youtube_scaled = scaler.fit_transform(X_train_youtube)
X_test_youtube_scaled = scaler.transform(X_test_youtube)

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Knn': KNeighborsRegressor(n_neighbors=5),
    'SVM': SVR()
}

In [102]:
print('Spotify')
for name, model in models.items():
    model.fit(X_train_streams_scaled, y_train_streams)
    y_pred = model.predict(X_test_streams_scaled)
    print(f"{name} - MSE:", mean_squared_error(y_test_streams, y_pred))
    print(f"{name} - R2 Score:", r2_score(y_test_streams, y_pred))
    print()

Spotify
Linear Regression - MSE: 5.354714424117997e+16
Linear Regression - R2 Score: 0.049454094487035416

Random Forest - MSE: 4.643636177641062e+16
Random Forest - R2 Score: 0.17568165064645114

Decision Tree - MSE: 1.0432487573793966e+17
Decision Tree - R2 Score: -0.8519303854786144

Knn - MSE: 6.237396052018465e+16
Knn - R2 Score: -0.10723575688828046

SVM - MSE: 6.361830098553859e+16
SVM - R2 Score: -0.1293247543720466



In [103]:
print('Youtube')
for target in target_youtube:
    print(f'Models for {target}')
    for name, model in models.items():
        model.fit(X_train_youtube_scaled, y_train_youtube[target])
        y_pred = model.predict(X_test_youtube_scaled)
        print(f"{name} - MSE:", mean_squared_error(y_test_youtube[target], y_pred))
        print(f"{name} - R2 Score:", r2_score(y_test_youtube[target], y_pred))
        print()


Youtube
Models for Views
Linear Regression - MSE: 6.3543864209907624e+16
Linear Regression - R2 Score: 0.024260640465305405

Random Forest - MSE: 6.002753509632842e+16
Random Forest - R2 Score: 0.07825516471808158

Decision Tree - MSE: 1.3559077481167621e+17
Decision Tree - R2 Score: -1.082046117568812

Knn - MSE: 7.341660133311141e+16
Knn - R2 Score: -0.1273388619764626

SVM - MSE: 7.1426906980950904e+16
SVM - R2 Score: -0.09678637485616903

Models for Likes
Linear Regression - MSE: 2330722627715.32
Linear Regression - R2 Score: 0.030355197637960685

Random Forest - MSE: 2119460134915.0703
Random Forest - R2 Score: 0.11824621291445991

Decision Tree - MSE: 5097933828237.845
Decision Tree - R2 Score: -1.1208808721190362

Knn - MSE: 2863064951809.634
Knn - R2 Score: -0.19111387015127068

SVM - MSE: 2681339054075.73
SVM - R2 Score: -0.11551089187453578

Models for Comments
Linear Regression - MSE: 13958213192.27945
Linear Regression - R2 Score: 0.005949766237423604

Random Forest - MSE: 

In [109]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for AdaBoost
param_grid = {
    'n_estimators': [12, 25, 50],
    'learning_rate': [0.0001, 0.001, 0.01]
}

# Function to perform grid search and print the best parameters and scores
def grid_search_adaboost(X_train, y_train, X_test, y_test):
    ada = AdaBoostRegressor()
    grid_search = GridSearchCV(estimator=ada, param_grid=param_grid, cv=5, scoring='r2')
    grid_search.fit(X_train, y_train)
    
    # Print best parameters and score
    print("Best parameters found: ", grid_search.best_params_)
    print("Best R2 score: ", grid_search.best_score_)
    print()

    # Predict with best model
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    print("Test MSE:", mean_squared_error(y_test, y_pred))
    print("Test R2 Score:", r2_score(y_test, y_pred))
    print()

# Perform grid search for Spotify streams
print('Spotify AdaBoost Hyperparameter Tuning')
grid_search_adaboost(X_train_streams_scaled, y_train_streams, X_test_streams_scaled, y_test_streams)

# Perform grid search for YouTube metrics
print('Youtube AdaBoost Hyperparameter Tuning')
for target in target_youtube:
    print(f'AdaBoost Tuning for {target}')
    grid_search_adaboost(X_train_youtube_scaled, y_train_youtube[target], X_test_youtube_scaled, y_test_youtube[target])


Spotify AdaBoost Hyperparameter Tuning
Best parameters found:  {'learning_rate': 0.01, 'n_estimators': 12}
Best R2 score:  0.0287061339192042

Test MSE: 5.4438390122203624e+16
Test R2 Score: 0.03363308040649682

Youtube AdaBoost Hyperparameter Tuning
AdaBoost Tuning for Views
Best parameters found:  {'learning_rate': 0.001, 'n_estimators': 12}
Best R2 score:  0.03554458879419016

Test MSE: 6.3227560786701224e+16
Test R2 Score: 0.029117595631873883

AdaBoost Tuning for Likes
Best parameters found:  {'learning_rate': 0.0001, 'n_estimators': 25}
Best R2 score:  0.0293566953655112

Test MSE: 2324043271074.429
Test R2 Score: 0.03313399395329519

AdaBoost Tuning for Comments


In [107]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for KNN
param_grid_knn = {
    'n_neighbors': list(range(5, 50))  # Search for n_neighbors from 1 to 20
}

# Function to perform grid search and print the best parameters and scores for KNN
def grid_search_knn(X_train, y_train, X_test, y_test):
    knn = KNeighborsRegressor()
    grid_search = GridSearchCV(estimator=knn, param_grid=param_grid_knn, cv=5, scoring='r2')
    grid_search.fit(X_train, y_train)
    
    # Print best parameters and score
    print("Best n_neighbors found: ", grid_search.best_params_)
    print("Best R2 score: ", grid_search.best_score_)
    
    # Predict with best model
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    print("Test MSE:", mean_squared_error(y_test, y_pred))
    print("Test R2 Score:", r2_score(y_test, y_pred))
    print()

# Perform grid search for Spotify streams
print('Spotify KNN Hyperparameter Tuning')
grid_search_knn(X_train_streams_scaled, y_train_streams, X_test_streams_scaled, y_test_streams)

# Perform grid search for YouTube metrics
print('YouTube KNN Hyperparameter Tuning')
for target in target_youtube:
    print(f'KNN Tuning for {target}')
    grid_search_knn(X_train_youtube_scaled, y_train_youtube[target], X_test_youtube_scaled, y_test_youtube[target])


Spotify KNN Hyperparameter Tuning
Best n_neighbors found:  {'n_neighbors': 49}
Best R2 score:  0.02998813546466326
Test MSE: 5.373537247690176e+16
Test R2 Score: 0.04611274769249085

YouTube KNN Hyperparameter Tuning
KNN Tuning for Views
Best n_neighbors found:  {'n_neighbors': 38}
Best R2 score:  0.018086498160014085
Test MSE: 6.4286854872794424e+16
Test R2 Score: 0.012851746112412243

KNN Tuning for Likes
Best n_neighbors found:  {'n_neighbors': 48}
Best R2 score:  0.021211927173711766
Test MSE: 2373227246344.7925
Test R2 Score: 0.012672105690271995

KNN Tuning for Comments
Best n_neighbors found:  {'n_neighbors': 48}
Best R2 score:  -0.006027832396328159
Test MSE: 15530041073.709303
Test R2 Score: -0.10598976724341735



In [105]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_train_streams_scaled = scaler.fit_transform(X_train_streams_scaled)
X_test_streams_scaled = scaler.transform(X_test_streams_scaled)

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_streams_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_streams_scaled.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_streams_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_streams_scaled.values, dtype=torch.float32)

# Define the neural network model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(X_train_streams_scaled.shape[1], 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train_tensor)
    loss = criterion(outputs.squeeze(), y_train_tensor)
    loss.backward()
    optimizer.step()

# Evaluate the model
model.eval()
with torch.no_grad():
    y_pred_tensor = model(X_test_tensor).squeeze()

print('Neural Network for Streams:')
print('MSE:', mean_squared_error(y_test_tensor.numpy(), y_pred_tensor.numpy()))
print('R2 Score:', r2_score(y_test_tensor.numpy(), y_pred_tensor.numpy()))


NameError: name 'y_train_streams_scaled' is not defined