### Using Random Forest to perform Machine Learning
---

In [1]:
# importing libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np

In [5]:
# Load the dataset into a pandas dataframe
cleanedMusicData = pd.read_csv('../Data/tracks_cleaned.csv')
cleanedMusicData.drop('release_date',axis=1,inplace=True)
cleanedMusicData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   popularity        469281 non-null  int64  
 1   duration_ms       469281 non-null  int64  
 2   explicit          469281 non-null  int64  
 3   danceability      469281 non-null  float64
 4   energy            469281 non-null  float64
 5   key               469281 non-null  int64  
 6   loudness          469281 non-null  float64
 7   mode              469281 non-null  int64  
 8   speechiness       469281 non-null  float64
 9   acousticness      469281 non-null  float64
 10  instrumentalness  469281 non-null  float64
 11  liveness          469281 non-null  float64
 12  valence           469281 non-null  float64
 13  tempo             469281 non-null  float64
 14  time_signature    469281 non-null  int64  
 15  num_artists       469281 non-null  int64  
 16  year              46

Since Key is already an estimated value instead of actual we will drop it from the frame. Additonally, we drop release_date due to the inconsisitencies in the data where release_months are only available for some of the data. Since we have already created a column called year in the data cleaning part, we are still able to use that and take advantage of the strong relation between year and popularity

In [6]:
# Remove unnecessary variables
cleanedMusicData.drop(['key'],axis=1, inplace=True)

In [7]:
# Data cleaning and arrangement
time_signature_df=pd.get_dummies(cleanedMusicData["time_signature"])
cleanedMusicData = pd.concat([cleanedMusicData,time_signature_df],axis=1)
cleanedMusicData['mode'] = np.where(cleanedMusicData['mode']=='Major', 1, 0)

In [8]:
# data modelling
X= cleanedMusicData.loc[:,cleanedMusicData.columns !="popularity"] # all the features accept popularity
y = cleanedMusicData["popularity"] # the popularity

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   duration_ms       469281 non-null  int64  
 1   explicit          469281 non-null  int64  
 2   danceability      469281 non-null  float64
 3   energy            469281 non-null  float64
 4   loudness          469281 non-null  float64
 5   mode              469281 non-null  int64  
 6   speechiness       469281 non-null  float64
 7   acousticness      469281 non-null  float64
 8   instrumentalness  469281 non-null  float64
 9   liveness          469281 non-null  float64
 10  valence           469281 non-null  float64
 11  tempo             469281 non-null  float64
 12  time_signature    469281 non-null  int64  
 13  num_artists       469281 non-null  int64  
 14  year              469281 non-null  int64  
 15  0                 469281 non-null  uint8  
 16  1                 46

In [10]:
# separate the data to training and testing
X_train, X_test, y_train,y_test=train_test_split(X,y,test_size=0.2)

# save as np.array
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train) 
y_test = np.array(y_test)

In [533]:
# random forest regressor
from sklearn.ensemble import RandomForestRegressor

model_random_forest = RandomForestRegressor(n_estimators=100, max_depth=10,)
model_random_forest.fit(X_train,y_train)


In [534]:
print("Train Set Performance")
print("R^2: " + str(model_random_forest.score(X_train,y_train)))

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Predict on the train set
y_train_pred = model_random_forest.predict(X_train)

# Calculate MSE and RMSE on train set
mse_train = mean_squared_error(y_train, y_train_pred)

rmse_train = np.sqrt(mse_train)
print("MSE: ", mse_train)
print("RMSE:", rmse_train)

Train Set Performance
R^2: 0.49331159747127684
MSE:  171.03817164957607
RMSE: 13.07815627867996


In [535]:
print("Test Set Performance")
print("R^2: " + str(model_random_forest.score(X_test,y_test)))

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Predict on the train set
y_test_pred = model_random_forest.predict(X_test)

# Calculate MSE and RMSE on train set
mse_test = mean_squared_error(y_test, y_test_pred)

rmse_test = np.sqrt(mse_test)
print("MSE: ", mse_test)
print("RMSE:", rmse_test)

Test Set Performance
R^2: 0.47486637863325565
MSE:  176.6339797766907
RMSE: 13.290371694451991


In [526]:
from sklearn.metrics import mean_squared_error
import numpy as np

y_pred = model_random_forest.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


RMSE: 13.271554001544486


---  
We have used an arbitrary max_depth and number of classfiers for our Random Forests model. Let's see and explore if we can fine tune the model to improve the performance.

----
## Fine Tuning Model

In [558]:
# Dropping non-essential factors
# data modelling

refinedMusicData = pd.read_csv('../Data/tracks_cleaned.csv')
refinedMusicData.drop('release_date',axis=1,inplace=True)

x_top_6= refinedMusicData.loc[:,refinedMusicData.columns !="popularity"] # all the features accept popularity
y_top_6 = refinedMusicData["popularity"] # the popularity

In [559]:
x_top_6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   duration_ms       469281 non-null  int64  
 1   explicit          469281 non-null  int64  
 2   danceability      469281 non-null  float64
 3   energy            469281 non-null  float64
 4   key               469281 non-null  int64  
 5   loudness          469281 non-null  float64
 6   mode              469281 non-null  int64  
 7   speechiness       469281 non-null  float64
 8   acousticness      469281 non-null  float64
 9   instrumentalness  469281 non-null  float64
 10  liveness          469281 non-null  float64
 11  valence           469281 non-null  float64
 12  tempo             469281 non-null  float64
 13  time_signature    469281 non-null  int64  
 14  num_artists       469281 non-null  int64  
 15  year              469281 non-null  int64  
dtypes: float64(9), int64

In [560]:
x_top_6.drop(['key','year', 'num_artists', 'mode', 'speechiness', 'liveness', 'valence', 'tempo', 'time_signature','instrumentalness'],axis=1, inplace=True)
x_top_6.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   duration_ms   469281 non-null  int64  
 1   explicit      469281 non-null  int64  
 2   danceability  469281 non-null  float64
 3   energy        469281 non-null  float64
 4   loudness      469281 non-null  float64
 5   acousticness  469281 non-null  float64
dtypes: float64(4), int64(2)
memory usage: 21.5 MB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_top_6.drop(['key','year', 'num_artists', 'mode', 'speechiness', 'liveness', 'valence', 'tempo', 'time_signature','instrumentalness'],axis=1, inplace=True)


In [561]:
y_top_6.value_counts()

0     35632
35     9755
23     9710
1      9630
36     9484
      ...  
97        2
93        2
95        1
98        1
96        1
Name: popularity, Length: 99, dtype: int64

We notice that there is a imbalance in data for the minority classes. We can upsample. This would reduce model bias

In [562]:
# Split the Dataset into Train and Test
X_train_top_6, X_test_top_6, y_train_top_6, y_test_top_6 = train_test_split(x_top_6, y_top_6, test_size = 0.2)

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_train_ros, y_train_ros = ros.fit_resample(X_train_top_6,y_train_top_6)

print('Removed Indexes:')
print(X_train_ros.shape,y_train_ros.shape)

Removed Indexes:
(2769284, 6) (2769284,)


In [563]:
y_train_ros.value_counts()

39    28258
76    28258
78    28258
70    28258
68    28258
      ...  
29    28258
5     28258
17    28258
11    28258
95    28258
Name: popularity, Length: 98, dtype: int64

In [573]:
# Dataset is large. To reduce grid search time, we can use a subset of the data

# Split the Dataset into Train and Test
X_train_top_6_CV, X_test_top_6_CV, y_train_top_6_CV, y_test_top_6_CV = train_test_split(X_train_ros, y_train_ros, test_size = 0.995 ,stratify=y_train_ros)

In [574]:
y_train_top_6_CV.value_counts()

40    142
15    142
68    142
94    142
72    142
     ... 
11    141
35    141
91    141
54    141
36    141
Name: popularity, Length: 98, dtype: int64

In [575]:
X_train_top_6_CV.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13846 entries, 2470266 to 2619945
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   duration_ms   13846 non-null  int64  
 1   explicit      13846 non-null  int64  
 2   danceability  13846 non-null  float64
 3   energy        13846 non-null  float64
 4   loudness      13846 non-null  float64
 5   acousticness  13846 non-null  float64
dtypes: float64(4), int64(2)
memory usage: 757.2 KB


In [581]:
X_train_top_6_CV=np.array(X_train_top_6_CV)
y_train_top_6_CV=np.array(y_train_top_6_CV)

# Import GridSearch for hyperparameter tuning using Cross-Validation (CV)
from sklearn.model_selection import GridSearchCV

# Define the Hyper-parameter Grid to search on, in case of Random Forest
param_grid = {'n_estimators': np.arange(100,1001,100),   # number of trees 100, 200, ..., 1000
              'max_depth': [10,12,15], # depth of trees 10, 11, ..., 20
              }             

# Create the Hyper-parameter Grid
hpGrid = GridSearchCV(RandomForestRegressor(),   # the model family
                      param_grid,                 # the search grid
                      cv = 5,                     # 5-fold cross-validation
                      scoring = 'r2',
                      n_jobs=-1,
                      verbose=10)       # score to evaluate
  
# Train the models using Cross-Validation
hpGrid.fit(X_train_top_6_CV, y_train_top_6_CV.ravel())

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV 1/5; 1/30] START max_depth=10, n_estimators=100.............................
[CV 2/5; 1/30] START max_depth=10, n_estimators=100.............................
[CV 3/5; 1/30] START max_depth=10, n_estimators=100.............................
[CV 4/5; 1/30] START max_depth=10, n_estimators=100.............................
[CV 5/5; 1/30] START max_depth=10, n_estimators=100.............................
[CV 1/5; 2/30] START max_depth=10, n_estimators=200.............................
[CV 2/5; 2/30] START max_depth=10, n_estimators=200.............................
[CV 3/5; 2/30] START max_depth=10, n_estimators=200.............................
[CV 2/5; 1/30] END max_depth=10, n_estimators=100;, score=0.552 total time=   3.3s
[CV 4/5; 2/30] START max_depth=10, n_estimators=200.............................
[CV 4/5; 1/30] END max_depth=10, n_estimators=100;, score=0.551 total time=   3.4s
[CV 5/5; 1/30] END max_depth=10, n_estimato

In [583]:
# Fetch the best Model or the best set of Hyper-parameters
print(hpGrid.best_estimator_)

# Print the score (accuracy) of the best Model after CV
print(np.abs(hpGrid.best_score_))

RandomForestRegressor(max_depth=15, n_estimators=1000)
0.611477370912186


Now that we obtained a set of parameters using gridsearch, let's build our improved model. 

----
## Improved Model
---

In [11]:
# Load the dataset into a pandas dataframe
RefinedMusicData = pd.read_csv('../Data/tracks_cleaned.csv')
RefinedMusicData.drop('release_date',axis=1,inplace=True)
RefinedMusicData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   popularity        469281 non-null  int64  
 1   duration_ms       469281 non-null  int64  
 2   explicit          469281 non-null  int64  
 3   danceability      469281 non-null  float64
 4   energy            469281 non-null  float64
 5   key               469281 non-null  int64  
 6   loudness          469281 non-null  float64
 7   mode              469281 non-null  int64  
 8   speechiness       469281 non-null  float64
 9   acousticness      469281 non-null  float64
 10  instrumentalness  469281 non-null  float64
 11  liveness          469281 non-null  float64
 12  valence           469281 non-null  float64
 13  tempo             469281 non-null  float64
 14  time_signature    469281 non-null  int64  
 15  num_artists       469281 non-null  int64  
 16  year              46

In [12]:
# Data cleaning and arrangement
RefinedMusicData.drop(['key'],axis=1, inplace=True)
time_signature_df=pd.get_dummies(RefinedMusicData["time_signature"])
time_signature_df.columns = time_signature_df.columns.astype(str)
RefinedMusicData = pd.concat([RefinedMusicData,time_signature_df],axis=1)
RefinedMusicData['mode'] = np.where(RefinedMusicData['mode']=='Major', 1, 0)

In [13]:
# data modelling
X_improved= RefinedMusicData.loc[:,RefinedMusicData.columns !="popularity"] # all the features accept popularity
y_improved = RefinedMusicData["popularity"] # the popularity
X_improved.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469281 entries, 0 to 469280
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   duration_ms       469281 non-null  int64  
 1   explicit          469281 non-null  int64  
 2   danceability      469281 non-null  float64
 3   energy            469281 non-null  float64
 4   loudness          469281 non-null  float64
 5   mode              469281 non-null  int64  
 6   speechiness       469281 non-null  float64
 7   acousticness      469281 non-null  float64
 8   instrumentalness  469281 non-null  float64
 9   liveness          469281 non-null  float64
 10  valence           469281 non-null  float64
 11  tempo             469281 non-null  float64
 12  time_signature    469281 non-null  int64  
 13  num_artists       469281 non-null  int64  
 14  year              469281 non-null  int64  
 15  0                 469281 non-null  uint8  
 16  1                 46

In [14]:
# splitting

# separate the data to training and testing
X_train_improved, X_test_improved, y_train_improved,y_test_improved=train_test_split(X_improved,y_improved,test_size=0.2)

# save as np.array
X_test_improved = np.array(X_test_improved)
y_test_improved = np.array(y_test_improved)
X_train_improved=np.array(X_train_improved)
y_train_improved=np.array(y_train_improved)

In [15]:
# random forest regressor
from sklearn.ensemble import RandomForestRegressor

model_random_forest_improved = RandomForestRegressor(n_estimators=1000, max_depth=15,n_jobs=-1)
model_random_forest_improved.fit(X_train_improved,y_train_improved)


In [23]:
# to import to another jupyter notebook
%store model_random_forest_improved

Stored 'model_random_forest_improved' (RandomForestRegressor)


In [17]:
print("Train Set Performance")
print("R^2: " + str(model_random_forest_improved.score(X_train_improved,y_train_improved)))

from sklearn.metrics import mean_squared_error

# Predict on the train set
y_train_pred = model_random_forest_improved.predict(X_train_improved)

# Calculate MSE and RMSE on train set
mse_train = mean_squared_error(y_train_improved, y_train_pred)

rmse_train = np.sqrt(mse_train)
print("MSE: ", mse_train)
print("RMSE:", rmse_train)

Train Set Performance
R^2: 0.5981622323851397
MSE:  135.29050696919228
RMSE: 11.631444749866299


In [18]:
print("Test Set Performance")
print("R^2: " + str(model_random_forest_improved.score(X_test_improved,y_test_improved)))


# Predict on the train set
y_test_pred = model_random_forest_improved.predict(X_test_improved)

# Calculate MSE and RMSE on train set
mse_test = mean_squared_error(y_test_improved, y_test_pred)

rmse_test = np.sqrt(mse_test)
print("MSE: ", mse_test)
print("RMSE:", rmse_test)

Test Set Performance
R^2: 0.5103836840658821
MSE:  166.41431142653403
RMSE: 12.900167108473209


---
## Conclusion


In conclusion, we can observe that after fine tuning the model, we have seen an improved R^2 value as well as a reduced MSE and RMSE value. 

The model can be improved further but requires much more computational power. In our fine tuning of model we considered only the top 6 non-negligible factors. However, given more computational powers we could include all the features such that the random forests could more accurately determine the appropriate weight.

In addition, we could better sample our data to reduce the class imbalance. However, this requires much more computational power as well. 

Nonetheless, seeing an improvement in R^2 value for test data from `0.47486637863325565`to `0.5111782273071137` shows that the model could be further optimized. However, this would require more computational power.