<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">Table Of Content</h3>


$\text{Table Of Content}$

* [1. Introduction](#1)
    * [1.1 Import Libraries](#1.1)
    * [1.2 Load data](#1.2)
* [2. Exploratory Data Analysis(EDA)](#2)
    * [2.1 Overview](#2.1)
* [3. Feature Engineering](#3)
    * [3.1 Creating The Major Genre Feature](#3.1)
* [4. Model Selection,fitting and prediction](#4)
    * [4.1 Selecting The Target Feature And The Predictior Features](#4.1)
    * [4.2 Linear Regression Cross Validation Test](#4.2)
    * [4.3 Decision Tree Cross Validation Test](#4.3)
    * [4.4 Random Forest Cross Validation Test](#4.4)
    * [4.5 KNN Cross Validation Test](#4.5)
    * [4.6 Model Blending](#4.6)    
    

<a id="1.1"></a>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.subplots import make_subplots
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
def RMSE(y,yhat):
    return np.sqrt(mean_squared_error(y,yhat))
from nltk.sentiment.vader import SentimentIntensityAnalyzer

plt.rc('figure',figsize=(20,11))


<a id="1.2"></a>


In [None]:
s_data =pd.read_csv('/kaggle/input/top50spotify2019/top50.csv',encoding='ISO-8859-1')
s_data.head(5)

In [None]:
s_data.drop(s_data.columns[0] ,axis=1,inplace=True)

In [None]:
s_data.info()

<a id="2"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h3>


<a id="2.1"></a>
<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Overview </span></p>

In [None]:
info = s_data.describe()
info.loc['skew'] =s_data.skew()
info.loc['kurtosis'] =s_data.kurt()
info.loc['median'] =s_data.median()
info.loc['iqr'] = info.loc['75%']-info.loc['25%']
info


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>So We have no missing values which are great and all of our features are correctly labeled, also we already can see that Ed Sheeran has the most songs in the top 50 list! </span></p>

In [None]:
number_of_unique_artists = len(s_data['Artist.Name'].value_counts().to_list())
number_of_unique_genres = len(s_data['Genre'].value_counts().to_list())
print("Number Of Unique Artists: ",number_of_unique_artists,' | ',' Number Of Unique Genres: ',number_of_unique_genres)

### 


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>In terms of encoding our artists and our genres we see one-hot encoding is a bad idea here considering the cardinality of those features therefore we will dummy encode the genres and replace the artist names with numerical values closer to the model selection stage </span></p>

In [None]:
fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Perason Correaltion',  'Spearman Correaltion'))
colorscale=[[0.0, "rgb(165,0,38)"],
                [0.1111111111111111, "rgb(215,48,39)"],
                [0.2222222222222222, "rgb(244,109,67)"],
                [0.3333333333333333, "rgb(253,174,97)"],
                [0.4444444444444444, "rgb(254,224,144)"],
                [0.5555555555555556, "rgb(224,243,248)"],
                [0.6666666666666666, "rgb(171,217,233)"],
                [0.7777777777777778, "rgb(116,173,209)"],
                [0.8888888888888888, "rgb(69,117,180)"],
                [1.0, "rgb(49,54,149)"]]

s_val =s_data.corr('pearson')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,name='pearson',showscale=False,xgap=1,ygap=1,colorscale=colorscale),
    row=1, col=1
)


s_val =s_data.corr('spearman')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1,colorscale=colorscale),
    row=2, col=1
)

fig.update_layout(height=700, width=900, title_text="Locations That Contribute The Most To Our Cut-Offs")
fig.show()


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Let us take a side not here and remember that Popularity, which prediction of it is one of our goals, is correlated mostly with beats per minute, valence, and speechless </span></p>

<h2 style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>What music genres dominate the top 50 songs? Extracting The Main Genre From Our Sub Genres</span></h2>

In [None]:
ax = sns.countplot(s_data['Genre'],palette='Greens')
ax.set_xticklabels(ax.get_xticklabels(),rotation=90,size=13)
ax.set_title('Distribution Of Genres Across Our Data',fontsize=16)
ax.patches[2].set_fc('r')
plt.show()



<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Interesting we can see that the dance-pop genre is the genre of which there are the most songs in our data, but if we take a close look we can see that there is an interesting feature hidden in our data, and its that most of the genres we have in our data are sub-genres of major music genres. We see that we have the pop that is the main genre and a few sub-genres of pop lets create a new feature of the main genre and observe our findings. </span></p>

<a id="3"></a>
<a id="3.1"></a>


<h2 style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>Extracting The Main Genre From Our Sub-Genres</span></h2>

In [None]:
main_genres = ['rock','pop','blues','hip hop','jazz','reggae','techno','trap','regga','rap','r&b']
def check_genre(sir):
    for word in main_genres:
        if sir.find(word) != -1:
            if word == 'rap':
                return 'hip hop'
            else:
                return word
    return sir

s_data['Main.Genre'] = s_data['Genre'].apply(check_genre)


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Now that we have our new feature which shows us the main genres of music lets take a quick look again at our distribution and make some assumptions </span></p>

In [None]:
ax = sns.countplot(s_data['Main.Genre'],palette='Greens',order=s_data['Main.Genre'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90,size=13)
ax.set_title('Distribution Of Genres Across Our Data',fontsize=16)
ax.patches[0].set_fc('r')
ax.patches[1].set_fc((0.75,0,0))
ax.patches[2].set_fc((0.50,0,0))
plt.legend({'Most Frequent Music Genre':0},prop={'size':'16'})
plt.show()


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>As we see the top 3 music major music genres are&nbsp;</span></p>
<ol>
    <li style="text-align: center;"><span style="font-family: 'Times New Roman', Times, serif;"><span style="font-size: 24px;">Pop</span></span></li>
    <li style="text-align: center;"><span style="font-family: 'Times New Roman', Times, serif;"><span style="font-size: 24px;"><span style="color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; float: none; display: inline !important;">Hip Hop</span> <br></span></span></li>
    <li style="text-align: center;"><span style="font-family: 'Times New Roman', Times, serif;"><span style="font-size: 24px;">Latin</span></span></li>
</ol>
<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>where the most popular sub-genre is dance-pop</span></p>

<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Another important point to observe is which artist dominates the top 50 songs? </span></p>

In [None]:
ax = sns.countplot(s_data['Artist.Name'],palette='Greens',order = s_data['Artist.Name'].value_counts().index,label='Top Artist')
ax.set_xticklabels(ax.get_xticklabels(),rotation=90,size=13)
ax.set_title('Distribution Of Genres Across Our Data',fontsize=16)
ax.patches[0].set_fc('r')
plt.legend(prop={'size':'16'})
plt.show()

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>Its quite clear that our top artist is Ed Sheeran with 4 of his songs in the top 50 list followed by The Chainsmokers, Shawn Mendes, Post Malone, Sech, Marshmello, Billie Elish, J Bavlin, Lil Nas, and Ariana Grande with 2 songs at the top 50 list alike the other artist which only have 1 song. </span></p>
<p><br></p>

In [None]:
#Our Top 10 Artist And Top 10 Genres 
top_10_artist = s_data['Artist.Name'].value_counts()[:10]
top_10_genres = s_data['Genre'].value_counts()[:10]
top_10_songs = s_data.iloc[s_data['Popularity'].nlargest(10).index,:]
top_10_artist.to_frame()
top_10_songs

In [None]:
ax = sns.distplot(s_data['Popularity'],hist_kws={'color':'r'},kde_kws={'color':'g','lw':'6'})
textstr = '\n'.join(
    
        (   r'$\mu=%.2f$' % (s_data['Popularity'].mean(),)
          , r'$\mathrm{median}=%.2f$' % (s_data['Popularity'].median(),)
          , r'$\sigma=%.2f$' % (s_data['Popularity'].std(),)
          , r'Skew=%.2f' % (s_data['Popularity'].skew(),)
          , r'Kurtosis=%.2f' % (s_data['Popularity'].kurt(),)

        )
    
                  )

props = dict(boxstyle='round', facecolor='red', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=17,
        verticalalignment='top', bbox=props)
ax.set_title('Distribution Of Popularity Scores In Our Data',fontsize=16)
ax.set_xlabel('Popularity',fontsize=16)
plt.legend()
plt.show()

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'> So we see that we have a negative skew in our popularity scores, and our mean score is 87.5 with a standard deviation value of 4.49; in the following steps of our kernel, we would like to transform our data and center it around the mean and obtain a more normal distribution using the z transformation. Hopefully making our model better.</span></p>
<p><br></p>

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>Next, let us take a closer look at the artists in our data; we have a minimal data set of just 50 samples; my goals currently will be to collect more basic information on the artists who create those new features and take a look at all the data concerning the artist and hopefully uncover an underlying connection between the top 10 genres and the top 10 artists</span></p>
<p><br></p>

In [None]:
gender = {'Ed Sheeran':'M','The Chainsmokers':'Group','Shawn Mendes':'M','Post Malone':'M','Sech':'M','Marshmello':'M','Billie Eilish':'F','J Balvin':'M',
         'Lil Nas X':'M','Ariana Grande':'F','Daddy Yankee':'M','Y2K':'M','DJ Snake':'M','Lewis Capaldi':'M','Chris Brown':'M','Khalid':'M','Lizzo':'F','Lauv':'M',
         'Kygo':'M','Ali Gatie':'M','Lady Gaga':'F','Bad Bunny':'M','Lunay':'M','Sam Smith':'M','Anuel AA':'M','Nicky Jam':'M','Lil Tecca':'M','ROSAL√çA':'F','Young Thug':'M',
         'Martin Garrix':'M','Katy Perry':'F','Jhay Cortez':'M','Drake':'M','Tones and I':'F','Taylor Swift':'F','Jonas Brothers':'Group','MEDUZA':'M','Maluma':'M'}
s_data['Artist.Gender'] = s_data['Artist.Name'].apply(lambda x: gender[x])

In [None]:
top_10_arist_df = s_data[s_data['Artist.Name'].isin(top_10_artist.index)]
sia = SentimentIntensityAnalyzer()
top_10_arist_df['Track_Name_Sentiment.c'] = top_10_arist_df['Track.Name'].apply(lambda x: sia.polarity_scores(x)['compound'])

In [None]:
plt.figure(figsize=(20,11))
ax = sns.countplot(top_10_arist_df['Artist.Gender'])
ax.set_title('Distribution Of Gender Among The Top 10 Artists',fontsize=17)
plt.show()

<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>In our top 50 data set, most of the top 10 artists are male in general, it applies to all 50 songs in our dataset.</span></p>
<p style="text-align: center;"><span style="font-family: 'Times New Roman', Times, serif;"><span style="font-size: 24px;"><br></span></span></p>
<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>let us try and find out what do the top 10 songs share in common </span></p>

In [None]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list([f'<b>{x}<b>' for x in top_10_songs.columns]),
                fill_color='royalblue',
                font_color='white',
                font_size=13,
                align='left'),
    cells=dict(values=[top_10_songs[col] for col in top_10_songs.columns],
               fill_color='azure',
               align='left'))
])
fig.update_layout(title='Top 10 Songs')
fig.show()

In [None]:
ex.scatter_polar(s_data,theta='Main.Genre',r='Beats.Per.Minute',color ='Popularity',title='Spread of different genre popularity according to beats per minute')

In [None]:
ex.density_heatmap(s_data,x='Beats.Per.Minute',y='Popularity',title='Popularity counts according to BPM ')

In [None]:
numeric_f = top_10_songs.columns[3:13]
cor = top_10_songs.corr()
ax = sns.distplot((top_10_songs['Danceability']-top_10_songs['Danceability'].mean())/top_10_songs['Danceability'].std(),hist=False,label='Danceability')
ax = sns.distplot((top_10_songs['Energy']-top_10_songs['Energy'].mean())/top_10_songs['Energy'].std(),hist=False,label='Energy')
ax = sns.distplot((top_10_songs['Valence.']-top_10_songs['Valence.'].mean())/top_10_songs['Valence.'].std(),hist=False,label='Valence')
ax = sns.distplot((top_10_songs['Length.']-top_10_songs['Length.'].mean())/top_10_songs['Length.'].std(),hist=False,label='Length')
ax = sns.distplot((top_10_songs['Beats.Per.Minute']-top_10_songs['Beats.Per.Minute'].mean())/top_10_songs['Beats.Per.Minute'].std(),hist=False,label='Beats.Per.Minute')
ax.set_xlabel('Tansformed Distribution',fontsize=16)
ax.set_title('Normalized Distributions Of The Most Significant Features In Our Top 10 Songs',fontsize=16)
plt.legend(prop={'size':'20'})
plt.show()


<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>We can observe that for the features we tested above the variance of the normalized distribution is quite normal and there is a strong resemblance across these features in our top 10 songs </span></p>

In [None]:
sd_data = s_data.copy()
geners_one = pd.get_dummies(sd_data['Main.Genre'],prefix='Genre')
geners_one = geners_one[geners_one.columns[1:]]
sd_data = pd.concat([sd_data,geners_one],axis=1)
sd_data = sd_data.drop(columns='Main.Genre')

In [None]:
sd_data['Track_Name_Sentiment'] = sd_data['Track.Name'].apply(lambda x: sia.polarity_scores(x)['compound'])
sd_data['Track_Name_Length'] = sd_data['Track.Name'].apply(lambda x: len(x))
sd_data['Genre'] =sd_data['Genre'].astype('category').cat.codes
sd_data['Artist.Gender'] =sd_data['Artist.Gender'].astype('category').cat.codes

fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Perason Correaltion',  'Spearman Correaltion'))
colorscale=[[0.0, "rgb(165,0,38)"],
                [0.1111111111111111, "rgb(215,48,39)"],
                [0.2222222222222222, "rgb(244,109,67)"],
                [0.3333333333333333, "rgb(253,174,97)"],
                [0.4444444444444444, "rgb(254,224,144)"],
                [0.5555555555555556, "rgb(224,243,248)"],
                [0.6666666666666666, "rgb(171,217,233)"],
                [0.7777777777777778, "rgb(116,173,209)"],
                [0.8888888888888888, "rgb(69,117,180)"],
                [1.0, "rgb(49,54,149)"]]

s_val =sd_data.corr('pearson')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,name='pearson',showscale=False,xgap=1,ygap=1,colorscale=colorscale),
    row=1, col=1
)


s_val =sd_data.corr('spearman')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1,colorscale=colorscale),
    row=2, col=1
)

fig.update_layout(height=700, width=900, title_text="Locations That Contribute The Most To Our Cut-Offs")
fig.show()

<p style="text-align: center;"><span style='font-size: 24px; font-family: "Times New Roman", Times, serif;'>Heatmap for the new features we created, we can see that some of the features we added have a significantly higher correlation with popularity than the original features in the data, lest check how those features span to decide which type of regression to use</span></p>

In [None]:
p_correaltion=['Speechiness.','Beats.Per.Minute','Valence.','Genre','Genre_hip hop']
fig,axs = plt.subplots(2,2)
fig.set_figheight(15)
fig.set_figwidth(15)
sns.regplot(y=sd_data['Popularity'],x=sd_data[p_correaltion[1]],ax=axs[0,1],color='r')
sns.regplot(y=sd_data['Popularity'],x=sd_data[p_correaltion[0]],ax=axs[0,0])
sns.regplot(y=sd_data['Popularity'],x=sd_data[p_correaltion[2]],ax=axs[1,0],color='g')
sns.regplot(y=sd_data['Popularity'],x=sd_data[p_correaltion[3]],ax=axs[1,1],color='c')
plt.show()

<a id="4"></a>

<h3 style="background-color:orange;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">Model Selection And Evaluation</h3>


<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>Let us select a model to predict the popularity of a song. our data set is very small so we must use cross-validation hopefully avoiding overfitting</span></p>

<a id="4.1"></a>

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;'>Selecting The Target Feature And The Predictor Features</span></p>

In [None]:
y = sd_data['Popularity']
X = sd_data[['Speechiness.','Beats.Per.Minute','Valence.','Genre','Genre_hip hop','Genre_escape room','Genre_r&b']]


In [None]:
#Knn Pipe
Knn_Pipe =  Pipeline(steps=[('scaler',StandardScaler()),('DT',KNeighborsRegressor(n_neighbors=5))])
Knn_scores = np.sqrt(-1*cross_val_score(Knn_Pipe,X,y,cv=5,scoring='neg_mean_squared_error'))
Knn_Pipe.fit(X,y)

#Random Forest Pipe
RF_Pipe =  Pipeline(steps=[('scaler',StandardScaler()),('DT',RandomForestRegressor(max_leaf_nodes=14,n_estimators=20,random_state=42))])
RF_scores = np.sqrt(-1*cross_val_score(RF_Pipe,X,y,cv=5,scoring='neg_mean_squared_error'))
RF_Pipe.fit(X,y)

#Decision Tree Pipe
DT_Pipe =  Pipeline(steps=[('scaler',StandardScaler()),('DT',DecisionTreeRegressor(max_leaf_nodes=10))])
DT_scores = np.sqrt(-1*cross_val_score(DT_Pipe,X,y,cv=5,scoring='neg_mean_squared_error'))
DT_Pipe.fit(X,y)

#Linear Regression Pipe
LR_pipe = Pipeline(steps=[('scaler',StandardScaler()),('poly',PolynomialFeatures(degree=1)),('LinearRegression',LinearRegression())])
LR_scores = np.sqrt(-1*cross_val_score(LR_pipe,X,y,cv=5,scoring='neg_mean_squared_error'))
LR_pipe.fit(X,y)

pred = LR_pipe.predict(X)
mse = np.sqrt(mean_squared_error(pred,y))


In [None]:
fig = make_subplots(rows=4, cols=1,shared_xaxes=True,subplot_titles=('Random Forest Cross Val Scores',
                                                                     'Decision Tree Cross Val Scores',
                                                                    'Linear Regression Cross Val Scores',
                                                                    'KNN Cross Val Scores'))

fig.add_trace(
    go.Scatter(x = np.arange(1,len(RF_scores)+1),y=RF_scores,mode='lines+markers',name='Random Forest'),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x = np.arange(1,len(DT_scores)+1),y=DT_scores,mode='lines+markers',name='Decision Tree'),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x = np.arange(1,len(LR_scores)+1),y=LR_scores,mode='lines+markers',name='Linear Regression'),
    row=3, col=1
)
fig.add_trace(
    go.Scatter(x = np.arange(1,len(Knn_scores)+1),y=Knn_scores,mode='lines+markers',name='KNN'),
    row=4, col=1
)


#Fold Means
fig.add_shape(type="line",
    x0=1, y0=np.mean(RF_scores), x1=5, y1=np.mean(RF_scores),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean',
        xref='x1', 
        yref='y1'
)

fig.add_shape(type="line",
    x0=1, y0=np.mean(DT_scores), x1=5, y1=np.mean(DT_scores),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean',
        xref='x2', 
        yref='y2'
)

fig.add_shape(type="line",
    x0=1, y0=np.mean(LR_scores), x1=5, y1=np.mean(LR_scores),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean',
        xref='x3', 
        yref='y3'
)

fig.add_shape(type="line",
    x0=1, y0=np.mean(Knn_scores), x1=5, y1=np.mean(Knn_scores),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean',
        xref='x4', 
        yref='y4'
)


fig.update_layout(height=700, width=900, title_text="Different Model 5 Fold Cross Validation")
fig.update_yaxes(title_text="RMSE")
fig.update_xaxes(title_text="Fold #")

fig.show()

In [None]:
fig = go.Figure(data=[go.Table(header=dict(values=['<b>Model<b>', '<b>Root Mean Squared Error<b>'],
                                           line_color='darkslategray',
    fill_color='whitesmoke',
    align=['center','center'],
    font=dict(color='black', size=18),
    height=40),
                               
                 cells=dict(values=[['<b>Random Forest<b>', '<b>Decision Tree<b>','<b>Linear Regression<b>','<b>KNN<b>'],
                                    [np.round(RF_scores.mean(),2), 
                                     np.round(DT_scores.mean(),2),
                                     np.round(mse,2),
                                     np.round(Knn_scores.mean(),2) 
                                    ]]))
                     ])

fig.update_layout(title='Model Result On Original Data (Without Upsampling)')
fig.show()

<a id="4.6"></a>

<h3 style="background-color:orange;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Model Blending</h3>

In [None]:
pred = LR_pipe.predict(X)*0.2 + RF_Pipe.predict(X)*0.3 + 0.4* DT_Pipe.predict(X) + Knn_Pipe.predict(X)*0.1
rmse = RMSE(pred,y)

In [None]:
ax=sns.lineplot(x=np.arange(0,len(y)),y=y,label = 'Actual Popularity Value')
ax = sns.lineplot(x=np.arange(0,len(y)),y=pred,label = 'Predicted Popularity Value')
ax.set_xlabel('Song Index',fontsize=16)
ax.set_ylabel('Popularity',fontsize=16)
prop3 = dict(boxstyle='round',facecolor='orange',alpha=0.5)
ax.text(0.05, 0.25, 'RMSE : {:.2f}'.format(rmse), transform=ax.transAxes, fontsize=27,
        verticalalignment='top', bbox=prop3)
plt.legend(prop={'size':'20'})
plt.show()

In [None]:
output=pd.DataFrame({'Prediction':pred,'Actual':y})
fig = make_subplots(
    rows=3, cols=2,subplot_titles=('','Actual','Predictions','Residuals'),
    vertical_spacing=0.09,
    specs=[[{"type": "table","rowspan": 3}     ,{"type": "scatter"}] ,
           [None                               ,{"type": "scatter"}]            ,           
           [None                               ,{"type": "scatter"}]                           
          ]
)

fig.add_trace(
    go.Scatter(
        x=np.arange(0,len(output["Actual"])),
        y=output["Actual"],
        mode="markers",
    ),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(
        x=np.arange(0,len(output["Prediction"])),
        y=output["Prediction"],
        mode="markers",
    ),
    row=2, col=2
)

fig.add_trace(
    go.Scatter(
        x=np.arange(0,len(output["Prediction"])),
        y=output["Prediction"]-output["Actual"],
        mode="markers",
    ),
    row=3, col=2
)

fig.add_trace(
    go.Table(
        header=dict(
            values=['Prediction','Actual'],
            font=dict(size=10),
            align="left"
        ),
        cells=dict(
            values=[output[k].tolist() for k in output.columns],
            align = "left")
    ),
    row=1, col=1
)



fig.add_shape(type="line",
    x0=0, y0=(output["Prediction"]-output["Actual"]).mean(), x1=len(output["Prediction"]), y1=(output["Prediction"]-output["Actual"]).mean(),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean',
        xref='x3', 
        yref='y3'
)

fig.update_layout(
    height=800,
    showlegend=False,
    title_text="Prediction Evaluation",
)

fig.show()

<h3 style="background-color:orange;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Visable Heteroscedasticity Evaluation</h3>

In [None]:
plt.title('Residuals of predicted values',fontsize=(18))
sns.residplot(pred,y)
plt.show()