In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Spotify - Popularity Classification 
### All Time Top 2000s Mega Dataset


TABLE OF CONTENT

0. INTRODUCTION & PROJECT GOAL
1. IMPORTING LIBRARIES
2. DATA DESCRIPTION & CLEANING
3. EXPLORATORY ANALYSIS & VISUALISATIONS
4. MODELLING DATA
5. FINAL CONCLUSIONS

### 0 INTRODUCTION

#### Data Collection:

This Dataset was collected from Kaggle.com

- Link: https://www.kaggle.com/iamsumat/spotify-top-2000s-mega-dataset

#### Context

- This dataset contains audio statistics of roughly the top 2000 tracks from between 1956 to 2019 on Spotify. The data contains 15 columns each describing the track and it's qualities. 

#### Acknowledgements

This data is extracted from the Spotify playlist - Top 2000s on PlaylistMachinery(@plamere) using Selenium with Python. More specifically, it was scraped from http://sortyourmusic.playlistmachinery.com/. 

#### Content Variables

1. Index: ID
2. Title: Name of the Track
3. Artist: Name of the Artist
4. Top Genre: Genre of the track
5. Year: Release Year of the track
6. Beats per Minute(BPM): The tempo of the song
7. Energy: The energy of a song - the higher the value, the more energtic. song
8. Danceability: The higher the value, the easier it is to dance to this song.
9. Loudness: The higher the value, the louder the song.
10. Valence: The higher the value, the more positive mood for the song.
11. Length: The duration of the song.
12. Acoustic: The higher the value the more acoustic the song is.
13. Speechiness: The higher the value the more spoken words the song contains
15. Popularity: The higher the value the more popular the song is.

# PROJECT GOAL & INTERESTS:

1. The goal is to build a classification models using: **Linear Regression, Decision Tree Classifier & Naive Bayes.**
- Will look to classify a songs level of popularity based off of given feature metrics as mention above.

Along the way we will look at other interests such as:

2. Most popular Genres and Artists of all time from 1950s to 2000s?
3. Is there a trend in genres preferred back in the day vs now?
4. What other variables have an impact on the popularity metric? 

# 1 IMPORTING LIBRARIES

In [None]:
import pandas as pd
import numpy as np
from scipy import stats, special

import seaborn as sns
from seaborn import pairplot, heatmap
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

from sklearn import model_selection, metrics, linear_model, tree, datasets, feature_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import LabelEncoder

# 2 DATA DESCRIPTION & CLEANING

#### 2.1 DATA DESCRIPTION

In [None]:
# Loading datadataset & View

spotify_df = pd.read_csv('../input/spotify-top-2000s-mega-dataset/Spotify-2000.csv') 
spotify_df.head()

In [None]:
# Overview of Dataset information and data types

spotify_df.info() 

In [None]:
# Overview of Dataset numerical data

spotify_df.describe()

In [None]:
# Number of genres that have featured in the all time top 2000.

len(spotify_df["Top Genre"].unique())

In [None]:
# Number of times each genre features in the all time top 2000.

spotify_df["Top Genre"].value_counts()

### Raw Dataset  Summary:

#### The Dataset contains:
- 1994 entries
- 1994 non-null entries
- 15 total variable columns
- 149 Genre entries

#### Data Types:
- 4 categorical columns
- 11 numerical columns

#### Numerical Data:
- The data set is between years 1994 - 2019. Just about 63 years worth of most popular songs as classified by spotify.
- min Popularity of a song is 11 and max is 100.


### Initial Analysis & Progression:

1. It's clear that Rock music seems to be the all time favourite genre with the most features. But that being said the data is lob-sidded towards pre-2000s and music taste does tend to change over the years so this can be be investigated further in the EDA.

### 2.2 Data Cleaning

#### Action:

1. Convert column data types.
2. Remove unecessary columns. 
3. Adjust column titles.
4. consolidate genre column as there are many variations of a single genre e.g. 'dutch pop' and 'dance pop' or 'album rock'and alternative rock. we will make these columns just 'pop' or just 'rock' as to provide a more accurate summarised representation of that genres.

In [None]:
#Converting Length (Duration) to an integer data type

spotify_df.replace(',','', regex=True, inplace=True)
spotify_df['Length (Duration)']= spotify_df['Length (Duration)'].apply(pd.to_numeric,errors='coerce')
print('Length (Duration) is now a -->',spotify_df['Length (Duration)'].dtype, 'data type')

In [None]:
#Removing the Index column.
spotify_df.drop(columns = ['Index'], inplace = True)

#Converting all column titles to lowercase.
spotify_df.columns = map(str.lower, spotify_df.columns)

#Coverting column names to have no space between, if they do, replace space with an underscore "_"
spotify_df.rename(columns = {'top genre' : 'genre', 'beats per minute (bpm)':'beats_per_minute','loudness (db)': 'loudness','length (duration)': 'duration'}, inplace = True)

In [None]:
spotify_df.info()

In [None]:
spotify_df.head(3)

### Consolidating genre column

In [None]:
# function to split the genre column
    
def genre_splitter(genre):
    result = genre.copy()
    result = result.str.split(" ",1)
    for i in range(len(result)):
        if (len(result[i]) > 1):
            result[i] = [result[i][1]]
    return result.str.join('')

#loop until the genre cannot be split any further

new_genre = spotify_df['genre'].copy()
while(max((new_genre.str.split(" ", 1)).str.len()) > 1):
    new_genre = genre_splitter(new_genre)
    
print('New Total of Genres from 146 to -->', len(new_genre.unique()))

In [None]:
new_genre.value_counts()

### Analysis:
- Above shows consolidated genre column into single/more generalistic genres into rock, pop etc.
- There is also an expected increase in values due to the consolidating the genres.

In [None]:
#inputting new column values from new_genre to genre in dataframe.

spotify_df['genre'] = new_genre
spotify_df['genre']

# 3 EXPLORATORY ANALYSIS & VISUALISATIONS


-  In this section we will investigate the data. Taking a particular look at our target features: "Popularity" & "Genre" and their correlating variables within the data set.

### 3.1 Most popular Genres & Artists from 1950s to 2000s?

In [None]:
# Create a function top_10, which takes a single parameter for a column. 
# Group the data by the desired column input, sum the values the remaining columns, sort sum values by 'Popularity' column 
# from highest to lowest, the print the top 10 rows. 

def top_10(column):
    top_10_songs = spotify_df.groupby([column]).sum().sort_values('popularity', ascending=False).head(10)
    return(top_10_songs[['popularity']])  # Only show 'popularity' column.

top_10('genre')

In [None]:
#Use the same function for the Artists

top_10('artist')

#### 3.1 Conclusion:

- The above shows the accumulation of most popular Genres & Artists of all time. We can also see there is a significant amount of Rock music genre entries compared to the rest of the genres. So there was always a high possibility this will be the most popular overall. Though Michael Jackson, would be an outlier here due to his popularity. 

- But in terms of 'pure' popularity as seen below, values from 0-100, the story is a little different. We can see that pop music and its many variations has a majority in populularity in more recent years. This may indicate a shift in popularity over the years as well as musical listening trends due to advancements in technology.

In [None]:
pure_popularity = spotify_df.sort_values('popularity', ascending=False).head(10)
pure_popularity[['genre', 'year', 'popularity']]

## 3.2  Is there a trend/shift in genres prefered to pre-2000s vs now over the years?

#### - To approach this, I will **split the dataset into quarters (n/4) and track the popularity change** in genre, as well as the genre entry count over the years. 

In [None]:
# Split of 'Year' column data into 4x equal-dispersed buckets in ascending order from 1956 - 2019:

spotify_df['year'] = pd.qcut(spotify_df['year'], q=4, labels=[1, 2, 3, 4]) 
spotify_df['year'].value_counts()

In [None]:
x=['YB-1', 'YB-2', 'YB-3', 'YB-4']
y=spotify_df['year'].value_counts()

_ = sns.barplot(x=x, y=y, palette="BrBG")
_.set(xlabel='Year Block (YB) divided into 4/4', ylabel='Year Count')

In [None]:
# Function for creating year block sets (1 - 4):

def year_block(year_no):
    block = spotify_df.loc[spotify_df['year'] == year_no]
    return block[['genre', 'year', 'popularity']].sort_values('popularity', ascending=False)

# Function for creating Top 5 genre value counts for Pie Chart visual:

def genre_count(year_block):
    return year_block['genre'].value_counts().head()

## Pie Chart Visuals of Year Blocks showing trend/shift in genres between (1956 - 2019)

In [None]:
#Creation of Year block 1 & Genre counter for Pie Chart Viaual:
year_block_1 = year_block(1) 
genre_count(year_block_1)
   
#Pie chart for year block 1:
values = genre_count(year_block_1).values
names = genre_count(year_block_1).index
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'AntiqueWhite']

fig = go.Figure(data=[go.Pie(labels=names, values=values, pull=[0.1, 0, 0, 0, 0])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title_text='Year Block 1')
fig.show()
   
    
#Creation of Year block 2 & Genre counter for Pie Chart Viaual:
year_block_2 = year_block(2) 
genre_count(year_block_2)
   
#Pie chart for year block 2:
values = genre_count(year_block_2).values
names = genre_count(year_block_2).index
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'AntiqueWhite']

fig = go.Figure(data=[go.Pie(labels=names, values=values, pull=[0.1, 0, 0, 0, 0])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title_text='Year Block 2')
fig.show()


#Creation of Year block 3 & Genre counter for Pie Chart Viaual:
year_block_3 = year_block(3) 
genre_count(year_block_3)
   
#Pie chart for year block 3:
values = genre_count(year_block_3).values
names = genre_count(year_block_3).index
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'AntiqueWhite']

fig = go.Figure(data=[go.Pie(labels=names, values=values, name="Year Block 3", pull=[0.1, 0, 0, 0, 0])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title_text='Year Block 3')
fig.show()


#Creation of Year block 4 & Genre counter for Pie Chart Viaual:
year_block_4 = year_block(4) 
genre_count(year_block_4)
   
#Pie chart for year block 4:
values = genre_count(year_block_4).values
names = genre_count(year_block_4).index
colors = ['mediumturquoise','gold', 'darkorange', 'lightgreen', 'AntiqueWhite']

fig = go.Figure(data=[go.Pie(labels=names, values=values, pull=[0.1, 0, 0, 0, 0])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title_text='Year Block 4')
fig.show()

#### 3.2 Conclusion: Pie Chart Analysis:
- As per above, we can clearly see a shift in genre preference, from Rock leaning into pop music at the end of the quater 4/4. But, the level of popularity for pop music has a much higher rate than that of rock music at it's peak as we shall see below.

## Genre Popularity Count over the years

In [None]:
# Joining the year_block heads for top genre count:

frames = [year_block_1.head(), year_block_2.head(), year_block_3.head(), year_block_4.head()]
top_genre_df = pd.concat(frames)
top_genre_df

## 3.3 What other features have an impact on the popularity of a song?

- We shall look to explore numerical variables such as the audio features and look for correlation bewteen each feature in order to help define usable features to build an accurate model. First let's remove the 'title' and 'year column as all values are unique and will not help with our classification model.

In [None]:
spotify_df.drop(columns = ['title','year'], inplace = True)
spotify_df.head()

#### Note:
- Music genres that have a single values would make our model inefficient, since it does not have enough data to work off of, so these values and corresponding rows in the original dataframe will be removed. Therefore: 

#### - Genre's with a value count less than 20x shall be removed

In [None]:
unique = spotify_df['genre'].unique()
to_remove = [] 

# genres that have a single instance only will be placed within the to_remove array
for genre in unique:
    if spotify_df['genre'].value_counts()[genre] < 20: 
        to_remove += [genre]

print('Genre Values to be removed from data set =', len(to_remove))

#### - Now to replace our original genre column with the updated version

In [None]:
spotify_df.set_index(["genre"],drop = False, inplace = True)
for name in to_remove:
    type(name)
    spotify_df.drop(index = str(name), inplace = True)
    
spotify_df.head()

In [None]:
spotify_df = spotify_df.reset_index(drop=True)
spotify_df.head()

#### - As you can see genre's have been removed with those having an instance less than 20.

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(spotify_df.corr(),annot=True,cmap='BrBG')
plt.show()

### 3.3 Conclusion:

#### Target Feature: 'Popularity'
From the above heat map, though not very dtrong, we can see that the strongest features that correlate with Popularity is:
- loudness (17%)
- danceability (13%)
- energy (12%)
- valence (10%)

#### Variable relationships: 
We can see the strongest relationship between the variables excl. paopularity are:
- loudness & energy (74%)
- valence & danceability (53%)
- valence & energy (43%)

# 4 MODELLING DATA

## --> Model (Popularity Classification)

**Popularity predictor Model:** Will look to classify a songs level of popularity based off of given feature metrics as mention above.

**Steps:**
1. We need to one hot-encode (get dummy variables) for the genre, artist & popularity feature for higher accuracy.
2. Split & Scale Data
3. Modele & Train Data

### Step 1: One hot-encode Genre & Artist Features:

#### - Create dummy variables for the genre column.

In [None]:
# Function for creating dummy variables

def dummies(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

spotify_df = dummies(spotify_df, 'genre', 'genre')
spotify_df = dummies(spotify_df, 'artist', 'artist')
spotify_df.head()

#### - Classify target feature: 'popularity' into 2x bins, to help better classify our data for our predictive modeling.

- 1 = 'most popular' 
- 0 = 'least popular'  

In [None]:
spotify_df['popularity'] = pd.qcut(spotify_df['popularity'], q=2, labels=[0, 1]) 
spotify_df[['popularity']].head()

In [None]:
target_v = spotify_df['popularity'].value_counts(normalize=True).round(3)
target_v

In [None]:
sns.barplot(x=target_v.index, y=target_v, palette="BrBG")

### Analysis

- As we see above, we have a very good split for our target variable. This will greatly minimise class imbalance and will allow our model to be more accurate.

### Step 2: Splitting & Scaling Data

In [None]:
# Chosing independant and dependant variables:

y = spotify_df.loc[:,'popularity'] #dependant 
X = spotify_df[['loudness', 'danceability', 'energy', 'valence']] #independant 

# Split data into training and test sets: 80% training, 20% test split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

In [None]:
print('Shape of Data:')
print('-')
print('X_train:', X_train.shape, 'y_train:', y_train.shape)
print('X_test:', X_test.shape,'y_test:', y_test.shape)

In [None]:
# Scale the independant variable for relatively normal distribution of the data. 

X_train = StandardScaler().fit_transform(X_train)

## Modeling & Training

Due to the nature of the dataset, we will be implementing classification type machine learning with the following:
- Logistic Regression
- Decision Tree Classifier
- Naive Bayes 

## Training Model Accuracy Outputs:

In [None]:
# Creating Models

logr_model = linear_model.LogisticRegression(solver='liblinear')
dtree_model = tree.DecisionTreeClassifier()
nb_model = GaussianNB()

# Training the models

logr_model.fit(X_train, y_train)
dtree_model.fit(X_train, y_train)
nb_model.fit(X_train, y_train)

# Accuracy of trained models with training data:

logr_train_acc = logr_model.score(X_train, y_train)
dtree_train_acc = dtree_model.score(X_train, y_train)
nb_train_acc = nb_model.score(X_train, y_train)

print('Training Model Accuracy Outputs:')
print('-')
print('Logistic Regression:', round(logr_train_acc*100,2),'%')
print('Decision Tree:', round(dtree_train_acc*100,2),'%')
print('Naive Bayes:', round(nb_train_acc*100,2),'%')

In [None]:
fig = px.bar(x=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'], 
             y=[logr_train_acc, dtree_train_acc, nb_train_acc], 
            color=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'],
             labels={'x': 'Model', 'y': 'Accuracy'},
            title='Accuracy of trained models with training data')
fig.show()

### Analysis:
- As seen above it appears that the Decision Tree Model has a really high accuracy rate against the training data. But this seems a bit too high and I suspect over fitting. I will validate this through k-fold cross validation.

## Validating models with k-fold cross validation:

In [None]:
# Validating models with k-fold cross validation method

kf = model_selection.KFold(n_splits=10, shuffle=True, random_state=24)

accuracy_logr = cross_val_score(logr_model, X_train, y_train, scoring="accuracy", cv=kf)
accuracy_dtree = cross_val_score(dtree_model, X_train, y_train, scoring="accuracy", cv=kf)
accuracy_nb = cross_val_score(nb_model, X_train, y_train, scoring="accuracy", cv=kf)

accuracy_logr = accuracy_logr.mean()
accuracy_dtree = accuracy_dtree.mean()
accuracy_nb = accuracy_nb.mean()

print('k-fold Cross Validation Accuracy Outputs:')
print('-')
print("Logistic Regression:", round(accuracy_logr*100,2),"%")
print("Decision Tree:", round(accuracy_dtree*100,2),"%")
print("Naive Bayes:", round(accuracy_nb*100,2),"%")

In [None]:
fig = px.bar(x=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'], 
             y=[accuracy_logr, accuracy_dtree, accuracy_nb], 
            color=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'],
             labels={'x': 'Model', 'y': 'Accuracy'},
            title='Validating models with k-fold cross validation')
fig.show()

### Analysis:
- As seen above this seems a little more accurate with the best accuracy being the Logistic Regression Model. It would appear that with a larger dataset the Decision Tree performs well but when there is minimal data, not so much. 

## Test Data Output from trained models:

In [None]:
# using the test data for our trained models:

print('Test data - Model Accuracy Outputs:')
print('-')
print('Logistic Regression:', round(logr_model.score(X_test, y_test)*100,2),'%')
print('Decision Tree:', round(dtree_model.score(X_test, y_test)*100,2),'%')
print('Naive Bayes:', round(nb_model.score(X_test, y_test)*100,2),'%')

### Analysis and Next Steps:
- With the test data, we have an identical match with Logistic Regression and Decision Tree.
- Accuracy ratings are very low.
- I will look to now include all variables as to help improve the accuracy of the models.

# --> Model (Popularity Classification) Pt.2 

- Using all independant variables to try and achieve a greater accuracy

### Splitting & Scaling Data.2

In [None]:
# Chosing independant and dependant variables:

y_1 = spotify_df.loc[:,'popularity'] #dependant/target 
X_1 = spotify_df.drop('popularity', axis=1)#independant 

# Split data into training and test sets: 80% training, 20% test split.

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_1, test_size=0.2, random_state=20)

In [None]:
print('Shape of Data.2:')
print('-')
print('X_train_1:', X_train_1.shape, 'y_train_1:', y_train_1.shape)
print('X_test_1:',X_test_1.shape,'y_test_1:', y_test_1.shape)

In [None]:
# Scale the independant variable for relatively normal distribution of the data. 

X_train_1 = StandardScaler().fit_transform(X_train_1)

## Modeling & Training.2

- Same a previosuly done with the 3x classification models.

#### Training Model Accuracy Outputs.2

In [None]:
# Creating Models

logr_model_1 = linear_model.LogisticRegression(solver='liblinear')
dtree_model_1 = tree.DecisionTreeClassifier()
nb_model_1 = GaussianNB()

# Training the models

logr_model_1.fit(X_train_1, y_train_1)
dtree_model_1.fit(X_train_1, y_train_1)
nb_model_1.fit(X_train_1, y_train_1)

# Accuracy of trained models with training data:

logr_train_acc_1 = logr_model_1.score(X_train_1, y_train_1)
dtree_train_acc_1 = dtree_model_1.score(X_train_1, y_train_1)
nb_train_acc_1 = nb_model_1.score(X_train_1, y_train_1)

print('Training Model Accuracy Outputs.2:')
print('-')
print('Logistic Regression:', round(logr_train_acc_1*100,2),'%')
print('Decision Tree:', round(dtree_train_acc_1*100,2),'%')
print('Naive Bayes:', round(nb_train_acc_1*100,2),'%')

In [None]:
fig = px.bar(x=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'], 
             y=[logr_train_acc_1, dtree_train_acc_1, nb_train_acc_1], 
            color=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'],
             labels={'x': 'Model', 'y': 'Accuracy'},
            title='Accuracy of trained models with tusing all independant variables on training data')
fig.show()

### Analysis:
- As seen above it appears that the accuracy has increased quite a bit for all models, again, Decision Tree Model has a really high accuracy rate against the training data. So will validate this through k-fold cross validation.

#### Validating models with k-fold cross validation.2

In [None]:
# Validating new models with k-fold cross validation method

accuracy_logr_1 = cross_val_score(logr_model_1, X_train_1, y_train_1, scoring="accuracy", cv=kf)
accuracy_dtree_1 = cross_val_score(dtree_model_1, X_train_1, y_train_1, scoring="accuracy", cv=kf)
accuracy_nb_1 = cross_val_score(nb_model_1, X_train_1, y_train_1, scoring="accuracy", cv=kf)

accuracy_logr_1 = accuracy_logr_1.mean()
accuracy_dtree_1 = accuracy_dtree_1.mean()
accuracy_nb_1 = accuracy_nb_1.mean()

print('k-fold Cross Validation Accuracy Outputs.2:')
print('-')
print("Logistic Regression:", round(accuracy_logr_1*100,2),"%")
print("Decision Tree:", round(accuracy_dtree_1*100,2),"%")
print("Naive Bayes:", round(accuracy_nb_1*100,2),"%")

In [None]:
fig = px.bar(x=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'], 
             y=[accuracy_logr_1, accuracy_dtree_1, accuracy_nb_1], 
            color=['Logistic Regression', 'Decision Tree model', 'Naive Bayes'],
             labels={'x': 'Model', 'y': 'Accuracy'},
            title='Validating new models with k-fold cross validation')
fig.show()

### Analysis:
- With the increase in data for the feature variables, we have increased validation of the models, with logistic regression coming out on top and decision trees coming in as least favourable.

#### Test Data Output from trained models.2

In [None]:
# using the test data for our trained models:

print('Test data - Model Accuracy Outputs.2:')
print('-')
print('Logistic Regression:', round(logr_model_1.score(X_test_1, y_test_1)*100,2))
print('Decision Tree:', round(dtree_model_1.score(X_test_1, y_test_1)*100,2))
print('Naive Bayes:', round(nb_model_1.score(X_test_1, y_test_1)*100,2))

## Summary of Model Accuracy:

#### Model (Popularity Classification) 
- With 'loudness', 'danceability', 'energy', 'valence' as independant variables due to highest correlation.

Trained Model:
- LR Accuracy: 56%
- DT Accuracy: 99%
- NB Accuracy: 55%

K-fold Cross Val:
- LR Accuracy: 56%
- DT Accuracy: 54%
- NB Accuracy: 55%

Test Data Result:
- LR Accuracy: 53%
- DT Accuracy: 53%
- NB Accuracy: 53%


#### Model (Popularity Classification) pt.2
- With all independant data variables included.

Trained Model.2:
- LR Accuracy: 85%
- DT Accuracy: 100%
- NB Accuracy: 77%

K-fold Cross Val.2:
- LR Accuracy: 69%
- DT Accuracy: 62%
- NB Accuracy: 63%

Test Data Result.2:
- LR Accuracy: 55%
- DT Accuracy: 50%
- NB Accuracy: 62%

# 5 FINAL CONCLUSIONS

1. The accuracy of the data sets are quite vaired throughout the models used.
2. Overall, it seems the models perform better with more data than less. My assumption is this is due to the lack of strong correlation between what makes a song popular. As well as a shift in music trends over the years which makes prediction more challenging. In retrospect I could have split the modeling over the year blocks created. This may produce a more accurate model and relevant model.
3. In conclusion, the best performing in general, either the **Linear Regression** should be used with a larger dataset (+1000) with an 85% accuracy . And **Naive Bayes** models should be used with a smaller data set with a 62% accuracy.