<table align="center" width=100%>
    <tr>
        <td width="15%">
            <img src="Spotify_1.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=24px>
                    <b>Spotify Popularity Prediction
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

Spotify being one of the top application among song streaming platforms with active users from 178 countries adding up to 158 Million Premium Subscribers and 356 Million Monthly Active Users.Unlike physical or download sales, which pay artists a fixed price per song or album sold, Spotify pays royalties based on the number of artist streams as a proportion of total songs streamed.The popularity rating is based on total number of plays compared to other tracks as well as how recent those plays are.

### Problem Statement:
**Being able to predict about songs, whether its gonna be popular or not can be beneficial to digital music platforms like Spotify, Apple music, Lastfm in many ways. They will be able to know which genre of songs are popular among what age group and which are the songs people like listening these days. They can recommend the similar genre songs to them. That can help the applications like spotify to remain a step ahead from their competitors.**

**Here we will try to look into the features important for deciding the popularity ratings.** 

## Data Defination

**duration_ms**: The duration of the track in milliseconds.

**key**: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

**audio_mode**: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

**time_signature**: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

**acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

**danceability**: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

**energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

**instrumentalness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

**loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

**speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music.Values below 0.33 most likely represent music and other non-speech-like tracks.

**audio_valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

**song_popularity**: Song ratings of spotify audience.

**liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

## Table of Contents

1. **[Read Data](#Read_Data)**
2. **[Data Analysis and Preparation](#data_preparation)**
   - 2.1 - [Understand the Data](#Data_Understanding)
       - 2.1.1 - [Data Dimension](#Data_Shape)
       - 2.1.2 - [Data Types](#Data_types)
       - 2.1.3 - [Missing Values](#Data_missing)
       - 2.1.4 - [Checking for unique values](#duplicate)
       - 2.1.5 - [Class creation for Target varible](#class_creation)
   - 2.2 - [Outliers](#outliers)
   - 2.3 - [Variable Ananlysis](#variable_analysis)
        - 2.3.1 - [Categorical Variable Analysis](#cat_var_analysis)
        - 2.3.2 - [Basic Data Analysis](#b_d_a)
   - 2.3 - [Popularity_analysis](#p_a)
   - 2.4 - [Features Distribution](#f_d)
3. **[Feature Engineering](#f_e)**
4. **[Model building](#m_b)**
   - 4.1 - [Logistic Regression](#l_r)
   - 4.2 - [KNN Algorithm](#knn)
   - 4.3 - [Naive Bayes](#navie)
   - 4.4 - [Decision Tree Classifier](#dtc)
   - 4.5 - [Random Forest Classifier](#rfc)
5. **[Feature Importance](#feat_imp)**
6. **[Comparison Of Performance](#cop)**
7. **[Conclusion](#conclusion)**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import statsmodels
import statsmodels.api as sm

In [None]:
plt.rcParams['figure.figsize'] = [12, 9]

<a id='Read_Data'></a>
## 1. Read Data

In [None]:
import os
for dirname, _, song_data in os.walk('/kaggle/input'):
    for song_data in song_data:
        print(os.path.join(dirname, song_data))

In [None]:
df = pd.read_csv('/kaggle/input/19000-spotify-songs/song_data.csv')
df.head()

<a id='data_preparation'></a>
## 2. Data Analysis and Preparation

<a id='Data_Understanding'></a>
### 2.1 Understand the Dataset

<a id='Data_Shape'></a>
### 2.1.1 Data Dimension

In [None]:
df.shape

Dataset has 18835 rows and 15 columns

<a id='Data_types'></a>
### 2.1.2 Data Types

In [None]:
df.dtypes

Out of 15 columns 14 are numeric with 5 int and 9 float variables and 1 catogorical.

<a id='Data_missing'></a>
### 2.1.3 Missing Values

In [None]:
df.isnull().sum()

Dataset has nill null values.

<a id='duplicate'></a>
### 2.1.4 Checking for unique values

In [None]:
for i in df.columns:
    print(f'{i}: {df[i].nunique()}')

- song_name : from above table we can see that there are some song_names that have been repeated
- audio_mode : 1 = Major,  0 = Minor
- acousticness : [0,1] - 1 is high confidence track is acoustic(not having electrical amplification)
- instrumentalness : A value above 0.8 provides strong likelihood that the track is live
- loudness : [-60, 0]dB
- speechiness : (0,1) - 1 = speech, 0 = music
- audio_valence : (0,1) - 1 = happy, cheerfull, 0 = sad music
- tempo : BPM
- liveness ~ instrumentalness

<a id='class_creation'></a>
### 2.1.5. Class creation for Target varibale

In [None]:
df['song_popularity'].describe()

In [None]:
sns.distplot(df['song_popularity'])

plt.axvline(df['song_popularity'].mean(), linewidth = 2, color = 'r')
plt.axvline(df['song_popularity'].median(), linewidth = 2, color = 'k')
plt.xticks(ticks = np.arange(0, 100, 5))

plt.show()

In [None]:
def func(x):
    if 0 < x <= 66.5:
        return 0
    else:
        return 1

In [None]:
df['song_popularity'] = df['song_popularity'].apply(func)

In [None]:
df['song_popularity'].value_counts().sort_index()

Checked popularity rating of songs that have been popular in the last 10 years in Spotify and took the mean value of them (66.5) . According to this value, the songs has above this rating could remain on the top lists for a long time. If song_popularity is higher than 66.5 (this is about 30% percent of data) we labeled it "1" and if is not we labeled it "0". So we have "1" for the popular songs and "0" for the unpopular ones.

### Checking for imblanced data

In [None]:
print(f'% for Not popular songs: {(df["song_popularity"].value_counts()[0]/len(df["song_popularity"]))*100}')

In [None]:
print(f'% for Popular songs: {(df["song_popularity"].value_counts()[1]/len(df["song_popularity"]))*100}')

**Our data is not imbalanced.**

<a id='outliers'></a>
### 2.2 Outliers Detection

In [None]:
df.boxplot()
plt.xticks(rotation = 90)

plt.tight_layout()

In [None]:
# Checking outliers after removing song_duration_ms

In [None]:
df.drop('song_duration_ms', axis = 1).boxplot()
plt.xticks(rotation = 90)

plt.tight_layout()

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)

IQR = Q3 - Q1
IQR

In [None]:
Q1 = df['tempo'].quantile(0.25)
Q3 = df['tempo'].quantile(0.75)

IQR = Q3 - Q1
IQR

In [None]:
df1 = df.copy()

In [None]:
df['tempo'] = df['tempo'][~((df['tempo'] < (Q1 - 1.5 * IQR)) | (df['tempo'] > (Q3 + 1.5 * IQR)))]

In [None]:
df.shape

In [None]:
# Checking outliers after removing song_duration_ms

df.drop('song_duration_ms', axis = 1).boxplot()
plt.xticks(rotation = 90)

plt.tight_layout()

#### Data is completely cleaned.

### Removing NaN values

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

After treating outliers the nan values from the column tempo are removed

<a id='variable_analysis'></a>
### 2.3 Variable Analysis

<a id='cat_var_analysis'></a>
### 2.3.1 Categorical Variable Analysis

In [None]:
sns.set_theme(style = 'darkgrid',)

sns.countplot(df['key'], palette = "pastel")

plt.title('Countplot of Key variable', fontsize = 18)
plt.xlabel('key', fontsize = 15)
plt.ylabel('count', fontsize = 15)

plt.grid(True)

plt.show()

**•As we can see from the graph that 0, 7, 1 keys are been used in most of the songs.**

**•On the other hand key 3 is the least used.**

**•And keys 4, 6, 8, 10 are used in somewhat same frequancy.**

In [None]:
df['audio_mode'].value_counts()

In [None]:
sns.set_theme(style = 'darkgrid')

sns.countplot(df['audio_mode'], palette = "Set2")

plt.text(x = -0.05, y = 5490, s = str(round(df['audio_mode'].value_counts()[0]/len(df['audio_mode'])*100, 2))+'%')
plt.text(x = 0.95, y = 9399, s = str(round(df['audio_mode'].value_counts()[1]/len(df['audio_mode'])*100, 2))+'%')

plt.title('Countplot audio_mode variable', fontsize = 18)
plt.xlabel('audio_mode', fontsize = 15)
plt.ylabel('count', fontsize = 15)

plt.grid(True)

plt.show()

**•Minor(0) is 37.19% on the other hand Major(1) has been used 62.81%**

In [None]:
sns.set_theme(style = 'darkgrid')

sns.countplot(df['time_signature'], palette = "pastel")

plt.title('Countplot time_signature variable', fontsize = 18)
plt.xlabel('time signature', fontsize = 15)
plt.ylabel('count', fontsize = 15)

plt.grid(True)

plt.show()

**•4 BPM are the most common in this dataset.**


<a id='b_d_a'></a>
### 2.3.2 Basic Data Analysis

In [None]:
pd.crosstab(columns = df['song_popularity'], index = df['key'], values = df['audio_valence'], aggfunc = np.median)

**•In popularity 0 we can see that artists are prefering to stick between [0.52, 0.58]**

**•In popularity 2 artists range differs between [0.46, 0.56]**

**•Artists are not afraid in expressing their sad feelings and peopele are liking those songs**

In [None]:
df.groupby(['time_signature']).count()['song_popularity']

**•time_signature shows the BPM in each bar and 4 are the most used**

In [None]:
pd.crosstab(index = df['audio_mode'], columns = df['song_popularity'])

**•audio mode 1 is the most prefered among all classes of popularity.**

In [None]:
pd.crosstab(index = df['time_signature'], columns = df['song_popularity'], values = df['danceability'], aggfunc = np.mean)

**•danceability is highest at time_signature 4.**

<a id='p_a'></a>
### 2.3 Popularity analysis w.r.t catogorical variables

In [None]:
sns.set_theme(style = 'darkgrid')

sns.factorplot(y = "song_popularity", x = "key", data = df, kind = "bar", size = 8)

plt.xlabel("key", fontsize = 15)
plt.ylabel("popularity_probablity", fontsize = 15)

plt.grid(True)
plt.show()

Above graph shows the probablity of particular key being popular, and we can see that key-1 has the highest probablity of being popular and key-3 being the least popular

In [None]:
sns.set_theme(style = 'darkgrid')

sns.factorplot(x = "time_signature", y = "song_popularity", data = df, kind = "bar", size = 8)

plt.xlabel("time_signature", fontsize = 15)
plt.ylabel("Popularity Probability", fontsize = 15)

plt.grid(True)
plt.show()

time_signature with popularity shows beats 5 has more pobablity of being popular than other beats 

In [None]:
g = sns.FacetGrid(data = df, col='song_popularity', size = 5)
g.map(sns.scatterplot, 'key', 'time_signature')

plt.tight_layout()
plt.show()

Neither of the popularity class is using time_signature '2' irrespective of keys. Same is happening with the time_signature '0' except for key 0. Also key 2 doesn't have songs for high popularity with 1 time_signature. Same goes with key 4,9 and 11.


In [None]:
f = sns.FacetGrid(data = df, col = 'audio_mode', size = 5)
f.map(sns.barplot, 'song_popularity', 'acousticness')

plt.tight_layout()
plt.show()

Acousticness is seen higher in the lesser popular songs in both the audio_mode.

Implies that there might be apossibility that people prefer lyrics more as compared to music.

In [None]:
g = sns.FacetGrid(data = df, row = "audio_mode", col = "song_popularity", size = 4)
g.map(sns.barplot, "key", "instrumentalness")

g.add_legend()
plt.show()

Lesser popular songs have more liveliness when compared to popular songs for audio_mode = 0
And same goes with the audio_mode = 1

Which implies that people prefer songs more than live performances.

In [None]:
q = sns.FacetGrid(data = df, col = 'song_popularity')
q.map(sns.histplot, 'loudness')

g.add_legend()
plt.show()

Distribution of loudness w.r.t song_popularity shows that the loudeness value is lesser for popular songs in comparision with lesser popular songs.

Which shows that people prefer songs with less loudness.

In [None]:
q = sns.FacetGrid(data = df, col = 'song_popularity')
q.map(sns.histplot, 'audio_valence')

g.add_legend()
plt.show()

<a id='f_d'></a>
### 2.4 Features Distribution

In [None]:
f, axes = plt.subplots(3, 5, figsize = (12, 12))

sns.histplot(df['song_popularity'], color = 'teal', ax = axes[0,0], kde = True)
sns.histplot(df['song_duration_ms'], color = 'teal', ax = axes[0,1], kde = True)
sns.histplot(df['acousticness'], color = 'teal', ax = axes[0,2], kde = True)
sns.histplot(df['danceability'], color = 'teal', ax = axes[0,3])
sns.histplot(df['energy'], color = 'teal', ax = axes[0,4], kde = True)
sns.histplot(df['instrumentalness'], color = 'teal', ax = axes[1,0], kde = True)
sns.histplot(df['key'], color = 'teal', ax = axes[1,1], kde = True)
sns.histplot(df['liveness'], color = 'teal', ax = axes[1,2], kde = True)
sns.histplot(df['loudness'], color = 'teal', ax = axes[1,3], kde = True)
sns.histplot(df['audio_mode'], color = 'teal', ax = axes[1,4], kde = True)
sns.histplot(df['speechiness'], color = 'teal', ax = axes[2,0], kde = True)
sns.histplot(df['time_signature'], color = 'teal', ax = axes[2,1], kde = True)
sns.histplot(df['audio_valence'], color = 'teal', ax = axes[2,2], kde = True)

f.delaxes(axes[2][4])
plt.tight_layout()
plt.show()

In [None]:
df.corr()
sns.heatmap(df.corr(), annot=True)

From above correlation matrix we can see strong correlation between few variables, where as on the other hand naive bayes assumes there's no relation among varibales.
So we can conclude that correlation is the main reason for lesser accuracy.

<a id='f_e'></a>
## 3. Feature Engineering

In [None]:
df['key'] = df['key'].astype('category')
df['audio_mode'] = df['audio_mode'].astype('category')
df['time_signature'] = df['time_signature'].astype('category')

In [None]:
df = pd.get_dummies(df, columns=['key'])
df = pd.get_dummies(df, columns=['audio_mode'])
df = pd.get_dummies(df, columns=['time_signature'])

In [None]:
df.head()

In [None]:
def change_type(var):
    df[var] = df[var].astype(int)

In [None]:
column= ["key_0","key_1","key_2","key_3","key_4","key_5","key_6","key_7","key_8","key_9","key_10","key_11","audio_mode_0","audio_mode_1","time_signature_0","time_signature_1","time_signature_3","time_signature_4","time_signature_5"]
for i in column:
    change_type(i)

In [None]:
df.info()

In [None]:
df_num = df.select_dtypes(include = np.number)
df_num.drop('song_popularity', inplace = True, axis = 1)

In [None]:
from sklearn.preprocessing import MinMaxScaler

df_target = df['song_popularity']

mm = MinMaxScaler()
df_mm = mm.fit_transform(df_num)

X = pd.DataFrame(df_mm, columns = df_num.columns)
X.head()

In [None]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

X = sm.add_constant(X)

X_train, X_test, y_train, y_test = train_test_split(X, df_target, test_size=0.2, random_state=100)

In [None]:
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

<a id='m_b'></a>
## 4. Model building

In [None]:
import statsmodels.api as sm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from warnings import filterwarnings
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
 
filterwarnings('ignore')

In [None]:
def get_test_report(model):
    
    y_pred = model.predict(X_test)
    
    return(classification_report(y_test, y_pred))

In [None]:
def get_train_report(model):
    
    train_pred = model.predict(X_train)

    return(classification_report(y_train, train_pred))

In [None]:
def con_matrix(model):
    
    y_pred = model.predict(X_test)
    
    con = confusion_matrix(y_test, y_pred)
    
    c = pd.DataFrame(con, columns = ['Predicted:0', 'Predicted:1'], index = ['Actual:0', 'Actual:1'])
    
    sns.heatmap(c, annot = True)

In [None]:
from sklearn.metrics import roc_auc_score

def plot_roc(model):
    
    y_pred_prob = model.predict_proba(X_test)[:,1]
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

    plt.plot(fpr, tpr)

    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])

    plt.plot([0, 1], [0, 1],'r--')

    plt.title('ROC curve for Popular songs prediction', fontsize = 15)
    plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
    plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

    plt.text(x = 0.02, y = 0.9, s = ('AUC Score:',round(roc_auc_score(y_test, y_pred_prob),4)))

    plt.grid(True)

<a id='l_r'></a>
### 4.1.Logistic Regression

In [None]:
y_train = list(y_train)

In [None]:
logreg = sm.Logit(y_train, X_train).fit()
print(logreg.summary())

In [None]:
df_odds = pd.DataFrame(np.exp(logreg.params), columns= ['Odds']) 
df_odds

**Do predictions on the test set.**

In [None]:
y_pred_prob = logreg.predict(X_test)
y_pred_prob.head()

In [None]:
y_pred = [ 0 if x < 0.5 else 1 for x in y_pred_prob]

In [None]:
y_pred[0:5]

In [None]:
l = metrics.accuracy_score(y_test, y_pred)
print(l)

In [None]:
print(classification_report(y_test, y_pred))

#### Plot the confusion matrix.

In [None]:
cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot = True, fmt = 'd', cbar = False, 
            linewidths = 0.1, annot_kws = {'size':25})
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.show()

In [None]:
score_card = pd.DataFrame(columns=['Probability Cutoff', 'AUC Score', 'Precision Score', 'Recall Score','Accuracy Score', 'Kappa Score', 'f1-score'])

def update_score_card(model, cutoff):
    y_pred_prob = logreg.predict(X_test)
    
    y_pred = [0 if x < cutoff else 1 for x in y_pred_prob]
    
    global score_card
    
    score_card = score_card.append({'Probability Cutoff': cutoff,
                                    'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                                    'Precision Score': metrics.precision_score(y_test, y_pred),
                                    'Recall Score': metrics.recall_score(y_test, y_pred),
                                    'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa Score':metrics.cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred)}, 
                                    ignore_index = True)

In [None]:
update_score_card(logreg, 0.5)

In [None]:
score_card = score_card.sort_values('Probability Cutoff').reset_index(drop = True)
score_card

**Roc curve**

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.plot([0, 1], [0, 1],'r--')
plt.title('ROC curve for Popularity prediction', fontsize = 15)
plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

plt.text(x = 0.02, y = 0.9, s = ('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred_prob),4)))
                               
plt.grid(True)

### GridSearchCV

In [None]:
param_grid = {'C': np.logspace(-3, 3, 7), 'penalty': ['l1', 'l2']}
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg,param_grid,cv=3)
logreg_cv.fit(X_train,y_train)

In [None]:
logit_accuracy= (logreg_cv.best_score_)

<a id='knn'></a>
### 4.2. KNN Algoritham

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_class = KNeighborsClassifier(n_neighbors = 3)

knn_model = knn_class.fit(X_train, y_train)

In [None]:
con_matrix(knn_model)

In [None]:
train_report = get_train_report(knn_model)

print(train_report)

In [None]:
test_report = get_test_report(knn_model)

print(test_report)

Why is your accuracy on train data sometimes lower than accuracy on the test ?

Most likely culprit is your train/test split percentage. Imagine if you're using 99% of the data to train, and 
1% for test, then obviously testing set accuracy will be better than the testing set, 99 times out of 100.
The solution here is to use 50% of the data to train on, and 50% to evaluate the model.

In [None]:
plot_roc(knn_model)

### Feature Selection

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

##### k_features = (1, 10) with backward selection

In [None]:
sfs_knn = sfs(knn_model, k_features = (1, 10), forward = False, cv = 10, 
              scoring = 'accuracy',
             n_jobs = -1)

selected_knn1 = sfs_knn.fit(X_train, y_train)

In [None]:
# Generate the new subsets based on the selected features
X_train_sfs = selected_knn1.transform(X_train)
X_test_sfs = selected_knn1.transform(X_test)

# Fit the estimator using the new feature subset
# and make a prediction on the test data
knn_model.fit(X_train_sfs, y_train)
y_pred = knn_model.predict(X_test_sfs)

# Compute the accuracy of the prediction
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))

### GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score 

In [None]:
tuned_paramaters = {'n_neighbors': np.arange(1, 50, 3),'metric': ['hamming','euclidean','manhattan','Chebyshev']}


knn_classification = KNeighborsClassifier()

knn_grid = GridSearchCV(estimator = knn_classification, param_grid = tuned_paramaters, cv = 10, scoring = 'accuracy')


knn_grid.fit(X_train, y_train)

In [None]:
print('Best parameters for KNN Classifier: ', knn_grid.best_params_, '\n')

In [None]:
test_report = get_test_report(knn_grid)

print(test_report)

In [None]:
knn_accuracy=knn_grid.score(X_test, y_test)
knn_accuracy

In [None]:
error_rate = []

for i in np.arange(1, 50, 2):
    
     
    knn = KNeighborsClassifier(i, metric = 'euclidean')
   
    score = cross_val_score(knn, X_train, y_train, cv = 10)
    
    score = score.mean()
    
    error_rate.append(1 - score)

plt.plot(range(1, 50, 2), error_rate)

plt.title('Error Rate', fontsize = 15)
plt.xlabel('K', fontsize = 15)
plt.ylabel('Error Rate', fontsize = 15)

plt.xticks(np.arange(1, 50, step = 2))

plt.axvline(x = 1, color = 'red')

plt.show()

In [None]:
plot_roc(knn_grid)

<a id='navie'></a>
### 4.3.Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb_model = gnb.fit(X_train, y_train)

In [None]:
con_matrix(gnb_model)

In [None]:
print(get_train_report(gnb_model))

In [None]:
print(get_test_report(gnb_model))

In [None]:
naive_accuracy=gnb_model.score(X_test,y_test)

### Decision Tree

In [None]:
from sklearn.metrics import accuracy_score,recall_score,precision_score,confusion_matrix,f1_score
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(criterion = 'gini',
                                  max_depth = 5,
                                  min_samples_split = 4,
                                  max_leaf_nodes = 6,
                                  random_state = 10)

decision_tree = dt_model.fit(X_train, y_train)

In [None]:
train_report = get_train_report(decision_tree)

print('Train data:\n', train_report)

In [None]:
test_report = get_test_report(decision_tree)

print('Test data:\n', test_report)

### Feature Selection 

In [None]:
sfs_dt = sfs(decision_tree, k_features = (1, 10), forward = True, cv = 10, 
              scoring = 'accuracy',
             n_jobs = -1)

selected_dt = sfs_dt.fit(X_train, y_train)

In [None]:
selected_dt.k_feature_names_

In [None]:
# Generate the new subsets based on the selected features
X_train_sfs = selected_dt.transform(X_train)
X_test_sfs = selected_dt.transform(X_test)

# Fit the estimator using the new feature subset
# and make a prediction on the test data
decision_tree.fit(X_train_sfs, y_train)
y_pred = decision_tree.predict(X_test_sfs)

# Compute the accuracy of the prediction
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))

### Grid Search CV

In [None]:
tuned_paramaters = [{'criterion': ['entropy', 'gini'], 
                     'max_depth': range(2, 10),
                     'max_features': ["sqrt", "log2"],
                     'min_samples_split': range(2,10),
                     'min_samples_leaf': range(1,10),
                     'max_leaf_nodes': range(1, 10)}]

decision_tree_classification = DecisionTreeClassifier(random_state = 10)

tree_grid = GridSearchCV(estimator = decision_tree_classification, 
                         param_grid = tuned_paramaters, 
                         cv = 10)

In [None]:
tree_grid_model = tree_grid.fit(X_train, y_train)

In [None]:
print('Best parameters for decision tree classifier: ', tree_grid_model.best_params_, '\n')

In [None]:
dt_model = DecisionTreeClassifier(criterion = tree_grid_model.best_params_.get('criterion'),
                                  max_depth = tree_grid_model.best_params_.get('max_depth'),
                                  max_features = tree_grid_model.best_params_.get('max_features'),
                                  max_leaf_nodes = tree_grid_model.best_params_.get('max_leaf_nodes'),
                                  min_samples_leaf = tree_grid_model.best_params_.get('min_samples_leaf'),
                                  min_samples_split = tree_grid_model.best_params_.get('min_samples_split'),
                                  random_state = 10)

dt_model = dt_model.fit(X_train, y_train)

In [None]:
print('Classification Report for test set: \n', get_test_report(dt_model))

### Randomized Search

In [None]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_dist = {"max_depth": [3, None],"max_features": randint(1, 9),"min_samples_leaf": randint(1, 9),"criterion": ["gini", "entropy"]}

tree = DecisionTreeClassifier()

tree_cv = RandomizedSearchCV(tree, param_dist, cv = 10)

In [None]:
tree_cv.fit(X_train,y_train)

In [None]:
DT_prob=tree_cv.predict_proba(X_test)

In [None]:
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print('\n')
print("Best score is {}".format(tree_cv.best_score_))

In [None]:
DT_accuracy=tree_cv.best_score_

### Random Forest 

In [None]:
rf_classification = RandomForestClassifier(n_estimators = 10, random_state = 10)

rf_model = rf_classification.fit(X_train, y_train)

In [None]:
train_report = get_train_report(rf_model)

print(train_report)

In [None]:
test_report = get_test_report(rf_model)

print(test_report)

In [None]:
# Possibility of overfitting here.

### Feature Selection

In [None]:
sfs_rf = sfs(rf_model, k_features = (1, 10), forward = True, cv = 10, 
              scoring = 'accuracy',
             n_jobs = -1)

selected_rf = sfs_rf.fit(X_train, y_train)

In [None]:
selected_rf.k_feature_names_

In [None]:
X_train_sfs = selected_rf.transform(X_train)
X_test_sfs = selected_rf.transform(X_test)


rf_model.fit(X_train_sfs, y_train)
y_pred = rf_model.predict(X_test_sfs)

acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))

### Grid Search Cv

In [None]:
tuned_paramaters = [{'criterion': ['entropy', 'gini'],
                     'n_estimators': [10, 30, 50, 70, 90],
                     'max_depth': [10, 15, 20],
                     'max_features': ['sqrt', 'log2'],
                     'min_samples_split': [2, 5, 8, 11],
                     'min_samples_leaf': [1, 5, 9],
                     'max_leaf_nodes': [2, 5, 8, 11]}]
 

random_forest_classification = RandomForestClassifier(random_state = 10)

rf_grid = GridSearchCV(estimator = random_forest_classification, 
                       param_grid = tuned_paramaters, 
                       cv = 10)

rf_grid_model = rf_grid.fit(X_train, y_train)

In [None]:
print('Best parameters for random forest classifier: ', rf_grid_model.best_params_, '\n')

In [None]:
rf_model = RandomForestClassifier(criterion = rf_grid_model.best_params_.get('criterion'),
                                  max_depth = rf_grid_model.best_params_.get('max_depth'),
                                  max_features = rf_grid_model.best_params_.get('max_features'),
                                  max_leaf_nodes = rf_grid_model.best_params_.get('max_leaf_nodes'),
                                  min_samples_leaf = rf_grid_model.best_params_.get('min_samples_leaf'),
                                  min_samples_split = rf_grid_model.best_params_.get('min_samples_split'),
                                  random_state = 10)

rf_model = rf_model.fit(X_train, y_train)

In [None]:
print('Classification Report for test set: \n', get_test_report(rf_model))

### Randomized Search CV

In [None]:
from scipy.stats import randint

param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

forest = RandomForestClassifier()

forest_cv = RandomizedSearchCV(forest, param_dist, cv = 10)

In [None]:
forest_cv.fit(X_train, y_train)

In [None]:
print("Tuned Decision Tree Parameters: {}".format(forest_cv.best_params_))
print('\n')
print("Best score is {}".format(forest_cv.best_score_))

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from matplotlib.colors import ListedColormap
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10)
ada_model.fit(X_train, y_train)

In [None]:
def plot_confusion_matrix(model):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])
    sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = ListedColormap(['lightskyblue']), cbar = False, 
                linewidths = 0.1, annot_kws = {'size':25})
    plt.xticks(fontsize = 20)
    plt.yticks(fontsize = 20)
    plt.show()

In [None]:
plot_confusion_matrix(ada_model)

In [None]:
#calculating for test data

In [None]:
test_report = get_test_report(ada_model)
print(test_report)

#### Plot the ROC curve.

In [None]:
plot_roc(ada_model)

In [None]:
RF_accuracy= forest_cv.best_score_

## 3.2 Gradient Boosting

#### Build a gradient boosting model on a training dataset.

In [None]:
gboost_model = GradientBoostingClassifier(n_estimators = 150, max_depth = 10, random_state = 10)
gboost_model.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(gboost_model)

### calculating for test data

In [None]:
test_report = get_test_report(gboost_model)
print(test_report)

In [None]:
plot_roc(gboost_model)

## 3.3 XGBoost

In [None]:
xgb_model = XGBClassifier(max_depth = 10, gamma = 1)
xgb_model.fit(X_train, y_train)


In [None]:
plot_confusion_matrix(xgb_model)

In [None]:
test_report = get_test_report(xgb_model)
print(test_report)

In [None]:
plot_roc(xgb_model)

### 3.3.1 Tune the Hyperparameters (GridSearchCV)

In [None]:
tuning_parameters = {'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],'max_depth': range(3,10),'gamma': [0, 1, 2, 3, 4]}

xgb_model = XGBClassifier()
xgb_grid = GridSearchCV(estimator = xgb_model, param_grid = tuning_parameters, cv = 3, scoring = 'roc_auc')
xgb_grid.fit(X_train, y_train)

print('Best parameters for XGBoost classifier: ', xgb_grid.best_params_, '\n')

In [None]:
xgb_grid_model = XGBClassifier(learning_rate = xgb_grid.best_params_.get('learning_rate'),
                               max_depth = xgb_grid.best_params_.get('max_depth'),
                              gamma = xgb_grid.best_params_.get('gamma'))

xgb_model = xgb_grid_model.fit(X_train, y_train)

print('Classification Report for test set:\n', get_test_report(xgb_model))

In [None]:
plot_roc(xgb_model)

### Identify the Important Features using XGBoost

In [None]:
important_features = pd.DataFrame({'Features': X_train.columns, 
                                   'Importance': xgb_model.feature_importances_})

important_features = important_features.sort_values('Importance', ascending = False)

sns.barplot(x = 'Importance', y = 'Features', data = important_features)

plt.title('Feature Importance', fontsize = 15)
plt.xlabel('Importance', fontsize = 15)
plt.ylabel('Features', fontsize = 15)

plt.show()

In [None]:
model_performances=pd.DataFrame({'Model':['RandomForest','DesicionTreeClassifier','K-NearestNeighbors','LogisticRegession','NaiveBayes'],
                                 'Accuracy':[RF_accuracy,DT_accuracy,knn_accuracy,logit_accuracy,naive_accuracy]})
model_performances