# Table of Contents
* [1. Introduction ](#1.-Introduction )
* [2. Set-up ](#2.-Set-up )
    * [2.1. Installs and imports of libraries](#2.1.-Installs-and-imports-of-libraries)
    * [2.2. Import dataset](#2.2.-Import-dataset)
    * [2.3. Clean the data](#2.3.-Clean-the-data)
* [3. Helper Function and Key Variables](#3.-Helper-Function-and-Key-Variables)
    * [3.1. Add Dependent Variable](#3.1.-Add-Dependent-Variable)
    * [3.2. Unique Artist Names, etc, etc,](#3.2.-Unique-Artist-Names,-etc,-etc,)
    * [1.3. Connection to **`Spotify API`**](#3.3.-Connection-to-Spotify-API)
* [4. Exploratory Data Analysis](#4.-Exploratory-Data-Analysis)
    * [4.1 Initial Overview of Data Load](#4.1-INITIAL-OVERVIEW-OF-DATA-LOAD)
    * [4.2 Age, Stream Length, Hour, Weekday Distribution](#4.2-AGE,-STREAM-LENGTH,-HOUR,-WEEKDAY-DISTRIBUTION)
    * [4.3 User-based Features Distribution](#4.3-USER--BASED-FEATURES-DISTRIBUTION)
    * [4.4 Time Series Plot](#4.4-TIME-SERIES-PLOT)
    * [4.5 Artists and Playlists Plots](#4.5-ARTISTS-AND-PLAYLISTS-PLOTS)
    * [4.6 Key Playlists Vizualizations](#4.6-KEY-PLAYLISTS-VIZUALIZATIONS )
* [5. Feature Engineering](#5.-Feature-Engineering)
    * [5.1 Artist Features](#5.1-Artist-Features)
    * [5.2 Playlist Features](#5.2-Playlist-Features)
    * [5.3 User Features](#5.3-User-Features)
    * [5.4 Merge the features into the analytics-ready DataFrame](#5.4-Merge-the-features-into-the-analytics-ready-DataFrame)
* [6. Data Preparation](#6.-Data-Preparation)
    * [6.1 Dealing with Missing Data](#6.1-Deal-with-Missing-Data)
    * [6.2 Feature Correlations](#6.2-Feature-Correlations)  
    * [6.3 Class Imbalance](#6.3-Class-Imbalance)
    * [6.4 Train-Test Split](#6.4-Train-Test-Split)
    * [6.5 Feature Scaling](#6.5-Feature-Scaling)
* [7. Model Generation (Classical)](#7.-Model-Generation-(Classical))
    * [7.1 Class Imbalance Problem](#7.1-Class-Imbalance-Problem)
    * [7.2 Logistic Regression](#7.2-Logistic-Regression)
    * [7.3 Naive Bayes](#7.3-Naive-Bayes)
    * [7.4 Support Vector Machine](#7.4-Support-Vector-Machine)
    * [7.5 Decision Tree](#7.5-Decision-Tree)  
    * [7.6 Ensemble Learning](#7.6-Ensemble-Learning)
        * [7.6.1 Random Forest](#7.6.1-Random-Forest)
        * [7.6.2 Adaboost](#7.6.2-Adaboost)
        * [7.6.3 Soft Voting Ensemble Model](#7.6.3-Sof-Voting-Ensemble-Model)
* [8. Extra Models (Deep Learning)](#7.-Extra-Models-(Deep-Learning))
    * [8.1 Neural Network](#8.1-Neural-Network)
    * [8.2 Knowledge Transfer](#8.2-Knowledge-Transfer)
* [9. Model Evaluation and Selection](#9.-Model-Evaluation-and-Selection)
    * [9.1 Confusion Matrix, Accuracy, Recall, etc](#9.1-Confusion-Matrix,-Accuracy,-Recall,-etc)
    * [9.2 ROC Curve](#9.2-ROC-Curve)
    * [9.3 Feature Importance](#9.3-Feature-Importance)
* [10. Conclusions and Recommendations](#10.-Conclusions-and-Recommendations)

# 1. Introduction

# 2. Set up
## 2.1. Installs and imports of libraries

In [1]:
# Import all required libraries
import random 
import pandas as pd
import numpy as np
import seaborn as sns
!pip install missingno
import missingno as msno

# To make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
!pip install matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# Import custom functions from library, named 'spotfunc'
import spotfunc as spotfunc_v2

# Ignore useless warnings 
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# Various other things needed for data preparation
from sklearn.preprocessing import scale
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from scipy.stats import norm

# Imports for Principal Component Analysis (PCA)
!
from sklearn.decomposition import PCA

# For the Spotify API
!pip install spotipy --upgrade
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials


Requirement already up-to-date: spotipy in d:\anaconda\lib\site-packages (2.10.0)


## 2.2. Import dataset

In [2]:
%%time
# Read in sampled data
data = pd.read_csv('cleaned_data.csv', low_memory=False)
print('rows data:',len(data))

# Keep a copy of original data in case of changes made to dataframe
all_artists = data.copy()

# Load playlist data
playlist_ids_and_titles = pd.read_csv('playlists_ids_and_titles.csv',encoding = 'latin-1',error_bad_lines=False,warn_bad_lines=False)

# Drop duplicates
playlist_mapper = playlist_ids_and_titles.drop_duplicates(['id'])


ParserError: Error tokenizing data. C error: out of memory

## 2.3. Clean the data

In [3]:
#DATA CLEANING 
#make all artists with lower case
data['artist_name'] = data['artist_name'].astype(str).str.lower()
data['track_name'] = data['track_name'].astype(str).str.lower()

# As age is more intuitive to interpret than birthyear, we convert 
data["user_age"] = 2016-data["birth_year"]
data['user_age'] = data['user_age'].apply(lambda x: x if x <= 100 else 100) #Ioana: anything greater than 100 will take the value of 100
# Maybe instead of making <100 to 100 maybe make them the median or mean or mode?

data.drop_duplicates()

NameError: name 'data' is not defined

In [None]:
# Df contianing only the 4 popular playlists and all their associated IDs
filtered_mapper = playlist_mapper[playlist_mapper["name"].isin(["Hot Hits UK", "Massive Dance Hits", "The Indie List", "New Music Friday"])]

In [None]:
# Dropping redundant columns
data.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'], axis=1, inplace = True) # not bringing any value
data.drop(['referral_code' , 'offline_timestamp', 'stream_cached', 'source'], axis=1, inplace = True) # null columns


# 3. Helper Functions

## 3.1. Add Dependent Variable

In [None]:
# select relevant playlists 
successid=filtered_mapper["id"].unique().tolist()

# Define variable which looks at "if artist && top playlist ==> successful, otherwise unsuccessful"
success = data["playlist_id"].isin(successid)
data["successful"] = success

# TZ - Just addded, need to figure out where to put these. Needed for feature engineering
list_fourkeyplaylists = ['Hot Hits UK', 'Massive Dance Hits', 'The Indie List', 'New Music Friday']
list_uniqueartists = list(data['artist_name'].unique())
df_withoutkeyplaylists = data[~data['playlist_name'].isin(list_fourkeyplaylists)]

#============second variable==============
# Functon that retrieves a list with all successful artists (who showed up on a top playlist at least once)
def get_successful_artists(all_artists):
    records_successful_playlists = all_artists[all_artists['playlist_id'].isin(successid)]
    return records_successful_playlists['artist_name'].unique()

successful_artists = get_successful_artists(data)

# Define variable which looks at "if successful artist ==> successful, otherwise unsuccessful"
# Create column with dependent variable "success", which takes value of 1if the artist is in the successful list
data['success'] = np.where(data['artist_name'].isin(successful_artists), 1, 0)


In [None]:
# can have successful artists in both buckets, but only first is successfull artists in successfull playlists
print('successful artists:', data.loc[data['successful'] == 1]['artist_name'].nunique())
print('unsuccessful artists:', data.loc[data['successful'] == 0]['artist_name'].nunique())
data['successful'].value_counts()

In [None]:
# If artists is successfull, regardless of playlist, its 1
print('successful artists:', data.loc[data['success'] == 1]['artist_name'].nunique())
print('unsuccessful artists:', data.loc[data['success'] == 0]['artist_name'].nunique())
data['success'].value_counts(normalize=True)

In [None]:
# FOR PLOTTING CLUSTER MATRIX MAPPING
def plot_cluster(t,figsize=None):
    cg = sns.clustermap(t,figsize=(10, 8), cmap="mako", vmin=1)
    plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
    return cg

## 3.2. Unique Artist Names, etc, etc,

In [None]:
# TZ - Takes the main Pandas dataframe and returns a list of the unique artist names
def get_list_unique_artists(data):
    all_artists = pd.DataFrame(data['artist_name'].unique())
    all_artists.rename(columns={0: "artist_name"},inplace=True)
    return all_artists['artist_name'].to_list()

In [None]:
# TZ - Flattens contents of a nested list inside a Pandas dataframe column and creates a new row with each of the
# list values
def expand_list(df, list_column, new_column): 
    lens_of_lists = df[list_column].apply(len)
    origin_rows = range(df.shape[0])
    destination_rows = np.repeat(origin_rows, lens_of_lists)
    non_list_cols = (
      [idx for idx, col in enumerate(df.columns)
       if col != list_column]
    )
    expanded_df = df.iloc[destination_rows, non_list_cols].copy()
    expanded_df[new_column] = (
      [item for items in df[list_column] for item in items]
      )
    expanded_df.reset_index(inplace=True, drop=True)
    return expanded_df


## 3.3. Connection to Spotify API

In [None]:

# Code that connects to the Spotify API, to be used to get the genre information for artists.
#Credentials that provide access to the Spotify API
client_id='85b60901e29e4a00bb292f0376dfcf7d'
client_secret='4f25b06efa8a46608fd17a9631c2c32a'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

# Create Spotify object to access API
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


# 4. Exploratory Data Analysis (DON'T RUN THIS)

## 4.1. INITIAL OVERVIEW OF DATA LOAD

In [None]:
data.head()

In [None]:
# First we look at what data type Pandas attributed to each cloumn
# we need two options to force Pandas to count non-null instances in each coloumn
data.info(verbose=True, null_counts=True)

<div class="alert alert-success">
    <b> Initial Overview of data load </b>

<p>The first thing we observe is that the majority (28/45) of columns are non-numeric (neither int nor float Dtypes), and got classified as an object instead. While an object could, in theory, hold different values, we know that these objects hold text values due to the fact that they came from a .csv file (which can only hold text or numbers).
    
With that out of the way, we see that not all columns contain 3805499 non-null values (e.g. postal_code only has 2453318 and stream_source_uri only 1043871). This information will be important later on when we prepare the dataset for various ML tasks - we will need to deal with these Null values somehow. We also see that 4 columns (e.g. referral_code or stream_cached) have 0 values in them, which means that we will drop these columns, as they serve no purpose nor benefit.</p> 

In [None]:
# Take a look at descriptive summary statistics of each column 
data.describe()

In [None]:
# Innitially this block was run at the very beginning, to decide on the empty columns which can be dropped
# This block shows now that the columns keps are not empty and contain a fair amount of information
for col in data.columns:
    pct_missing = np.mean(data[col].isnull())
    if pct_missing != 0:
        print('{} - {}%'.format(col, round(pct_missing*100))) # look into financial product

## 4.2 AGE, STREAM LENGTH, HOUR, WEEKDAY DISTRIBUTION 

In [None]:
# PLOTTING AGE, STREAM LENGTH, HOUR, WEEKDAY DISTRIBUTION  
sns.set(style="white", palette="muted", color_codes=True)
rs = np.random.RandomState(10)

f, axes = plt.subplots(2, 2, figsize=(15, 7))
sns.despine(left=True)

sns.distplot(data['user_age'], bins=50, fit=norm, ax=axes[0, 0])
sns.distplot(data["stream_length"], bins=50,fit=norm, ax=axes[0, 1])
sns.distplot(data["hour"], bins=24,fit=norm, ax=axes[1, 0])
sns.distplot(data["weekday"], bins=7, fit=norm, ax=axes[1, 1])

axes[0,0].set_xlim(10, 100)
axes[0,1].set_xlim(0,500)

axes[0,0].set_title("Age Distribution")
axes[0,1].set_title("Stream Lenght Distribution")
axes[1,0].set_title("Hour Distribution")
axes[1,1].set_title("Weekday Distribution")

axes[0,0].set_xlabel('')
axes[0,1].set_xlabel('')
axes[1,0].set_xlabel('')
axes[1,1].set_xlabel('')

plt.setp(axes, yticks=[])
plt.tight_layout()

  <div class="alert alert-success">
<b> AGE, STREAM LENGTH, HOUR, WEEKDAY DISTRIBUTION </b>

<p> The age distribution is heavily right-skewed with a mean around 29 and a mode of 24, which is to be expected as Spotify required users to be above 13 as a minimum, and young people are more adapt at adopting digital trends such as music streaming and smartphones. A long tail is observable, which leads to the right-skewedness of the distribution. </p>
<p> Most streams are in the area of 150 – 250 seconds in this left skewed distribution. This is the intuitive duration for songs. The black curve represents a standard-normal distribution for comparison. </p> 
<p> Users stream more on Monday and Saturday, and mostly in the afternoon in the hours of 14:00 – 20:00. Broken down by our key playlists, we see a continuation of this pattern in the below graph. </p>

</div>

## 4.3 USER-BASED FEATURES DISTRIBUTION

In [None]:
# GENDER SPLIT
mpl.style.use('seaborn-white')

labels = 'Male', 'Female'
sizes = [sum(data["gender"]!="female"), sum(data["gender"]=="female")]
colors = ["#1aa64b","#363837"]

fig = plt.figure(figsize=(8,5))
plt.title('Gender Distribution')
plt.pie(sizes,  explode=(0.04,0), labels=labels, colors=colors, autopct='%1.1f%%', startangle=90, pctdistance=1.2,labeldistance=1.4)

plt.axis('equal')
plt.show()


In [None]:
# USER-BASED FEATURES DISTRIBUTION
f, axes = plt.subplots(4, 2, figsize=(20, 35))

prob = data.stream_os.value_counts(normalize=True)
threshold = 0.05
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar', color=["#1aa64b","#363837"], ax=axes[0, 0])
axes[0,0].set_title("Operating System Distribution")
axes[0,0].set_xticklabels(prob.index, rotation=0)

data_plot_1 = data.stream_device.value_counts(normalize=True)
data_plot_1.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[0,1])
axes[0,1].set_title("Device Distribution")
axes[0,1].set_xticklabels(data_plot_1.index, rotation=0)

data_plot_2 = data.mobile.value_counts(normalize=True)
data_plot_2.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[1,0])
axes[1,0].set_title("Mobile Distribution")
axes[1,0].set_xticklabels(data_plot_2.index, rotation=0)

data_plot_3 = data.partner_name.value_counts(normalize=True)
data_plot_3.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[1,1])
axes[1,1].set_title("Provider Distribution")
axes[1,1].set_xticklabels(data_plot_3.index, rotation=0)

data_plot_4 = data.user_product_type.value_counts()
data_plot_4.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[2,0])
axes[2,0].set_title("User Product Type distribution")
axes[2,0].set_xticklabels(data_plot_4.index, rotation=0)

data_plot_5 = data.stream_source.value_counts(normalize=True)
data_plot_5.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[2,1])
axes[2,1].set_title("Stream Source Distribution")
axes[2,1].set_xticklabels(data_plot_5.index, rotation=0)

data_plot_6 = data.financial_product.value_counts(normalize=True)
data_plot_6.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[3,0])
axes[3,0].set_title("Product Distribution")
axes[3,0].set_xticklabels(data_plot_6.index, rotation=45)

data_plot_7 = data.access.value_counts()
data_plot_7.plot(kind="bar", color=["#1aa64b","#363837"], ax=axes[3,1])
axes[3,1].set_title("User Subscription Type Distribution")
axes[3,1].set_xticklabels(data_plot_7.index, rotation=0)

plt.xticks(rotation=0)
plt.show()


<div class="alert alert-success">
<b>User Demographics</b>

<p>Our userbase contains slightly more females then males. The majority use an iOS mobile device, have either Vodafone or Boku as their mobile broadband provider and belong to the user category "paid". Most streaming is done from collections. Users mostly fall under the financial product category "student" and have a "premium" type subscription. Such subscription profile corresponds to the age profile of Spotify users. This is also potentially an evidence that users are likely to be price sensitive. However, it can also be seen that there is still space to increase revenue. </p> 

</div>

## 4.4 TIME SERIES PLOT

In [None]:
# TIME SERIES PLOT

# Create a DataFrame of the total number of streams per date  
df_date = data['date'].value_counts()
df_date=pd.DataFrame(df_date.sort_index())
df_date.reset_index(level=df_date.index.names, inplace=True) 
df_date=df_date.rename(columns={"index": "Date"})
df_date=df_date.rename(columns={"date":"Number Of Streams"}) 

# Create the bar graph to better visualize time series
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(6, 15))
sns.barplot(x=(df_date['Number Of Streams']), y=df_date['Date'],
            data=df_date,            
            label="Number of streams by month",
            orient = "h")


<div class="alert alert-success">
<b> Time series distribution of streams </b>

<p> We notice that the number of streams follows an upward trend. This indicates that Spotify, as expected, has raised in demand from year to year, seeing a significant increase in the last few years. However, there are some declines at certain dates, which shows that there might be a  degree of seasonality in the time series.</p> 

</div>

## 4.5 ARTISTS AND PLAYLISTS PLOTS 

In [None]:
# Next up, we will show how often various playlists come up in the data
# Because we have so many playlists, we only display those who make up more than 2 % of the total data
# https://stackoverflow.com/questions/37598665/how-to-plot-a-value-counts-in-pandas-that-has-a-huge-number-of-different-counts

f, axes = plt.subplots(2, 2, figsize=(20, 20))

prob = data.playlist_name.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar', color=["#1aa64b","#363837"], ax=axes[0, 0])
axes[0,0].set_title("Playlists Distribution")

# But here we show all artists who make up more than 5 % of total streams - 5 % would lead to an overcrowded graph
prob = data.artist_name.value_counts(normalize=True)
threshold = 0.05
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar', color=["#1aa64b","#363837"], ax=axes[0, 1])
axes[0,1].set_title("Artists Distribution")


<div class="alert alert-success">
<b>Artist and Plylist distribution</b>

<p>The Playlists suggested by the Warner Analysts (Hot Hits UK, Massive Dance Hits, The Indie List, New Music Friday) are not very popular, with the exception of Hot Hits UK, which indeed is the very most popular playlist. The others do not show up at all with the threshold of 2%, however, we see that the majority of playlists are small ones.</p> 

<p>For artists, we see that only 6 artists have more than 5 % of total streams each, with Charlie Puth being the "Most popular artist" by far, followed by Dua Lipa and Lukas Graham.</p> 

</div>



In [None]:
# TOP 10 MOST STREAMED SUCCESSFUL ARTISTS BAR CHART 

mpl.style.use('seaborn-white')
top_artists_df = data.loc[data['success'] == 1]['artist_name'].value_counts()[:10].sort_values(ascending=True)
objects = top_artists_df.keys()
y_pos = np.arange(len(objects))
performance = top_artists_df.values
plt.barh(y_pos, performance, align='center', alpha=0.5, color = "#1aa64b")
plt.yticks(y_pos, objects)
plt.xlabel('Number of Occurrences')
plt.title('Top 10 most streamed successful artists')

plt.show()


<div class="alert alert-success">
<b>Top 10 most streamed successful artists distribution</b>

<p>The most streamed successful artist is Charlie Puth by far, followed by Dua Lipa and Lukas Graham.</p>
</div>


In [None]:
# TOP PLAYLSIST WHERE SUCCESSFUL ARTISTS SHOW UP

# stronger recommendation based on other playlists - key playlists may not be relevant any more, many other playlists have way more streams
top_unpopular_playlists =  successful_artists_df.loc[successful_artists_df['success']==1]['playlist_name'].value_counts()[:10].sort_values(ascending=True)
objects = top_unpopular_playlists.keys()
y_pos = np.arange(len(objects))
performance = top_unpopular_playlists.values

plt.barh(y_pos, performance, align='center', alpha=0.5, color = "#1aa64b")
plt.yticks(y_pos, objects)
plt.xlabel('Number of Occurrences')
plt.title('Top Playlists of Successful Artists')
plt.show()


In [None]:
# PLAYLISTS CONTAINING THE HIGHEST NUMBER OF DISTINCT SUCCESSFUL ARTISTS! --> HHUK stands out

successful_artists_frequency_playlists = successful_artists_df.groupby('playlist_name')['artist_name'].nunique().sort_values(ascending=False).reset_index(name='successful_artists')[:15]
objects = successful_artists_frequency_playlists.playlist_name
y_pos = np.arange(len(objects))
performance = successful_artists_frequency_playlists.successful_artists
plt.barh(y_pos, performance, align='center', alpha=0.5, color = "#1aa64b")
plt.yticks(y_pos, objects)
plt.xlabel('Number of Unique Successful Artists')
plt.title('Playlists which contain the highest number of distinct successful artists')
plt.show()


<div class="alert alert-success">
<b>Playlists with successful artists</b>

<p> Upon looking at which playlists artists who have been flagged successful feature on, it becomes manifests the insight that Hot Hits UK has by far the most successful artists. However, the other 3 playlists do not show up at all. It must be questioned whether the other 3 playlists are a good indicator of success or if the key playlists should be updated. </p> 

</div>

In [None]:
# TOP 10 PLAYLISTS MOST ARTISTS SHOW UP, EXCLUSING THE 4 "SUCCESS" RELATED ONES

successful_artists_df = data.loc[data['artist_name'].isin(successful_artists)]
print('Total successful artists:', successful_artists_df['artist_name'].nunique())
top_unpopular_playlists = successful_artists_df.loc[successful_artists_df['successful']==False]['playlist_name'].value_counts()[:10].sort_values(ascending=True)
objects = top_unpopular_playlists.keys()
y_pos = np.arange(len(objects))
performance = top_unpopular_playlists.values
plt.barh(y_pos, performance, align='center', alpha=0.5, color = "#1aa64b")
plt.yticks(y_pos, objects)
plt.xlabel('Number of Occurrences')
plt.title('Top 10 playlists most popular artists show up, different than "The 4"')
plt.show()


<div class="alert alert-success">
<b>Playlists with successful artists - excluding the key playlists</b>

<p> 'Today's Top Hits' is the playlist with most successful artists when we exclude the key playlists. </p>
</div>

## 4.6  KEY PLAYLISTS VIZUALIZATIONS 

In [None]:
#"Hot Hits UK", "Massive Dance Hits", "The Indie List", "New Music Friday"
print('From a total of {} successful artists'.format(len(successful_artists)))
print('Total successful artists showing up on Hot Hits UK:', successful_artists_df.loc[successful_artists_df['playlist_name'] == 'Hot Hits UK']['artist_name'].nunique())
print('Total successful artists showing up on Massive Dance Hits:', successful_artists_df.loc[successful_artists_df['playlist_name'] == 'Massive Dance Hits']['artist_name'].nunique())
print('Total successful artists showing up on The Indie List:', successful_artists_df.loc[successful_artists_df['playlist_name'] == 'The Indie List']['artist_name'].nunique())
print('Total successful artists showing up on New Music Friday:', successful_artists_df.loc[successful_artists_df['playlist_name'] == 'New Music Friday']['artist_name'].nunique())


In [None]:
# DISTRIBUTION OF STREAMS PER HOUR FOR THE KEY PLAYLISTS - Line chart showing that HHUK is signifficantly more streamed than the others

# 4 key Playlists 
key_playlists = playlist_mapper[playlist_mapper["name"].isin(["Hot Hits UK", "Massive Dance Hits", "The Indie List", "New Music Friday"])]
data_key_playlists = data[data.stream_source_uri.astype(str).str[-22:].isin(key_playlists.id)]

# Normalize to identify hourly trends 
key_playlists_streams_hour = data_key_playlists.groupby(['playlist_name','hour']).size().unstack().fillna(0).transpose()
key_playlists_streams_hour.plot(kind='line',figsize=(12,6))

# Visulaize the distribution of streams per hour (Normalized)
key_playlists_streams_hour_norm = pd.DataFrame(scale(key_playlists_streams_hour))
key_playlists_streams_hour_norm.columns = key_playlists_streams_hour.columns
key_playlists_streams_hour_norm.plot(title='Distribution Of Streams Per Hour (Normalized)',kind='line',figsize=(12,6))


<div class="alert alert-success">
<b>Distribution of streams per hour for the key playlists </b>

<p> We observe that overall users tend to stream the key playlists mostly from 10 a.m. to 3 p.m. From 5 a.m. to 10 a.m the number of streams start to raise while from 3 p.m to 8 p.m the  number of streams falls. During the late hours of the night there are not much streams. This is quite expected if we think of a normal person's day-to-day rutine.</p> 
<p> We should also look at each key playlist's hourly trend. The Hot Hits UK playlist has the biggest number of streams somewhere between 3 p.m. and 8 p.m., while it sees a small decrease in the number of streams from 10 a.m to 3 p.m. Other than that, it follows the overall hourly trend discussed before. The Massive Dance Hits playlist seems to count the biggest number of streams at 3 p.m. and it generally follows the overall hourly trend discussed before. The New Music Friday playlist seems to also count the biggest number of streams at 3 p.m. We also observe an increase in the number of streams after 8 p.m. which does not happen for the other key playlists. Other than that, it tends to follow the overall hourly trend discussed before, but it has (most) signifficant ups and downs in number of streams. The Indie list playlists seems to count the biggest number of streams at 10 a.m. Other than that, it follows the overall hourly trend discussed before.
</div>

In [None]:
# PLOT MATRIX WITH ARTISTS DISTRIBUTION ACROSS TOP PLAYLISTS  

playlists_filtered = data.stream_source_uri.value_counts().head(20).keys().tolist()
artists_filtered = data.artist_name.value_counts().head(20).keys().tolist()
df = data[:]
df['playlist_name'] = df.stream_source_uri.astype(str).str[-22:].map(playlist_mapper.set_index('id')['name'])
df = df.dropna(subset=['playlist_name','artist_name'])
df = df[(df.stream_source_uri.isin(playlists_filtered))&df.artist_name.isin(artists_filtered)]
df = df.groupby(['artist_name','playlist_name']).size().unstack().fillna(0)
plot_cluster(df)


In [None]:
# PLOT MATRIX WITH ARTISTS DISTRIBUTION ACROSS TOP PLAYLISTS EXCLUDING HHUK 

# CREATE A NEW DATAFRAME WHICH EXCLUDES HHUK
successful_artists_df_no_HHUK = successful_artists_df.loc[successful_artists_df.playlist_name != 'Hot Hits UK']
successful_artists_df_no_HHUK.artist_name.nunique()

filter_playlists = successful_artists_df_no_HHUK.stream_source_uri.value_counts().head(20).keys().tolist()
filter_artists = successful_artists_df_no_HHUK.artist_name.value_counts().head(20).keys().tolist()
t = successful_artists_df_no_HHUK[:]
t['playlist_name'] = t.stream_source_uri.astype(str).str[-22:].map(playlist_mapper.set_index('id')['name'])
t = t.dropna(subset=['playlist_name','artist_name'])
t = t[(t.stream_source_uri.isin(filter_playlists))&t.artist_name.isin(filter_artists)]
t = t.groupby(['artist_name','playlist_name']).size().unstack().fillna(0)

plot_cluster(t)


<div class="alert alert-success">
<b>Cluster the playlists and artists by number of streams</b>

<p> The above two heatmaps show the difference when the dominant Hot Hits UK playlist is removed from the analysis – while the artists stay mostly the same, the distribution across the other playlists becomes much clearer. A separation into two distinct success categories, one without Hot Hits UK, is a worthwhile experiment. </p> 

</div>

# 5. Feature Engineering

## 5.1 Artist Features

In [None]:
#TZ - Artist Feature 1 -> TOTAL NUMBER OF STREAMS BY ARTIST
def artist_total_streams(data):
    ar_stream_df = data['artist_name'].value_counts().to_frame().reset_index()
    ar_stream_df = ar_stream_df.rename(columns={"index" : "artist_name", "artist_name" : "ar_stream_count"})
    return ar_stream_df


In [None]:
## graph
# call funtion and generate the temp feature for viz
# code for viz
'''
...
...
...

'''
# what we can see after this function.....

## each feature the same a.m. a. p.


In [None]:
#TZ - Artist Feature 2 -> TOTAL NUMBER OF USERS BY ARTIST
def artist_total_users(data):
    ar_users_df = data.groupby('artist_name')['customer_id'].nunique().sort_values(ascending=False).reset_index(name='ar_users')
    return ar_users_df

In [None]:
#TZ - Artist Feature 3 -> PASSION SCORE BY ARTIST
def artist_passion_score(data):
    ar_passion_df = artist_total_streams(data)
    ar_passion_df = ar_passion_df.merge(artist_total_users(data), on='artist_name', how='left', sort=False)
    ar_passion_df['ar_passion_score'] = ar_passion_df['ar_stream_count']/ar_passion_df['ar_users']
    ar_passion_df.drop(columns=['ar_stream_count','ar_users'], inplace=True)
    return ar_passion_df

In [None]:
#TZ - Artist Feature 4 -> GENRE PCA BY ARTIST ** NEW FEATURE **
def artist_genres_with_pca(data, no_components_or_preserved_variance):
    artists_list = get_list_unique_artists(data)
    artist_names = []
    artist_ids = []
    artist_genres = []

    for i in artists_list:
        try:
            result = sp.search(i) #search query
            artist_ids.append(result['tracks']['items'][0]['artists'][0]['id'])
            artist_names.append(i)

            result = sp.artist(result['tracks']['items'][0]['artists'][0]['uri'])
            artist_genres.append(result['genres'])
        except IndexError as error:
            print(i) 

    temp_df = pd.DataFrame(zip(artist_names, artist_genres), columns=["artist_name", "genres"]) 
    final_list = []

    col_one_list = temp_df['genres'].tolist()
    flat_list = [item for sublist in col_one_list for item in sublist]
    mylist = list(dict.fromkeys(flat_list))

    new_temp_df = pd.DataFrame(columns=[mylist], index=[temp_df["artist_name"]]) 

    expanded_dataframe = expand_list(temp_df,"genres", "genres")
    expanded_dataframe['value']=1

    pvit = expanded_dataframe.pivot(index='artist_name', columns='genres', values='value')
    pvit.fillna(0, inplace=True)
    pvit.reset_index(inplace=True)
    
    # Run PCA on Components Saving 40 columns
    Y = pvit['artist_name']
    X = pvit.iloc[:,1:]
    
    # Keeping 40 Principal Components which preserve rougly 50% of the variance
    n_components = no_components_or_preserved_variance
    pca = PCA(n_components=n_components)
    principalComponents = pca.fit_transform(X)
    
    n_components_column_names = []

    for i in range(1,n_components+1):
        n_components_column_names.append('genre_pc_' + str(i))
    
    principalComponents = pd.DataFrame(data = principalComponents, columns =n_components_column_names)
    principalComponents = pd.concat([Y, principalComponents],axis=1)
    
    return principalComponents

In [None]:
## graphs for pca
# code for viz
'''
distribution
feature importance 
variance explained
...

'''
# 

In [None]:
#TZ - Artist Feature 5 -> Region PCA BY ARTIST ** OLD PCA Analysis Rolled into a function **
def artist_regions_with_pca(data, no_components_or_preserved_variance):
    
    artist_region = data.groupby(['artist_name','region_code']).agg({'region_code': ['count']})

    feature_region = artist_region.unstack().fillna(0)

    scaler = StandardScaler()
    artist_region_scaled = scaler.fit_transform(feature_region)

    pca = decomposition.PCA(no_components_or_preserved_variance)

    artist_region_pca = pca.fit(artist_region_scaled)

    artist_region_pca_features = pca.fit_transform(artist_region_scaled)

    pc_name=[]
    for i in range(1,11):
        pc_name.append('region_pc_'+str(i))

    region_pcaDF = pd.DataFrame(data=np.transpose(artist_region_pca.components_), columns=pc_name)
    artist_pca = pd.Series(feature_region.index.values, name='artist_name')

    region_pcaDF = pd.concat([artist_pca, region_pcaDF],axis=1)

    return region_pcaDF

In [None]:
## graphs for pca
# code for viz
'''
distribution
feature importance 
variance explained
...

'''
# 

In [None]:
#TZ - Artist Feature 6 -> Region PCA BY ARTIST ** OLD PCA Analysis Rolled into a function **
def artist_avg_track_streaming_time(data):
    ar_avg_track_streaming_df
    return ar_avg_track_streaming_df


## 5.2 Playlist Features

In [None]:
#TZ - Playlist Feature 1 -> TOTAL NUMBER OF STREAMS ON TOP 20 PLAYLISTS BY ARTIST
def playlist_top20_total_streams(data):
    pasc = data.groupby('playlist_name')['playlist_name'].count()
    df_playlist = pasc.to_frame().rename(columns={'playlist_name': 'pl_stream_count'})
    df_playlist.reset_index(inplace=True)
    df_ar_pl = data.groupby(['artist_name','playlist_name']).size()
    df_ar_pl = df_ar_pl.groupby(level=0, group_keys=False).nlargest(20)
    df_ar_p1_reset = df_ar_pl.to_frame().reset_index()
    df_ar_p1_reset.rename(columns={0:'count'}, inplace=True)
    df_ar_p1_merge = df_ar_p1_reset.merge(df_playlist, on='playlist_name', how='inner')
    df_ar_p1_merge = df_ar_p1_merge.sort_values(by=['artist_name', 'count'], ascending=[1,0])
    df_ar_p1_merge.drop(columns=['pl_stream_count'], inplace=True)
    artist_sum = df_ar_p1_merge.groupby(['artist_name']).agg({'count':sum})
    df_ar_p1_merge = df_ar_p1_merge.merge(artist_sum,on='artist_name', how='inner', suffixes=('','_sum'))
    df_ar_p1_merge.drop(columns=['playlist_name','count'], inplace=True)
    df_ar_p1_merge.rename(columns={"count_sum" : "pl_total_streams_top20"}, inplace=True)
    df_ar_p1_merge.drop_duplicates(inplace=True)
    df_ar_p1_merge.reset_index(drop=True, inplace=True)
    return df_ar_p1_merge

In [None]:
#TZ - Playlist Feature 2 ->  TOTAL NUMBER OF UNIQUE USERS ON TOP 20 PLAYLISTS BY ARTIST
def playlist_top20_user_counts(data):
    df_temp = pd.DataFrame(columns=['artist_name', 'total_users_top20playlists'])

    for i in list_uniqueartists:
        df_artist = df_withoutkeyplaylists[df_withoutkeyplaylists['artist_name']==i]
        df_artist = pd.DataFrame(df_artist.groupby(['playlist_name'])['customer_id'].nunique())
        df_artist.sort_values(by='customer_id', ascending=False, inplace=True)
        df_artist = df_artist.head(20)
        df_artist.insert(0, 'artist_name',i)
        df_artist.reset_index(inplace=True)
        df_artist.drop(columns='playlist_name', inplace=True)
        df_artist = pd.DataFrame(df_artist.groupby('artist_name')['customer_id'].sum())
        df_artist.rename(columns={'customer_id':'total_users_top20playlists'},inplace=True)
        df_temp = pd.concat([df_temp, df_artist], axis=0)

    df_temp.reset_index(inplace=True)
    df_temp.drop(columns='artist_name', inplace=True)
    df_temp.rename(columns={'index':'artist_name'},inplace=True)

    return df_temp

In [None]:
#TZ - Playlist Feature 3 ->  PASSION SCORE ON TOP 20 PLAYLISTS BY ARTIST
def playlist_top20_passion_score(data):
    temp_df = playlist_top20_total_streams(data)
    pl_passion_df = temp_df.merge(playlist_top20_user_counts(data), on='artist_name', how='left', sort=False)
    pl_passion_df['pl_passion_score_top20'] = pl_passion_df['pl_total_streams_top20']/pl_passion_df['total_users_top20playlists']
    pl_passion_df.drop(columns=['pl_total_streams_top20','total_users_top20playlists'], inplace=True)
    return pl_passion_df

In [None]:
#TZ - Playlist Feature 4 -> AVERAGE TRACK STREAMING TIME BY PLAYLIST ** NEW FEATURE **
def playlist_average_track_streaming_time(data):
    avg_streamed_tracks_per_playlist = data.groupby(['playlist_name', 'track_name'])['log_time'].nunique().groupby(['playlist_name']).mean().sort_values(ascending = False).to_frame().reset_index().rename(columns={"log_time" : "avg_tracks_streaming"})
    return avg_streamed_tracks_per_playlist

## 5.3 User Features

In [None]:
#TZ - User Feature 1 -> TOTAL FEMALE USERS BY ARTIST
def user_total_female(data):
    df_genderbreakdownbyartist = data[['artist_name','gender']].copy()
    df_genderbreakdownbyartist['female']=0
    df_genderbreakdownbyartist.loc[df_genderbreakdownbyartist['gender'] == 'female', 'female'] = 1
    df_genderbreakdownbyartist = df_genderbreakdownbyartist.groupby(['artist_name']).agg({'female':'sum'})
    df_genderbreakdownbyartist.sort_values(by='female', ascending=False, inplace=True)
    return df_genderbreakdownbyartist

In [None]:
#TZ - User Feature 2 -> BREAKDOWN OF USER AGES BY ARTIST
def user_age_breakdown(data):
    
    df_temp = data

    bins = [0, 18, 30, 40, 50, 60, 70, 200]
    labels = ['<18','18-29', '30-39', '40-49', '50-59', '60-69', '70<']
    df_temp['user_age_range'] = pd.cut(df_temp.user_age, bins, labels = labels,include_lowest = True)

    df_temp = pd.concat([df_temp['artist_name'], df_temp['user_age_range']], axis=1)
    df_temp['count_dummy'] = 1
    df_temp = pd.pivot_table(df_temp, values='count_dummy', index=['artist_name'],
                          columns=['user_age_range'], aggfunc=np.sum)

    df_temp.fillna(0,inplace=True)
    return df_temp

## 5.4 region feature (PCA)

In [None]:
def pca(data):
    

## 5.4 Merge the features into the analytics-ready DataFrame

In [None]:
featurefunctions = [artist_total_streams(data), 
                    artist_total_users(data),
                    artist_passion_score(data),
                    artist_genres_with_pca(data, 40), # This one in particular takes ages as it needs to connect to the Spotify API
                    artist_regions_with_pca(data, 10),
                    playlist_top20_total_streams(data),
                    playlist_top20_user_counts(data),
                    playlist_top20_passion_score(data),
#                     playlist_average_track_streaming_time(data), TBC 
                    user_total_female(data),
                    user_age_breakdown(data)
                   ]
                    

In [None]:
# This bit takes a while, sit back, chill, have a coffee
final_df = pd.DataFrame()

for i in featurefunctions:
    if final_df.empty:
        final_df = i
    else: 
        final_df = final_df.merge(i, on='artist_name', how='left', sort=False)

In [None]:
#final_df = final_df.set_index('artist_name')
final_df

# 6. Data Preparation

## 6.1. Dealing with Missing Data

## 6.2 Features Correlations

In [None]:
final_df_temp = final_df

In [None]:
correlation = final_df_temp.corr()
plt.figure(figsize=(16, 10))
heatmap = sns.heatmap(correlation >0.95 ,  linewidths=0, vmin=-0.75, vmax=1, cmap="RdBu_r")

In [None]:
# DROPPINGG THE COLUMNS THAT CORRELATE

# Create correlation matrix
corr_matrix = final_df_temp.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.50)]

#final_df_temp = final_df_temp.drop(columns=to_drop)

print("Correlated features dropped: ")
print(*to_drop, sep = ", ") 

In [None]:
final_df_temp.columns

## 6.3 Class Imbalance

We can see that the ratio of successful artist to unsuccessful artists in the dataset is 1:5. This represents a fairly major class-imbalance. 
In general class imbalances can lead to strong predictive biases, as models can optimize the training objective by learning to predict the majority label most of the time. Furthermore, over-sampling can lead to over-fitting of the smaller class. (Source: Python Papaer - ask Ioana).

Therefore, we will adopt random undersampling technique, which will get the proportions of classes closer to each other. We shall under-sample the unsuccessful artist enough to balance the classes and not lose too much information. A 60:40 ratio should be apppropriate. This new class imbalance proportion will be less significant for the modesl. 

In [None]:
success_count = final_df_drop.success.value_counts()
print('failure :', success_count[0])
print('success :', success_count[1])
print('Proportion:', round(success_count[0] / success_count[1], 2), ': 1')

success_count.plot(kind='bar', title='Count (success)');

In [None]:
# Class count
count_fail, count_suc  = final_df_drop.success.value_counts()

# Divide by class
df_suc = final_df_drop[final_df_drop['success'] == 1]
df_fail = final_df_drop[final_df_drop['success'] == 0]
print(df_suc.shape)
print(df_fail.shape)

In [None]:
# Random under-sampling
df_fail_under = df_fail.sample(int(count_suc/40*60))  # achive a class balance closer to 60-40.

df_class_balance = pd.concat([df_fail_under, df_suc], axis=0)

print('Random under-sampling:')
print(df_class_balance.success.value_counts())

df_class_balance.success.value_counts().plot(kind='bar', title='Count (success)');

## 6.4 Train-Test Split

## 6.5 Features Scaling

# 7. Model Generation (Classical)

## 7.1 Logistic Regression

In [None]:
# Standard Logistic Regression with Grid Search Tuning
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
lr_clf_grd = LogisticRegression(random_state=42, solver = 'liblinear')
params = {'penalty': ('l1', 'l2', 'none'),
          'max_iter': [10,100,1000,10000],
          'tol': [1e-4, 1e-3, 1e-2, 1e-1, 1], 
          'C':[0.001,.009,0.01,.09,1,5,10,25] 
         }
best_lr_clf = GridSearchCV(lr_clf_grd, param_grid=params, cv=5, scoring = 'recall')
best_lr_clf.fit(X_train,y_train)
print(best_lr_clf.best_params_)
print(best_lr_clf.best_score_)

lr_clf_optimized = LogisticRegression(random_state=42, solver = 'liblinear', **best_lr_clf.best_params_)
lr_clf_optimized.fit(X_train, y_train)
y_pred_test_lr = lr_clf_optimized.predict(X_test)

print("MSE Optimised Logistic Regression test error:", mean_squared_error(y_test, y_pred_test_lr))
print("Accuracy score:", lr_clf.score(X_test, y_test)) # accuracy_score

## 7.2 Naive Bayes

Naive Bayes is a probabilistic classifier based on the strong(naive) assumption, that every feature is independent of the others, in order to predict the category of a given sample. It learns p(y|x) indirectly by infering p(x/y) and p(y) first. 

As it takes only 2 parameters, and is strictly based on probabilities and likelyhood, performing Grid Search does not make sense. 

In [None]:
from sklearn.naive_bayes import GaussianNB

nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

y_pred_test_nb = nb_clf.predict(X_test)
print(classification_report(y_test, nb_predict, target_names = label_list))

print("MSE of Naive Bayes test:", mean_squared_error(y_test, y_pred_test_nb))

## 7.3 Support Vector Machine

Feature importance works only with linear kernel. For other kernels it is not possible because data are transformed by kernel method to another space, which is not related to input space. Weights asigned to the features (coefficients in the primal problem). This is only available in the case of linear kernel. Its coefficients can be viewed as weights of the input's "dimensions".

In [None]:
# SVM with Grid Search Tuning
from sklearn.svm import SVC

params = {
    'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),
    'degree' : [1, 2, 3, 4, 5],
    'tol': [1e-4, 1e-3, 1e-2, 1e-1, 1],
    'probability': (True, False)
         }

svc_grid_clf = SVC(random_state=42)
best_svm_clf = GridSearchCV(svc_grid_clf, param_grid=params, cv=5, scoring = 'recall')
best_svm_clf.fit(X_train,y_train)
print(best_svm_clf.best_params_)
print(best_svm_clf.best_score_)

svm_clf_optimized = SVC(random_state=42, **best_svm_clf.best_params_)
svm_clf_optimized.fit(X_train, y_train)
y_pred_test_svm = svm_clf_optimized.predict(X_test)

print("MSE of SVC test:", mean_squared_error(y_test, y_pred_test_svm))
print("Accuracy score:", svm_clf.score(X_test, y_test)) # accuracy_score

## 7.4 Decision Tree

Gini scoring is used instead of Entropy because:

* Performance will not change whether you use Gini impurity or Entropy
* Entropy might be a little slower to compute (because it makes use of the logarithm).
* It only matters in 2% of the cases whether you use gini impurity or entropy.

In [None]:
from sklearn.tree import DecisionTreeClassifier
# SVM with Grid Search Tuning

params = {
    'splitter': ('best', 'random'),
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': range(10,500,20),
    'min_samples_leaf': [1, 2, 3, 4],
    'max_features': ('int', 'float', 'auto', 'sqrt', 'log2', None)
         }

tree_grid_clf = DecisionTreeClassifier(criterion='gini', random_state=42)
best_tree_clf= GridSearchCV(tree_grid_clf, param_grid=params, cv=5, scoring = 'recall')
best_tree_clf.fit(X_train,y_train)
print(best_tree_clf.best_params_)
print(best_tree_clf.best_score_)

tree_clf_optimized = DecisionTreeClassifier(**best_tree_clf.best_params_)
tree_clf_optimized.fit(X_train, y_train)

y_pred_test_tree = tree_clf_optimized.predict(X_test)

# Probabilities for each class
y_tree_clf_probs = tree_clf_optimized.predict_proba(X_test)[:, 1]

print("MSE of Decision Tree test:", mean_squared_error(y_test, y_pred_test_tree))
print("Accuracy score:", tree_clf.score(X_test, y_test)) # accuracy_score

## 7.5 Stochasting Gradient Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
SGD_clf=SGDClassifier()
SGD_clf.fit(X_train, y_train)

# produce prediction
y_pred_test_SGD = SGD_clf.predict(X_test)

print("MSE of SGD:", mean_squared_error(y_test, y_pred_test_SGD))
print("Accuracy score:", SGD_clf.score(X_test, y_test)) # accuracy_score

## 7.6 KNNeighbours

In [None]:
# Code to run KNN on multiple n vales to understand the value that optimises for Training and Test

neighbors = list(range(1,30))
train_results = []
test_results = []

for n in neighbors:
    model = KNeighborsClassifier(n_neighbors=n)
    model.fit(X_train, y_train)
    
    train_pred = model.predict(X_train)
    false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_train, train_pred)
    roc_auc = metrics.auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    y_pred = model.predict(X_test)
    false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_test, y_pred)
    roc_auc = metrics.auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)
    
line1, = plt.plot(neighbors, train_results, 'b', label="Train AUC")
line2, = plt.plot(neighbors, test_results, 'r', label="Test AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('n_neighbors')
plt.show()

In [None]:
# Based on visual interpretation of the AUC graph we chose n = 5
#KKN_clf = KNeighborsClassifier(n_neighbors=5)
KKN_clf.fit(X_train, y_train)
y_KKN_clf = KKN_clf.predict(X_test)

## 7.6 Ensemble Learning

### 7.6.1 Random Forest

In [None]:
# RANDOM FOREST WITH PARAMETERS TUNED
from sklearn.ensemble import RandomForestClassifier

params = {
    'n_estimators': [10, 50, 100, 200, 500],
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [1, 2, 3, 4],
    'min_samples_leaf': [1, 2, 3, 4],
    'max_features': ('int', 'float', 'string', None)
         }

rnd_grid_clf = RandomForestClassifier(random_state=42)
best_rf_clf = GridSearchCV(rnd_grid_clf, param_grid=params, cv=5, scoring = 'recall')
best_rf_clf.fit(X_train,y_train)
print(best_rf_clf.best_params_)
print(best_rf_clf.best_score_)

rnd_clf_otimized = RandomForestClassifier(random_state=42, **best_rf_clf.best_params_)
rnd_clf_otimized.fit(X_train, y_train)
y_pred_test_rf = rnd_clf_otimized.predict(X_test)

print("MSE of Optimized Random Forest test:", mean_squared_error(y_test, y_pred_test_rf))

# Probabilities for each class
y_random_forest_clf_prob = rnd_clf_otimized.predict_proba(X_test)[:, 1]

### 7.6.2 Adaboost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

adaboost_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
adaboost_clf.fit(X_train, y_train)
y_adaboost_clf_pred = adaboost_clf.predict(X_test)

print("MSE of Adaboost test:", mean_squared_error(y_test, y_adaboost_clf_pred))

### 7.6.3 Soft Voting Ensemble Model

In [None]:
from sklearn.ensemble import VotingClassifier

#TBD
estimators=[('Linear Regression', lr_clf_optimized), 
            ('Naive Bayes', nb_clf), 
            ('SVM', svm_clf_optimized), 
            ('KNN', KKN_clf), 
            ('Decision Tree', tree_clf_optimized), 
            ('SGD', SGD_clf)]

voting_clf = VotingClassifier(
    estimators=estimators,
    voting='soft')
voting_clf.fit(X_train, y_train)

y_pred_test_voting = voting_clf.predict(X_test)

print("MSE of Voting test:", mean_squared_error(y_test, y_pred_test_voting))

# 8. Extra Models (Deep Learning)

## 8.1 Neural Network

## 8.2 Knowledge Transfer

# 9. Model Evaluation and Selection

## 9.1 Confusion Matrix, Accuracy, Recall, etc.

### Confusion Matrix

In [None]:
# MAKE SUBPLOTS
f, axes = plt.subplots(3, 3, figsize=(16, 16))

countt = 1
for name, clf in estimators:
    ax = fig.add_subplot(3,3,countt)
    y_predicted = clf.predict(X_test)
    cm = confusion_matrix(y_test, y_predicted) 
    make_confusion_matrix(cm)
    ax.imshow(wordcloud, interpolation="bilinear") # not sure if it works
    ax.set_title(name)
    countt+=1
  #  ax.axis('off')

### Accuracy Plot

In [None]:
# PLOT CROSS VALIDATION SCORES ACROSS ALL MODELS
model_names = []
scores = []
for name, clf in estimators:
    score = cross_val_score(clf, x_train, y_train, cv=folds, verbose=0, scoring = 'recall')
    # clf.fit(x_train, y_train)
    # score = clf_k.score(x_test.values, y_test.values)
    model_names.append(name)
    scores.append(score.mean())
    print(name + ': ' + str(score.mean()))

performances = np.array(scores, dtype=np.float32).round(3)  
plt.barh(model_names, performances, align='edge', alpha=0.6, color='deepskyblue')
plt.xlabel('cross-validation score')
plt.title('comparison of different classifiers')
plt.show()

In [None]:
# PLOT CROSS VALIDATION rerall ACROSS ALL MODELS
## graph
# code for viz
'''
recall

'''

## 9.2 ROC Curve

## 9.3 Feature-Importance

In [None]:
#FEATURE IMPORTANCE FOR LOGISTIC REGRESSION
rfe = RFE(lr_clf_optimized, n_features_to_select=10)
rfe = rfe.fit(X_train, y_train)
mask = rfe.support_
print('Top 10 Most Important Features:')
print(X_train_1.T[mask == 1].index)

In [None]:
## graph
# code for viz
'''
for loop all mode:
    plot ROC curve for each model

'''
# what we can see after this function.....

In [None]:
************* some requirements **************
'''
1. all graphs consistent
2. have titles
3. similar color palette　(green and black)
4. ..
'''



# 10. Conclusions and Recommendations