# Recommender Systems



In this notebook:

<!-- * [Problem Statement](#problem)
* [Data Cleaning](#cleaning)
* [Formatting for Classification](#format)
* [EDA](#EDA) -->

### The Data
Data set obtained from: https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db

Spotify audio features explained here: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

## Import Libraries & Read in Data
<hr/>

In [1]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [3]:
### read in data
data = pd.read_csv('./data/hiphop.csv')
data.head()

Unnamed: 0,genre,track_id,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,is_popular
0,Hip-Hop,2JvzF1RMd7lE3KmFlsyZD8,0.149,0.837,0.364,0.0,0.271,-11.713,0.276,123.984,0.463,1
1,Hip-Hop,2IRZnDFmlqMuOrYOLnZZyc,0.259,0.889,0.496,0.0,0.252,-6.365,0.0905,86.003,0.544,1
2,Hip-Hop,2t8yVaLvJ0RenpXUIAC52d,0.0395,0.837,0.636,0.00125,0.342,-7.643,0.086,145.972,0.274,1
3,Hip-Hop,79OEIr4J4FHV0O3KrhaXRb,0.00195,0.942,0.383,0.0,0.0922,-8.099,0.565,100.021,0.38,1
4,Hip-Hop,1xzBco0xcoJEDXktl7Jxrr,0.194,0.729,0.625,0.00986,0.248,-5.266,0.0315,146.034,0.261,1


In [8]:
data_red = data[['track_id', 'danceability', 'energy', 'valence']].copy()

In [9]:
data_red['score'] = data_red.mean(axis=1)

In [10]:
data_red.head()

Unnamed: 0,track_id,danceability,energy,valence,score
0,2JvzF1RMd7lE3KmFlsyZD8,0.837,0.364,0.463,0.554667
1,2IRZnDFmlqMuOrYOLnZZyc,0.889,0.496,0.544,0.643
2,2t8yVaLvJ0RenpXUIAC52d,0.837,0.636,0.274,0.582333
3,79OEIr4J4FHV0O3KrhaXRb,0.942,0.383,0.38,0.568333
4,1xzBco0xcoJEDXktl7Jxrr,0.729,0.625,0.261,0.538333


In [11]:
data_red.describe()

Unnamed: 0,danceability,energy,valence,score
count,9295.0,9295.0,9295.0,9295.0
mean,0.718808,0.643275,0.473381,0.611821
std,0.130642,0.150037,0.222325,0.11252
min,0.201,0.000243,0.0336,0.197833
25%,0.639,0.539,0.3,0.5305
50%,0.735,0.646,0.469,0.610667
75%,0.816,0.752,0.6425,0.694333
max,0.986,0.995,0.979,0.914333


In [15]:
piv_df = pd.pivot_table(data_red, index=data_red.index, columns='track_id', values ='score')

In [17]:
piv_df.isna().sum()

track_id
002QT7AS6h1LAF5dla8D92    9294
007PPvZtGDYHSEhYPxqIfC    9294
009MFLQ8i6P2VeJp4ivex6    9294
00BNT97AtJ5aB8SSsE5xGH    9294
00BnfL75e8vHSGCmwUWbEk    9294
                          ... 
7znMNt4SWNyheMtwgzvUzK    9294
7znZvX0Mt6NBmaI8VCPurT    9294
7zrH5Yxm0GYeQAKyy4ctp5    9294
7ztcWZ0EB6akSzlb8BUaqG    9294
7zxRMhXxJMQCeDDg0rKAVo    9294
Length: 9295, dtype: int64