# Please **DO NOT** run any cells as everything (all final images, graphs, etc. are preloaded)

# Section 0 Preface for Imports, Data Handling, & Methodologies 

## Section 0.1 Preface for Write-Up Interpretation & Acknowledgements

For the remainder of this notebook, each part 1) and 2) will be answered in line with notation for the following block of code to show:

1. A brief statement (~paragraph) of what was done to answer the question (narratively explaining what you did in code to answer the question, at a high level).

2. A brief statement (~paragraph) as to why this was done (why the question was answered in this way, not by doing something else. Some kind of rationale as to why you did x and not y or z to answer the question – why is what you did a suitable approach?).

For 3) and 4) (below) the findings and interpretations will be provided in Results & Discussions for each question (model(s)) we ran.

3. A brief statement (~paragraph) as to what was found. This should be as objective and specific as possible – just the results/facts. Do make sure to include numbers and a figure (=a graph or plot) in your statement, to substantiate and illustrate it, respectively. As the unsupervised methods often yield visualizable results, be sure to include a figure. 

4. A brief statement (~paragraph) as to what you think the findings mean. This is your interpretation of your findings and should answer the original question.

Code was used from my own Github repository, found at `www.github.com/sunnydigital/IDS_F21`, including code derived from Stephen Spivak from Introduction to Data Science, Fall 2021. Most of the code falling under the aforementioned two categories surrounds the `PCA` and `k-means` analysis plots.

All code used in this analysis attributable to Introduction to Machine Learning is not cited - we feel it is fair to use code from the course.

The author refers to all analysis performed in the first-person plural tense 'we,' as the author believes it to be weird to say *'I'* did anything, given all *I* did was learn from the amazing instructors & TAs :) Thank you.

## Section 0.2 Imports & Installation of Packages and Libraries, Seaborn Settings

Below we set the random seed to the numeric portion of my NYU ID: `N12664675` import packages & libraries as well as set the settings for `seaborn` plots

In [4]:
!pip install xgboost
!pip install impyute
!pip install missingno

Collecting missingno
  Downloading missingno-0.5.1-py3-none-any.whl (8.7 kB)
Installing collected packages: missingno
Successfully installed missingno-0.5.1


In [10]:
import random
random.seed(12664675)

In [11]:
import sys

import pandas as pd
import numpy as np

import missingno as msno
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

from scipy import stats as st
from statsmodels import api as sm

from impyute.imputation.cs import fast_knn

from tqdm import tqdm

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

import xgboost as xgb

In [13]:
import os
import time
import re

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.modules.activation import Softmax
import torch.optim as optim

from statsmodels.distributions.empirical_distribution import ECDF
import statsmodels.api as sm

import scipy.stats as st
import statsmodels.api as sm ## Need revision for Windows use
from scipy.stats import zscore
from scipy.spatial.distance import squareform

from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression, Perceptron
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_validate, KFold, RepeatedKFold, GridSearchCV
from sklearn.metrics import mean_squared_error as mse
from sklearn import tree, ensemble, metrics, calibration
from sklearn.svm import SVC, LinearSVC
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, MDS
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_samples

from IPython import display

from eli5.sklearn import PermutationImportance
from eli5 import show_prediction
import eli5

from skopt.space import Real, Categorical, Integer
from skopt.plots import plot_objective
from skopt import BayesSearchCV

from tune_sklearn import TuneSearchCV, TuneGridSearchCV
import ray.tune as tune

from tqdm import tqdm

from matplotlib import pyplot as plt
from matplotlib import mlab
from mpl_toolkits.mplot3d import Axes3D
import scikitplot as skplt
import seaborn as sns; sns.set_theme(color_codes=True); sns.set_style("whitegrid")
import graphviz
import colorcet as cc

In [14]:
data = 'musicData.csv'
df = pd.read_csv(data)

df_map = df[['instance_id', 'artist_name', 'track_name', 'obtained_date']]
df = df.loc[:, ~df.columns.isin(['instance_id', 'artist_name','track_name','obtained_date'])]

# print(np.count_nonzero(df['artist_name'].unique()))
# print(np.count_nonzero(df['track_name'].unique()))
df.head(20)

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,music_genre
0,27.0,0.00468,0.652,-1.0,0.941,0.792,A#,0.115,-5.201,Minor,0.0748,100.889,0.759,Electronic
1,31.0,0.0127,0.622,218293.0,0.89,0.95,D,0.124,-7.043,Minor,0.03,115.00200000000001,0.531,Electronic
2,28.0,0.00306,0.62,215613.0,0.755,0.0118,G#,0.534,-4.617,Major,0.0345,127.994,0.333,Electronic
3,34.0,0.0254,0.774,166875.0,0.7,0.00253,C#,0.157,-4.498,Major,0.239,128.014,0.27,Electronic
4,32.0,0.00465,0.638,222369.0,0.587,0.909,F#,0.157,-6.266,Major,0.0413,145.036,0.323,Electronic
5,47.0,0.00523,0.755,519468.0,0.731,0.854,D,0.216,-10.517,Minor,0.0412,?,0.614,Electronic
6,46.0,0.0289,0.572,214408.0,0.803,8e-06,B,0.106,-4.294,Major,0.351,149.995,0.23,Electronic
7,43.0,0.0297,0.809,416132.0,0.706,0.903,G,0.0635,-9.339,Minor,0.0484,120.008,0.761,Electronic
8,39.0,0.00299,0.509,292800.0,0.921,0.000276,F,0.178,-3.175,Minor,0.268,149.94799999999998,0.273,Electronic
9,22.0,0.00934,0.578,204800.0,0.731,0.0112,A,0.111,-7.091,Minor,0.173,139.933,0.203,Electronic


In [15]:
df.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,valence
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,44.22042,0.306383,0.558241,221252.6,0.599755,0.181601,0.193896,-9.133761,0.093586,0.456264
std,15.542008,0.34134,0.178632,128672.0,0.264559,0.325409,0.161637,6.16299,0.101373,0.247119
min,0.0,0.0,0.0596,-1.0,0.000792,0.0,0.00967,-47.046,0.0223,0.0
25%,34.0,0.02,0.442,174800.0,0.433,0.0,0.0969,-10.86,0.0361,0.257
50%,45.0,0.144,0.568,219281.0,0.643,0.000158,0.126,-7.2765,0.0489,0.448
75%,56.0,0.552,0.687,268612.2,0.815,0.155,0.244,-5.173,0.098525,0.648
max,99.0,0.996,0.986,4830606.0,0.999,0.996,1.0,3.744,0.942,0.992


## Section 0.4 Data Handling

Below, we check the presence of `NA`'s in the dataframe and afterwards, output a description of the dataframe, including `'all'` columns

In [143]:
df = df.dropna()
df = df.drop(index=df[(df['duration_ms'] == -1) | (df['tempo'] == '?')].index, axis=0).reset_index(drop=True)

In this section we use the `LabelEncoder` package from `sklearn.preprocessing` and the `get_dummies` package from `pandas` to obtain numerical data for features with string values and dummy features for categorical data.

In [144]:
string_data = ['key', 'obtained_date']
categorical_data = ['artist_name', 'track_name', 'mode', 'music_genre']

smap = lambda x: re.search('\w+(?!_)', x)[0]
cmap = lambda x: re.search('\w+(?=_name)|mode|((?<=music_)\w+)', x)[0]

In [146]:
encode_string = {}
for i, col in zip(range(len(string_data)), string_data):
    encode_string[f'encode_{smap(col)}'] = LabelEncoder().fit(df[[col]])
    
encode_categorical = {}
for i, col in zip(range(len(categorical_data)), categorical_data):
    encode_string[f'encode_{cmap(col)}'] = LabelEncoder().fit(df[[col]])

  y = column_or_1d(y, warn=True)


# Appendix

## Appendix 1.0 Depreciated Code

In [125]:
encode_artist = LabelEncoder(df[['artist_name']]).fit_transform()
encode_track = LabelEncoder(df[['track_name']]).fit_transform()
encode_key = LabelEncoder(df[['artist_name']]).fit_transform()
encode_artist = LabelEncoder(df[['artist_name']]).fit_transform()
encode_artist = LabelEncoder(df[['artist_name']]).fit_transform()
encode_artist = LabelEncoder(df[['artist_name']]).fit_transform()
encode_artist = LabelEncoder(df[['artist_name']]).fit_transform()

TypeError: LabelEncoder() takes no arguments