## Music Genre Recommender
PROBLEM: Based on user profiles, we want to recommend music genres to users after they sign up to improve revenue and retention. We're assuming, based on the given data that men between 20-25 like Hip-hop, 26-30 like Jazz, and 30 and above like Classical music. We're also assuming that women between 20-25 like Dance music, 26-30 like Acoustic, and 30 and above like Classical.

Tutorial link: https://www.youtube.com/watch?v=7eh4d6sabA0

In [21]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
import joblib

music_data = pd.read_csv('music.csv')
music_data.describe()

Unnamed: 0,age,gender
count,18.0,18.0
mean,27.944444,0.5
std,5.12746,0.514496
min,20.0,0.0
25%,25.0,0.0
50%,28.0,0.5
75%,31.0,1.0
max,37.0,1.0


### Preparing the data
Since we don't have any missing records/duplicates, we don't need to clean the data. We're splitting the data into the input set and output set by dropping the 'genre' column. The output set contains the predictions.

In [4]:
# By convention, we use 'X' to represent our input data set
X = music_data.drop(columns=['genre'])

# By convention, we use 'y' to represent our output data set
y = music_data['genre']

### Building Our Model Using a ML Algorithm
We're going to use a decision tree from the scikitlearn library.

In [6]:
model = DecisionTreeClassifier()

# The .fit() method takes in the input data and output data, then trains the model
# model.fit(X.values,y)

# We're asking our model to make 2 predictions; 21M & 22F. Expecting 'HipHop' and 'Dance'
# predictions = model.predict([ [21, 1], [22,0] ])
# predictions

array(['HipHop', 'Dance'], dtype=object)

### Measuring Model Accuracy
We're going to split our data into two sets: one for training and one for testing. A rule of thumb is to allocate 70-80% for training and 20-30% for testing. We're using accuracy_score to compare the predictions to the output data. The data changes everytime we run it, so the score does too. By changing the test size, we also affect the accuracy.

In [16]:
# The function below returns a tuple, so we're destructuring it
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)
predictions_new = model.predict(X_test) 
score = accuracy_score(y_test, predictions_new)
score

1.0

### Model Persistence
Training a model can be very time consuming, so we don't want to do that everytime. What we can do is train the model, then save it to a file. After that, when we want to make predictions, we load the file and use it to predict. We're doing this with joblib.

In [17]:
joblib.dump(model, 'music-recommender.joblib')

['music-recommender.joblib']

In [18]:
model = joblib.load('music-recommender.joblib')

In [19]:
predictions = model.predict([[21,1]])
predictions



array(['HipHop'], dtype=object)

### Visualizing Decision Trees
We're making use of the 'tree' library. This gives us a binary tree showing how our model makes decisions.

In [22]:
# Filled = gives every node a colour
# Rounded = gives every node rounded edges
# label = gives every node labels
# class_names = show classes on nodes
# feature_names = allows us see the rules
tree.export_graphviz(model, out_file='music-recommender.dot', feature_names=['age', 'gender'], class_names=sorted(y.unique()), label='all', rounded=True, filled=True)