In [6]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [7]:
import grader

# Classifing Music by Genre

Music offers an extremely rich and interesting playing field. The objective of this miniproject is to develop models that are able to recognize the genre of a musical piece, first from pre-computed features and then working from the raw waveform. This is a typical example of a classification problem on time series data.

Each piece has been classified to belong to one of the following genres:
- electronic
- folkcountry
- jazz
- raphiphop
- rock

The model will be assessed based on the accuracy score of your classifier.  There is a reference solution.  The reference solution has a score of 1. *(Note that this doesn't mean that the accuracy of the reference solution is 1)*. Keeping this in mind...

## A note on scoring
It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


# Questions


## Question 1: All Features Model
Download a set of pre-computed features from Amazon S3:

In [1]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'df_train_anon.csv'

download: s3://dataincubator-course/mldata/df_train_anon.csv to ./df_train_anon.csv


This file contains 549 pre-computed features for the training set. The last column contains the genre.

Build a model to generate predictions from this feature set. Steps in the pipeline could include:

- a normalization step (not all features have the same size or distribution)
- a dimensionality reduction or feature selection step
- ... any other transformer you may find relevant ...
- an estimator
- a label encoder inverse transform to return the genre as a string

Use GridSearchCV to find the scikit learn estimator with the best cross-validated performance.

*Hints:*
- Scikit Learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) can center the data and/or scale by the standard deviation.
- Use a dimensionality reduction technique (e.g. [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)) or a feature selection criteria when possible.
- Use [GridSearchCV](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV) to improve score.
- Use a [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to generate an encoding for the labels.
- The model needs to return the genre as a string. You may need to create a wrapper class around scikit-learn estimators in order to do that.

Submit a function that takes a list of records, each a list of the 549 features, and returns a list of genre predictions, one for each record.

In [14]:
import pandas as pd
df = pd.read_csv("df_train_anon.csv",header=None)

In [165]:
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,540,541,542,543,544,545,546,547,548,549
0,0.008931,8.9e-05,6.483444,57.936259,0.006836,0.001953,0.120605,0.008816,8.4e-05,3.814679,...,0.292633,0.235038,0.210596,0.189853,0.172447,0.172212,0.174138,0.173343,99.384014,jazz
1,0.009044,0.000141,7.253759,66.717846,0.005859,0.002441,0.147949,0.008619,6.8e-05,3.203744,...,0.561459,0.551483,0.544794,0.552135,0.576745,0.581951,0.578048,0.578411,123.046875,jazz
2,0.009094,8.2e-05,7.845424,71.890004,0.006836,0.00293,0.106445,0.008814,5.6e-05,3.803823,...,0.325219,0.347016,0.365914,0.369968,0.348162,0.334928,0.352339,0.350118,109.956782,electronic
3,0.009234,8.2e-05,5.243707,43.529571,0.007324,0.001953,0.105469,0.009056,5.3e-05,2.560825,...,0.109804,0.102941,0.137016,0.185714,0.202237,0.14224,0.102227,0.143689,178.205819,jazz
4,0.009895,9.2e-05,9.996839,110.425882,0.008789,0.001953,0.136719,0.009625,6.9e-05,3.137464,...,0.087319,0.062016,0.051776,0.046376,0.039522,0.046094,0.063596,0.207312,120.18532,electronic


In [26]:
len(df.loc[0:549])

550

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.loc[:,0:548],df[549])

In [33]:
labels = pd.factorize(y_train)[0]
labels[0:5]

array([0, 1, 2, 3, 3])

In [41]:
y_train[:10]

1141           rock
20       electronic
723       raphiphop
174     folkcountry
409     folkcountry
961            rock
998            jazz
398            jazz
119       raphiphop
1031           rock
Name: 549, dtype: object

In [43]:
dict_genre={0:"rock",1:"electronic",2:"raphiphop",3:"folkcountry",4:"jazz"}

In [66]:
from sklearn.ensemble import RandomForestClassifier

In [94]:
clf = RandomForestClassifier(min_samples_leaf=5)
clf.fit(X_train, labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [124]:
type(labels)

numpy.ndarray

In [95]:
def all_features(records):
    labels = clf.predict(records)
    return [dict_genre[r] for r in labels]

In [289]:
t = all_features(X_test)

In [290]:
t

['jazz',
 'jazz',
 'rock',
 'rock',
 'jazz',
 'jazz',
 'folkcountry',
 'rock',
 'jazz',
 'rock',
 'jazz',
 'raphiphop',
 'raphiphop',
 'jazz',
 'raphiphop',
 'rock',
 'rock',
 'raphiphop',
 'raphiphop',
 'jazz',
 'rock',
 'jazz',
 'jazz',
 'rock',
 'rock',
 'raphiphop',
 'jazz',
 'raphiphop',
 'rock',
 'rock',
 'jazz',
 'rock',
 'rock',
 'jazz',
 'rock',
 'jazz',
 'jazz',
 'rock',
 'rock',
 'raphiphop',
 'rock',
 'jazz',
 'jazz',
 'jazz',
 'raphiphop',
 'folkcountry',
 'raphiphop',
 'rock',
 'rock',
 'raphiphop',
 'folkcountry',
 'folkcountry',
 'raphiphop',
 'raphiphop',
 'folkcountry',
 'raphiphop',
 'rock',
 'rock',
 'rock',
 'jazz',
 'jazz',
 'jazz',
 'raphiphop',
 'rock',
 'rock',
 'folkcountry',
 'electronic',
 'rock',
 'folkcountry',
 'rock',
 'raphiphop',
 'rock',
 'rock',
 'rock',
 'raphiphop',
 'rock',
 'raphiphop',
 'rock',
 'jazz',
 'jazz',
 'raphiphop',
 'rock',
 'rock',
 'jazz',
 'jazz',
 'rock',
 'jazz',
 'rock',
 'rock',
 'rock',
 'folkcountry',
 'jazz',
 'jazz',
 'jazz

In [61]:
dict_genre[3]

'folkcountry'

In [96]:
# def all_features_est(records):
#     return ['blues' for r in records]

# grader.score('music__all_features_model', all_features)

Your score:  0.908333333355


## Question 2: Raw Features Predictions

For questions 2 and 3, you will need to extract features from raw audio.  Because this extraction can be rather time-consuming, you will not conduct the feature extraction of the test set in real time during the grading.

Instead, you will download a set of test files.  After you have trained your model, you will run it on the test files, to make a prediction for each.  Then submit to the grader a dictionary of the form

```python
{
  "fe_test_0001.mp3": "electronic",
  "fe_test_0002.mp3": "rock",
  ...
}
```

A sets of files for training and testing are available on Amazon S3:

In [2]:
# Training files
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' \
    --include 'music_train.tar.gz' \
    --include 'music_train_labels.csv' \
    --include 'music_feature_extraction_test.tar.gz'

download: s3://dataincubator-course/mldata/music_feature_extraction_test.tar.gz to ./music_feature_extraction_test.tar.gz
download: s3://dataincubator-course/mldata/music_train_labels.csv to ./music_train_labels.csv
download: s3://dataincubator-course/mldata/music_train.tar.gz to ./music_train.tar.gz


In [101]:
import librosa
# import librosa #.feature.zero_crossing_rate
# import librosa #.feature.rmse

In [294]:
import tarfile

t = tarfile.open('music_train.tar', 'r')
# print t.getnames()

In [351]:
mono, fs = librosa.load('data/train/train_0001.mp3', sr = 44100)

In [352]:
zcr = librosa.feature.zero_crossing_rate(mono)

In [353]:
rmse = librosa.feature.rmse(mono)
len(rmse)

1

In [354]:
# mono, fs = librosa.load(f, sr = None)
zcr = librosa.feature.zero_crossing_rate(mono)
rmse = librosa.feature.rmse(mono)
tempo, beat = librosa.beat.beat_track(mono,hop_length=256)
mfcc = librosa.feature.mfcc(mono,sr=44100)
delta_mfcc = librosa.feature.delta(mfcc)

In [360]:
len(np.mean(delta_mfcc.mean(axis=1)))

TypeError: object of type 'numpy.float64' has no len()

In [363]:
np.std(rmse)

0.04212356

In [348]:
mfcc = librosa.feature.mfcc(mono,sr=44100)

In [271]:
l[19]

-4.8529844026803239

In [279]:
l = [np.mean(mfcc[i]) for i in range(0,20)]
l

[-176.16666883807429,
 152.82696025947527,
 -40.404001542526224,
 44.473942579295418,
 -6.9547758979535486,
 24.830904429872433,
 -19.696804668467674,
 17.167506412655509,
 -9.8134965801073921,
 3.4701912407945863,
 -0.80063359334406448,
 -5.954915128951022,
 -8.3172745830374399,
 -1.9527064181899951,
 -4.5052767654051689,
 -11.227108012843939,
 3.4384566398038321,
 -3.8638007366205063,
 2.2215932970622188,
 -4.8529844026803239]

In [281]:
d={'MFCC__%d' % i: value for (i, value) in enumerate(l)}

In [282]:
d

{'MFCC__0': -176.16666883807429,
 'MFCC__1': 152.82696025947527,
 'MFCC__10': -0.80063359334406448,
 'MFCC__11': -5.954915128951022,
 'MFCC__12': -8.3172745830374399,
 'MFCC__13': -1.9527064181899951,
 'MFCC__14': -4.5052767654051689,
 'MFCC__15': -11.227108012843939,
 'MFCC__16': 3.4384566398038321,
 'MFCC__17': -3.8638007366205063,
 'MFCC__18': 2.2215932970622188,
 'MFCC__19': -4.8529844026803239,
 'MFCC__2': -40.404001542526224,
 'MFCC__3': 44.473942579295418,
 'MFCC__4': -6.9547758979535486,
 'MFCC__5': 24.830904429872433,
 'MFCC__6': -19.696804668467674,
 'MFCC__7': 17.167506412655509,
 'MFCC__8': -9.8134965801073921,
 'MFCC__9': 3.4701912407945863}

In [115]:
df_labels = pd.read_csv("music_train_labels.csv")

In [139]:
g_labels = [dict_genre[r] for r in df_labels['genre']]

In [148]:
df_labels_fac = pd.factorize(df_labels['genre'])[0]

In [208]:
df_labels_fac[0:10]

array([0, 1, 1, 1, 0, 1, 0, 2, 3, 4])

In [209]:
dict2_genre={0:"folkcountry",1:"jazz",2:"rock",3:"raphiphop",4:"electronic"}

In [190]:
import numpy as np

def feature_eng(t):
    rmse_list_s=[]
    zcr_list_s=[]

    rmse_list_m=[]
    zcr_list_m=[]

    for f in t.getnames():
        mono, fs = librosa.load(f, sr = None)
        zcr = librosa.feature.zero_crossing_rate(mono)
        rmse = librosa.feature.rmse(mono)

        rmse_m = np.mean(rmse)
        zcr_m = np.mean(zcr)

        rmse_s = np.std(rmse)
        zcr_s = np.std(zcr)

        rmse_list_m.append(rmse_m)
        zcr_list_m.append(zcr_m)

        rmse_list_s.append(rmse_s)
        zcr_list_s.append(zcr_s)

    features_pd = pd.DataFrame(
        {'RMSE_M': rmse_list_m,
         'ZCR_M': zcr_list_m,
         'RMSE_S': rmse_list_s,
         'ZCR_S': zcr_list_s
         })
    return features_pd

In [168]:
rmse_list_s[0]

0.04212356

In [202]:
clf_q2 = RandomForestClassifier(min_samples_leaf=20)
clf_q2.fit(feature_eng(t), df_labels_fac)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=20,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [175]:
t1 = tarfile.open('music_feature_extraction_test.tar', 'r')
#print t1.getnames()

In [203]:
test_labels = clf_q2.predict(feature_eng(t1))

In [204]:
test_labels[144]

2

In [206]:
# raw_features_predictions()

{'fe_test_0001.mp3': 'raphiphop',
 'fe_test_0002.mp3': 'raphiphop',
 'fe_test_0003.mp3': 'electronic',
 'fe_test_0004.mp3': 'raphiphop',
 'fe_test_0005.mp3': 'raphiphop',
 'fe_test_0006.mp3': 'raphiphop',
 'fe_test_0007.mp3': 'raphiphop',
 'fe_test_0008.mp3': 'raphiphop',
 'fe_test_0009.mp3': 'raphiphop',
 'fe_test_0010.mp3': 'folkcountry',
 'fe_test_0011.mp3': 'electronic',
 'fe_test_0012.mp3': 'raphiphop',
 'fe_test_0013.mp3': 'raphiphop',
 'fe_test_0014.mp3': 'raphiphop',
 'fe_test_0015.mp3': 'raphiphop',
 'fe_test_0016.mp3': 'raphiphop',
 'fe_test_0017.mp3': 'folkcountry',
 'fe_test_0018.mp3': 'folkcountry',
 'fe_test_0019.mp3': 'electronic',
 'fe_test_0020.mp3': 'raphiphop',
 'fe_test_0021.mp3': 'electronic',
 'fe_test_0022.mp3': 'raphiphop',
 'fe_test_0023.mp3': 'raphiphop',
 'fe_test_0024.mp3': 'electronic',
 'fe_test_0025.mp3': 'raphiphop',
 'fe_test_0026.mp3': 'raphiphop',
 'fe_test_0027.mp3': 'raphiphop',
 'fe_test_0028.mp3': 'folkcountry',
 'fe_test_0029.mp3': 'electronic',


All songs are sampled at 44100 Hz.

The simplest features that can be extracted from a music time series are the [zero crossing rate](https://en.wikipedia.org/wiki/Zero-crossing_rate) and the [root mean square energy](https://en.wikipedia.org/wiki/Root_mean_square).

1. Build a function or a transformer that calculates these two features starting from a raw file input.  In order to go from a music file of arbitrary length to a fixed set of features you will need to use a sliding window approach, which implies making the following choices:

 1. what window size are you going to use?
 2. what's the overlap between windows?

 Besides that, you will need to decide how you are going to summarize the values of such features for the whole song. Several strategies are possible:
 -  you could decide to describe their statistics over the whole song by using descriptors like mean, std and higher order moments
 -  you could decide to split the song in sections, calculate statistical descriptors for each section and then average them
 -  you could decide to look at the rate of change of features from one window to the next (deltas).
 -  you could use any combination of the above.

 Your goal is to build a transformer that will output a "song fingerprint" feature vector that is based on the 2 raw features mentioned above. This vector has to have the same size, regardless of the duration of the song clip it receives.

2. Train an estimator that receives the features extracted by the transformer and predicts the genre of a song.  Your solution to Question 1 should be a good starting point.

Use this pipeline to predict the genres for the 145 files in the `music_feature_extraction_test.tar.gz` set and submit your predictions as a dictionary.

*Hints*
- Extracting features from time series can be computationally intensive. Make sure you choose wisely which features to calculate.
- You can use MRJob or PySpark to distribute the feature extraction part of your model and then train an estimator on the extracted features.

In [210]:
def raw_features_predictions():
    return {("fe_test_%04d.mp3" % i): dict2_genre[test_labels[i-1]] for i in xrange(1, 146)}


In [211]:
# def raw_features_predictions():
# #     return {("fe_test_%04d.mp3" % i): 'blues' for i in xrange(1, 146)}
#     return {("fe_test_%04d.mp3" % i): dict_genre[test_labels[i-1]] for i in xrange(1, 146)}

# grader.score('music__raw_features_predictions', raw_features_predictions)

Your score:  0.952380952426


## Question 3: All Features Predictions
The approach of Question 2 can be generalized to any number and kind of features extracted from a sliding window. Use the [librosa library](https://github.com/librosa/librosa) to extract features that could better represent the genre content of a musical piece.
You could use:
- spectral features to capture the kind of instruments contained in the piece
- MFCCs to capture the variations in frequencies along the piece
- Temporal features like tempo and autocorrelation to capture the rhythmic information of the piece
- features based on psychoacoustic scales that emphasize certain frequency bands.
- any combination of the above

As for question 1, you'll need to summarize the time series containing the features using some sort of aggregation. This could be as simple as statistical descriptors or more involved, your choice.

As a general rule, build your model gradually. Choose few features that seem interesting, calculate the descriptors and generate predictions.

Make sure you `GridSearchCV` the estimators to find the best combination of parameters.

Use this pipeline to predict the genres for the 145 files in the `music_feature_extraction_test.tar.gz` set and submit your predictions as a dictionary.

**Questions for Consideration:**
1. Does your transformer make any assumption on the time duration of the music piece? If so how could that affect your predictions if you receive longer/shorter pieces?

2. This model works very well on one of the classes. Which one? Why do you think that is?

In [327]:
a = 1
b = 2
c = 3
ll = a,b,c
st = []
st.append(ll)

In [324]:
st.extend(i for i in range(1,10))

In [325]:
st

[(1, 2, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [370]:
def all_feature_eng(t):
    rmse_list_s=[]
    zcr_list_s=[]
    tempo_list_s=[]
    mfcc_list_s=[]
    delta_mfcc_list_s=[]
    beat_list_s=[]

    rmse_list_m=[]
    zcr_list_m=[]
    tempo_list_m=[]
    mfcc_list_m=[]
    delta_mfcc_list_m=[]
    beat_list_m=[]
    
    mfcc_l = []

    features_list = []
    
    for f in t.getnames():
        mono, fs = librosa.load(f, sr = None)
        zcr = librosa.feature.zero_crossing_rate(mono)
        rmse = librosa.feature.rmse(mono)
        tempo, beat = librosa.beat.beat_track(mono,hop_length=256)
        mfcc = librosa.feature.mfcc(mono,sr=44100)
        delta_mfcc = librosa.feature.delta(mfcc)
#         delta = librosa.feature.delta(mono)
        
#         l = [np.mean(item) for item in mfcc]
#         sd_mfcc = [np.std(item) for item in mfcc]

        rmse_m = np.mean(rmse)
        zcr_m = np.mean(zcr)
        tempo_m=np.mean(tempo)
        mfcc_m=np.mean(mfcc)
        delta_mfcc_m=np.mean(delta_mfcc)
#         delta_mfcc_m=delta_mfcc.mean(axis=1)
        beat_m = np.mean(beat)
        
        features = []
        features.append(rmse_m)
        features.append(zcr_m)
        features.append(tempo_m)
        features.append(beat_m)
        
        for i in range(0,len(mfcc)):
            features.append(np.mean(mfcc[i]))
    
        for i in range(0,len(delta_mfcc)):
            features.append(np.mean(delta_mfcc[i]))
    
        
        rmse_s = np.std(rmse)
        zcr_s = np.std(zcr)
        tempo_s=np.std(tempo)
        mfcc_s=np.std(mfcc)
        delta_mfcc_s=np.std(delta_mfcc)
        beat_s=np.std(beat)

        features.append(rmse_s)
        features.append(zcr_s)
        features.append(tempo_s)
        features.append(beat_s)
        
        for i in range(0,len(mfcc)):
            features.append(np.std(mfcc[i]))
    
        for i in range(0,len(delta_mfcc)):
            features.append(np.std(delta_mfcc[i]))

#         features = rmse_m, rmse_s, zcr_m, zcr_s, tempo_m, tempo_s, beat_m, beat_s,delta_mfcc_m, delta_mfcc_s 
        
        features_list.append(features)  


#         mfcc_l.append(l)
        
#         rmse_list_m.append(rmse_m)
#         zcr_list_m.append(zcr_m)
#         tempo_list_m.append(tempo_m)
#         mfcc_list_m.append(mfcc_m)
#         delta_mfcc_list_m.append(delta_mfcc_m)
#         beat_list_m.append(beat_m)


#         rmse_list_s.append(rmse_s)
#         zcr_list_s.append(zcr_s)
#         tempo_list_s.append(tempo_s)
#         mfcc_list_s.append(mfcc_s)
#         delta_mfcc_list_s.append(delta_mfcc_s)
#         beat_list_s.append(beat_s)
        
        


# #     features_pd = pd.DataFrame(
# #         {'RMSE_M': rmse_list_m,
# #          'ZCR_M': zcr_list_m,
# #          'RMSE_S': rmse_list_s,
# #          'ZCR_S': zcr_list_s,
# #          'TEMPO_M': tempo_list_m,
# #          'TEMPO_S': tempo_list_s,
# #          'MFCC_M': mfcc_list_m,
# #          'MFCC_S': mfcc_list_s,
# #          'DELTA_M': delta_mfcc_list_m,
# #          'DELTA_S': delta_mfcc_list_s,
# #          'BEAT_M': beat_list_m,
# #          'BEAT_S': beat_list_s,
#          'MFCC__%d' % i: value for (i, value) in enumerate(mfcc_l)
        #          'MFCCmfcc_l[0],
#          'MFCC1': mfcc_l[1],
#          'MFCC2': mfcc_l[2],
#          'MFCC3': mfcc_l[3],
#          'MFCC4': mfcc_l[4],
#          'MFCC5': mfcc_l[5],
#          'MFCC6': mfcc_l[6],
#          'MFCC7': mfcc_l[7],
#          'MFCC8': mfcc_l[8],
#          'MFCC9': mfcc_l[9],
#          'MFCC10': mfcc_l[10],
#          'MFCC11': mfcc_l[11],
#          'MFCC12': mfcc_l[12],
#          'MFCC13': mfcc_l[13],
#          'MFCC14': mfcc_l[14],
#          'MFCC15': mfcc_l[15],
#          'MFCC16': mfcc_l[16],
#          'MFCC17': mfcc_l[17],
#          'MFCC18': mfcc_l[18]
#          'MFCC19': mfcc_l[19],
#          })
    return features_list

In [287]:
#         {'MFCC__%d' % i: value for (i, value) in enumerate(mfcc_l),


In [None]:
df_train = all_feature_eng(t)

In [373]:
len(df_train)

1167

In [308]:
df_train.loc[0]

BEAT_M                                               740.312
BEAT_S                                               474.105
DELTA_M    [-0.0355509506339, -0.0375985567771, 0.0183213...
DELTA_S                                               2.4488
MFCC_M                                              -2.30404
MFCC_S                                               57.0175
RMSE_M                                           (0.104415,)
RMSE_S                                             0.0421236
TEMPO_M                                               49.692
TEMPO_S                                                    0
ZCR_M                                     (0.0650754442402,)
ZCR_S                                              0.0282038
Name: 0, dtype: object

In [367]:
clf_q3 = RandomForestClassifier(min_samples_leaf=20)
clf_q3.fit(df_train, df_labels_fac)

ValueError: setting an array element with a sequence.

In [376]:
df_test=all_feature_eng(t1)

In [232]:
test_labels_q3 = clf_q3.predict(df_test)

In [379]:
len(df_test)

145

In [244]:
from sklearn.linear_model import RidgeClassifier


In [374]:
ridge = RidgeClassifier(alpha=0.2,normalize=True,class_weight='balanced',fit_intercept=True)


In [None]:
print(len(df_train))
print(len(df_labels_fac))
print(len(df_train[0]))

In [375]:
ridge.fit(df_train, df_labels_fac)

RidgeClassifier(alpha=0.2, class_weight='balanced', copy_X=True,
        fit_intercept=True, max_iter=None, normalize=True,
        random_state=None, solver='auto', tol=0.001)

In [251]:
df_train.shape

(1167, 12)

In [380]:
test_labels_ridgeclf = ridge.predict(df_test)

In [381]:
len(test_labels_ridge)

145

In [382]:
def all_features_predictions():
#     return {("fe_test_%04d.mp3" % i): 'blues' for i in xrange(1, 146)}
    return {("fe_test_%04d.mp3" % i): dict2_genre[test_labels_ridgeclf[i-1]] for i in xrange(1, 146)}


# grader.score('music__all_features_predictions', all_features_predictions)

Your score:  0.982608695622


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*