In [72]:
import pandas as pd
from os import path

# Edit path if need be (shouldn't need to b/c we all have the same folder structure)
CSV_PATH = '../VidAnalysis/all_data'
FILE_EXTENSION = '_all.csv'
GENRES = ['country', 'edm', 'pop', 'rap', 'rock']

# Containers for the data frames
genre_dfs = {}
all_genres = None


# Read in the 5 genre's of CV's
for genre in GENRES:
    genre_csv_path = path.join(CSV_PATH, genre) + FILE_EXTENSION
    genre_dfs[genre] = pd.read_csv(genre_csv_path)

all_genres = pd.concat(genre_dfs.values())
len(all_genres)

# genre_dfs is now a dictionary that contains the 5 different data frames
# all_genres is a dataframe that contains all of the data

414

In [3]:
def gen_new_headers(old_headers):
    headers = ['colors_' + str(x+1) + '_' for x in range(10)]
    h = []
    for x in headers:
        h.append(x + 'red')
        h.append(x + 'blue')
        h.append(x + 'green')
    return old_headers + h + ['genre']

### Ordinal Genres
Below, we make the genres ordinal to fit in the random forest classifiers. We add a new column to our dataframe to do so, write a function to populate it, and run it across the dataframe.

In [75]:
def genre_to_ordinal(genre_in):
    if(genre_in == "country"):
        return "0"
    elif(genre_in == "pop"):
        return "1"
    elif(genre_in == "rock"):
        return "2"
    elif(genre_in == "edm"):
        return "3"
    elif(genre_in == "rap"):
        return "4"
    else:
        return genre_in
    
all_genres['genre_ordinal'] = all_genres.genre.apply(genre_to_ordinal)

We add in some boolean genre classifiers to make our analysis more fine-grained. Rather than saying "we predict this video is country with 50% confidence", we could say "this video is not edm with 90% confidence" and so on.

In [76]:
# Adding is_country flag
def is_country(genre_in):
    if(genre_in == "country"):
        return "1"
    else:
        return "0"
    
all_genres['is_country'] = all_genres.genre.apply(is_country)

# Adding is_country flag
def is_rock(genre_in):
    if(genre_in == "rock"):
        return "1"
    else:
        return "0"
    
all_genres['is_rock'] = all_genres.genre.apply(is_rock)

# Adding is_edm flag
def is_edm(genre_in):
    if(genre_in == "edm"):
        return "1"
    else:
        return "0"
    
all_genres['is_edm'] = all_genres.genre.apply(is_edm)

# Adding is_rap flag
def is_rap(genre_in):
    if(genre_in == "rap"):
        return "1"
    else:
        return "0"
    
all_genres['is_rap'] = all_genres.genre.apply(is_rap)

# Adding is_country flag
def is_pop(genre_in):
    if(genre_in == "pop"):
        return "1"
    else:
        return "0"
    
all_genres['is_pop'] = all_genres.genre.apply(is_pop)

### Test and Train Sets
We create our training and test sets by splitting all_genres by genre, and making 10 of each genre train and 10 test. We aggregate by genre to make our full train and full test sets, each containing 50 records of various genres.

In [82]:
import pandas as pd

# Subset all_genres to group by individual genres
country_records  = all_genres[all_genres["genre"] == "country"]
rock_records     = all_genres[all_genres["genre"] == "rock"]
pop_records      = all_genres[all_genres["genre"] == "pop"]
edm_records      = all_genres[all_genres["genre"] == "edm"]
rap_records      = all_genres[all_genres["genre"] == "rap"]

# From the subsets above, create train and test sets from each
country_train = country_records.head(43)
country_test  = country_records.tail(43)
rock_train    = rock_records.head(35)
rock_test     = rock_records.tail(35)
pop_train     = pop_records.head(41)
pop_test      = pop_records.tail(41)
edm_train     = edm_records.head(40)
edm_test      = edm_records.tail(40)
rap_train     = rap_records.head(44)
rap_test      = rap_records.tail(44)

# Create big training and big test set for analysis
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

print "Training:\t" , len(training_set)
print "Test:\t\t" , len(test_set)

Training:	203
Test:		203


### Generating Random Forests
We start generating our random forests, and output a relative accuracy and a confusion matrix. In this first one, we simply factor in non-color variables (rating, likes, dislikes, length and viewcount), and run it across all records to predict an ordinal genre value.

In [86]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Predicting based solely on non-color features
clf = RandomForestClassifier(n_estimators=11)
meta_data_features = ['rating', 'likes','dislikes','length','viewcount']
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[meta_data_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[meta_data_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[meta_data_features]),rownames=["Actual"], colnames=["Predicted"])

0.566502463054


Predicted,0,1,2,3,4
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,32,0,0,7,4
1,2,5,28,6,0
2,5,15,5,6,4
3,8,5,1,14,12
4,2,7,2,7,26


As shown above, this method yields relatively poor results. This is because there's no distinct clusters being created by our random forest, and simple viewer statistics tell us nothing about what kind of video we're watching. However, we see that country, rap and pop are initially somewhat distinct (diagonal is the highest value), and rock and edm are getting mistaken for one another. Let's see if we can't make something of this.

### Random Forest Only Color
Below, we do the same random forest as above, but going strictly off of average frame color for the video.

We found the most commonly appearing color in each frame and called it the 'frame mode'. We then took all of the frame modes and found the 10 most common of them. Those became the 'color data' we use to analyze videos.

In [36]:
clf = RandomForestClassifier(n_estimators=11)
color_features = gen_new_headers([])[:-1]

# Predicting based solely on colors
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[color_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[color_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[color_features]),rownames=["Actual"], colnames=["Predicted"])

0.246305418719


Predicted,0,1,2,3,4
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,19,6,3,6,9
1,8,11,8,7,7
2,16,7,3,7,2
3,15,5,4,8,8
4,8,6,9,13,8


This actually yields worse results than just the viewer statistics, because the color of a video by itself does not determine the genre. If rappers only had red in their videos and rockers only had black this might be somewhat accurate, but that's just not the case. But, what if we pair these findings with our initial viewer statistics? 

In [37]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])

0.467980295567


Predicted,0,1,2,3,4
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,29,0,0,6,8
1,1,7,26,6,1
2,12,7,9,4,3
3,10,7,2,7,14
4,7,4,1,6,26


### Singling Out Pop and Rap
Scores are expectedly low. It seems as if we're trying to make the classifier do way too much work, and are giving it very mediocre data to go off of. Recall that we're actually trying to determine WHICH genre a video is by the above code, not whether or not a video is of ONE specific genre. This brings back the binary classifiers that we created above, let's put those to use to see if we can improve these scores.

We try pop and rap first, since they seem to be the most distinct by what we've gathered above.

In [87]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['is_pop'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['is_pop'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.is_pop, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])

0.871921182266


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,156,6
1,20,21


In [39]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['is_rap'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['is_rap'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.is_rap, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])

0.743842364532


Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,138,21
1,31,13


What we're seeing above is a confusion matrix that, based on our training data, predicts whether or not a video in the test set is a pop video or not. In the "predicted" row, 0 means it predicts it's not a pop video, and that the 1 is. Likewise with the actual, 0 shows that the video actually wasn't a pop video, and the 1 shows that it was.

The confusion matrix above is our first effort at utilizing these binary classifiers. Most of our videos aren't pop videos (40 aren't, 10 are), and the model did a good job of picking out those that aren't pop. However, we could use some improvement in the realm of "false negatives", where the model classified a video as not pop when it actually was.

We recreated the tests above with each genre, and the results are below:

##### Ranking performance of boolean classifiers, train and test sets of 50, respectively.
- 1 - is_pop (.84 avg) 
- 2 - is_rap (.82 avg, fewer true negatives)
- 3 - is_rock (.78 avg, too many true negatives)
- 4 - is_edm (.--, DO NOT USE. Rarely predicts a positive edm value)
- 5 - is_country (.--, DO NOT USE. Way too many false positives)

We do these tests 50 times for sake of average score.

In [44]:
clf = RandomForestClassifier(n_estimators=11)

# Average score over many iterations calculation
loop_indices = range(0,50)
cumsum = 0

for i in loop_indices:
    y, _ = pd.factorize(training_set['is_pop'])
    clf = clf.fit(training_set[all_features], y)

    z, _ = pd.factorize(test_set['is_pop'])
    #print clf.score(test_set[all_features],z)
    cumsum = cumsum + clf.score(test_set[all_features],z)
    #print pd.crosstab(test_set.is_pop, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
    
print "Average Score for",len(loop_indices),"is_pop iterations:", cumsum/len(loop_indices)  

Predicted    0   1
Actual            
0          154   8
1           16  25
Average Score for 50 is_pop iterations: 0.870443349754


In [45]:
# Average score over many iterations calculation
loop_indices = range(0,50)
cumsum = 0

for i in loop_indices:
    y, _ = pd.factorize(training_set['is_rap'])
    clf = clf.fit(training_set[all_features], y)

    z, _ = pd.factorize(test_set['is_rap'])
    #print clf.score(test_set[all_features],z)
    cumsum = cumsum + clf.score(test_set[all_features],z)
    #print pd.crosstab(test_set.is_pop, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
    
print "Average Score for",len(loop_indices),"is_rap iterations:", cumsum/len(loop_indices)  

Average Score for 50 is_rap iterations: 0.794975369458


Rather than hard-coding each time we wanted to run something for average, we wrote a function that does it for us. All we have to do is pass in the boolean classifier in quotes ("is_rock", etc.), and the number of iterations that we want. Results are displayed below.

In [46]:
def multi_RF_averages(is_genre,num_iterations):
    clf = RandomForestClassifier(n_estimators=11)
    loop_indices = range(0,num_iterations)
    cumsum = 0

    for i in loop_indices:
        y, _ = pd.factorize(training_set[is_genre])
        clf = clf.fit(training_set[all_features], y)

        z, _ = pd.factorize(test_set[is_genre])
        cumsum = cumsum + clf.score(test_set[all_features],z)
    
    print "Average Score for",len(loop_indices),is_genre,"iterations:", cumsum/len(loop_indices)

In [47]:
multi_RF_averages("is_pop",50)
multi_RF_averages("is_rap",50)
multi_RF_averages("is_rock",50)
multi_RF_averages("is_edm",50)
multi_RF_averages("is_country",50)

Average Score for 50 is_pop iterations: 0.872413793103
Average Score for 50 is_rap iterations: 0.794679802956
Average Score for 50 is_rock iterations: 0.828177339901
Average Score for 50 is_edm iterations: 0.76315270936
Average Score for 50 is_country iterations: 0.786206896552


We ran the above test with all genres, and as shown in above analysis, our country and edm typically have very low accuracy. We've seen above that edm and rock videos are getting mixed up with one another, so we assume that something is characteristic of these 2 genres that's not of everything else. We take out the edm values from our training and test datasets, hoping to improve accuracy.

In [48]:
# Removing EDM for better analysis - makes is_pop and is_rap much more accurate
training_set = pd.concat([country_train,rock_train,pop_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,rap_test])

multi_RF_averages("is_pop",50)
multi_RF_averages("is_rap",50)
multi_RF_averages("is_rock",50)
multi_RF_averages("is_edm",50)
multi_RF_averages("is_country",50)

Average Score for 50 is_pop iterations: 0.865398773006
Average Score for 50 is_rap iterations: 0.810674846626
Average Score for 50 is_rock iterations: 0.78282208589
Average Score for 50 is_edm iterations: 1.0
Average Score for 50 is_country iterations: 0.755950920245


So, what does this tell us? Based on our training data, we have the best chance of accurately classifying something as pop or not pop (under these conditions). 

We want to find out which 2 are the most distinct, so we can make build our model based on that classification.

In [69]:
training_set = pd.concat([country_train,rock_train,edm_train,rap_train,pop_train])

test_set     = pd.concat([rock_test])
multi_RF_averages("is_rock",50)

test_set     = pd.concat([rap_test])
multi_RF_averages("is_rap",50)

test_set     = pd.concat([country_test])
multi_RF_averages("is_country",50)

test_set     = pd.concat([pop_test])
multi_RF_averages("is_pop",50)

test_set     = pd.concat([edm_test])
multi_RF_averages("is_edm",50)

Average Score for 50 is_rock iterations: 0.906285714286
Average Score for 50 is_rap iterations: 0.6
Average Score for 50 is_country iterations: 0.298604651163
Average Score for 50 is_pop iterations: 0.445365853659
Average Score for 50 is_edm iterations: 0.9115


Rock and EDM have suprisingly distinct classifiers. We should dive into the videos and see what this means.

In [67]:
test_set     = pd.concat([edm_test,rock_test])
multi_RF_averages("is_edm",50)
multi_RF_averages("is_rock",50)

Average Score for 50 is_edm iterations: 0.5152
Average Score for 50 is_rock iterations: 0.579466666667
