# My Favorite Things
#### A Tour of Unstructured Data

 
I scraped the profiles of several thousand users on a popular website. Among the data on each profile is their favorite books, movies, TV shows, and even food. But this data is hard to make sense of; it's just a list of words.

Using some natural language processing, however, we can tame this data and discover some interesting insights.

In [3]:
import pandas as pd

from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation, digits

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import recall_score, accuracy_score

from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [16]:
# I've pre-wrangled and saved the data, but we'll still need to do a bit more work

data = pd.read_csv('favorites_data.csv')
del data['Unnamed: 0']

# Dataframe info
print('Shape:',data.shape)
print('Columns:', data.columns)

# Sample entries
data.sample(5)

Shape: (7116, 2)
Columns: Index(['favorites', 'length'], dtype='object')


Unnamed: 0,favorites,length
3,I read about anything! At the momentI am readi...,280
6018,"Books: anything by Mochael Crichton, Jules Ver...",580
6497,I have read like 50 books in my life time that...,185
5473,"Drunk History, Shameless, Everything is Illumi...",154
3067,"Shows: Prison BreakMusic: RNB, KpopFood: Spagh...",198


In [17]:
for i in punctuation:
    data['favorites'] = data['favorites'].str.replace(i, ' ')
    
data['favorites'] = data['favorites'].str.lower()

data = data[data.length > 2]

data.shape

(7116, 2)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(max_df=.15, min_df=2)

tf.fit(data.favorites)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.15, max_features=None, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [19]:
def text_process(text, tf_stop_words):
    
    # Tfidf stopwords
    cleaned = [word for word in text.split() if word not in tf_stop_words]
    
    # NLTK stopwords
    cleaned = [word for word in cleaned if word not in stopwords.words('english')]
    
    cleaned = [word for word in cleaned if len(word) > 1]
    return cleaned


data['favorites'] = data['favorites'].apply(text_process, tf_stop_words=tf.stop_words_)

In [20]:
data['favorites'].sample(5)

5755    [recently, symphonic, metal, authors, laurel, ...
4844    [happiness, intelligent, people, thing, know, ...
2860    [basically, thai, indian, mexican, vietnamese,...
19      [walking, dead, girl, modern, family, orange, ...
4551    [huge, book, nerd, mostly, fiction, biography,...
Name: favorites, dtype: object

In [21]:
things_liked = list(data['favorites'])

everything_liked = [item for user in things_liked for item in user]

results = Counter(everything_liked)
results.most_common(40)

[('things', 1250),
 ('big', 1164),
 ('one', 1152),
 ('book', 1129),
 ('many', 1128),
 ('lot', 1102),
 ('enjoy', 1100),
 ('series', 1092),
 ('everything', 1046),
 ('black', 1046),
 ('pretty', 1046),
 ('reading', 1045),
 ('movie', 1032),
 ('dead', 1003),
 ('thrones', 992),
 ('life', 988),
 ('american', 961),
 ('fiction', 959),
 ('house', 938),
 ('always', 933),
 ('show', 924),
 ('horror', 908),
 ('listen', 891),
 ('list', 857),
 ('eat', 817),
 ('country', 812),
 ('get', 795),
 ('etc', 792),
 ('bad', 775),
 ('harry', 771),
 ('favorites', 737),
 ('star', 730),
 ('go', 714),
 ('potter', 711),
 ('know', 704),
 ('stuff', 694),
 ('sushi', 688),
 ('old', 657),
 ('walking', 650),
 ('try', 641)]

This a bag-of-words approach so it requires some pop-culture domain knowledge to interpret. Here's what I notice in the list:

1. Rock music
2. Game of Thrones
3. The Walking Dead
4. Harry Potter
5. Sushi
6. Italian food
7. Pop music
8. Thai food
9. Mexican food
10. Hip-hop music
 
Things that "might" be mentioned (in no particular order):
1. This American Life
2. Black Mirror
3. Orange is the New Black
4. Clockwork Orange
5. American Horror Story
6. House, M.D.
7. House music
8. Star Wars
9. Star Trek
10. Family Guy
11. Modern Family

These are my educated guesses. People might be huge fans of *Little House on the Prairie*, but I doubt it.

# Creating the DataFrame
This is where the real magic happens. I iterate over each word used, then (if necessary) create a new column with that word and assign it a value of 1. We can then look at correlations and create a simple recommender system.

I saved the results to a csv, then put the code in a function. I only needed to run it once.

In [27]:
df = pd.DataFrame()

for i in range(len(data)):

    for j in things_liked[i]:
        if results[j] >= 5: #set a threshold
            df.at[i, j] = 1

df.fillna(0, inplace=True)



In [28]:
df.to_csv('favorites_processed_2.csv', encoding='utf-8')

In [None]:
def favorites_df(mentions=20, save=True):
    df = pd.DataFrame()

    for i in range(len(data)):
        things_list = []
        things_liked = text_process(data.iloc[i]['favorites'], stem=False)
        if len(things_liked) > 0:
            for j in things_liked:
                if results[j] >= mentions: #set a threshold
                    df.set_value(i, j, 1)

    df.fillna(0, inplace=True)
    
    if save == True:
        df.to_csv('favorites_processed.csv', encoding='utf-8')
    
    return df

In [None]:
# Retrieving the csv I saved with the previous function
df = pd.read_csv('favorites_processed.csv')

df.head(3)

In [29]:
def corrs(corr_item=df['thrones'], df=df):
    '''
    Finds correlated words, then sorts by absolute value to account for
    strong negative correlations.
    '''
    cor = df.select_dtypes(include=[np.number]).corrwith(corr_item)
    df = pd.DataFrame(cor.sort_values(ascending=False),columns=['corr'])
    df['absol'] = np.abs(df['corr'])
    return df[df.absol < 1].sort_values('absol', ascending=False)['corr']         

In [32]:
# Titles that were ambiguous from our top results, or might
# otherwise be hard to find. Not perfect, but works well enough.

df['always_sunny'] = np.where((df.always == 1) & (df.sunny == 1), 1, 0)
df['american_dad'] = np.where((df.american == 1) & (df.dad == 1) & (df.horror == 0), 1, 0)
df['american_horror_story'] = np.where((df.american == 1) & (df.dad == 0) & (df.horror == 1), 1, 0)
df['big_bang_theory'] = np.where((df.bang == 1) & (df.theory == 1), 1, 0)
df['black_mirror'] = np.where((df.black == 1) & (df.mirror == 1), 1, 0)
df['clockwork_orange'] = np.where((df.clockwork == 1) & (df.orange == 1), 1, 0)

# People love to point out that they hate country music.
df['country'] = np.where((df.country == 1) & (df['except'] == 0), 1, 0) 

df['crazy_ex_girlfriend'] = np.where((df.crazy == 1) & (df.ex == 1), 1, 0)
df['criminal_minds'] = np.where((df.criminal == 1) & (df.minds == 1), 1, 0)
df['family_guy'] = np.where((df.family == 1) & (df.guy == 1), 1, 0)
df['final_fantasy'] = np.where((df.final == 1) & (df.fantasy == 1), 1, 0)
df['lord_of_the_rings'] = np.where((df.lord == 1) & (df.rings == 1), 1, 0)
df['modern_family'] = np.where((df.modern == 1) & (df.family == 1), 1, 0)
df['name_of_the_wind'] = np.where((df.wind == 1) & (df.name == 1), 1, 0) # My favorite book!
df['orange_is_the_new_black'] = np.where((df.black == 1) & (df.orange == 1), 1, 0)
df['pans_labyrinth'] = np.where((df.pan == 1) & (df.labyrinth == 1), 1, 0)
df['princess_bride'] = np.where((df.princess == 1) & (df.bride == 1), 1, 0)
df['south_park'] = np.where((df.south == 1) & (df.park == 1), 1, 0)
df['star_trek'] = np.where((df.star == 1) & (df.trek == 1), 1, 0) # Really, "trek" would be enough.
df['star_wars'] = np.where((df.star == 1) & (df.wars == 1), 1, 0)
df['the_office'] = np.where((df.office == 1) & (df.space == 0), 1, 0)
df['this_american_life'] = np.where((df.american == 1) & (df.life == 1), 1, 0)
df['whose_line'] = np.where((df.whose == 1) & (df.line == 1), 1, 0)
df['game_of_thrones'] = df.thrones
del df['thrones']

# One of my favorite shows, and a huge pain to search for!
df['lost_tv_show'] = np.where((df.lost == 1) & (df.translation == 0) &\
                              (df.boys == 0) & (df.paradise == 0) &\
                              (df.getting == 0) & (df.world == 0) &\
                              (df.souls == 0) & (df.children == 0) &\
                              (df.girl == 0), 1, 0)


In [33]:
corrs(df.inception).head(10)

matrix          0.131429
interstellar    0.126217
memento         0.111756
500             0.104910
linkin          0.102768
knight          0.099501
julian          0.089933
fight           0.089708
assassin        0.088461
club            0.088178
Name: corr, dtype: float64

People who like Inception also tend to like:
- The Matrix
- Interstellar
- Memento
- 500 Days of Summer
- Linkin Park
- The Dark Knight
- Fight Club
- Hunting Nemo

Wait, that's not the name of the movie! Does "hunting" refer to *Good Will Hunting*? Let's make sure.

In [34]:
corrs(df['hunting']).head(10)

goodwill     0.209285
bourne       0.151121
beautiful    0.109954
bergerac     0.103301
cyrano       0.103301
gran         0.103301
scotland     0.103301
nutshell     0.103301
raid         0.103301
thirteen     0.103301
Name: corr, dtype: float64

It would appear so. It's interesting to note that it's pretty common to simply be a "Matt Damon" fan (e.g., Jason Bourne, The Departed).

So, what about *Good Will Hunting*?

In [None]:
#corrs(df['will'])

I won't run it, but this results in a KeyError! The problem with my stopwords is that they sometimes filter out names of titles, such as "Finding" Nemo or "Good" Will Hunting.

However, I'd argue this is more a problem with unstructured data, than the use of stopwords.

# Clusters
What if there are certain "types" of people in terms of personal tastes? A cluster analysis could uncover this.

In [None]:
# Create a new dataframe. We need to increase the popularity
# threshold to 50 to reduce dimensionality.

df_popular = favorites_df(mentions=50, save=False)

In [None]:
df_popular.shape

There's no avoiding it: We'll have to reduce the number of dimensions in order to perform a cluster analysis. Let's see if we can find a point of diminishing returns.

In [None]:
from sklearn.decomposition import PCA

variance_dict = {}

for i in range(1,1100,100):
    pca = PCA(n_components=i, whiten=True).fit(df_popular)
    variance_dict[i] = sum(pca.explained_variance_ratio_)

In [None]:
vdf = pd.DataFrame([variance_dict]).T.plot()

There's an elbow at approximately 100, but that's still too many features for this data set.

In [None]:
variance_dict = {}

for i in range(1,210,20):
    pca = PCA(n_components=i, whiten=True).fit(df_popular)
    variance_dict[i] = sum(pca.explained_variance_ratio_)
    
pd.DataFrame([variance_dict]).T.plot()

20 appears to be another good option. However, after much trial and error, I discovered that it's most effective to capture about 10% of the variance with only 10 features. It makes our clustering decisions much easier.

In [None]:
pca = PCA(n_components=10, whiten=True).fit(df_popular)

dframe = pca.transform(df_popular)

print('PCA explained variance:', sum(pca.explained_variance_ratio_))
print('Shape:', dframe.shape)

We're only catching a tiny fraction of the variance, but it'll have to do.

Next question: How many clusters should we have?

In [None]:
# Inertia measures the average distance to the nearest centroid.
# We want it to be low, but adding more clusters is always going
# to reduce it. So we need to find a point of diminishing returns.

kdict = {}

for i in range(2,20):
    clf = KMeans(n_clusters=i)
    clf.fit(dframe)
    kdict[i] = clf.inertia_

kframe = pd.DataFrame([kdict]).T
kframe.columns = ['inertia']

# Calculate the improvement as a percentage
kframe['improvement'] = (kframe['inertia'].shift(1) - kframe['inertia'])/kframe['inertia'].shift(1)

kframe

In [None]:
kframe.improvement.plot()

12 appears to be a good number of clusters.

In [None]:
clf = KMeans(n_clusters=12)
clf.fit(df_popular)

In [None]:
df_popular['cluster'] = clf.labels_ + 1

In [None]:
groups = df_popular.groupby('cluster').mean()

groups = groups[(groups > .1)]
groups.dropna(how='all', axis=1, inplace=True)

In [None]:
groups = groups.dropna(how='all',axis=1).fillna(0)

In [None]:
corrs(df.black).head(10)

In [None]:
df.columns

In [None]:
for i in range(len(groups)):
    print('Cluster', str(i+1))
    for j in groups.columns:
        if groups.iloc[i][j] > .2:
            print(j, end=', ')
    print()
    print()

# Machine Learning Your Tastes

I like Game of Thrones, Breaking Bad, Lost, Ferris Bueller, and The Walking Dead. Would I like Margaret Atwood?

In [None]:
my_tastes = {'game_of_thrones': 1, 'breaking': 1, 'bad': 1, 'walking': 1, 'dead': 1,
             'lost_tv_show': 1, 'name_of_the_wind': 1, 'ferris': 1, 'bueller': 1}

df = df.append(my_tastes, ignore_index=True).fillna(0)

In [None]:
x = df.copy()
y = x.pop('atwood')

del x['handmaid']
del x['margaret']
del x['tale']
del x['heart']
del x['goes']
del x['last']
del x['year']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.20)

clf = RandomForestClassifier(n_estimators=500) # We need a lot of estimators to create precise predictions here.
clf.fit(xtrain, ytrain)
pred = clf.predict(xtest)
pred_proba = clf.predict_proba(xtest)

accuracy_score(ytest, pred)

In [None]:
pred.sum()

We have a problem! There's a huge class imbalance, and the model is going to be 99% accurate just by predicting that no one likes Margaret Atwood.

I'm quite willing to have some false positives, so let's create a custom set of predictions.

In [None]:
atwood_average = pred_proba[:,1].mean()
atwood_std = pred_proba[:,1].std()

print("Average probability of liking Margaret Atwood:", atwood_average)
print("Standard deviation:", atwood_std)

Let's combine them, and say that if your probability is one standard deviation above the average, then you're predicted to like her.

In [None]:
predicted_to_like = atwood_average + atwood_std
pred_custom = []

for i in pred_proba[:,1]:
    if i >= predicted_to_like:
        pred_custom.append(1)
    else:
        pred_custom.append(0)

np.sum(pred_custom)

Woohoo! Now many people are predicted to like her. Let's evaluate this now using recall score, rather than accuracy. This benchmark is more appropriate, considering I'm okay with false positives.

In [None]:
recall_score(ytest, pred_custom)

In [None]:
# Compare it to my original predictions. 
recall_score(ytest, pred)

Much better!

In [None]:
atwood = pd.DataFrame([df.columns, clf.feature_importances_, ]).transpose()
atwood.columns = ['feature', 'importance']

atwood.sort_values('importance', ascending=False).head(10)

We went off on a tangent. The original question was whether I'd like Margaret Atwood.

In [None]:
pred_me = clf.predict_proba(x.iloc[-1].reshape(1, -1))

In [None]:
pred_me[0,1]

Apparently Margaret Atwood isn't for me. In fact, the model was bold enough to assign a 0% probability! Which makes me wonder -- is it assigning binary labels to everyone?

In [None]:
clf.predict_proba(x)[:,1][:5]

Nope, it just really thinks I won't like her. Thanks, model!

In [None]:
# This doesn't work very well.
# It just identifies things that are popular

def similar(item, df=df):
    similarity_dict = {}
    cnt = df[item].sum()
    
    for i in df.columns:
        cnt_i = df[i].sum()
        similarity = cnt_i/cnt
        similarity_dict[i] = similarity
        
    df_sim = pd.DataFrame([similarity_dict]).transpose()
    df_sim.columns = ['similarity']
    return df_sim.sort_values('similarity', ascending=False)
        
similar('kanye')