# Lab Assignment 2 for CSE 7324 Fall 2017

___Members___: Hongning Yu, Hui Jiang, Hao Pan

## 1. Business Understanding


## 2. Data Encoding
First let's load the data in to dataframe. The data is already in a csv file but all of the lyrics are in raw text with different formats. Our gold is to predict genre basing on lyrics, so we still need to clean all lyrics.

In [35]:
import pandas as pd
import nltk
import numpy as np
import string

pd.set_option('display.max_columns', 60)

In [36]:
df = pd.read_csv("./lyrics.csv", encoding="utf-8")
df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


### check null values in dataset.

In [37]:
df.isnull().sum()

index         0
song          2
year          0
artist        0
genre         0
lyrics    95680
dtype: int64

Looks like there are null values in lyrics and song. Just drop them.

In [38]:
df.dropna(inplace=True)
df.isnull().sum()

index     0
song      0
year      0
artist    0
genre     0
lyrics    0
dtype: int64

### check genre

In [39]:
df.genre.value_counts()

Rock             109235
Pop               40466
Hip-Hop           24850
Not Available     23941
Metal             23759
Country           14387
Jazz               7970
Electronic         7966
Other              5189
R&B                3401
Indie              3149
Folk               2243
Name: genre, dtype: int64

As we can see, some genres have way more records than others. For our genre-predicting classification problem, we could sample the dataset and choose subsets of some genres to avoid bias. But let's now keep it as it is and deal with this later.

Check certain genres:

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 266556 entries, 0 to 362236
Data columns (total 6 columns):
index     266556 non-null int64
song      266556 non-null object
year      266556 non-null int64
artist    266556 non-null object
genre     266556 non-null object
lyrics    266556 non-null object
dtypes: int64(2), object(4)
memory usage: 14.2+ MB


### 2.1 Read in data and check data quality

### Change to ASCII
First let's try to get rid of all non-ascii characters, since we only want english characters

**Takes too much time**

In [41]:
# %%time
# import re
# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = df.loc[row, 'lyrics'].encode('ascii', errors='ignore').decode()

# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = re.sub(r'[^\x00-\x7f]',
#                                    r'', 
#                                    df.loc[row, 'lyrics']) 

### English Filter
We want to focus on song's with english lyrics, so let's delete all non-english records if they exist.

I tried to build a English-ratio detector to eliminate all non-english songs. 
Reference: https://github.com/rasbt/musicmood/blob/master/code/collect_data/data_collection.ipynb

But the loop of set calculation **takes too much time**. Need to improve.

In [42]:
# %%time
# def eng_ratio(text):
#     ''' Returns the ratio of non-English to English words from a text '''

#     english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 
#     text_vocab = set(w.lower() for w in text.split('-') if w.lower().isalpha()) 
#     unusual = text_vocab.difference(english_vocab)
#     diff = len(unusual)/(len(text_vocab)+1)
#     return diff

    
# # first let's eliminate non-english songs by their names
# before = df.shape[0]
# for row_id in range(100):
#     text = df.loc[row_id]['song']
#     diff = eng_ratio(text)
#     if diff >= 0.5:
#         df = df[df.index != row_id]
# after = df.shape[0]
# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)

This is another approach, which uses a package from https://github.com/saffsd/langid.py. This package can detect language in a fairly quicker way. But still, 260k records takes around 50 mins.

In [43]:
%%time
# package from https://github.com/saffsd/langid.py
import langid

before = df.shape[0]
for row in df.index:
    lang = langid.classify(df.loc[row]['lyrics'])[0]
    if lang != 'en':
        df = df[df.index != row]
after = df.shape[0]

rem = before - after
print('%s have been removed.' %rem)
print('%s songs remain in the dataset.' %after)

KeyboardInterrupt: 

**Final Solution** try to use my own dictionary when creating bag-of-words model.

### Resampling
300k records easily run out of memory. So I tried to resample the dataset and choose equal size of each genre.

In [10]:
size =  2000       # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df = df.groupby('genre', as_index=False).apply(fn)

df = df.set_index('index').sort_index()
df.shape

(24000, 5)

### Add word_count

In [12]:
df['word_count'] = df['lyrics'].str.split().str.len()
df['word_count'].groupby(df['genre']).describe()

genre            
Country     count    2000.000000
            mean      187.739000
            std        84.260814
            min         1.000000
            25%       128.750000
            50%       172.000000
            75%       233.000000
            max       895.000000
Electronic  count    2000.000000
            mean      193.201000
            std       140.251937
            min         1.000000
            25%        99.000000
            50%       169.000000
            75%       256.000000
            max      1068.000000
Folk        count    2000.000000
            mean      178.983500
            std       113.399775
            min         1.000000
            25%       109.000000
            50%       165.000000
            75%       235.000000
            max      1274.000000
Hip-Hop     count    2000.000000
            mean      490.846000
            std       231.598001
            min         1.000000
            25%       333.750000
            50%       493

### Check the lyrics' quality

In [34]:
# check lyrics with length less than 20
less_than_20 = 0
for row in df.index[:1000]:
    if len(df.loc[row]['lyrics'])<=20:
        print(df.loc[row]['lyrics'])
        less_than_20 += 1
print("Num of lyrics with length less than 20 in first 10000: {}".format(less_than_20))

index
166    Tryna rain, tryna rain on the thunder\nTell th...
166    Tryna rain, tryna rain on the thunder\nTell th...
Name: lyrics, dtype: object
index
166    Tryna rain, tryna rain on the thunder\nTell th...
166    Tryna rain, tryna rain on the thunder\nTell th...
Name: lyrics, dtype: object
index
483    life me why preposessing\nsince very sin,game ...
483    life me why preposessing\nsince very sin,game ...
Name: lyrics, dtype: object
index
483    life me why preposessing\nsince very sin,game ...
483    life me why preposessing\nsince very sin,game ...
Name: lyrics, dtype: object
index
510    erased Ä± threw Ä± my destiny from\nyou from e...
510    erased Ä± threw Ä± my destiny from\nyou from e...
Name: lyrics, dtype: object
index
510    erased Ä± threw Ä± my destiny from\nyou from e...
510    erased Ä± threw Ä± my destiny from\nyou from e...
Name: lyrics, dtype: object
index
513    white sea evenings one different being\nespeci...
513    white sea evenings one different being\nes

It looks like lots of songs don't have meaningful lyrics(instrumental music, or something wrong happened when crawling).

So we just drop all song records with less than 100 lyric length

In [14]:
print("Deleting records with lyric length < 100 and > 1000")


len_before = df.shape[0]
df_clean= df[df['word_count'] >= 100]
df_clean = df[df['word_count'] <= 1000]
len_after = df_clean.shape[0]

print("Before: {}\nAfter : {}\nDeleted: {}".format(len_before, len_after, len_before-len_after))

Deleting records with lyric length < 100 and > 1000
Before: 24000
After : 23959
Deleted: 41


### transfer lyrics to list 

In [15]:
x = df_clean['lyrics'].values
y = df_clean['genre'].values
print('Size of x: {}\nSize of y: {}'.format(x.size, y.size))

x = x.tolist()

x[1]

Size of x: 23959
Size of y: 23959


"Beyonc - Intro:\nBeyonce\nIAM\nWelcome\nWelcome\nHey\nOhhhh\nBeyonc - Chorus:\nWelcome to a place\nWhere people lie to your face\nJust to get it done.\nWelcome to the human race\nWhere if you ain't got money, then you'll pay in pain\nWelcome to this world of ours\nAnd if you had the chance would you come back again ?\nCause now you're here there ain't no turning back,\nYou got tears in your eyes\nAnd the monkey on your back (?)*\nAkhenaton - Verse 1:\nBienvenue l o l'fort tue le faible, o la faim tue de fait\nO la socit t'congratule, et t'accepte une fois la fortune faite,\nO les dettes cumulent, o l'crdit accule, des tas d'foyers,\nO les hommes en perte d'idaux, jurent qu'par les putes dvoyes en vido,\nChaque soir c'est la fte, les salons cossus attirent les fesses,\nSur les gros cads, qui arrosent les guridons de pure cocane,\nIci, les gosses rvent d't'Pirs, beaucoup partent et peu qui restent,\nNombreux sont ceux qui ds 13 ans connaissent leur premire ivresse,\nBienvenue l o l'air 

### average sentence length for each lyric
count average sentence length of each song before removing puncturations and newlines

reference: https://stackoverflow.com/questions/13970203/how-to-count-average-sentence-length-in-words-from-a-text-file-contains-100-se

In [16]:
def count_sentence_len(lyric):
    """count average sentence len for a lyric"""
    sents_list = lyric.split('\n')
    avg_len = sum(len(x.split()) for x in sents_list) / len(sents_list)
    return avg_len


sentence_length_avg = []
x_clean = []

# count all lyrics len, store in sentence_len_average
# remove all puncturation and \n, save to x_clean
translator = str.maketrans('', '', string.punctuation)
for l in x:
    l = l.translate(translator)
    sentence_len = count_sentence_len(l)
    
    l = l.replace('\n', ' ')
    
    sentence_length_avg.append(sentence_len)
    x_clean.append(l)

In [17]:
# randomly print 10 lyrics
import random
for i in random.sample(range(len(x_clean)), 10):
    print(x_clean[i])
    print("=============================")

Que facil es decir te quiero cuando nada es cierto Dificil es cuando recibes un fuerte desprecio Una ilusion es la que tu le haz dejado a mi corazon y mis palabras se las ah llevado el viento Tambien dije que aunque tu nunca nunca me quisiste Yo te lleve hacia el camino que nunca emprendiste Yo te ensee para que hoy digas que otro fue elque te enseo el camino del amor que conmigo nunca lo hiciste tarde entendi tu me engaaste y me traisionate pero yo no voy a llorar por que tu me dejaste mi corazon esta feliz por esas noche de placer y amor que con pasion tu a mi me entregaste
Feel the adrenaline moving under my skin Its an addiction such an eruption Sound is my remedy feeding me energy Music is all I need Baby I just wanna dance I dont really care I just wanna dance I dont really care care care Feel it in the air yeah Shes been a crazy dita disco fever and you wonder Whos that chick Whos that chick Too cold for you to keep her Too hot for you to leave her Whos that chick Whos that chic

### 2.2 Removing stop words
nltk package has a build in library of stop words. But when we transfering text to bag-of-words model, stop words can be eliminated automatically.

In [18]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

### 2.3 Bag-of-words representation

Here I used the dictionary from https://github.com/eclarson/MachineLearningNotebooks/tree/master/data

In [19]:
with open('./ospd.txt', encoding='utf-8', errors='ignore') as f1:
    vocab1 = f1.read().split("\n")

print(len(vocab1))

79340


In [20]:
from sklearn.feature_extraction.text import CountVectorizer



# CounterVectorizer can automatically change words into lower case
cv = CountVectorizer(stop_words='english',
                    encoding='utf-8',
                    lowercase=True,
                    vocabulary=vocab1)

bag_words = cv.fit_transform(x_clean)

print('Shape of bag words: {}'.format(bag_words.shape))
print("Length of Vocabulary: {}".format(len(cv.vocabulary_)))

Shape of bag words: (23959, 79340)
Length of Vocabulary: 79340


Let's createe a pandas dataframe containing bag-of-words(bow) model

In [21]:
df_bow = pd.DataFrame(data=bag_words.toarray(),columns=cv.get_feature_names())
df_bow

Unnamed: 0,aa,aah,aahed,aahing,aahs,aal,aalii,aaliis,aals,aardvark,aardwolf,aargh,aarrgh,aarrghh,aas,aasvogel,aba,abaca,abacas,abaci,aback,abacus,abacuses,abaft,abaka,abakas,abalone,abalones,abamp,abampere,...,zygoid,zygoma,zygomas,zygomata,zygose,zygoses,zygosis,zygosity,zygote,zygotene,zygotes,zygotic,zymase,zymases,zyme,zymes,zymogen,zymogene,zymogens,zymogram,zymology,zymosan,zymosans,zymoses,zymosis,zymotic,zymurgy,zyzzyva,zyzzyvas,Unnamed: 61
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
word_freq = df_bow.sum().sort_values()

https://github.com/dwyl/english-words

In [23]:
word_freq[-100:]

big         2780
sun         2817
thought     2838
wrong       2881
hand        2918
walk        2947
high        2952
dream       2983
end         3005
hes         3032
el          3059
lost        3074
god         3109
boy         3225
place       3246
going       3269
fall        3293
fuck        3327
real        3334
people      3347
em          3362
true        3386
believe     3445
money       3465
inside      3468
id          3475
hard        3521
turn        3530
shes        3579
stay        3624
           ...  
world       7019
girl        7780
man         7985
tell        8042
night       8203
day         8577
need        8768
right       8964
away        9149
la          9359
feel        9558
heart       9698
life        9703
yeah       11337
ill        11765
say        11817
cause      12018
make       12077
way        12344
let        12396
want       13125
come       13499
baby       14688
time       15713
got        18504
oh         19006
just       25019
like       265

### 2.4 Tf-idf representation

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english',
                             encoding='utf-8',
                             lowercase=True,
                             vocabulary=vocab1)

tfidf_mat = tfidf_vect.fit_transform(x_clean)

print('Shape of bag words: {}'.format(tfidf_mat.shape))
print("Length of Vocabulary: {}".format(len(tfidf_vect.vocabulary_)))

Shape of bag words: (23959, 79340)
Length of Vocabulary: 79340


In [25]:
del df_bow

df_tfidf = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())

In [26]:
df_tfidf

Unnamed: 0,aa,aah,aahed,aahing,aahs,aal,aalii,aaliis,aals,aardvark,aardwolf,aargh,aarrgh,aarrghh,aas,aasvogel,aba,abaca,abacas,abaci,aback,abacus,abacuses,abaft,abaka,abakas,abalone,abalones,abamp,abampere,...,zygoid,zygoma,zygomas,zygomata,zygose,zygoses,zygosis,zygosity,zygote,zygotene,zygotes,zygotic,zymase,zymases,zyme,zymes,zymogen,zymogene,zymogens,zymogram,zymology,zymosan,zymosans,zymoses,zymosis,zymotic,zymurgy,zyzzyva,zyzzyvas,Unnamed: 61
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# word_score = df_tfidf.sum().sort_values()