In [96]:
# Import all of the things you need to import!
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the **TF** and **IDF** parts make a little sense. Kind of. Somewhat.

No, just kidding, we're *professionals* now.

## Investigating the Congressional Record

The [Congressional Record](https://en.wikipedia.org/wiki/Congressional_Record) is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from [this page here](http://www.cs.cornell.edu/home/llee/data/convote.html).

In [1]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  8339k      0  0:00:01  0:00:01 --:--:-- 8346k


In [2]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from `convote_v1.1/data_stage_one/development_set/`. It's a bunch of text files.

In [3]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]

['convote_v1.1/data_stage_one/development_set/052_400095_1479080_ROY.txt',
 'convote_v1.1/data_stage_one/development_set/493_400189_2243032_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/421_400333_2010010_DON.txt',
 'convote_v1.1/data_stage_one/development_set/199_400300_2013031_DON.txt']

In [4]:
len(paths)

702

So great, we have 702 of them. Now let's import them.

In [6]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i yield myself such time as i m...",052_400095_1479080_ROY.txt,convote_v1.1/data_stage_one/development_set/05...
1,i yield to the gentleman from texas . \n,493_400189_2243032_DON.txt,convote_v1.1/data_stage_one/development_set/49...
2,"mr. chairman , i do not have it on the top of ...",052_400011_1479046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. speaker , i yield 3 minutes to the gentlem...",421_400333_2010010_DON.txt,convote_v1.1/data_stage_one/development_set/42...
4,"mr. speaker , let me conclude on this side by ...",199_400300_2013031_DON.txt,convote_v1.1/data_stage_one/development_set/19...


In class we had the `texts` variable. For the homework can just do `speeches_df['content']` to get the same sort of list of stuff.

**Take a look at the contents of the first 5 speeches**

In [7]:
speeches_df['content'].head(5)

0    mr. chairman , i yield myself such time as i m...
1             i yield to the gentleman from texas . \n
2    mr. chairman , i do not have it on the top of ...
3    mr. speaker , i yield 3 minutes to the gentlem...
4    mr. speaker , let me conclude on this side by ...
Name: content, dtype: object

# Doing our analysis

Use the `sklearn` package and a plain boring `CountVectorizer` to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

**Be sure to include English-language stopwords**

In [34]:
count_vectorizer = CountVectorizer(stop_words='english')

In [35]:
X=count_vectorizer.fit_transform(speeches_df['content'])

In [36]:
len(count_vectorizer.get_feature_names())

9106

In [25]:
tokens_df=pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
tokens_df

Unnamed: 0,000,00007,018,050,092,10,100,106,107,108,...,youngsters,youth,yuan,zero,zeroing,zeros,zigler,zirkin,zoe,zoellick
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Okay, it's **far** too big to even look at. Let's try to get a list of features from a new `CountVectorizer` that only takes the top 100 words.

In [40]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=100)

In [41]:
X=count_vectorizer.fit_transform(speeches_df['content'])
len(count_vectorizer.get_feature_names())

100

Now let's push all of that into a dataframe with nicely named columns.

In [42]:
tokens_df=pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
tokens_df

Unnamed: 0,000,11,act,allow,amendment,america,american,amp,association,balance,...,trade,united,urge,vote,want,way,work,year,years,yield
0,2,0,0,0,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,2
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0,0,0,0,1,0,2,0,0,0,...,0,2,1,1,0,0,0,4,0,0
5,0,0,0,0,2,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
6,0,1,2,0,0,0,0,0,0,0,...,0,0,1,2,2,2,2,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,2,0,3,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and how many don't mention "chairman" and how many mention neither "mr" nor "chairman"?

In [44]:
# 702 rows means 702 speeches, since each speech is a single string
len(tokens_df)

702

In [53]:
# if the speech doesnt contain a chairman, the column entry will be 0. so, 250 no-chairmain speeches. granted,
# we have no idea if they stared the speech with chairman or just mentioned him somewhere
len(tokens_df[tokens_df['chairman']==0])

250

In [66]:
# 76 times no mr or chairman. which means they must call the chairman just 'chairman' a lot. rude!
len(tokens_df[(tokens_df['mr']==0) & (tokens_df['chairman']==0)])

76

What is the index of the speech which is the most thankful, a.k.a. includes the word 'thank' the most times?

In [80]:
# so speech index 375
tokens_df[tokens_df['thank']==tokens_df['thank'].max()].index

Int64Index([375], dtype='int64')

In [82]:
# lets look at the speech
speeches_df['content'][375]

"mr. chairman , i just wanted to remind the house that faith-based organizations can and do sponsor federally funded head start programs . \nany sponsor who will agree not to discriminate in employment , if they can sponsor a program with the discrimination amendment , they can sponsor the program without that amendment if they would agree not to discriminate . \nwhat we are talking about is discrimination . \nsome people want to discriminate against catholics , jews , muslims , african americans . \nwe had this discussion in the 1960s , and the consensus back then was that discrimination in employment was so offensive that we made it illegal . \nthe victim needs to be protected and the weight of the federal government will fall down on the side of the victim . \nthe vote was not unanimous . \nsome people did not like it then ; they do not like it now . \nand we are discussing where should the weight of the government be , with the victim or with somebody trying to discriminate . \nthi

In [None]:
# wow that was long. but its the most thankful one, so whatever.

If I'm searching for `China` and `trade`, what are the top 3 speeches to read according to the `CountVectoriser`?

In [94]:
# i sorted by china here, on a lark. lets see if it holds true if i sort by trade
tokens_df[(tokens_df['china']>0) & (tokens_df['trade']>0)].sort_values(by='china', ascending=False)[['china', 'trade']].head(3)

Unnamed: 0,china,trade
294,29,63
27,27,9
267,16,5


In [95]:
tokens_df[(tokens_df['china']>0) & (tokens_df['trade']>0)].sort_values(by='trade', ascending=False)[['china', 'trade']].head(3)

Unnamed: 0,china,trade
294,29,63
136,5,21
45,1,18


In [110]:
# kind of! at any rate, speech 294 seems to be the most china and trade related. lets look at it!
speeches_df['content'][294]

"mr. speaker , i yield myself such time as i may consume . \nmr. speaker , this is a huge week for the congress , a big week for the house of representatives . \nwe are passing out major postal reform for the first time in years , a highway bill that has been in the making for over 2 congresses now , an energy conference report that has also been in the making for over 2 congresses now ; the opportunity to have at least one and perhaps as many as three appropriations conference reports behind us as we enter the august district work period ; and a central american free trade agreement , as well as a bill that gets tough with china , that finally holds our administration 's feet and the feet of , either party 's feet to the fire , and requires that they monitor and enforce the existing trade agreements that have been enacted by this congress . \nthis bill has been called a smoke screen , it has been called a fig leaf , it has been called a number of demeaning terms . \nbut at the end of 

In [None]:
# thats another super long speech, but it does seem to mostly be about trade and china.

Now what if I'm using a `TfidfVectorizer`?

In [108]:
l2_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
X = l2_vectorizer.fit_transform(speeches_df['content'])
tfidf_tokens_df = pd.DataFrame(X.toarray(), columns=l2_vectorizer.get_feature_names())
china_trade_df=pd.DataFrame([tfidf_tokens_df['china'], tfidf_tokens_df['trade'], tfidf_tokens_df['china'] + tfidf_tokens_df['trade']], index=["china", "trade", "china + trade"]).T
china_trade_df[china_trade_df.any(axis=1)].sort_values(by='china + trade', ascending=False).head(3)

Unnamed: 0,china,trade,china + trade
636,0.438664,0.470697,0.909362
447,0.516963,0.346696,0.863658
690,0.418276,0.439404,0.85768


In [None]:
# wow, that comes up with a totally different list of speeches. lets look at speech 636

In [109]:
speeches_df['content'][636]

"madam speaker , i rise today in opposition to h.r. 3283 , the so-called united states trade rights enforcement act . \nthis bill purports to address china 's lax enforcement of its international trade obligations . \nin fact , this bill does little to address serious trade issues with china , and it is on the house floor for only one reason : to garner votes for cafta later this week . \nthere is no question that congress should do everything in its power to enforce trade rights worldwide . \nhowever , giving lip service to an issue that deserves our careful consideration and strong action is a grave disservice to the american people . \nwhat we should be talking about today is the bush administration 's continued failure to decrease our trade deficits and promote labor rights , environmental standards and public health protections with our trading partners . \nlet 's look at the facts : in 2004 , the u.s. trade deficit with china grew to a record $ 162 billion . \nthis despite the fa

In [None]:
# that one is very short by comparison and this time, its really only about china and trade

**What's the content of the speeches?** Here's a way to get them:

In [111]:
# index 0 is the first speech, which was the first one imported.
paths[0]

'convote_v1.1/data_stage_one/development_set/052_400095_1479080_ROY.txt'

In [112]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}

mr. chairman , i yield myself such time as i may consume . 
mr. chairman , i heard my colleague from virginia say the cost is now up to three quarters of a million dollars . 
i do not think we are getting rid of the police officers ; i think we are just moving the five horses . 
their salaries , i think , would be fungible . 
so i do not think you can count that . 
as far as being something we do not need because the park police are already out there with their horses , let me state that the capitol grounds are statutorily defined , and because of that the park police do not have jurisdictions over the capitol grounds , it is my understanding . 
this program has only been in existence and operational since may of 2004 . 
the gao study , as the chairman stated , said that it is hard for them to quantify the benefits of the horse patrol because the performance measures are evolving , he failed to say the rest of it , and that data is still being collected on these measures . 
so 

In [None]:
# i guess i probably should have read ahead to this part. oh well, dumping the index of the speeches_df 
# was still mostly readable

**Now search for something else!** Another two terms that might show up. `elections` and `chaos`? Whatever you thnik might be interesting.

In [128]:
election_chaos_df=pd.DataFrame([tfidf_tokens_df['election'], tfidf_tokens_df['chaos'], tfidf_tokens_df['election'] + tfidf_tokens_df['chaos']], index=["election", "chaos", "election + chaos"]).T
election_chaos_df[election_chaos_df.any(axis=1)].sort_values(by='chaos', ascending=False).head(10)

Unnamed: 0,election,chaos,election + chaos
257,0.051012,0.078108,0.12912
382,0.044475,0.068098,0.112573
701,0.148767,0.045557,0.194324
467,0.065667,0.0,0.065667
352,0.076376,0.0,0.076376
424,0.179072,0.0,0.179072
426,0.114181,0.0,0.114181
459,0.220375,0.0,0.220375
469,0.105802,0.0,0.105802
302,0.032576,0.0,0.032576


In [129]:
# i did the sort that way because i guess they dont talk about chaos much. lets look at that speech
!cat {paths[257]}

mr. chairman , i yield myself 45 seconds . 
mr. chairman , this is about chaos and confusion . 
there is no definition of how the announcement will go out to the people beyond the beltway . 
a mere extending from 2 days to 5 days to make sure that americans , even in crisis , have due process and democracy and justice is not too much to ask . 
i would indulge and beg my colleagues to realize all this does is simply allow for the people of america in crisis to be represented and to be responded to . 
mr. chairman , i yield 30 seconds to the gentlewoman from california ( ms. millender-mcdonald ) xz4002750 , the ranking member of the committee on house administration . 
ms. millender-mcdonald . 
mr. chairman , i rise in strong support of the jackson-lee amendment . 
a portion of the gentlewoman 's amendment seeks to provide an expedited appeals process to the united states district court for matters arising out of the special election process . 
we have been talking about this 44

In [146]:
# thats pretty weak. i tried to come up with something spicier but i cant tell what years these are from.
# how about this?
clinton_welfare_df=pd.DataFrame([tfidf_tokens_df['clinton'], tfidf_tokens_df['welfare'], tfidf_tokens_df['clinton'] + tfidf_tokens_df['welfare']], index=["clinton", "welfare", "clinton + welfare"]).T
clinton_welfare_df[clinton_welfare_df.any(axis=1)].sort_values(by='clinton + welfare', ascending=False).head(10)

Unnamed: 0,clinton,welfare,clinton + welfare
346,0.498457,0.18248,0.680938
214,0.418044,0.229563,0.647607
107,0.314363,0.115085,0.429447
356,0.172899,0.047472,0.220371
67,0.177291,0.0,0.177291
96,0.129783,0.0,0.129783
31,0.053955,0.059258,0.113213
612,0.107158,0.0,0.107158
402,0.076199,0.0,0.076199
560,0.0,0.050511,0.050511


In [139]:
!cat {paths[346]}

mr. chairman , in the original welfare reform bill by president clinton , this provision was never in it . 
second , it was unconstitutional , and it was never promulgated by president clinton in the rulemaking . 
he does not support that provision . 
if you want to support something that president clinton believed in , then try fiscal responsibility and start balancing the budget . 
this is not what he believes , and the gentleman from ohio knows that , mr. chairman . 


In [None]:
# still pretty dull. oh well.

# Enough of this garbage, let's cluster

Using a **simple counting vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency inverse document frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

In [150]:
from sklearn.cluster import KMeans

In [151]:
count_vectorizer = CountVectorizer(stop_words='english')
X=count_vectorizer.fit_transform(speeches_df['content'])
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [153]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = count_vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: mr chairman time amendment gentleman
Cluster 1: head start religious rights civil
Cluster 2: nbsp amp lt gt trade
Cluster 3: association national restaurant contractors chamber
Cluster 4: rule 11 rules federal 420
Cluster 5: mr house people time trade
Cluster 6: start head children program amendment
Cluster 7: environmental justice agency executive epa


In [154]:
tf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=False)
X = tf_vectorizer.fit_transform(speeches_df['content'])
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [155]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tf_vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: yield gentleman texas illinois wisconsin
Cluster 1: mr chairman yield gentleman minutes
Cluster 2: mr chairman amendment time gentleman
Cluster 3: start head children amendment program
Cluster 4: time mr chairman balance yield
Cluster 5: china trade speaker mr legislation
Cluster 6: mr speaker yield gentleman time
Cluster 7: horses wild mr chairman amendment


In [156]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [157]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: demand recorded vote mr speaker
Cluster 1: yield gentleman mr chairman minutes
Cluster 2: start head children program amendment
Cluster 3: mr amendment chairman time gentleman
Cluster 4: frivolous lawsuits civil religious federal
Cluster 5: claim consent opposition ask unanimous
Cluster 6: china trade speaker madam cafta
Cluster 7: balance time chairman reserve mr


**Which one do you think works the best?**

In [None]:
# well, it seems to be between results including wild horses and those including frivolous lawsuits.
# i prefer wild horses but the tfidf is probably the best representation of the actual document

# Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out **the two types of Harry Potter fanfiction**. What is your hypothesis?

In [164]:
paths = glob.glob('hp/hp/*')
paths[:5]

['hp/hp/9586935.txt',
 'hp/hp/10608415.txt',
 'hp/hp/10608060.txt',
 'hp/hp/9973627.txt',
 'hp/hp/10602965.txt']

In [165]:
hp_fics = []
for path in paths:
    with open(path) as hp_file:
        hp_fic = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': hp_file.read()
        }
    hp_fics.append(hp_fic)
hp_fics_df = pd.DataFrame(hp_fics)
hp_fics_df.head()

Unnamed: 0,content,filename,pathname
0,"Hello, my name is Malcolm Hargreaves, but most...",9586935.txt,hp/hp/9586935.txt
1,I do not own Harry Potter or the Darren Shan s...,10608415.txt,hp/hp/10608415.txt
2,"This is my entry for the ""Three Prompts"" compe...",10608060.txt,hp/hp/10608060.txt
3,"Author's Notes: In my own, happy little world,...",9973627.txt,hp/hp/9973627.txt
4,"The roommates.Harry, Ron, Seamus, Dean, and ev...",10602965.txt,hp/hp/10602965.txt


In [166]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(hp_fics_df['content'])
number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [167]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: harry hermione draco said just
Cluster 1: lily james sirius remus said


In [None]:
# i would say people are either writing about harry and hermione or lily and james
# ive never read harry potter, but i think that means either the harry generation or their parents generation?
# seems legit