In [7]:
# Import all of the things you need to import!
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the **TF** and **IDF** parts make a little sense. Kind of. Somewhat.

No, just kidding, we're *professionals* now.

## Investigating the Congressional Record

The [Congressional Record](https://en.wikipedia.org/wiki/Congressional_Record) is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from [this page here](http://www.cs.cornell.edu/home/llee/data/convote.html).

In [2]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  6136k      0  0:00:01  0:00:01 --:--:-- 6142k


In [3]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from `convote_v1.1/data_stage_one/development_set/`. It's a bunch of text files.

In [4]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]

['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']

In [5]:
len(paths)

702

So great, we have 702 of them. Now let's import them.

In [8]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i thank the gentlewoman for yie...",052_400011_0327014_DON.txt,convote_v1.1/data_stage_one/development_set/05...
1,"mr. chairman , i want to thank my good friend ...",052_400011_0327025_DON.txt,convote_v1.1/data_stage_one/development_set/05...
2,"mr. chairman , i rise to make two fundamental ...",052_400011_0327044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. chairman , reclaiming my time , let me mak...",052_400011_0327046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
4,"mr. chairman , i thank my distinguished collea...",052_400011_1479036_DON.txt,convote_v1.1/data_stage_one/development_set/05...


In class we had the `texts` variable. For the homework can just do `speeches_df['content']` to get the same sort of list of stuff.

**Take a look at the contents of the first 5 speeches**

In [12]:
speech_num = 0
for speech in speeches_df['content'].head(5):
    speech_num += 1
    print(speech_num)
    print(speech)
    print('')

1
mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled by an un

# Doing our analysis

Use the `sklearn` package and a plain boring `CountVectorizer` to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

**Be sure to include English-language stopwords**

In [13]:
count_vectorizer = CountVectorizer(stop_words = 'english')

In [17]:
count_vectorizer.get_feature_names() 

['000',
 '00007',
 '018',
 '050',
 '092',
 '10',
 '100',
 '106',
 '107',
 '108',
 '108th',
 '109th',
 '10th',
 '11',
 '110',
 '114',
 '117',
 '118',
 '11th',
 '12',
 '120',
 '121',
 '122',
 '123',
 '125',
 '128',
 '12898',
 '13',
 '13279',
 '1332',
 '1335',
 '1344',
 '135',
 '138',
 '14',
 '140',
 '143',
 '144',
 '145',
 '149',
 '1498',
 '14th',
 '15',
 '150',
 '1520',
 '153',
 '155',
 '159',
 '16',
 '160',
 '162',
 '163',
 '165',
 '1671',
 '1675',
 '17',
 '170',
 '1700',
 '174',
 '178',
 '1787',
 '17th',
 '18',
 '180',
 '1800',
 '1800s',
 '181',
 '1812',
 '1855',
 '186',
 '1868',
 '18th',
 '19',
 '190',
 '1907',
 '1922',
 '1927',
 '1930',
 '1940s',
 '1950s',
 '196',
 '1960',
 '1960s',
 '1964',
 '1965',
 '1967',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1976',
 '1979',
 '198',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19th',
 '1st'

In [14]:
X = count_vectorizer.fit_transform(speeches_df['content'])

In [18]:
pd.DataFrame(X.toarray(), columns = count_vectorizer.get_feature_names()) 

Unnamed: 0,000,00007,018,050,092,10,100,106,107,108,...,youngsters,youth,yuan,zero,zeroing,zeros,zigler,zirkin,zoe,zoellick
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Okay, it's **far** too big to even look at. Let's try to get a list of features from a new `CountVectorizer` that only takes the top 100 words.

Now let's push all of that into a dataframe with nicely named columns.

In [21]:
from nltk.stem.porter import PorterStemmer
import re

porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words = 'english', tokenizer = stemming_tokenizer, max_features = 100)
X = count_vectorizer.fit_transform(speeches_df['content'])
pd.DataFrame(X.toarray(), columns = count_vectorizer.get_feature_names()) 

Unnamed: 0,1,2,act,allow,amend,american,amp,ani,appropri,associ,...,urg,use,veri,vote,wa,want,way,work,year,yield
0,0,0,3,1,2,3,0,0,0,0,...,0,0,2,1,1,1,2,0,0,2
1,0,0,1,1,1,0,0,0,0,0,...,0,0,1,1,0,1,3,0,1,0
2,0,0,1,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,1,1,0,0,0
4,0,0,0,0,1,0,0,0,1,0,...,0,0,1,2,0,0,0,0,0,2
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,1,0,3,0,0,1,0,...,0,0,0,2,0,0,0,0,0,0


Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?

In [23]:
# how many speeches are there total
len(speeches_df['content'])

702

In [26]:
# how many speeches don't mention "chairman"
len(speeches_df[speeches_df['content'].str.contains('chairman') == False])

249

In [43]:
# how many speeches don't mention "chairman" OR "mr."
len(speeches_df[(speeches_df['content'].str.contains('chairman') == False) & (speeches_df['content'].str.contains('mr.') == False)])

75

What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, max_features = 100, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
term_freq = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [52]:
term_freq['thank'].sort_values(ascending = False).head(1)

179    0.25
Name: thank, dtype: float64

If I'm searching for `China` and `trade`, what are the top 3 speeches to read according to the `CountVectoriser`?

In [58]:
(term_freq['china'] + term_freq['trade']).sort_values(ascending = False).head(3)

345    0.397059
336    0.281250
402    0.250000
dtype: float64

Now what if I'm using a `TfidfVectorizer`?

In [53]:
l2_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, max_features = 100)
X = l2_vectorizer.fit_transform(speeches_df['content'])
l2_df = pd.DataFrame(X.toarray(), columns=l2_vectorizer.get_feature_names())

Unnamed: 0,1,2,act,allow,amend,american,amp,ani,appropri,associ,...,urg,use,veri,vote,wa,want,way,work,year,yield
0,0.000000,0.000000,0.096449,0.031370,0.045459,0.102570,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.059449,0.028406,0.027954,0.030172,0.064842,0.000000,0.000000,0.037749
1,0.000000,0.000000,0.072345,0.070591,0.051147,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.066888,0.063921,0.000000,0.067896,0.218866,0.000000,0.066236,0.000000
2,0.000000,0.000000,0.087605,0.000000,0.000000,0.093165,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.080997,0.077405,0.000000,0.000000,0.000000,0.000000,0.000000,0.051431
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.135191,0.000000,...,0.000000,0.000000,0.000000,0.104059,0.000000,0.110529,0.118766,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.101489,0.000000,0.0,0.000000,0.164781,0.000000,...,0.000000,0.000000,0.132722,0.253671,0.000000,0.000000,0.000000,0.000000,0.000000,0.168549
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.693326
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.410171,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.693326
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.000000,0.145293,0.000000,0.475061,0.0,0.000000,0.170927,0.000000,...,0.000000,0.000000,0.000000,0.263132,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [59]:
(l2_df['china'] + l2_df['trade']).sort_values(ascending = False).head(3)

345    1.296847
402    1.207024
317    1.202438
dtype: float64

**What's the content of the speeches?** Here's a way to get them:

In [60]:
# index 0 is the first speech, which was the first one imported.
paths[0]

'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'

In [61]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}

mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled by an unel

**Now search for something else!** Another two terms that might show up. `elections` and `chaos`? Whatever you thnik might be interesting.

In [65]:
(l2_df['elect'] + l2_df['children']).sort_values(ascending = False).head(3)

124    0.797181
25     0.760251
142    0.723351
dtype: float64

# Enough of this garbage, let's cluster

Using a **simple counting vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency inverse document frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

In [84]:
# Simple counting vectorizer
vectorizer = CountVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: thi mr chairman amend gentleman
Cluster 1: head start program right religi
Cluster 2: amp nbsp p gt lt
Cluster 3: thi mr state s time
Cluster 4: associ nation restaur contractor chamber
Cluster 5: start head program children thi
Cluster 6: church wa s embezzl financi
Cluster 7: rule 11 state feder court


In [85]:
# Term frequency vectorizer
vectorizer = TfidfVectorizer(use_idf=False, tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000, norm = 'l1')
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: speaker mr time balanc reserv
Cluster 1: thi mr chairman gentleman amend
Cluster 2: mr yield chairman gentleman minut
Cluster 3: chairman mr time balanc yield
Cluster 4: yield mr gentleman chairman speaker
Cluster 5: mr demand vote record chairman
Cluster 6: yield gentleman texa wisconsin illinoi
Cluster 7: amend chairman mr opposit time


In [86]:
# Term frequency inverse document frequency vectorizer
vectorizer = TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(speeches_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: start head program children thi
Cluster 1: balanc time chairman mr reserv
Cluster 2: china trade thi s enforc
Cluster 3: demand record vote mr chairman
Cluster 4: consent claim opposit unanim ask
Cluster 5: gentleman yield mr chairman texa
Cluster 6: thi mr amend chairman time
Cluster 7: mr minut yield chairman gentleman


**Which one do you think works the best?**

One of the second two, but it's hard to tell which.

# Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out **the two types of Harry Potter fanfiction**. What is your hypothesis?

In [78]:
paths = glob.glob('hp/*')
paths[:5]

['hp/10001898.txt',
 'hp/10004131.txt',
 'hp/10004927.txt',
 'hp/10007980.txt',
 'hp/10010343.txt']

In [79]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
hp_df = pd.DataFrame(speeches)
hp_df.head()

Unnamed: 0,content,filename,pathname
0,Prologue: The MissionDisclaimer: All character...,10001898.txt,hp/10001898.txt
1,BlackDisclaimer: I do not own Harry PotterAuth...,10004131.txt,hp/10004131.txt
2,"Chapter 1""I'm pregnant.""""""""Mum please say some...",10004927.txt,hp/10004927.txt
3,"Author's Note: Hey, just so you know, this is ...",10007980.txt,hp/10007980.txt
4,Disclaimer: I do not own Harry Potter and frie...,10010343.txt,hp/10010343.txt


In [82]:
# # The two clusters for this are 
# # Top terms per cluster:
# # Cluster 0: wa hi harri hermion t
# # Cluster 1: wa hi t s lili
# # Which is unintelligible thing to me

# vectorizer = TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words='english', max_features = 10000)
# X = vectorizer.fit_transform(hp_df['content'])

# from sklearn.cluster import KMeans

# number_of_clusters = 2
# km = KMeans(n_clusters=number_of_clusters)
# km.fit(X)

# print("Top terms per cluster:")
# order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# terms = vectorizer.get_feature_names()
# for i in range(number_of_clusters):
#     top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
#     print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: wa hi harri hermion t
Cluster 1: wa hi t s lili


In [83]:
vectorizer = TfidfVectorizer(stop_words='english', max_features = 10000)
X = vectorizer.fit_transform(hp_df['content'])

from sklearn.cluster import KMeans

number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: lily james sirius remus said
Cluster 1: harry hermione draco said just
