# Homework #5: Topic modeling

Instead of topic modeling newsgroup data, let's look at fiction and see what we can do with it.


In [1]:
import re
import numpy as np
import pandas as pd

from pathlib import Path

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

### Load fiction data

We load a dataset of fiction that's in your repository.

In [3]:
ficpath = Path('../data/HWfiction/HWfiction.tsv')
fic = pd.read_csv(ficpath, sep = '\t')
fic.head()

Unnamed: 0,chunkid,decade,text
0,1880_Adams_1,1880,young and fresh from that hot-bed of abolition...
1,1880_Adams_1,1880,"we dine at half-past six."" Senator Ratcliffe h..."
2,1880_Adams_3,1880,"The story is this, Mrs. Lee; and it is well-kn..."
3,1880_Adams_4,1880,"tell you,"" said he drily, ""you will be wiser t..."
4,1880_Aldrich_1,1880,"suggested somebody. ""Three on 'em snaked in to..."


Each row contains an id, a date rounded off at the decade level, and the text of a chunk of fiction.

Let's also load a stopword list.

In [19]:
stoppath = Path('../data/HWfiction/HWfictionstopwords.txt')

stopwords = [x.strip() for x in open(stoppath, encoding = 'utf-8').readlines()]

len(stopwords)

6437

###  Assignment 1

Vectorize the fiction using the list of stopwords we just loaded, and other settings parallel to our lab.

    CountVectorizer(strip_accents = 'unicode',
                                stop_words = stopwords,
                                 token_pattern = r'\b[a-zA-Z]{3,}\b',
                                lowercase = True,
                                max_df = 0.5, 
                                min_df = 10)

Then train an 20-topic model of the data, using ```random_state = 0```.

Explore the model using pyLDAvis. (This may not appear in your .pdf when you print, but that's okay.)

Then 

A) Create the doc-topics matrix and turn it into a Pandas data frame so you can associate "decade" with each document. Use groupby() and mean() to summarize this matrix so it has one row for each decade, and the row contains mean topic probabilities for that decade.

B) Choose a topic that's rising across time; there will probably be a topic that features body parts like "face," "eyes," "hand" that makes a good example, but you can choose something else if you like.

Create a line chart that shows the topic's average frequency in different decades; the rise should be visible.

C) Choose a topic that's falling across time; there will probably be one with words like "sir," "king," "years" that makes a good example, but if not, you can poke around and find something else. Again, create a line chart.

D) Offer a two- or three-sentence speculative hypothesis to explain either the rising topic or the falling one. (You don't have to do both.) I know you don't really have evidence for the hypothesis yet. The point is not to be right but simply to stretch your hypothesis-forming muscles. Think about how you might test the hypothesis if you needed to.

In [25]:
# your code goes here

### Assignment B

Go back to the original document-topic matrix (before you grouped it by decade), and perform Principal Component analysis to compress it down to two dimensions. Select about 20% of the rows, which you can do by

    .sample(frac = .2)

and visualize them in the space created by PCA, colored by decade (if you use a continuous palette like 'viridis,' this may be easier to understand).

Would you say chronology is or isn't a pattern organizing this topic model?
Write a sentence expressing your opinion.