In [1]:
import pandas as pd
import requests
import json
import csv
import time
import datetime
import re
import string
import gzip
import os

import numpy as np
import pickle #for saving output files, pickles

def datetime_to_unix_time(d):
    return int(time.mktime(datetime.datetime.strptime(d, "%m/%d/%Y").timetuple()))

## Import trailers from my pickle file!

In [2]:
with open('/home/russell/Documents/DataScience/DonorsChoose/Data/trailers.pickle', 'rb') as handle:
    trailers = pickle.load(handle)

## Process trailers from DonorsChoose create data files

In [3]:
data = pd.DataFrame(trailers.items(), columns=['id', 'trailer'])


In [4]:
data.columns = ['id','text']
pd.set_option('max_colwidth', 400)
data.head()

Unnamed: 0,id,text
0,4956615,Help me give my students a summer reading boost with a very engaging digital reading program.
1,4904116,"Help me give my students various dice and a starting dice game to use in endless ways at home to practice, reinforce, and learn math concepts."
2,4333875,"Help me give my students a safe, welcoming, comfortable classroom space where they can feel at home and ready to learn!"
3,4948217,Help me give my students basic school supplies! Every child needs a pencil and now even disinfectant wipes to keep everyone safe and learning!
4,4946464,Help me give my students the hands on materials they need during Remote learning! I have included hands on materials as well as a math workbook and snack.


### Clean up text

In [5]:
# not using re.sub(r'[^a-zA-Z\s]', '', t) to avoid losing emojis
text = [re.sub(r'([0-9]+?)', ' ', t).lower() for t in data['text']] # remove all numbers and symbols
text = [re.sub(r'(!|"|#|\$|%|&|\'|\(|\)|\*|\+|,|-|\.|/|:|;|<|=|>|\?|@|\[|\\|\]|\^|_|`|{|\||}|~)+', ' ', t) for t in text]
data['text'] = [re.sub(r'\s+\s', ' ', t).strip() for t in text] # repace double spaces with single spaces

data = data.loc[data['text'].map(len) > 5].reset_index(drop = True) # keep only strings longer than 5 characters

# Discovering and Visualizing Topics in Texts

Most typical cases of text classification in NLP (named entity recognition, question answering, etc) require training datasets where each piece of text is associted with a label. However, in real-life scenarios, text collections rarely come with metadata labels that tell you what the texts are about. When people answer open-ended survey questions, for example, they don't repeat detectable keywords in their answer with the topics they discuss.

**Topic modeling** is an unsupervised classification technique that is able to discover the topics in a collection of texts by looking at their commonalities. In this context, "topics" refers to groups of related words that often occur together in the same text. For example, in a collection of newspaper articles a topic model may identify one topic that is made up of words such as "politician", "law", and "parliament", and another characterized by words such as "player", "match" and "penalty". Topic models only go as far as identifying clusters of related words; a human is still needed to interpret these clusters and give them labels such as "politics" and "football". 

One of the most popular topic models is Latent Dirichlet Allocation (LDA). LDA is a generative model that sees every text as a mixture of topics and each sentence as a mixture of words. For example, the "football" topic will generate the word "penalty" with a high probability, while the "politics" topic will have a much higher probability for "politician" than for "penalty". Other words, such as "the" and "an", will have similar probabilities in all topics. LDA takes its name from the Dirichlet probability distribution. This is the prior distribution it assumes the topics in a text will have.

Modified from https://github.com/nlptown/nlp-notebooks/blob/master/Discovering%20and%20Visualizing%20Topics%20in%20Texts%20with%20LDA.ipynb

## Data

Insight fellows frequently come up with project ideas that revolve around topic modeling of online reviews. Here, we'll use a dataset of project 'trailers' from the website DonorsChoose providing a brief description of the reason/project for which a teacher is requesting funding.

In [6]:
data.head(2)

Unnamed: 0,id,text
0,4956615,help me give my students a summer reading boost with a very engaging digital reading program
1,4904116,help me give my students various dice and a starting dice game to use in endless ways at home to practice reinforce and learn math concepts


## Preprocessing

Before we train a topic model, we need to tokenize our texts. Let's do this with the [spaCy](https://spacy.io/) NLP library. We need to load a statisti English and use spaCy to perform our first preprocessing pass:

In [7]:
import spacy

# If you haven't installed the spaCy language model, uncomment the following line and run this cell
# ! python -m spacy download en_core_web_sm

# You will need to restart the notebook (go to the menu Kernel -> Restart) and re-run cells up to this point

In [8]:
nlp = spacy.load('en_core_web_sm')

texts = data['text'].tolist()
%time spacy_docs = list(nlp.pipe(texts))

CPU times: user 2.14 s, sys: 385 ms, total: 2.52 s
Wall time: 2.53 s


The text of each review is now a spaCy Doc that we can transform into a list of tokens. Instead of the original tokens, we're going to work with the **lemmas** instead. This will allow our model to generalize and understand that different forms of a word should be treated as one.

Stemming and Lemmatization both generate the root form of the words. Lemmatization uses the rules about a language and the resulting tokens are all actual words. For example, the word "thought" becomes the lemma "think". Stemming is a crude heuristic that chops off the ends of words such that the resulting tokens may not be actual words. Stemming is faster but only works well for simple words like "toys" and "toy".

This is the full list of our initial preprocessing steps: 
 
- remove all words shorter than 2 characters (these are often fairly uninteresting from a topical point of view)
- drop all stopwords
- lowercase remaining lemmas

In [9]:
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 2 and not t.is_stop] for doc in spacy_docs]
for i in range(5):
    print(docs[i])
    print('\n')

['help', 'student', 'summer', 'reading', 'boost', 'engaging', 'digital', 'reading', 'program']


['help', 'student', 'dice', 'start', 'dice', 'game', 'use', 'endless', 'way', 'home', 'practice', 'reinforce', 'learn', 'math', 'concept']


['help', 'student', 'safe', 'welcoming', 'comfortable', 'classroom', 'space', 'feel', 'home', 'ready', 'learn']


['help', 'student', 'basic', 'school', 'supply', 'child', 'need', 'pencil', 'disinfectant', 'wipe', 'safe', 'learning']


['help', 'student', 'hand', 'material', 'need', 'remote', 'learning', 'include', 'hand', 'material', 'math', 'workbook', 'snack']




Next, we also want to take frequent bigrams into account. **Bigrams are multiword units**, such as "colored pencil" that actually form one word rather than two. We'll use Gensim to first identify the frequent bigrams in the corpus, then append them to the list of tokens for the documents in which they appear. This means the bigrams will not be in their correct position in the text, but that's fine: topic models are bag-of-word models that ignore word position anyway.

In [10]:
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)
tokens = []

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:  # bigrams can be recognized by the "_" that joins the invidual words
            docs[idx].append(token)
            tokens.append(token)
            
print(list(set(tokens))[:10])

['high_interest', 'classroom_library', 'distance_learning', 'comic_strip', 'seating_option', 'flexible_seating', 'document_camera', 'dry_erase', 'social_emotional', 'school_year']


In [11]:

pickle_out = open('/home/russell/Documents/DataScience/DonorsChoose/Data/cleantrailer.pickle',"wb")
pickle.dump(docs, pickle_out)
pickle_out.close()




Next, we move on to the final Gensim-specific preprocessing steps. First, we create a dictionary representation of the documents. This dictionary will map each word to a unique ID and help us create bag-of-word representations of each document. These bag-of-word representations contain the ids of the words in the document, together with their frequency. Additionally, we can remove the least and most frequent words from the vocabulary. This improves the quality of our topic model and speeds up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents a word is allowed to occur in.

In [12]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print('Number of unique words in original documents:', len(dictionary))

dictionary.filter_extremes(no_below=3, no_above=0.25)
print('Number of unique words after removing rare and common words:', len(dictionary))

print("Example representation of document 3:", dictionary.doc2bow(docs[2]))

Number of unique words in original documents: 2035
Number of unique words after removing rare and common words: 668
Example representation of document 3: [(10, 1), (11, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)]


Then we create bag-of-word representations for each document in the corpus:

In [13]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

## Training

Now it's time to train our topic model. We do this with the following parameters: 

- `corpus`: the bag-of-word representations of our documents
- `id2token`: the mapping from indices to words
- `num_topics`: the number of topics we want the model to identify
- `chunksize`: the number of documents the model sees for every update
- `passes`: the number of times we show the total corpus to the model during training
- `random_state`: we use a seed to ensure reproducibility.

On a corpus of this size, the training will typically about a minute.

In [14]:
from gensim.models import LdaModel

%time model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=6, chunksize=500, passes=3, random_state=1)

CPU times: user 745 ms, sys: 0 ns, total: 745 ms
Wall time: 748 ms


## Results

Let's take a look at what the model has learnt. We do this by printing out the ten words that are most characteristic for each of the topics. Most topics show common words like "experience", "item" and "school" but it's hard to identify any other patterns in the data.

In [15]:
for (topic, words) in model.print_topics():
    print(topic+1, ":", words, '\n')

1 : 0.043*"fun" + 0.039*"learn" + 0.031*"seating" + 0.030*"classroom" + 0.024*"play" + 0.021*"table" + 0.020*"environment" + 0.020*"increase" + 0.019*"emotional" + 0.019*"skill" 

2 : 0.057*"learning" + 0.028*"pencil" + 0.028*"activity" + 0.023*"item" + 0.021*"summer" + 0.019*"material" + 0.018*"reading" + 0.017*"home" + 0.016*"experience" + 0.016*"classroom" 

3 : 0.038*"book" + 0.031*"read" + 0.031*"lesson" + 0.024*"distance" + 0.023*"laptop" + 0.023*"hand" + 0.022*"tile" + 0.021*"set" + 0.020*"language" + 0.020*"provide" 

4 : 0.066*"science" + 0.046*"material" + 0.044*"art" + 0.044*"opportunity" + 0.030*"book" + 0.030*"meet" + 0.024*"high" + 0.021*"class" + 0.021*"interest" + 0.020*"learn" 

5 : 0.050*"learn" + 0.042*"supply" + 0.036*"book" + 0.036*"home" + 0.028*"teach" + 0.021*"life" + 0.021*"continue" + 0.019*"new" + 0.018*"like" + 0.017*"opportunity" 

6 : 0.049*"school" + 0.043*"year" + 0.035*"opportunity" + 0.026*"supply" + 0.025*"day" + 0.023*"college" + 0.021*"need" + 0.021

Another way of inspecting the topics is by visualizing them. This can be done with the [pyLDAvis](https://github.com/bmabey/pyLDAvis) library. PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which are the most salient words for this topic. Note it's important to set `sort_topics=False` on the call to pyLDAvis. If you don't, it will order the topics differently than Gensim. 

In [71]:
import pyLDAvis.gensim
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

Finally, let's inspect the topics the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a low number of topics for each documents, which makes its results easily interpretable.

In [72]:
for (text, doc) in zip(texts[:20], docs[:20]):
    print(text)
    print('-'*10)
    print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.3])
    print('\n')

help me give my students a summer reading boost with a very engaging digital reading program
----------
[(2, 0.8947776)]


help me give my students various dice and a starting dice game to use in endless ways at home to practice reinforce and learn math concepts
----------
[(5, 0.9402185)]


help me give my students a safe welcoming comfortable classroom space where they can feel at home and ready to learn
----------
[(1, 0.5075563), (5, 0.40881738)]


help me give my students basic school supplies every child needs a pencil and now even disinfectant wipes to keep everyone safe and learning
----------
[(2, 0.39377218), (5, 0.43039736)]


help me give my students the hands on materials they need during remote learning i have included hands on materials as well as a math workbook and snack
----------
[(2, 0.9354828)]


help me give my students a comfortable social distance reading corner with mobile seats and individual book bins
----------
[(1, 0.31220227), (2, 0.62686527)]


help me gi

In [42]:
docs

[['help',
  'student',
  'summer',
  'reading',
  'boost',
  'engaging',
  'digital',
  'reading',
  'program'],
 ['help',
  'student',
  'dice',
  'start',
  'dice',
  'game',
  'use',
  'endless',
  'way',
  'home',
  'practice',
  'reinforce',
  'learn',
  'math',
  'concept'],
 ['help',
  'student',
  'safe',
  'welcoming',
  'comfortable',
  'classroom',
  'space',
  'feel',
  'home',
  'ready',
  'learn'],
 ['help',
  'student',
  'basic',
  'school',
  'supply',
  'child',
  'need',
  'pencil',
  'disinfectant',
  'wipe',
  'safe',
  'learning'],
 ['help',
  'student',
  'hand',
  'material',
  'need',
  'remote',
  'learning',
  'include',
  'hand',
  'material',
  'math',
  'workbook',
  'snack',
  'remote_learning'],
 ['help',
  'student',
  'comfortable',
  'social',
  'distance',
  'reading',
  'corner',
  'mobile',
  'seat',
  'individual',
  'book',
  'bin'],
 ['help',
  'student',
  'summer',
  'skill',
  'workbook',
  'play',
  'card',
  'math',
  'game',
  'continue',


Looping through all texts, let's save the most likely topic number.

In [73]:
docs[0]

['help',
 'student',
 'summer',
 'reading',
 'boost',
 'engaging',
 'digital',
 'reading',
 'program']

In [74]:
texts[0]

'help me give my students a summer reading boost with a very engaging digital reading program'

In [75]:
topic_nums = []
for (text, doc) in zip(texts, docs):
    probs = np.array(model[dictionary.doc2bow(doc)])
    topic_nums.append(probs[np.argsort(probs[:,-1])][-1,0])
    
data['topic'] = topic_nums

product_vs_topic = pd.crosstab(data['id'], data['topic'])
product_vs_topic = product_vs_topic.T / product_vs_topic.sum(axis = 1) * 100
product_vs_topic = product_vs_topic.T

In [107]:
pd.set_option('display.max_columns', None)
#pd.set_option("max_rows", None) #undo by resetting --- 
#pd.reset_option("display.max_rows")
#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_rows', 50)

In [108]:
data.head()

Unnamed: 0,id,text
0,4956615,help me give my students a summer reading boost with a very engaging digital reading program
1,4904116,help me give my students various dice and a starting dice game to use in endless ways at home to practice reinforce and learn math concepts
2,4333875,help me give my students a safe welcoming comfortable classroom space where they can feel at home and ready to learn
3,4948217,help me give my students basic school supplies every child needs a pencil and now even disinfectant wipes to keep everyone safe and learning
4,4946464,help me give my students the hands on materials they need during remote learning i have included hands on materials as well as a math workbook and snack


In [109]:
import matplotlib.pyplot as plt
from matplotlib import colors
def background_gradient(s, m, M, cmap='PuBu', low=0, high=0):
    rng = M - m
    norm = colors.Normalize(m - (rng * low),
                            M + (rng * high))
    normed = norm(s.values)
    c = [colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
    return ['background-color: %s' % color for color in c]

product_vs_topic.round(2).style.apply(background_gradient,
               cmap='YlGnBu',
               m=product_vs_topic.min().min(),
               M=product_vs_topic.max().max(),
               low=0.5,
               high=0.8)

topic,0.0,1.0,2.0,3.0,4.0,5.0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2670527,0.0,0.0,0.0,0.0,0.0,100.0
3423945,0.0,0.0,0.0,0.0,0.0,100.0
3437251,0.0,100.0,0.0,0.0,0.0,0.0
3442661,0.0,0.0,0.0,0.0,100.0,0.0
3479767,100.0,0.0,0.0,0.0,0.0,0.0
3534334,0.0,0.0,0.0,100.0,0.0,0.0
3543249,0.0,0.0,0.0,100.0,0.0,0.0
3687922,0.0,0.0,0.0,0.0,100.0,0.0
3854967,0.0,100.0,0.0,0.0,0.0,0.0
3935924,0.0,0.0,100.0,0.0,0.0,0.0


In [80]:
data.loc[data['id'].isin(['4957430','4957502','4957562'])]

Unnamed: 0,id,text,topic
70,4957562,help me give my students books that showcase diversity and inclusion in our classroom,4.0
262,4957502,help me give my students books featuring diverse characters that reflect the diverse population of our school,4.0
287,4957430,help me give my students stories that create more representation with characters from marginalized backgrounds,4.0


## Conclusions

Many collections of unstructured texts don't come with any labels. Topic models such as Latent Dirichlet Allocation are a useful technique to discover the most prominent topics in such documents. Gensim makes training these topics model easy, and pyLDAvis presents the results in a visually attractive way. Together they form a powerful toolkit to better understand what's inside large sets of documents and to explore subsets of related texts. However, these methods can perform poorly in short texts with vague or unspecified subjects. Although traditional topic models are lacking in more semantic information (they don't use word embeddings, for instance), they can be really quick way of getting insights into large collections of documents.