<a href="https://colab.research.google.com/github/steve-wilson/ds32019/blob/master/04_Content_Analysis_DS3Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fundamentals of Text Analysis for User Generated Content @ [DS3](https://www.ds3-datascience-polytechnique.fr/)

# Part 4: Content Analysis

[<- Previous: Data Collection](https://colab.research.google.com/drive/1Jjx7t3cAkNTtCcKP4Qkkp5uOqd8wQTB3)

[-> Next: Text Embeddings](https://colab.research.google.com/drive/1kXX_ifuY5cnHqUt9Xt0cmzAp_NiVho9n)

Dates: June 27-28, 2019

Facilitator: [Steve Wilson](https://steverw.com)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.



In [0]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------
import collections
import re
import string
import warnings
warnings.filterwarnings('ignore')

# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')

# numpy: matrix library for Python
import numpy as np

# Gensim for topic modeling
import gensim
# for loading data
import sklearn.datasets
# for LDA visualization
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim

# for uploading data files
from google.colab import files

# downloading values lexicon
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/lexicon_1_0/values_lexicon.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/christian_500.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/business_500.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/college_500.txt

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens and \
                      re.match(r"^\w+$",lemma)]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
def docs2matrix(document_list):
    
    # use the vocab2index idea from before
    vocab2index = {}
    
    # load the stopword list for English
    stop_words = set(nltk.corpus.stopwords.words('english'))
    stop_words |= set(['from', 'subject', 're', 'edu', 'use'])
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    vocab2index = {}
    latest_index = 0

    lfs = []
    # this should be a nice starting point
    for doc in document_list:
        lf = text_to_lemma_frequencies(doc,all_removal_tokens)
        for token in lf.keys():
            if token not in vocab2index:
                vocab2index[token] = latest_index
                latest_index += 1
                
        lfs.append(lf)
    
    # create the zeros matrix
    corpus_matrix = np.zeros((len(lfs), len(vocab2index)))
    
    for row, lf in enumerate(lfs):
        for token, frequency in lf.items():
            column = vocab2index[token]
            corpus_matrix[row][column] = frequency
    
    return corpus_matrix, vocab2index

    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'


            
print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---

## Content Analysis

- Now that we have some real data, what are some ways that we can explore what's in it?
    - How can we answer the basic question: *What are people talking about in this corpus?*

### Topic Modeling

- Load a corpus matrix, like the ones we created earlier, into gensim's corpus object:

In [0]:
# this time, let's load all documents in the 20news dataset from these categories
categories = ['soc.religion.christian', 'rec.autos', 'talk.politics.misc', \
              'rec.sport.baseball', 'comp.sys.ibm.pc.hardware']
newsgroups_train_all = sklearn.datasets.fetch_20newsgroups(subset='train', \
                                              categories=categories).data
# using the function we wrote before, but modified to also return the vocab2index
corpus_matrix, word2id = docs2matrix(newsgroups_train_all)
# reverse this dictionary
id2word = {v:k for k,v in word2id.items()}

# Dense2Corpus expects that each 
corpus = gensim.matutils.Dense2Corpus(corpus_matrix, documents_columns=False)
print("Loaded",len(corpus),"documents into a Gensim corpus.")

- Given this, we can run LDA right out of the box:

In [0]:
# As of July 2019, gensim calls a deprecated numpy function and gives lots of warning messages
# Let's supress these.
warnings.filterwarnings('ignore')

# run LDA on our corpus, using out dictionary (k=6)
lda = gensim.models.LdaModel(corpus, id2word=id2word, num_topics=6)
lda.print_topics()

- There is still quite a bit of noise in this list because the documents are full of very common words like "write", "subject", and "from".
- One common approach is to remove the most (and possibly least) common words before running LDA.


In [0]:
total_counts = np.sum(corpus_matrix, axis=0)
sorted_words = sorted( zip( range(len(total_counts)) ,total_counts), \
                       key=lambda x:x[1], reverse=True )
N = 100
M = 50
top_N_ids = [item[0] for item in sorted_words[:N]]
appears_less_than_M_times = [item[0] for item in sorted_words if item[1] < M]
vocab_dense = [id2word[idx] for idx in range(len(id2word))]

print("Top words to remove:", ' '.join([id2word[idx] for idx in top_N_ids]))

remove_indexes = top_N_ids+appears_less_than_M_times
corpus_matrix_filtered = np.delete(corpus_matrix,remove_indexes,1)

for index in sorted(remove_indexes, reverse=True):
    del vocab_dense[index]

id2word_filtered = {}
word2id_filtered = {}

for i,word in enumerate(vocab_dense):
    id2word_filtered[i] = word
    word2id_filtered[word] = i
    
corpus_filtered = gensim.matutils.Dense2Corpus(corpus_matrix_filtered, documents_columns=False)

print("Original matrix shape:",corpus_matrix.shape)
print("New matrix shape:",corpus_matrix_filtered.shape)

- Now, run LDA again using this new matrix

In [0]:
lda = gensim.models.LdaModel(corpus_filtered, id2word=id2word_filtered, num_topics=6)
lda.print_topics()

- We can also use this model to get topic probabilities for unseen documents:

In [0]:
unseen_doc = "I went to the baseball game and say the player hit a homerun !"
unseen_doc_bow = [word2id_filtered.get(word.lower(),-1) for word in unseen_doc.split()]
unseen_doc_vec = np.zeros(len(word2id_filtered))
for word in unseen_doc_bow:
    if word >= 0:
        unseen_doc_vec[word] += 1
unseen_doc_vec = unseen_doc_vec[np.newaxis]
unseen_doc_corpus = gensim.matutils.Dense2Corpus(unseen_doc_vec, documents_columns=False)
vector = lda[unseen_doc_corpus]  # get topic probability distribution for a document
for item in vector:
    print(item)

- pyLDAvis is a nice tool for visualizing our topics:

In [0]:
pyLDAvis.enable_notebook()

# need to create a gensim dictionary object instead of our
# lightweight dict object - this is what pyLDA expects as input
dictionary = gensim.corpora.Dictionary()
dictionary.token2id = word2id_filtered

# visualize the LDA model
vis = pyLDAvis.gensim.prepare(lda, corpus_filtered, dictionary)
vis

### Lexical Resources

- For a fast and easy content analysis, use one of the many available prebuilt dictionaries/lexicons.
    - These map words or stems to semantic categories.
- We have discussed several lexicons in the slides.
- As an example, let's load the lexicon for measuring personal values in text:

In [0]:
def load_lexicon(lexicon_file_path):
    word2cat = collections.defaultdict(list)
    with open(lexicon_file_path,'r') as lexicon_file:
        for line in lexicon_file:
            if line:
                word, cat = line.strip().split(", ")
                word2cat[word].append(cat)
    return word2cat
            
values_lexicon = load_lexicon("values_lexicon.txt")
print("Loaded lexicon with",len(values_lexicon),"entries.")
print("The categories for 'mother' are:",values_lexicon['mother'])

It's very easy to score a document for each category:


In [0]:
file_list = ["christian_500.txt", "business_500.txt", "college_500.txt"]

for file_name in file_list:
    category_counts = collections.defaultdict(int)
    
    # just look at the first 25K characters
    # this way, we don't need to normalize based on the length of the document
    # and we'll save some time since this is just for demonstration purposes
    text = open(file_name).read().lower()[:25000]
    for pattern, categories in values_lexicon.items():
        count = re.findall(r'\b' + pattern + r'\b', text)
        if count:
            for category in categories:
                category_counts[category] += len(count)
    print(file_name,sorted(category_counts.items(),key=lambda x:x[1], reverse=True))

### Putting it together: analyzing social media content

- Let's try out some content analysis on the corpus that you created.
    - Keep in mind any characteristics of the data that may pollute the content, such as "RT" at the beginning of a tweet, that you may want to filter out.
- Here is some sample code to load a file from your computer. Go ahead and run it to upload your data.

In [0]:
# Load your social media data that you created in the previous section.
uploaded = files.upload()
print("Uploaded",len(uploaded),"files.")
for filename in uploaded:
    print("File:",filename)
    # access file data here
    with open(filename) as file_handle:
        data = file_handle.read()

**Exercise 5:** Social Media Analysis

- Run LDA on your own data
    - Write some code to convert your data into a gensim corpus
    - This should involve:
        - parsing the input based on the format that you decided upon when you saved the file in the previous section.
        - preprocessing the data (you can do this however you like, perhaps differently than we did before -- it's up to you), which might include:
            - punctuation removal
            - sentence and word tokenization
            - cleaning the data (e.g., spelling correction, converting emoji, removing links)
            - lemmatization or stemming (if you choose)
        - generating a vocab and a count matrix
        - creating the gensim corpus object
- This one may take a bit more code than the previous exercises.
    - It may be a good idea to split your code into separate functions.
    - Feel free to search around for documentation or examples. You can use different functions or approaches than we used in the other sections if you find a methodology that you prefer.
    - Of course, you can write and run all the code on your own machine if you don't want to do it all in colab.

In [0]:
# load the data into a list of documents

# preprocess each document

# create a gensim corpus object and id2word dictionary

corpus = None
id2word = None

In [0]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

# This is based on the format used to store the output in the 
# previous sample solution
docs = open("mytweets.txt").read().split("</TWEET>\n<TWEET>")
docs[0] = docs[0].replace("<TWEET>","")
docs[1] = docs[1].replace("</TWEET>","")

# preprocess each document

corpus_matrix, word2id = docs2matrix(docs)

# create a gensim corpus object and id2word dictionary

id2word = {v:k for k,v in word2id.items()}

# Dense2Corpus expects that each 
corpus = gensim.matutils.Dense2Corpus(corpus_matrix, documents_columns=False)
print("Loaded",len(corpus),"documents into a Gensim corpus.")

- Now that you have the corpus object, let's fit and LDA model and visualize the results:

In [0]:
# Change this, if you like
k = 10

lda = gensim.models.LdaModel(corpus, id2word=id2word, num_topics=k)
lda.print_topics()

# need to create a gensim dictionary object instead of our
# lightweight dict object - this is what pyLDA expects as input
dictionary = gensim.corpora.Dictionary()
dictionary.token2id = word2id

# visualize the LDA model
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
vis

- If you still have time, try running the values lexicon on your data.
    - Think about how you can use the corpus_matrix that you (may) have already created to easily get the counts of each category for each document.
    - Which document has the highest score for each value category?

In [0]:
# Bonus activity workspace


- [-> Next: Text Embeddings](https://colab.research.google.com/drive/1kXX_ifuY5cnHqUt9Xt0cmzAp_NiVho9n)