# Topic Modeling with LDA and NMF

*Note: Much of this code was modified from Aneesha Bakharia and her blog post on Topic Modeling   
  https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

In [1]:
# Standard libraries
import pandas as pd
import numpy as np

# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [2]:
# Things to try
# - different amounts of topics
# - different corpuses (full_text, be_here_now, quotes)

### Load in data

In [3]:
# Choose path to data
path = 'full_text.txt'
#path = 'quotes.txt'
#path = 'be_here_now.txt'

In [4]:
# Open file
f = open(path, encoding='utf-8')

# Instantiate documents list
documents = []

# Iterate through each line (document), strip newlines, and append to raw_text list
for line in f:
    documents.append(line.rstrip())

In [5]:
# Get rid of extra newlines that resulted in empty strings
documents = [doc for doc in documents if doc != '']

In [6]:
# View documents
documents

['There are three stages in this journey that I have been on! The first, the social science stage; the second, the psychedelic stage; and the third, the yogi stage. They are summating—that is, each is contributing to the next. It’s like the unfolding of a lotus flower. Now, as I look back, I realize that many of the experiences that made little sense to me at the time they occurred, were prerequisites for what was to come later. I want to share with you the parts of the Internal Journey that never get written up in the mass media. I’m not interested in the political parts of the story; I’m not interested in what you read in the Saturday Evening Post about LSD. This is the story of what goes on inside a human being who is undergoing all these experiences.',
 'And as a therapist I felt caught in the drama of my own theories.',
 'Now my own experiences were horrible and beautiful and I kept working in different environments and settings and whenever anybody that I trusted brought along so

In [7]:
# Check length of documents
print('Number of documents:', len(documents))

Number of documents: 1195


### Clean + preprocess text

In [8]:
def clean_text(document_string):
    """
    Function that takes in a document in
    the form of a string, and preprocesses
    it, returning a clean string ready
    to be used to fit a CountVectorizer.
    
    Preprocessing includes:
    - lowercasing text
    - eliminating punctuation
    - dealing with edge case punctuation
      and formatting
    - replacing contractions with
      the proper full words
      
    :param: document_string: str
    
    :returns: cleaned_text: str
    """
    # Make text lowercase
    raw_text = document_string.lower()

    # Replace encoding error with a space
    raw_text = raw_text.replace('\xa0', ' ')
    
    # Make hypnenated versions non-hyphenated
    raw_text = raw_text.replace('mahara-ji', 'maharaji')
    raw_text = raw_text.replace('maharaj-ji', 'maharaji')

    # Normalize period formatting
    raw_text = raw_text.replace('.', '')

    # Replace exclamation point with a space
    raw_text = raw_text.replace('!', ' ')

    # Replace slashes with empty
    raw_text = raw_text.replace('/', '')

    # Replace questin marks with empty
    raw_text = raw_text.replace('??', ' ')
    raw_text = raw_text.replace('?', ' ')

    # Replace dashes with space
    raw_text = raw_text.replace('-', ' ')
    raw_text = raw_text.replace('—', ' ')

    # Replace ... with empty
    raw_text = raw_text.replace('…', '')
    raw_text = raw_text.replace('...', '')
    
    # Replace = with 'equals'
    raw_text = raw_text.replace('=', 'equals')

    # Replace commas with empty
    raw_text = raw_text.replace(',', '')
    
    # Replace ampersand with and
    raw_text = raw_text.replace('&', 'and')

    # Replace semi-colon with empty
    raw_text = raw_text.replace(';', '')
    
    # Replace colon with empty
    raw_text = raw_text.replace(':', '')

    # Get rid of brackets
    raw_text = raw_text.replace('[', '')
    raw_text = raw_text.replace(']', '')
    
    # Replace parentheses with empty
    raw_text = raw_text.replace('(', '')
    raw_text = raw_text.replace(')', '')
    
    # Replace symbols with letters
    raw_text = raw_text.replace('$', 's')
    raw_text = raw_text.replace('¢', 'c')

    # Replace quotes with nothing
    raw_text = raw_text.replace('“', '')
    raw_text = raw_text.replace('”', '')
    raw_text = raw_text.replace('"', '')
    raw_text = raw_text.replace("‘", "")

    # Get rid of backslashes indicating contractions
    raw_text = raw_text.replace(r'\\', '')

    # Replace extra spaces with single space
    raw_text = raw_text.replace('   ', ' ')
    raw_text = raw_text.replace('  ', ' ')

    # Some apostrophes are of a different type --> ’ instead of '
    raw_text = raw_text.replace("’", "'")

    # Replace contractions with full words, organized alphabetically
    raw_text = raw_text.replace("can't", 'cannot')
    raw_text = raw_text.replace("didn't", 'did not')
    raw_text = raw_text.replace("doesn't", 'does not')
    raw_text = raw_text.replace("don't", 'do not')
    raw_text = raw_text.replace("hasn't", 'has not')
    raw_text = raw_text.replace("he's", 'he is')
    raw_text = raw_text.replace("i'd", 'i would')
    raw_text = raw_text.replace("i'll", 'i will')
    raw_text = raw_text.replace("i'm", 'i am')
    raw_text = raw_text.replace("isn't", 'is not')
    raw_text = raw_text.replace("it's", 'it is')
    raw_text = raw_text.replace("nobody's", 'nobody is')
    raw_text = raw_text.replace("she's", 'she is')
    raw_text = raw_text.replace("shouldn't", 'should not')
    raw_text = raw_text.replace("that'll", 'that will')
    raw_text = raw_text.replace("that's", 'that is')
    raw_text = raw_text.replace("there'd", 'there would')
    raw_text = raw_text.replace("they're", 'they are')
    raw_text = raw_text.replace("there's", 'there are')
    raw_text = raw_text.replace("wasn't", 'was not')
    raw_text = raw_text.replace("we'd", 'we would')
    raw_text = raw_text.replace("we'll", 'we will')
    raw_text = raw_text.replace("we're", 'we are')
    raw_text = raw_text.replace("we've", 'we have')
    raw_text = raw_text.replace("you'll", 'you will')
    raw_text = raw_text.replace("you're", 'you are')
    raw_text = raw_text.replace("you've", 'you have')

    # Fix other contractions
    raw_text = raw_text.replace("'s", ' is')
    
    cleaned_text = raw_text
    
    return(cleaned_text)

In [9]:
# Clean all documents
cleaned_documents = [clean_text(doc) for doc in documents]

In [10]:
# View cleaned documents
cleaned_documents

['there are three stages in this journey that i have been on the first the social science stage the second the psychedelic stage and the third the yogi stage they are summating that is each is contributing to the next it is like the unfolding of a lotus flower now as i look back i realize that many of the experiences that made little sense to me at the time they occurred were prerequisites for what was to come later i want to share with you the parts of the internal journey that never get written up in the mass media i am not interested in the political parts of the story i am not interested in what you read in the saturday evening post about lsd this is the story of what goes on inside a human being who is undergoing all these experiences',
 'and as a therapist i felt caught in the drama of my own theories',
 'now my own experiences were horrible and beautiful and i kept working in different environments and settings and whenever anybody that i trusted brought along some new chemical 

In [11]:
# Find average length of quotes by word and by characters
# Initialize count lists
char_length = []
word_length = []

# Iterate through each quote and find lengths
for doc in cleaned_documents:
    char_length.append(len(doc))
    word_length.append(len(doc.split(' ')))
    
# Calculate means
char_mean = int(round(np.mean(char_length)))
word_mean = int(round(np.mean(word_length)))

# View averages
print('The average number of characters in a document is:', char_mean)
print('The average number of words in a document is:', word_mean)

The average number of characters in a document is: 313
The average number of words in a document is: 62


In [12]:
def display_topics(model, feature_names, no_top_words):
    """
    Function that takes in a model, feature_names,
    and no_top_words and displays topics and top
    words in a readible fashion.
    
    :param: model: sklearn.decomposition
    :param: feature_names: list
    :param: no_top_words: int
    
    :returns: printed topics and top words
    """
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print()

In [13]:
# Set number of features
no_features = 1000

### Non-negative Matrix Factorization (NMF)

In [14]:
# NMF using tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(cleaned_documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Set number of topics
no_topics = 20

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [15]:
# View results
no_top_words = 10
display_topics(nmf, tfidf_feature_names, no_top_words)

Topic 0:
just say like come really people soul look awareness ego

Topic 1:
love person beloved everybody fear getting power consciously goes consciousness

Topic 2:
suffering relieve compassion root close deal trip pain situation joy

Topic 3:
moment comes death affecting deeply fully experience present creative mind

Topic 4:
going simple mantra like left begin process meet oh instead

Topic 5:
heart mind open spiritual emotional use human witness soul comes

Topic 6:
thought think breath thinking god mind home hear completely day

Topic 7:
way saying works seeing got attachment set experience action jesus

Topic 8:
life world live living awaken experiences transformation people unfolding deeply

Topic 9:
place energy form universe doing inside pure exists got atman

Topic 10:
thoughts senses content identification awareness identify just reality got set

Topic 11:
guru god man separate meet seeker speak point met form

Topic 12:
right doing inside message dance righteousness need ce

### Latent Dirichlet Allocation (LDA)

In [16]:
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(cleaned_documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)



In [17]:
# View results
no_top_words = 10
display_topics(lda, tf_feature_names, no_top_words)

Topic 0:
sound fact structure training identified works slowly evening heaven maharaji

Topic 1:
awareness mind consciousness thing loving energy thinking humor mantra sacrifice

Topic 2:
plane let physical astral idea causal pure channel consciousness yes

Topic 3:
fact pick telephone want power life minute long suffering possible

Topic 4:
listening chakra tuning beauty fourth starting butterfly energy caterpillar instrument

Topic 5:
surrender fear line golf end chakra certain fine interested forces

Topic 6:
desire says christ guru father ceremony ye lord ouspensky thou

Topic 7:
clock place hearing mother ear looked dying illusion watch saw

Topic 8:
pain day grace new stop thing wow york want say

Topic 9:
moment suffering heart way hear universe place comes open form

Topic 10:
drama qualities said man answer guy future butterfly came know

Topic 11:
life helping niyamas yamas try audience takes purification karma hold

Topic 12:
happened paradox vibrations exquisite perfection 