# NLP Topic Modeling in Python with Non-negative Matrix Factorization 

## Using Jupyter Notebooks 

Jupyter Notebooks is an interactive Python environment for data science.   Cells are seperated into Markdown (i.e., text) and code cells.  In this notebook, you should not need to edit code (unless you really want to!). Therefore, you can just run each cell by highlighting it and pressing "Cmd + Return" or using the "> Run" key at the top.

Note: A best practice is to import packages in the first cell of the notebook.  However, given that this is a tutorial I will import them in the first cell in which they are used to more closely associate the package with it's use.  

## Installing Packages 
First, we'll install some packages we'll use today for our text manipulation and topic modeling. This may produce a lot of output so please be patient. 

In [None]:
!pip install nltk

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

## 1. Import our Data

We're going to use pandas to import and inspect our data.  Notice that our text column is already in lower case and contains the article text from wikipedia.  

In [1]:
# Import Data 
import pandas as pd 

url = 'https://raw.githubusercontent.com/team-evolytics/data_science_party_nlp_tutorial/main/people_wiki.csv'

df = pd.read_csv(url)

#Inspect our dataframe
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


## 2. Pre-process our Text 

Before we analyze our text we need to clean it.  Cleaning text involves standardizing and removing terms that are non-informative.  Terms that occur in most documents or, alternatively, very few, are unlikely to help us know how to group things. 

In [None]:
import re
import string
from nltk.tokenize.regexp import WordPunctTokenizer
from nltk.corpus import stopwords


# These lines determines what the punctuation and numbers are replaced with.
punct_table = str.maketrans({ch: ' ' for ch in string.punctuation})  
digit_table = str.maketrans('', '', string.digits)

# Text cleaning functions
remove_punctuation = lambda x: x.translate(punct_table)
remove_numbers = lambda x: x.translate(digit_table)
remove_urls = lambda x: re.sub(r"http\S+", "", x)


#Tokenize texts.  Note- It is possible to comment out steps with a # to change how tokenization occurs. 
def tokenize(text):
    """
    Takes a list of strings and return 
         
    """
    # Creates stopword list from NLTK.
    sw = stopwords.words("english") + ['']
    
    # Creates a tokenizer instance. 
    tokenizer = WordPunctTokenizer() 
    
    # Text cleaning. 
    text = remove_urls(text) # removes urls 
    text = remove_punctuation(text)
    text = remove_numbers(text) # removes numbers.  Leaving here as dates may be informative. 
    text = text.lower() # sets to lowercase

    # Tokenization 
    tokenizer = WordPunctTokenizer()
    tkns = tokenizer.tokenize(text) # tokenizes text

    # Remove stopwords 
    tokenized_text = [tkn for tkn in tkns if tkn not in sw]

    return tokenized_text 

Let's see a demonstration of what the above function is doing! 

In [None]:
text = "I moved to Kansas City in 2020 during the pandemic!  It was hard to find housing " \
       "with social distancing but I was eventually able to find one on Zillow (http://www.zillow.com)."

tokenize(text)

### Feature Extraction (Vectorization) 

Computers only undestand numbers we need to convert the tokenized documents into vectors.  To do this we'll use term frequency - inverse docuemnt frequency (TF-IDF) metric. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize, max_df=.95, min_df=.0001, ngram_range=(1,2))
doc_term_matrix = vectorizer.fit_transform(list(df['text']))

print("Our matrix has {} documents and {} vocbaulary terms.".format(doc_term_matrix.shape[0], 
                                                                    doc_term_matrix.shape[1]))

# Store our vocab for later use. 
model_features = vectorizer.get_feature_names_out()

print("Note that our model features list has the same length as the vocabulary (i.e, they are the same.): {} \n".format(len(model_features)))
print("Sample Feature Names: ", model_features[100:110])

# 3.  Model Creation:  Non-Negative Matrix Factorization (NMF)

Shout out to Hui for teaching me that this could be used for topic modeling!  We are going to keep things very simple but I want to provide just a superficial explanation of the logic of this technique.  

Non-negative matrix factorization (NMF) seems like an intimidating technique but the basic logic is very simple. You may remember from elementary school that factorization is breaking a number down into numbers that when multiplied together equal the initial value (e.g., 30 = (2X3)X5 ).  We can do something similiar with matrices.  However, matrix multiplication has a special requirement that the number of rows in one matrix (we'll call is matrix W) must equal the number of columns in the other matrix (matrix H).  Because there are many different values that the number of rows/number of columns can take on we have to select a value for factorization.  Matrix multiplication follows slighty different rules from traditional multiplication which you can read about here.    

Just like we can multiply a simple factorization back together to get the original value (e.g., (2x3)x5 = 30), we can multiply Matrix W X Matrix H to reconstruct our original Document X Term matrix.  However, our reconstructed matrix is unlikely to exactly match our original matrix and the degree to which the values differ tells us how well our proposed model fits.  

One last thing!  *Non-negative Matrix Factorization is exactly that - non-negative*.  No value in the matrix can be below zero.  This makes it a good fit for behavioral data (you can't have negative clicks) but a poor fit for things like financial data where there may be negative values.  

In [None]:
# Enter the number of topics to model. 
num_topics = 100

from sklearn.decomposition import NMF

# Initialize model
model = NMF(init='nndsvd', n_components=num_topics)

# Fit our corpus to the model 
model.fit(doc_term_matrix)

# Get document weights for each component. 
doc_weights = model.transform(doc_term_matrix)

# 4.  Model Inspection 

### Retrieve Top Terms for each Topic

Here we are retrieving the topic terms that most characterize each topic.  Note that all terms are scored for each topic but we are interested in terms that are most unique and typical for a given a topic.   Generally we inspect the top N terms to get an idea about what the topic is.  I've provided an easy way for you to vary the number of terms returned below. 

In [None]:
#How many terms to do you want to retrieve? 
n_terms = 10

def get_nmf_topics(model, n_top_words, num_topics, feat_names):
    
    word_dict = {};
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-n_top_words - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i)] = words;
    
    return pd.DataFrame(word_dict)

nmf_topics = get_nmf_topics(model, n_terms, model.n_components, model_features)
nmf_topics

### Inspecting Representative Documents

We typically want to be able to inspect the original data.  Above when we fit the model, we saved weights that each component characterizes a document.  We're going to select the component with the max value for each document and assign it to that topic.  Then we'll filter our dataframe by topic to see if our topic modeling worked! 

In [None]:
print("Note that the document weights matrix has the name number of rows as documents: {}".format(doc_weights.shape[0]))
print("Additionally observe that is the same number of columns as our topics: {}".format(doc_weights.shape[1]))

Now let's print some bios to inspect! Do they make sense to you?  What themes do they have? 

In [None]:
# Which topics documents do you want to inspect? 
topic_id =47

# Reformating topic number 
topic_col = 'Topic # ' + '{:02d}'.format(topic_id)

# Get topic terms 
print("These words characterize this topic: ", "\n")
print(nmf_topics[topic_col], "\n\n")

# Assign topics to biographies in the Dataframe
df["Topic_idx"] = doc_weights.argmax(axis=1)

# Filtering our dataframe. 
df_topic = df.loc[df['Topic_idx'] == topic_id] 
bios = zip(df_topic['name'], df_topic['text'])

# Displaying the selected bios. 
print("Here are the biographies for individuals who scored highly on this topic: ", '\n')

for bio in bios:
    print("Name: ", bio[0])
    print("Biography: ", bio[1], "\n\n")
    

# 5. Conclusion

Hopefully you found the tutorial above interesting!  If you want to learn more about cleaning and preprocessing text as well as a different technique for topic modeling check out the Evolytics blog series here:  

- [Part II. Preparing Text for Analysis with Natural Language Toolkit (NLTK)](https://evolytics.com/blog/open-ended-survey-questions-for-computational-analysis-part-ii/)
- [Part III. How to Find Near Duplicate Text and Recognize Name Entities in Survey Responses](https://evolytics.com/blog/survey-responses-duplicate-text-and-named-entities/)

Please feel free to use the above code for your own projects.  
