# Text analysis basics in Python
*Bigram/trigram, sentiment analysis, and topic modeling*

![](text_basics.png)
*[source](https://unsplash.com/photos/uGP_6CAD-14)*


This article talks about the most basic text analysis tools in Python. We are not going into the fancy NLP models. Just the basics. Sometimes all you need is the basics :)

Let’s first get some text data. Here we have a list of course reviews that I made up. What can we do with this data? The first question that comes to mind is can we tell which reviews are positive and which are negative? Can we do some sentiment analysis on these reviews?


In [None]:
corpus = [
'Great course. Love the professor.',
'Great content. Textbook was great',
'This course has very hard assignments. Great content.',
'Love the professor.',
'Hard assignments though',
'Hard to understand.'
]

## Sentiment analysis
Great, let’s look at the overall sentiment analysis. I like to work with a pandas data frame. So let’s create a pandas data frame from the list.

In [None]:
import pandas as pd
df = pd.DataFrame(corpus)
df.columns = ['reviews']

Next, let’s install the library `textblob` (`conda install textblob -c conda-forge`) and import the library.

In [None]:
from textblob import TextBlob
df['polarity'] = df['reviews'].apply(lambda x: TextBlob(x).polarity)
df['subjective'] = df['reviews'].apply(lambda x: TextBlob(x).subjectivity)

We then can calculate the sentiment through the `polarity` function. polarity ranges from -1 to 1, with -1 being negative and 1 being positive. The TextBlob can also use the `subjectivity` function to calculate `subjectivity`, which ranges from 0 to 1, with 0 being objective and 1 being subjective.

In [None]:
df

## Sentiment analysis of Bigram/Trigram
Next, we can explore some word associations. N-grams analyses are often used to see which words often show up together. I often like to investigate combinations of two words or three words, i.e., Bigrams/Trigrams.

"An n-gram is a contiguous sequence of n items from a given sample of text or speech."

In the text analysis, it is often a good practice to filter out some stop words, which are the most common words but do not have significant contextual meaning in a sentence (e.g., “a”, “ the”, “and”, “but”, and so on). nltk provides us a list of such stopwords. We can also add customized stopwords to the list. For example, here we added the word “though”.

In [None]:
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + ['though']

Now we can remove the stop words and work with some bigrams/trigrams. The function `CountVectorizer` “convert a collection of text documents to a matrix of token counts”. The `stop_words` parameter has a build-in option “english”. But we can also use our user-defined stopwords like I am showing here. The `ngram_range` parameter defines which n-grams are we interested in — 2 means bigram and 3 means trigram. The other parameter worth mentioning is `lowercase`, which has a default value True and converts all characters to lowercase automatically for us. Now with the following code, we can get all the bigrams/trigrams and sort by frequencies.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3))
# matrix of ngrams
ngrams = c_vec.fit_transform(df['reviews'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})
df_ngram

Similar to the sentiment analysis before, we can calculate the polarity and subjectivity for each bigram/trigram.

In [None]:
df_ngram['polarity'] = df_ngram['bigram/trigram'].apply(lambda x: TextBlob(x).polarity)
df_ngram['subjective'] = df_ngram['bigram/trigram'].apply(lambda x: TextBlob(x).subjectivity)
df_ngram

## Topic modeling
We can also do some topic modeling with text data. There are two ways to do this: NMF models and LDA models. We will show examples using both methods next.
### NMF models
Non-Negative Matrix Factorization (NMF) is a matrix decomposition method, which decomposes a matrix into the product of W and H of non-negative elements. The default method optimizes the distance between the original matrix and WH, i.e., the Frobenius norm. Below is an example where we use NMF to produce 3 topics and we showed 3 bigrams/trigrams in each topic.

In [None]:
#Source: https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline
tfidf_vectorizer = TfidfVectorizer(stop_words=stoplist, ngram_range=(2,3))
nmf = NMF(n_components=3)
pipe = make_pipeline(tfidf_vectorizer, nmf)
pipe.fit(df['reviews'])
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += ", ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
print_top_words(nmf, tfidf_vectorizer.get_feature_names(), n_top_words=3)

Here is the result. Looks like topic 0 is about the professor and courses; topic 1 is about the assignment, and topic 3 is about the textbook. Note that we do not know what is the best number of topics here. We used 3 just because our sample size is very small. In practice, you might need to do a grid search to find the optimal number of topics.

### LDA models
"Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents."

Here in our example, we use the function `LatentDirichletAllocation`, which “implements the online variational Bayes algorithm and supports both online and batch update methods”. Here we show an example where the learning method is set to the default value “online”.

In [None]:
# Source: https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
from sklearn.decomposition import LatentDirichletAllocation
tfidf_vectorizer = TfidfVectorizer(stop_words=stoplist, ngram_range=(2,3))
lda = LatentDirichletAllocation(n_components=3)
pipe = make_pipeline(tfidf_vectorizer, lda)
pipe.fit(df['reviews'])
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += ", ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
print_top_words(lda, tfidf_vectorizer.get_feature_names(), n_top_words=3)

Now you know how to do some basic text analysis in Python. Our example has very limited data sizes for demonstration purposes. The text analysis in real-world will be a lot more challenging and fun. Hope you enjoy this article. Thanks!

References:  
https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 
https://stackoverflow.com/questions/11763613/python-list-of-ngrams-with-frequencies/11834518 

By Sophia Yang on [October 19, 2020](https://towardsdatascience.com/text-analysis-basics-in-python-443282942ec5)