

# ___Text Analytics in Python___

#### Introduction

Text data has been one of the most important sources across areas. In this notebook, we will use a toy dataset to go though a common text analytics procedure in Python.

## Let's take a simple example 

In [1]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm to check cancer status.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm ...
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [2]:
# find the number of characters for each string in df['text']
df['text'].str.len()

0    69
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [3]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0    11
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [4]:
# find which entries contain the word 'cancer'
df['text'].str.contains('cancer')

0     True
1    False
2    False
3    False
4    False
Name: text, dtype: bool

In [5]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [6]:
# find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [7]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [8]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

  


0    ???: The doctor's appointment is at 2:45pm to ...
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [9]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

  


0    Mon: The doctor's appointment is at 2:45pm to ...
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [10]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [11]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [12]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


## Let's talk more on real data

In [13]:
import pandas as pd

In [14]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [15]:
import os
os.chdir('/content/gdrive/My Drive/Teaching-task/python-DS')
!pwd

/content/gdrive/My Drive/Teaching-task/python-DS


In [16]:
df = pd.read_csv('./sample-data/ag_news.csv')
df.head()

Unnamed: 0,class_index,title,description,class_name
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...,Business
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o...",Sci/Tech
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...,Sci/Tech
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...,Sci/Tech
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...,Sci/Tech


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7600 entries, 0 to 7599
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class_index  7600 non-null   int64 
 1   title        7600 non-null   object
 2   description  7600 non-null   object
 3   class_name   7600 non-null   object
dtypes: int64(1), object(3)
memory usage: 237.6+ KB


In [18]:
set(df['class_name'].values)

{'Business', 'Sci/Tech', 'Sports', 'World'}

There are four columns: 
- class index and class names that annotate the content type
- title and description of each news piece

Let's combine the title and description into a column called `content` and drop unneccessary columns.

In [19]:
df['content'] = df['title'] + '. ' + df['description']
df.drop(columns=['title', 'description'], inplace=True)
df.head(5)

Unnamed: 0,class_index,class_name,content
0,3,Business,Fears for T N pension after talks. Unions repr...
1,4,Sci/Tech,The Race is On: Second Private Team Sets Launc...
2,4,Sci/Tech,Ky. Company Wins Grant to Study Peptides (AP)....
3,4,Sci/Tech,Prediction Unit Helps Forecast Wildfires (AP)....
4,4,Sci/Tech,Calif. Aims to Limit Farm-Related Smog (AP). A...


---

#### Document Representation

While we as human can understand the text, machines can hardly know text as it is. A classic way to mathematically represent text data is to convert them into [vector spaces](https://en.wikipedia.org/wiki/Vector_space_model). Specifically, each document will be stored as a vector, where each element is a ___weight___ for one term. A simple and commonly used approach is to tokenize documents and use term frequencies or [term frequency-inverse document frequency (TFIDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) as weights.

##### Bag-of-Words (BOW)

While more sophisticated approaches are available, BOW representation is still very popular due to its simplicity. In this style, we can forget word orders and treat each document as ___a bag of words___. While this is a very strong assumption, it still makes sense -- we can understand a sentence even if the words are randomly ordered. For example, we can easily understand the following sentence:

> sitting a chair is cat there on

In Python, we can use a `numpy.ndarray` or `list` to save this information:

In [20]:
sentence = ['stting', 'a', 'chair', 'is', 'cat', 'there', 'on']
sentence

['stting', 'a', 'chair', 'is', 'cat', 'there', 'on']

###### Tokenize

Let's now apply this onto our toy dataset. In English text (and many other languages), we can simply split each document by spaces. Besides, we usually may want to remove puncutations because they do not provide valuable information. Last but not least, we may want to exclude some very common words (e.g., `the`, `we`, `you`, `is`, etc.), called [___stop words___](https://en.wikipedia.org/wiki/Stop_words).

In [21]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


If there's no special needs, we can simply use a common stop word list from `nltk`.

In [22]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [23]:
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')
eng_stopwords[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

Now we can start to convert the contents into bag of words! To do this, we may utilize the package [`nltk`'s tokenize package](http://www.nltk.org/api/nltk.tokenize.html)

In [24]:
import nltk

In [25]:
df.head(8)

Unnamed: 0,class_index,class_name,content
0,3,Business,Fears for T N pension after talks. Unions repr...
1,4,Sci/Tech,The Race is On: Second Private Team Sets Launc...
2,4,Sci/Tech,Ky. Company Wins Grant to Study Peptides (AP)....
3,4,Sci/Tech,Prediction Unit Helps Forecast Wildfires (AP)....
4,4,Sci/Tech,Calif. Aims to Limit Farm-Related Smog (AP). A...
5,4,Sci/Tech,Open Letter Against British Copyright Indoctri...
6,4,Sci/Tech,"Loosing the War on Terrorism. \\""Sven Jaschan,..."
7,4,Sci/Tech,"FOAFKey: FOAF, PGP, Key Distribution, and Bloo..."


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7600 entries, 0 to 7599
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class_index  7600 non-null   int64 
 1   class_name   7600 non-null   object
 2   content      7600 non-null   object
dtypes: int64(1), object(2)
memory usage: 178.2+ KB


In [27]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [28]:
# also convert them to lower case
bow = [nltk.word_tokenize(content.lower()) for content in df['content'].values]
# show the first 1
print('; '.join(bow[0]))

fears; for; t; n; pension; after; talks; .; unions; representing; workers; at; turner; newall; say; they; are; 'disappointed; '; after; talks; with; stricken; parent; firm; federal; mogul; .


In [29]:
len(bow)

7600

In [30]:
bow[0]

['fears',
 'for',
 't',
 'n',
 'pension',
 'after',
 'talks',
 '.',
 'unions',
 'representing',
 'workers',
 'at',
 'turner',
 'newall',
 'say',
 'they',
 'are',
 "'disappointed",
 "'",
 'after',
 'talks',
 'with',
 'stricken',
 'parent',
 'firm',
 'federal',
 'mogul',
 '.']

In [31]:
bow[1]

['the',
 'race',
 'is',
 'on',
 ':',
 'second',
 'private',
 'team',
 'sets',
 'launch',
 'date',
 'for',
 'human',
 'spaceflight',
 '(',
 'space.com',
 ')',
 '.',
 'space.com',
 '-',
 'toronto',
 ',',
 'canada',
 '--',
 'a',
 'second\\team',
 'of',
 'rocketeers',
 'competing',
 'for',
 'the',
 '#',
 '36',
 ';',
 '10',
 'million',
 'ansari',
 'x',
 'prize',
 ',',
 'a',
 'contest',
 'for\\privately',
 'funded',
 'suborbital',
 'space',
 'flight',
 ',',
 'has',
 'officially',
 'announced',
 'the',
 'first\\launch',
 'date',
 'for',
 'its',
 'manned',
 'rocket',
 '.']

Let's remove the stop words and punctuations. Note that there are some very short words of length 1. I will typically remove them as well if they do not have special meanings. Also, I will remove pure numbers by `str.isdigit`. Note that this may not work well but for simplicity, we will just go with it. See more dicussion on this issue [here](https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float?page=1&tab=votes#tab-top).

In [32]:
min_length = 3 # define the customized minimum length

In [33]:
bow = [[word for word in content if word not in punctuation and word not in eng_stopwords and not word.isdigit()] for content in bow]
print('; '.join(bow[0]))

fears; n; pension; talks; unions; representing; workers; turner; newall; say; 'disappointed; talks; stricken; parent; firm; federal; mogul


In [34]:
print('; '.join(bow[1]))

race; second; private; team; sets; launch; date; human; spaceflight; space.com; space.com; toronto; canada; --; second\team; rocketeers; competing; million; ansari; x; prize; contest; for\privately; funded; suborbital; space; flight; officially; announced; first\launch; date; manned; rocket


As we can, sometimes we need further data cleaning to remove punctuations in words. In this example, we want to remove the quote in the word "disappointed". In this case, we can utilize [`string.translate`](https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate):

In [35]:
# do not translate anything, except for removing all punctuations
trans = str.maketrans('', '', punctuation)
bow = [[w.translate(trans).strip() for w in d] for d in bow]
bow = [[w for w in d if len(w) >= min_length] for d in bow]
print('; '.join(bow[0]))

fears; pension; talks; unions; representing; workers; turner; newall; say; disappointed; talks; stricken; parent; firm; federal; mogul


In [36]:
for i in range(1, 5):
    print('; '.join(bow[i]))
    print('---------------'*10)

race; second; private; team; sets; launch; date; human; spaceflight; spacecom; spacecom; toronto; canada; secondteam; rocketeers; competing; million; ansari; prize; contest; forprivately; funded; suborbital; space; flight; officially; announced; firstlaunch; date; manned; rocket
------------------------------------------------------------------------------------------------------------------------------------------------------
company; wins; grant; study; peptides; company; founded; chemistry; researcher; university; louisville; grant; develop; method; producing; better; peptides; short; chains; amino; acids; building; blocks; proteins
------------------------------------------------------------------------------------------------------------------------------------------------------
prediction; unit; helps; forecast; wildfires; barely; dawn; mike; fitzpatrick; starts; shift; blur; colorful; maps; figures; endless; charts; already; knows; day; bring; lightning; strike; places; expects;

###### Stemming/Lemmetization

Finally, it is noteworthy that one word can take different forms. For example, run can be ___run___, ___runs___, ___ran___, and ___running___. While they are different, they mean the same thing. There are two common methods to reduce a word back to the root form.

The first one is called ___stemming___, where each word will bee reduced to its "stem". For example, the stem of the word ___fly___ will be ___fli___

In [37]:
from nltk import PorterStemmer
stemmer = PorterStemmer()
for w in ['fly', 'papers', 'communication', 'community']:
    print(w, ': ', stemmer.stem(w))

fly :  fli
papers :  paper
communication :  commun
community :  commun


While this seems to be okay, sometimes it is hard to ___reverse___ the stemming result (e.g., both "communication" and "community" are transformed to "commun", although they mean very different things). A second choice can be ___lemmatization___, which makes a word to its [___lemma__ ](https://en.wikipedia.org/wiki/Lemma_(morphology)), or say canonical or dictionary form.

In [40]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [43]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for w in ['flies', 'papers', 'communication', 'communities']:
    print(w, ': ', lemmatizer.lemmatize(w))

flies :  fly
papers :  paper
communication :  communication
communities :  community


For simplicity, let's just use stemming.

In [44]:
bow = [[stemmer.stem(w) for w in d] for d in bow]

In [45]:
for i in range(1, 5):
    print('; '.join(bow[i]))
    print('---------------'*10)

race; second; privat; team; set; launch; date; human; spaceflight; spacecom; spacecom; toronto; canada; secondteam; rocket; compet; million; ansari; prize; contest; forpriv; fund; suborbit; space; flight; offici; announc; firstlaunch; date; man; rocket
------------------------------------------------------------------------------------------------------------------------------------------------------
compani; win; grant; studi; peptid; compani; found; chemistri; research; univers; louisvil; grant; develop; method; produc; better; peptid; short; chain; amino; acid; build; block; protein
------------------------------------------------------------------------------------------------------------------------------------------------------
predict; unit; help; forecast; wildfir; bare; dawn; mike; fitzpatrick; start; shift; blur; color; map; figur; endless; chart; alreadi; know; day; bring; lightn; strike; place; expect; wind; pick; moist; place; dri; flame; roar
-----------------------------

##### Vector Space Model (VSM)

Now that we have a bag of words, we can create vectors based on these items. Instead of using tokens/text, it is easier and better to just use integer indices. For example, `race` is the first word and therefore the number `1` maps to `race`.

For this task, the best approach may be [`gensim`](https://radimrehurek.com/gensim/index.html), which has a library of very well written and convenient APIs, especially for [topic modeling](https://en.wikipedia.org/wiki/Topic_model) and [word2vec](https://rare-technologies.com/word2vec-tutorial/) algorithms:

In [46]:
import gensim
dictionary = gensim.corpora.Dictionary(bow)
print(dictionary)

Dictionary(17957 unique tokens: ['disappoint', 'fear', 'feder', 'firm', 'mogul']...)


In [47]:
gensim.__version__

'3.6.0'

In [48]:
dictionary.token2id

{'disappoint': 0,
 'fear': 1,
 'feder': 2,
 'firm': 3,
 'mogul': 4,
 'newal': 5,
 'parent': 6,
 'pension': 7,
 'repres': 8,
 'say': 9,
 'stricken': 10,
 'talk': 11,
 'turner': 12,
 'union': 13,
 'worker': 14,
 'announc': 15,
 'ansari': 16,
 'canada': 17,
 'compet': 18,
 'contest': 19,
 'date': 20,
 'firstlaunch': 21,
 'flight': 22,
 'forpriv': 23,
 'fund': 24,
 'human': 25,
 'launch': 26,
 'man': 27,
 'million': 28,
 'offici': 29,
 'privat': 30,
 'prize': 31,
 'race': 32,
 'rocket': 33,
 'second': 34,
 'secondteam': 35,
 'set': 36,
 'space': 37,
 'spacecom': 38,
 'spaceflight': 39,
 'suborbit': 40,
 'team': 41,
 'toronto': 42,
 'acid': 43,
 'amino': 44,
 'better': 45,
 'block': 46,
 'build': 47,
 'chain': 48,
 'chemistri': 49,
 'compani': 50,
 'develop': 51,
 'found': 52,
 'grant': 53,
 'louisvil': 54,
 'method': 55,
 'peptid': 56,
 'produc': 57,
 'protein': 58,
 'research': 59,
 'short': 60,
 'studi': 61,
 'univers': 62,
 'win': 63,
 'alreadi': 64,
 'bare': 65,
 'blur': 66,
 'bring': 

Mapping of tokens

In [49]:
dictionary.token2id['disappoint'], dictionary.token2id['california']

(0, 101)

Upon creation of a dictionary that maps words to integers (and vice versa), we can transform our bag of words. Each document will be a list of tuples that contain token indices and frequencies.

In [50]:
corpus = [dictionary.doc2bow(d) for d in bow]
corpus[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 2),
 (12, 1),
 (13, 1),
 (14, 1)]

###### TFIDF

Sometimes term frequencies can be misleading. Just like the same reason we remove stop words, words that occur a lot across the whole corpus may not be informative. On the other hand, a word that shows up a lot in only a small portion instead of all can provide valuable informaiton on the contents of these texts.

One way to resolve this potential problem is called term frequency-inverse document frequency (TFIDF), which is a product of TF and IDF. A common weighting scheme is $TFIDF(t,d, D) = freq_{t,d}\times log_2~\dfrac{N_D}{N_t}$, where $N_D$ is the number of documents; $N_t$ is the number of documents that contain the term $t$. See [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency) for more details.

We can try this on our data with [`gensim.models.tfidfmodel`](https://radimrehurek.com/gensim/models/tfidfmodel.html):

In [51]:
from gensim.models import TfidfModel
model = TfidfModel(corpus)

Apply the model on one document

In [52]:
model[corpus[0]]

[(0, 0.23464966537701473),
 (1, 0.21340476186532015),
 (2, 0.16098999782417597),
 (3, 0.17617388045540608),
 (4, 0.3380922768382177),
 (5, 0.35558360355081436),
 (6, 0.25022339107653035),
 (7, 0.2521409920896115),
 (8, 0.22971412643239364),
 (9, 0.1150431833134092),
 (10, 0.38548522406671576),
 (11, 0.31781564552907887),
 (12, 0.3081906563223163),
 (13, 0.17230810874356683),
 (14, 0.18769468048518562)]

---

#### Topic Modeling

##### Latent Dirichet Allocation (LDA)

A very commonly used dimensionality reduction technique family is called ___topic modeling___. It assumes that each document is a mixture of topics, where each topic is a mixture of terms. One of the most successful algorithms is [___latent Dirichlet allocation___](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA), whose corresponding paper is:
> Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

LDA is a generative model that does the reverse engineerging of document generation. It can be represented as a probablistic graphical model:
![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

The generative process can be described as follows:
- For each topic $k$, sample a multinomial distribution $\phi_k$ over words from the Dirichlet prior with parameter $\beta$
- For each document $m$, sample a multinomial distribution $\theta_m$ over topics from the Dirichlet prior with parameter $\alpha$
    - For each word $n$ in $m$:
        - Sample a topic $z_{m,n}$ from the correponding topic distribution parameterized by $\theta_m$
        - Sample a word $w_{m,n}$ from the correponding topic $z_{m,n}$'s word distribution parameterized by $\phi_{z_{m,n}}$

##### Parameters in LDA

Generally, we need to control two hyperparamters of a LDA model:
- Topic-word Dirichlet prior $\beta$
- Document-topic Dirichlet prior $\alpha$

The selection of these parameters are application dependent. Heuristically, people will choose $\alpha=\dfrac{50}{K}$ and $\beta=0.01$, as described in

> Griffiths, T. L., and Steyvers, M. 2004. “Finding Scientific Topics,” Proceedings of the National Academy of Sciences (101:Supplement 1), National Academy of Sciences, pp. 5228–5235.

It is also possible to infer these two hyperparameters given the data.

The selection of $K$ totally depends on the context. It is also possible to select a topic number based on quantitative measures of topic modeling quality, but this is beyond the scope of this tutorial.

For our toy sample set, we will just select $K=4$ because there are 4 labels: 

In [53]:
df.class_name.unique()

array(['Business', 'Sci/Tech', 'Sports', 'World'], dtype=object)

##### Run LDA!

Thanks to the convenient APIs by `gensim`, we can easily run [LDA in Python](https://radimrehurek.com/gensim/models/ldamodel.html):

In [54]:
from gensim.models import LdaModel
lda = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=10, 
               minimum_probability=0)

##### Analysis on LDA results

Let's take a look at the output of LDA. First, we can check if the topics make sense

In [55]:
for _, topic_str in lda.show_topics():
    print(topic_str)
    print('------------'*10)

0.008*"win" + 0.007*"game" + 0.006*"first" + 0.006*"team" + 0.005*"season" + 0.005*"last" + 0.004*"two" + 0.004*"final" + 0.004*"one" + 0.004*"night"
------------------------------------------------------------------------------------------------------------------------
0.010*"said" + 0.009*"reuter" + 0.007*"kill" + 0.007*"oil" + 0.006*"presid" + 0.006*"iraq" + 0.005*"govern" + 0.005*"afp" + 0.005*"offici" + 0.005*"say"
------------------------------------------------------------------------------------------------------------------------
0.012*"new" + 0.007*"microsoft" + 0.006*"search" + 0.005*"googl" + 0.005*"quot" + 0.004*"announc" + 0.004*"intel" + 0.004*"linux" + 0.004*"secur" + 0.003*"softwar"
------------------------------------------------------------------------------------------------------------------------
0.010*"said" + 0.010*"compani" + 0.008*"new" + 0.007*"inc" + 0.006*"reuter" + 0.006*"year" + 0.005*"million" + 0.005*"share" + 0.005*"servic" + 0.005*"sale"
-------------

While we probably cannot say the topics are perfect, they are okay. We can interpret the topics as: business, sports, world, and sci/tech.

For each document, we can check their topic distributions:

In [56]:
i = 1000
lda.get_document_topics(corpus[i])

[(0, 0.012327542), (1, 0.16224293), (2, 0.012031073), (3, 0.8133985)]

We can see that topic 0, which is interpreted "business" topics dominate this document. We can check to see if this makes sense:

In [57]:
df.loc[i]

class_index                                                    3
class_name                                              Business
content        Albertsons #39; 2Q Profit Falls 36 Percent. Pe...
Name: 1000, dtype: object

In fact, LDA can be used in many situtaions, such as information retrieval, document clustering and labeling, and even for images! Here we just mention the simplest use case

---

#### Conclusion

In this tutorial, we went through a simple procedure, from preprocessing of raw texts, to modeling topics in the resulting bag of words corpus. A lot of terms are used, such as ___stop words___, ___bag of words___, ___stemming___, etc. However, these are only a small part of text analytics. There are a lot more to explore. Below I list some materials on text analytics in Python that may be useful:

- [Gensim tutorial](https://radimrehurek.com/gensim/tutorial.html)
- [NLTK book](http://www.nltk.org/book/)
- [Coursera text mining course](https://www.coursera.org/learn/python-text-mining)
- [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html) (A package not tutorial)