## Lighthouse Labs
### W08D4 NLP II
Instructor: Socorro Dominguez  
November 05, 2020

**Agenda:**
* Introduction to NLP modeling

* Sentiment analysis
    * Supervised learning sentiment analysis

* Topic modeling
    * LDA (Latent-Dirichlet-Allocation)

In [1]:
import os.path
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt

import gensim 
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet

import gensim.corpora as corpora
from gensim.corpora import Dictionary

from gensim import matutils, models
import pyLDAvis.gensim
import string

%matplotlib inline


from sklearn.model_selection import train_test_split

  from collections import Iterable
  from collections import Mapping


## Sentiment Anaysis

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

## Using Supervised Learning Algorithms for Sentiment Analysis

Naive Bayes is popular in text classification tasks. 

You have used NB before. Today, we will use it for sentiment analysis, which is a problem of assigning positive or negative label to a text based on the sentiment or attitude expressed in  it. 

For this example, we will use [IMDB movie review data set](https://www.kaggle.com/utathya/imdb-review-dataset). If you want to reproduce this example, you will need to download the data on your own.

### Loading data and preprocessing

1. We need to load data CSV as a pandas DataFrame.

2. There are three possible labels in the dataset: `pos`, `neg`, and `unsup`. For now, let's discard rows with `unsup`.

In [2]:
imdb_df = pd.read_csv('data/imdb_master.csv', encoding = "ISO-8859-1")
imdb_df.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [3]:
imdb_df['label'].value_counts()

unsup    50000
pos      25000
neg      25000
Name: label, dtype: int64

In [4]:
# only consider positive and negative reviews
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

### Feature extraction

The current data is in the form of moview reviews (text paragraphs) and their targets (`pos` or `neg`). 
We need to encode movie reviews into feature vectors so that we can train supervised machine learning models with `scikit-learn`. 

How can we do this?



#### Create word frequency counts (`X_counts`)
Turn the text into sparse vector of word frequency counts using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from  `scikit-learn`. 

When you reproduce this, explore the arguments of `CountVectorizer` (e.g., [`stop_words`](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words), `ngram_range`, `max_features`, `min_df`, and `tokenizer`).  

#### Create binarized representation of words (`X_binary`)
Create binarized encoding (`X_binary`) of `X_counts`, where you replace word frequencies $\geq$ 1 by 1.    
The intuition behind using binarized representation is that for sentiment analysis word occurrence may matter more than word frequency. For instance, the occurrence of the word _excellent_ tells us a lot and the fact that it occurs four times may not tell us much more. 

In [5]:
# For tokenization
import nltk
# For converting words into frequency counts
from sklearn.feature_extraction.text import CountVectorizer

  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict


In [6]:
# initialize movie_vector object and then turn movie reviews train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, stop_words='english')

# use top 5000 words only
# movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features = 5000) 
X_counts = movie_vec.fit_transform(imdb_df['review'])

# Convert raw frequency counts into binarized representation. 
X_binary = X_counts > 0

### Train Naive Bayes classifier

1. Split (`X_counts`, `imdb_df.label`) into train (80%) and test (20%).
2. Train [multinomial Naive Bayes algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) on the train set. 
3. Report train and test accuracies.
4. Now repeat steps 1, 2, and 3 with (`X_binary`, `imdb_df.label`). 
5. Compare your results for `X_counts` and `X_binary` and note your observations. 

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

In [8]:
def get_NB_train_test_accuracies(X, y, classifier = 'multinominal'):
    """
    Given X, y, and the classifier, this function splits the 
    data into train and test splits, prints the train and test accuracies,
    and returns the model.     
    """
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size = 0.20, 
                                                        random_state = 12)
    if classifier.startswith('multinominal'):
        model = MultinomialNB().fit(X_train, y_train)
    elif classifier.startswith('bernoulli'):
        model = BernoulliNB().fit(X_train, y_train)
    print('Training accuracy:', model.score(X_train, y_train))
    print('Test accuracy: ', model.score(X_test, y_test))
    print('---------')
    return model

In [9]:
print('Evaluation on binarized encoding ')
model_binary = get_NB_train_test_accuracies(X_binary, imdb_df.label, classifier = 'bernoulli')

print('Evaluation on counts encoding ')
model_counts = get_NB_train_test_accuracies(X_counts, imdb_df.label)

Evaluation on binarized encoding 
Training accuracy: 0.90135
Test accuracy:  0.8567
---------
Evaluation on counts encoding 
Training accuracy: 0.89905
Test accuracy:  0.8558
---------



What do you observe?

### Let's play with fake reviews 

Let's see how the model performs on fake movie reviews. Some examples are given below.

In [10]:
fake_reviews = ['This movie was excellent! The performances were oscar-worthy!',
               'Unbelievably disappointing.', 
               'Full of zany characters and richly applied satire, and some great plot twists',
               'This is the greatest screwball comedy ever filmed',
               'It was pathetic. The worst part about it was the boxing scenes.', 
               '''It could have been a great movie. It could have been excellent, 
                and to all the people who have forgotten about the older, 
                greater movies before it, will think that as well. 
                It does have beautiful scenery, some of the best since Lord of the Rings. 
                The acting is well done, and I really liked the son of the leader of the Samurai.
                He was a likeable chap, and I hated to see him die...
                But, other than all that, this movie is nothing more than hidden rip-offs.
                '''
              ]
gold_labels = ['pos', 'neg', 'pos', 'pos', 'neg', 'neg']

In [11]:
# Create word count encoding of the reviews.  
fake_reviews_counts = movie_vec.transform(fake_reviews)
fake_reviews_binary = fake_reviews_counts > 0

In [12]:
# Predict using the Naive Bayes classifier
predictions = model_binary.predict(fake_reviews_binary)

In [13]:
print(predictions.tolist())

['pos', 'neg', 'pos', 'pos', 'neg', 'pos']


In [14]:
pd.set_option('display.max_colwidth', 0)
d = {'Review':fake_reviews, 'Gold labels':gold_labels, 'NB labels':predictions}
df = pd.DataFrame(d)
df

Unnamed: 0,Review,Gold labels,NB labels
0,This movie was excellent! The performances were oscar-worthy!,pos,pos
1,Unbelievably disappointing.,neg,neg
2,"Full of zany characters and richly applied satire, and some great plot twists",pos,pos
3,This is the greatest screwball comedy ever filmed,pos,pos
4,It was pathetic. The worst part about it was the boxing scenes.,neg,neg
5,"It could have been a great movie. It could have been excellent, \n and to all the people who have forgotten about the older, \n greater movies before it, will think that as well. \n It does have beautiful scenery, some of the best since Lord of the Rings. \n The acting is well done, and I really liked the son of the leader of the Samurai.\n He was a likeable chap, and I hated to see him die...\n But, other than all that, this movie is nothing more than hidden rip-offs.\n",neg,pos


1. Our model works well when there are clear words indicating whether the review is positive or negative, as the features we are using are word features.
2. Fails for more complex examples, where understanding the context and overall text is essential to correctly classify reviews. The last example has many positive words in the beginning but the last sentence negates all positivity in the previous text. We need to incorporate deeper linguistic knowledge to correctly classify such cases. 

### Sentiment Analysis with Vader

In [15]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/seiryu8808/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

VADER's SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

* negative
* neutral
* positive
* compound (computed by normalizing the scores above)

In [17]:
a = 'The weather today is horrible. I dont feel like getting out'
sid.polarity_scores(a)

{'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.6818}

In [18]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

In [19]:
df['Vader_scores'] = df['Review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['Vader_scores'].apply(lambda score_dict: score_dict['compound'])

df['Vader labels'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df_labels = df[['Review', 'Gold labels', 'NB labels', 'Vader labels']]

df_labels.head()

Unnamed: 0,Review,Gold labels,NB labels,Vader labels
0,This movie was excellent! The performances were oscar-worthy!,pos,pos,pos
1,Unbelievably disappointing.,neg,neg,neg
2,"Full of zany characters and richly applied satire, and some great plot twists",pos,pos,pos
3,This is the greatest screwball comedy ever filmed,pos,pos,pos
4,It was pathetic. The worst part about it was the boxing scenes.,neg,neg,neg


10 min Break

## Topic modeling 

- Suppose your company has a large collection of documents on a variety of topics

### Example: A corpus of food magazines 
<center>
<img src="images/00_TM_food_magazines.png" height="2000" width="2000"> 
</center>

### Example: A corpus of news articles 
<center>
<img src="images/01_TM_NYT_articles.png" height="2000" width="2000"> 
</center>

### Topic modeling 

- Suppose your company has a large collection of documents on a variety of topics
- Suppose they ask you to 
    - infer different topics in the documents
    - pull all documents about a certain topic    

### Topic modeling motivation

- Humans are pretty good at reading and understanding documents and answering questions such as 
    - What is it about?  
    - What is it related to in terms of content?     
- Labeling by hand? 
    - Probably not
- Use topic modeling which automates this process of inferring underlying structure in a large corpus of text documents

### Topic modeling: Input 

<center>
<img src="images/02_TM_science_articles.png" height="2000" width="2000"> 
</center>
(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output
<center>
<img src="files/images/TM_topics.png" height="900" width="900"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output with interpretation

- The labels are assigned manually.  
<center>
<img src="images/03_TM_topics_with_labels.png" height="800" width="800"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))


## Topic modeling pipeline 

- Feed knowlege into the machines; let it read large amount of text
    * E.g., Wikipedia or News articles     
- Preprocess your corpus 
    - Be careful with the features (i.e., words)
- Train ML models
    - For now Latent Dirichlet Allocation (LDA)
- Interpret your topics     
- Evaluate
    - How well your model does on unseen documents? 

### Baysian approach: Latent Dirichlet Allocation (LDA)

- Developed by [David Blei](http://www.cs.columbia.edu/~blei/) and colleagues. 
    * One of the most cited papers in the last 15 years.
    
- Insight: 
    - Each document is a random mixture of corpus-wide topics
        - Every document is a discrete probability distribution of topics

    - Every topic is a mixture words
        - Every topic is a discrete probability distribution of words 

### LDA: insight
- Each document is a random mixture of corpus-wide topics
- Every topic is a mixture words
<center>
<img src="images/04_TM_dist_topics_words_blei.png" height="1000" width="1000"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Example: Every document is a discrete probability distribution of topics

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Document 1: 100% topic models
- Document 4: 100% fashion models
- Document 7: 60% topic models + 40% fashion model

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))

### Example: Every topic is a discrete probability distribution of words

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Topic 1: _model_ (0.33), _probabilistic_ (0.32), _topic_ (0.32), ...    
- Topic 2: _model_ (0.33), _famous_ (0.32), _fashion_ (0.32), ...    

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))

### LDA model

- Observable features: words
- All other parameters are hidden or latent

<center>
<img src="images/05_TM_topic_model_blei.png" height="700" width="700"> 
</center>

(Adapted from [David Blei's paper](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf))

### LDA: Hyperparameters

- $\alpha$ 
   - High alpha &rarr; every document contains a mixture of most of the topics
   - Low alpha &rarr; every document is representative of only a few topic
- $\beta$
    - High beta &rarr; Every topic contains a mixture of most of the words
    - Low beta &rarr; Every topic contains a mixture of only few words

<center>
<img src="images/05_TM_topic_model_blei.png" height="600" width="600"> 
</center>

(Adapted from [David Blei's paper](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf))

### LDA learning: goals

Infer the underlying topic structure in the documents. In particular, 
- Learn the probability distribution of topics in each document
- Learn the discrete probability distribution of words in each topic

### LDA learning: intuition

Intuition: A word in a document is likely to belong to the same topic as the other words in that document. 

<blockquote>
Document 1: probabilistic (Topic 1) topic (Topic 1) model (Topic ?) <br>
Document 1: probabilistic (Topic 1) topic (Topic 1) model (Topic 1)
</blockquote>

<blockquote>
Document 8: famous (Topic 2) fashion (Topic 2) model (Topic ?)
    <br>
Document 8: famous (Topic 2) fashion (Topic 2) model (Topic 2)    
</blockquote>


### LDA algorithm 

- Choose the number of topics you think are there in your corpus
    * Example: k = 2

### LDA algorithm

- Repeat the following steps till the topics make sense:     
- Randomly assign each words in each document to one of the topics
    * Example: The word _probabilistic_ is randomly assigned to topic 2 (fashion).
- Go through every word and its topic assignment in each document, looking at
    * How often the topic occurs in the document?
    * How often the word occurs with the topic overall? 
    * Example: Seems like topic 2 does not occur in Document 1 and the word _probabilistic_ doesn't occur much in topic 2 (fashion). So the word _probabilistic_ should probably be assigned to topic 1. 


### LDA topics in Science journal

<center>
<img src="images/03_TM_topics_with_labels.png" height="800" width="800"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Training LDA with [Gensim](https://radimrehurek.com/gensim/models/ldamodel.html)

You need

- Document-term matrix 
- Pick number of topics: `num_topics`
- Pick number of passes: `passes`



* *Disclaimer: You can also check out Sklearn's model. However, Gensim is more used in NLP.*

In [20]:
toy_df = pd.read_csv('data/toy_lda_data.csv')
toy_df

Unnamed: 0,doc_id,text
0,1,famous fashion model
1,2,famous fashion model
2,3,famous fashion model
3,4,famous fashion model
4,5,famous fashion model
5,6,famous fashion model
6,7,famous fashion model
7,8,famous fashion model
8,9,famous fashion model
9,10,famous fashion model


In [21]:
corpus = [doc.split() for doc in toy_df['text'].tolist()]
corpus

[['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model']]

In [22]:
# Create a vocabulary for the lda model and 
# convert our corpus into document-term matrix for Lda
dictionary = corpora.Dictionary(corpus)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)]]

In [23]:
lda = models.LdaModel(corpus=doc_term_matrix, 
                      id2word=dictionary, 
                      num_topics=2, 
                      passes=10)

In [24]:
lda.print_topics()

[(0,
  '0.326*"model" + 0.320*"fashion" + 0.320*"famous" + 0.017*"probabilistic" + 0.016*"topic"'),
 (1,
  '0.325*"topic" + 0.325*"probabilistic" + 0.324*"model" + 0.013*"famous" + 0.013*"fashion"')]

In [25]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
vis

# END SOLUTION 

### Tips when you build an LDA model on a large corpus 

- Preprocessing is crucial!! 
    - Tokenize, remove punctuation, convert text to lower case
    - Discard words with length < threshold or word frequency < threshold        
    - Stoplist: Remove most commonly used words in English 
    - Lemmatization: Consider the root form of the word. 
    - Restrict to specific part of speech
        * Only consider nouns, verbs, and adjectives

## Complex Topic modeling with LDA

Let's explore the topics in `scikit-learn`'s [20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) using [`gensim`'s `ldamodel`](https://radimrehurek.com/gensim/models/ldamodel.html). 

Usually, topic modeling is used for discovering the abstract "topics" that occur in a collection of documents when you do not know the actual topics present in the documents. But since 20 newsgroups text dataset is labeled with categories (e.g., sports, hardware, religion), you will be able to cross-check the topics discovered by your model with the actual topics. 

In [None]:
### Load data

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

In [27]:
data = {'text':[], 'target_name':[], 'target':[]}
data['text'] = newsgroups_train.data
data['target_name'] = [newsgroups_train.target_names[target] for target in newsgroups_train.target]
data['target'] = [target for target in newsgroups_train.target]
df = pd.DataFrame(data)
df.head()

Unnamed: 0,text,target_name,target
0,"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",rec.autos,7
1,"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n",comp.sys.mac.hardware,4
2,"From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdue University Engineering Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering\n---------------------------------------------------------------------------\n""Convictions are more dangerous enemies of truth than lies."" - F. W.\nNietzsche\n",comp.sys.mac.hardware,4
3,"From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert J.C. Kyanko (rob@rjck.UUCP) wrote:\n> abraxis@iastate.edu writes in article <abraxis.734340159@class1.iastate.edu>:\n> > Anyone know about the Weitek P9000 graphics chip?\n> As far as the low-level stuff goes, it looks pretty nice. It's got this\n> quadrilateral fill command that requires just the four points.\n\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n""The only thing that really scares me is a person with no sense of humor.""\n\t\t\t\t\t\t-- Jonathan Winters\n",comp.graphics,1
4,"From: jcm@head-cfa.harvard.edu (Jonathan McDowell)\nSubject: Re: Shuttle Launch Question\nOrganization: Smithsonian Astrophysical Observatory, Cambridge, MA, USA\nDistribution: sci\nLines: 23\n\nFrom article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n>>In article <C5JLwx.4H9.1@cs.cmu.edu>, ETRAT@ttacs1.ttu.edu (Pack Rat) writes...\n>>>""Clear caution & warning memory. Verify no unexpected\n>>>errors. ..."". I am wondering what an ""expected error"" might\n>>>be. Sorry if this is a really dumb question, but\n> \n> Parity errors in memory or previously known conditions that were waivered.\n> ""Yes that is an error, but we already knew about it""\n> I'd be curious as to what the real meaning of the quote is.\n> \n> tom\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.\n\n - Jonathan\n\n\n",sci.space,14


### Preprocessing

We want our topic model to identify interesting and important patterns instead of noise. For that we need to "normalize" our text. Preprocessing is a crucial step when you do topic modeling and it markedly affects the results. For this example, let's use a popular Python NLP library called [spaCy](https://spacy.io/). It is a fast, easy-to-use, industrial-strength, and a powerful library. 

spaCy is a powerful library and it can do many other things, but we'll be using it for preprocessing.  With this library, you can run the NLP pipeline by simply calling the function `nlp`. You can then access information about each token in a `for` loop. 

```
doc = nlp(text)
for token in doc:
    print(token.pos_)
    print(token.lemma_)
```

In [29]:
import spacy
# Load English model for SpaCy
nlp = spacy.load("en")

In [30]:
def preprocess(text, 
               min_token_len = 2 , 
               irrelevant_pos = ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']): 
    """
    Given text, min_token_len, and irrelevant_pos carry out preprocessing of the text 
    and return a preprocessed string. 
    
    Keyword arguments:
    text -- (str) the text to be preprocessed
    min_token_len -- (int) min_token_length required
    irrelevant_pos -- (list) a list of irrelevant pos tags
    
    Returns: (str) the preprocessed text
    """
    # Remove Emails
    text = re.sub(r'\S*@\S*\s?', '', text)
    
    text = re.sub(r'\s+', ' ', text)
    
    # Remove distracting characters
    text = re.sub(r'''[\*\~]+''', "", text)

    doc = nlp(text)
    clean_text = []
    
    for token in doc:
        if (token.is_stop == False # Check if it's not a stopword
            and token.is_alpha # Check if it's an alphanumerics char
            and len(token) > min_token_len # Check if the word meets minimum threshold
            and token.pos_ not in irrelevant_pos): # Check if the POS is in the acceptable POS tags
            lemma = token.lemma_ # Take the lemma of the word
            clean_text.append(lemma)  
    return " ".join(clean_text)

In [31]:
df['clean_text'] = df['text'].apply(preprocess)
#df.to_csv('data/clean_text.csv', index=False)


### Build dictionary and document-term co-occurrence matrix

We need two things to build `gensim`'s `LdaModel`: a dictionary and a document-term co-occurrence matrix. Let's

1. Create a dictionary using `gensim`'s [`corpora.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html) method. Optionally, you can exclude extremes using the [`filter_extremes`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes) method of your `corpora.Dictionary` object. 
2. Create the document-term co-occurrence matrix using `corpora.Dictionary`'s `doc2bow` method. 

In [38]:
preprocessed_corpus = [doc.split() for doc in df['clean_text'].tolist()]
# Create a vocabulary for the lda model and 
# convert our corpus into document-term matrix for Lda
dictionary = corpora.Dictionary(preprocessed_corpus)

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in preprocessed_corpus]


### Build a topic model

Let's actually build a topic model.  

1. Build an LDA model using `gensim`'s [`models.LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html) with `num_topics` = 10. Note: If you get many warnings when you build your model, update your gensim installation.  See [here](https://github.com/RaRe-Technologies/gensim/pull/2296).
2. Print LDA topics with the `model.print_topics()` methods, where `model` is your LDA model. 
3. Experiment with a few choices of `num_topics` hyperparameter. 
4. Experiment with a few choices of `passes` hyperparameter. 
5. Settle on the hyperparameters, where the topics make sense to you. 

In [33]:
lda = models.LdaModel(corpus=doc_term_matrix, 
                      id2word=dictionary, 
                      num_topics=10, 
                      passes=10)

In [34]:
lda.print_topics()

[(0,
  '0.023*"game" + 0.019*"team" + 0.015*"play" + 0.012*"player" + 0.011*"win" + 0.008*"season" + 0.006*"score" + 0.006*"hockey" + 0.006*"NHL" + 0.006*"fan"'),
 (1,
  '0.029*"God" + 0.013*"Jesus" + 0.007*"Bible" + 0.006*"life" + 0.006*"Christ" + 0.006*"faith" + 0.006*"Christians" + 0.005*"book" + 0.005*"belief" + 0.005*"word"'),
 (2,
  '0.010*"car" + 0.007*"buy" + 0.007*"gun" + 0.006*"power" + 0.005*"drive" + 0.005*"little" + 0.004*"USA" + 0.004*"wire" + 0.004*"high" + 0.004*"ground"'),
 (3,
  '0.035*"key" + 0.014*"chip" + 0.012*"encryption" + 0.008*"science" + 0.007*"Clipper" + 0.007*"algorithm" + 0.006*"government" + 0.006*"escrow" + 0.006*"secret" + 0.005*"bit"'),
 (4,
  '0.014*"information" + 0.013*"list" + 0.011*"send" + 0.009*"available" + 0.008*"address" + 0.008*"group" + 0.008*"file" + 0.007*"public" + 0.007*"internet" + 0.007*"email"'),
 (5,
  '0.010*"law" + 0.007*"state" + 0.006*"government" + 0.005*"man" + 0.005*"child" + 0.005*"fact" + 0.004*"opinion" + 0.004*"person" + 

### Visualization and interpretation

Once we have settled on the number of topics and passes, let's visualize the topics and interpret them. In particular,  

1. Visualize the topics using [pyLDAvis](https://github.com/bmabey/pyLDAvis), which is a Python library for interactive topic model visualization. Note: Use `sort_topics=False`. Otherwise the topic ids in the previous exercise won't match with the topics here.
2. Using the words in each topic and their corresponding weights, manually assign a label (e.g., sports, politics, religion) to each topic.
3. Create a dictionary with keys as the topic id and your manually-assigned topic label as the values. 

In [35]:
topic_labels = {0:'Automobiles',
              1:'Hardware, graphic cards',
              2:'Politics and guns',
              3:'Medicine',
              4:'Windows, graphic cards, technology',
              5:'Sports',
              6:'Religion',
              7:'Windows, technology', 
              8:'Space',
              9:'Security, cryptography'}

In [36]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
vis


### Test on unseen documents 

In this particular data, we already know the label for each article. In this exercise, you will examine whether your LDA model is able to assign reasonable topics to unseen documents. 

In [37]:

data = {'text':[], 'target':[]}
data['text'] = newsgroups_test.data
data['target_name'] = [newsgroups_test.target_names[target] for target in newsgroups_test.target]
data['target'] = [target for target in newsgroups_test.target]
test_df = pd.DataFrame(data)
test_df.head()

Unnamed: 0,text,target,target_name
0,From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)\nSubject: Need info on 88-89 Bonneville\nOrganization: University at Buffalo\nLines: 10\nNews-Software: VAX/VMS VNEWS 1.41\nNntp-Posting-Host: ubvmsd.cc.buffalo.edu\n\n\n I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.\n\n\t\t\tNeil Gandler\n,7,rec.autos
1,"From: Rick Miller <rick@ee.uwm.edu>\nSubject: X-Face?\nOrganization: Just me.\nLines: 17\nDistribution: world\nNNTP-Posting-Host: 129.89.2.33\nSummary: Go ahead... swamp me. <EEP!>\n\nI'm not familiar at all with the format of these ""X-Face:"" thingies, but\nafter seeing them in some folks' headers, I've *got* to *see* them (and\nmaybe make one of my own)!\n\nI've got ""dpg-view"" on my Linux box (which displays ""uncompressed X-Faces"")\nand I've managed to compile [un]compface too... but now that I'm *looking*\nfor them, I can't seem to find any X-Face:'s in anyones news headers! :-(\n\nCould you, would you, please send me your ""X-Face:"" header?\n\nI *know* I'll probably get a little swamped, but I can handle it.\n\n\t...I hope.\n\nRick Miller <rick@ee.uwm.edu> | <ricxjo@discus.mil.wi.us> Ricxjo Muelisto\nSend a postcard, get one back! | Enposxtigu bildkarton kaj vi ricevos alion!\n RICK MILLER // 16203 WOODS // MUSKEGO, WIS. 53150 // USA\n",5,comp.windows.x
2,"From: mathew <mathew@mantis.co.uk>\nSubject: Re: STRONG & weak Atheism\nOrganization: Mantis Consultants, Cambridge. UK.\nX-Newsreader: rusnews v1.02\nLines: 9\n\nacooper@mac.cc.macalstr.edu (Turin Turambar, ME Department of Utter Misery) writes:\n> Did that FAQ ever got modified to re-define strong atheists as not those who\n> assert the nonexistence of God, but as those who assert that they BELIEVE in \n> the nonexistence of God?\n\nIn a word, yes.\n\n\nmathew\n",0,alt.atheism
3,"From: bakken@cs.arizona.edu (Dave Bakken)\nSubject: Re: Saudi clergy condemns debut of human rights group!\nKeywords: international, non-usa government, government, civil rights, \tsocial issues, politics\nOrganization: U of Arizona CS Dept, Tucson\nLines: 101\n\nIn article <benali.737307554@alcor> benali@alcor.concordia.ca ( ILYESS B. BDIRA ) writes:\n>It looks like Ben Baz's mind and heart are also blind, not only his eyes.\n>I used to respect him, today I lost the minimal amount of respect that\n>I struggled to keep for him.\n>To All Muslim netters: This is the same guy who gave a ""Fatwah"" that\n>Saudi Arabia can be used by the United Ststes to attack Iraq . \n\nThey were attacking the Iraqis to drive them out of Kuwait,\na country whose citizens have close blood and business ties\nto Saudi citizens. And me thinks if the US had not helped out\nthe Iraqis would have swallowed Saudi Arabia, too (or at \nleast the eastern oilfields). And no Muslim country was doing\nmuch of anything to help liberate Kuwait and protect Saudi\nArabia; indeed, in some masses of citizens were demonstrating\nin favor of that butcher Saddam (who killed lotsa Muslims),\njust because he was killing, raping, and looting relatively\nrich Muslims and also thumbing his nose at the West.\n\nSo how would have *you* defended Saudi Arabia and rolled\nback the Iraqi invasion, were you in charge of Saudi Arabia???\n\n>Fatwah is as legitimate as this one. With that kind of ""Clergy"", it might\n>be an Islamic duty to separate religion and politics, if religion\n>means ""official Clergy"".\n\nI think that it is a very good idea to not have governments have an\nofficial religion (de facto or de jure), because with human nature\nlike it is, the ambitious and not the pious will always be the\nones who rise to power. There are just too many people in this\nworld (or any country) for the citizens to really know if a \nleader is really devout or if he is just a slick operator.\n\n>\n> \tCAIRO, Egypt (UPI) -- The Cairo-based Arab Organization for Human\n> Rights (AOHR) Thursday welcomed the establishement last week of the\n> Committee for Defense of Legal Rights in Saudi Arabia and said it was\n> necessary to have such groups operating in all Arab countries.\n\nYou make it sound like these guys are angels, Ilyess. (In your\nclarinet posting you edited out some stuff; was it the following???)\nFriday's New York Times reported that this group definitely is\nmore conservative than even Sheikh Baz and his followers (who\nthink that the House of Saud does not rule the country conservatively\nenough). The NYT reported that, besides complaining that the\ngovernment was not conservative enough, they have:\n\n\t- asserted that the (approx. 500,000) Shiites in the Kingdom\n\t are apostates, a charge that under Saudi (and Islamic) law\n\t brings the death penalty. \n\n\t Diplomatic guy (Sheikh bin Jibrin), isn't he Ilyess?\n\n\t- called for severe punishment of the 40 or so women who\n\t drove in public a while back to protest the ban on\n\t women driving. The guy from the group who said this,\n\t Abdelhamoud al-Toweijri, said that these women should\n\t be fired from their jobs, jailed, and branded as\n\t prostitutes.\n\n\t Is this what you want to see happen, Ilyess? I've\n\t heard many Muslims say that the ban on women driving\n\t has no basis in the Qur'an, the ahadith, etc.\n\t Yet these folks not only like the ban, they want\n\t these women falsely called prostitutes? \n\n\t If I were you, I'd choose my heroes wisely,\n\t Ilyess, not just reflexively rally behind\n\t anyone who hates anyone you hate.\n\n\t- say that women should not be allowed to work.\n\n\t- say that TV and radio are too immoral in the Kingdom.\n\nNow, the House of Saud is neither my least nor my most favorite government\non earth; I think they restrict religious and political reedom a lot, among\nother things. I just think that the most likely replacements\nfor them are going to be a lot worse for the citizens of the country.\nBut I think the House of Saud is feeling the heat lately. In the\nlast six months or so I've read there have been stepped up harassing\nby the muttawain (religious police---*not* government) of Western women\nnot fully veiled (something stupid for women to do, IMO, because it\nsends the wrong signals about your morality). And I've read that\nthey've cracked down on the few, home-based expartiate religious\ngatherings, and even posted rewards in (government-owned) newspapers\noffering money for anyone who turns in a group of expartiates who\ndare worship in their homes or any other secret place. So the\ngovernment has grown even more intolerant to try to take some of\nthe wind out of the sails of the more-conservative opposition.\nAs unislamic as some of these things are, they're just a small\ntaste of what would happen if these guys overthrow the House of\nSaud, like they're trying to in the long run.\n\nIs this really what you (and Rached and others in the general\nwest-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)\nwant, Ilyess?\n\n--\nDave Bakken\n==>""the President is doing a fine job, but the problem is we don't know what\n to do with her husband."" James Carville (Clinton campaign strategist),2/93\n==>""Oh, please call Daddy. Mom's far too busy."" Chelsea to nurse, CSPAN, 2/93\n",17,talk.politics.mideast
4,"From: livesey@solntze.wpd.sgi.com (Jon Livesey)\nSubject: Re: After 2000 years, can we say that Christian Morality is\nOrganization: sgi\nLines: 22\nDistribution: world\nNNTP-Posting-Host: solntze.wpd.sgi.com\n\nIn article <1993Apr21.141259.12012@st-andrews.ac.uk>, nrp@st-andrews.ac.uk (Norman R. Paterson) writes:\n|> In article <1r2m21$8mo@fido.asd.sgi.com> livesey@solntze.wpd.sgi.com (Jon Livesey) writes:\n|> >In article <1993Apr19.151902.21216@st-andrews.ac.uk>, nrp@st-andrews.ac.uk (Norman R. Paterson) writes:\n> >Just as well, then, that I'm not claiming that my own moral system is\n> >absolute.\n> >\n> >jon.\n> >\n> >[list of references stretching from here to Alpha Centauri deleted.]\n>\n> Jon-\n>\n> [and I thought to impress with my references!]\n>\n> Ok, so you don't claim to have an absolute moral system. Do you claim\n> to have an objective one? I'll assume your answer is ""yes,"" apologies\n> if not.\n\nI've just spent two solid months arguing that no such thing as an\nobjective moral system exists.\n\njon.\n",19,talk.religion.misc
