## NLP Topic Modeling Exercise

[Guide](https://blog.mlreview.com/topic-modeling-with-scikit-learn-e80d33668730)

In [1]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

* create a variable called `'no_features'` and set its value to 100.

In [3]:
no_features = 100

* create a variable `'no_topics'` and set its value to 100

In [4]:
no_topics = 100

## NMF

* instantiate a TfidfVectorizer with the following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [5]:
tfidf_vector = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of TfidfVectorizer to transform the documents

In [6]:
tfidf = tfidf_vector.fit_transform(documents)

* get the features names from TfidfVectorizer

In [7]:
tfidf_feature_names = tfidf_vector.get_feature_names()

* instantiate NMF and fit transformed data

In [8]:
nmf = NMF().fit(tfidf)



## LDA w/ Sklearn

* instantiate a CountVectorizer with following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [9]:
tf_vector = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of CountVectorizer to transform documents

In [10]:
tf = tf_vector.fit_transform(documents)

* get the features names from TfidfVectorizer

In [11]:
tf_feature_names = tf_vector.get_feature_names()

* instantiate LatentDirichletAllocation and fit transformed data 

In [12]:
lda = LatentDirichletAllocation().fit(tf)

* create a function `display_topics` that is able to display the top words in a topic for different models

In [13]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f'Topic {topic_idx}:')
        print(' '.join([feature_names[i] for i in topic.argsort()[:no_top_words -1:1]]))

* display top 10 words from each topic from NMF model

In [14]:
no_top_words = 10
display_topics(nmf, tfidf_feature_names, no_top_words)

Topic 0:
00 run right really read question program problem probably
Topic 1:
00 right really read question program problem probably power
Topic 2:
00 run right really read question program problem probably
Topic 3:
00 run right really read question program problem probably
Topic 4:
00 really read question program problem probably power point
Topic 5:
00 run right really read question program problem probably
Topic 6:
00 right really read question program problem probably power
Topic 7:
00 really read question program problem probably power point
Topic 8:
00 run right really read question program problem probably
Topic 9:
00 run right really read question program problem probably
Topic 10:
00 right really read question program problem probably power
Topic 11:
00 right really read question program problem probably power
Topic 12:
00 really read question program problem probably power point
Topic 13:
00 really read question program problem probably power point
Topic 14:
00 run right reall

* display top 10 words from each topic from LDA model

In [16]:
display_topics(lda, tfidf_feature_names, no_top_words)

Topic 0:
ax g9v a86 b8f file government drive windows data
Topic 1:
ax b8f g9v a86 00 windows jesus space file
Topic 2:
g9v b8f a86 ax jesus god law government said
Topic 3:
b8f ax g9v a86 jesus god mr 00 said
Topic 4:
b8f ax g9v a86 god jesus drive space power
Topic 5:
jesus course law com probably drive god does edu
Topic 6:
ax a86 g9v b8f jesus god space file software
Topic 7:
g9v b8f a86 ax jesus god drive mr windows
Topic 8:
ax g9v b8f a86 max 00 file data drive
Topic 9:
a86 g9v ax b8f jesus god file data windows


### Stretch: Use LDA w/ Gensim to do the same thing.

[Guide](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)