___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Topic Modeling on Quora Reviews

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.

In [3]:
import pandas as pd
import os
os.chdir(r'D:\Data Science Projects\NLP\Quora')

In [47]:
quora = pd.read_csv('quora_questions.csv')

In [48]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [49]:
quora.shape

(404289, 1)

In [50]:
quora['Question'][7]

'How can I be a good geologist?'

# Preprocessing


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [53]:
dtm = tfidf.fit_transform(quora['Question'])

In [54]:
#TF-IDF Dimensions
dtm.shape

(404289, 38669)

# Non-negative Matrix Factorization


In [55]:
from sklearn.decomposition import NMF

In [56]:
# Create an NMF instance: model
# the 20 components will be the topics
nmf_model = NMF(n_components=20, random_state=42)

In [57]:
# Fit the model to TF-IDF
nmf_model.fit(dtm)



NMF(n_components=20, random_state=42)

In [58]:
for index, topic in enumerate(nmf_model.components_):
    print(f"The top 15 words for TOPIC is: {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print()

The top 15 words for TOPIC is: 0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']

The top 15 words for TOPIC is: 1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']

The top 15 words for TOPIC is: 2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']

The top 15 words for TOPIC is: 3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']

The top 15 words for TOPIC is: 4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']

The top 15 words for TOPIC is: 5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business'

#### Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [59]:
# Transform the TF-IDF: nmf_features (document & topic matrix)
nmf_features = nmf_model.transform(dtm)  

In [60]:
#Features Dimensions
nmf_features.shape

(404289, 20)

In [61]:
#Components Dimensions (topic and word matrix)
nmf_model.components_.shape

(20, 38669)

In [62]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(nmf_model.components_ , columns=tfidf.get_feature_names())
components_df

Unnamed: 0,00,000,0000000000,000rs,001,0019,002,00am,01,012,...,竟然,親しみやすい,譬如朝露,还能靠什么支撑自己走下去,這是什麽,骂人,그런데,근데,심하잖아,하지만
0,0.0,0.056304,5.4e-05,0.002323,0.0,0.0,0.0,7e-06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.001239,0.0,3.5e-05,0.0,3.534633e-05,1.6e-05,0.0,1.8e-05,0.000959,0.0,...,0.0,0.0,0.000308,3e-06,1.1e-05,0.0,0.0,0.0,0.00365,0.0
2,0.0,0.0,0.0,0.0,1.786137e-05,1.5e-05,9e-06,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000125,0.0,0.0,0.0,0.0,0.0
3,0.000455,0.05402,0.0,0.0,0.0,0.0,0.0,0.0,0.000369,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002177,0.000237,...,0.0,0.0,0.000145,0.0,3.6e-05,0.0,0.0,0.0,0.0,0.0
5,0.000921,0.003782,0.0,5.6e-05,0.0,0.0,3e-06,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,2.349743e-06,0.0,4.5e-05,0.0,0.0,1.4e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.000796,0.0,0.0,0.0,5e-06,0.0,1e-06,0.0,1.7e-05,...,0.0,0.0,0.0,0.0,8.6e-05,0.0,0.0,0.0,0.0,0.0
8,0.000247,0.008207,0.0,0.0,1.296839e-05,0.0,0.0,9e-06,0.001442,0.0001,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.000378,0.002413,0.0,0.0,0.0,0.0,0.0,5e-06,0.000857,0.00023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
#Get the Words of the Highest Value for each Topic

for topic in range(components_df.shape[0]):
    tmp = components_df.iloc[topic]
    print(f'For topic {topic+1}, the top 10 words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 1, the top 10 words with the highest value are:
best      8.248757
movies    0.674862
book      0.640512
books     0.639019
2016      0.483959
ways      0.397412
movie     0.394834
laptop    0.369424
buy       0.339786
phone     0.304055
Name: 0, dtype: float64


For topic 2, the top 10 words with the highest value are:
does       8.591282
mean       2.921088
work       1.125444
feel       0.872357
long       0.632791
cost       0.347555
compare    0.320279
really     0.287291
exist      0.257838
use        0.220598
Name: 1, dtype: float64


For topic 3, the top 10 words with the highest value are:
quora          4.893323
questions      1.759927
question       1.351294
ask            0.984636
answer         0.646882
answers        0.624923
google         0.423363
asked          0.392752
delete         0.323093
improvement    0.292727
Name: 2, dtype: float64


For topic 4, the top 10 words with the highest value are:
money       4.049779
make        3.389326
online      2.2693

In [64]:
quora['Question'][55]

'How difficult is it get into RSI?'

In [65]:
# to get the topics from Feature Matrix for 55th document 
pd.DataFrame(nmf_features).loc[55]

0     0.000000
1     0.000000
2     0.000012
3     0.000049
4     0.000320
5     0.000098
6     0.000338
7     0.000000
8     0.000075
9     0.000000
10    0.000004
11    0.000000
12    0.000000
13    0.000104
14    0.000038
15    0.000077
16    0.000199
17    0.000000
18    0.000026
19    0.000261
Name: 55, dtype: float64

In [66]:
# to get the index in once
pd.DataFrame(nmf_features).loc[55].idxmax()

6

In [67]:
# to see the number of documents for each topic
pd.DataFrame(nmf_features).idxmax()

0       5847
1      11337
2     112269
3       4060
4      20385
5      17314
6        779
7      75244
8     357256
9      20152
10      7875
11    326915
12      2246
13      1979
14      1806
15     53162
16       838
17     63372
18      2210
19     26624
dtype: int64

In [71]:
# Get dominant topic for each document
quora['Topic'] = nmf_features.argmax(axis=1)

In [72]:
quora.head(10)

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17


In [73]:
def label_theme(row):
    if row['Topic'] == 0 :
        return 'American/Car/Marriage/Story/Life in general'
    if row['Topic'] == 1 :
        return 'Education/Business/Money'
    if row['Topic'] == 2 :
        return 'American Medicare/Trump'
    if row['Topic'] == 3:
        return 'State/Social/Rights'
    if row['Topic']  == 4:
        return 'Build new life'
    if row['Topic'] == 5:
        return 'Highly educated Indian engineers in America'
    if row['Topic'] == 6:
        return 'Tips on improving work day efficiency'
    if row['Topic'] == 7:
        return 'College/Service/Power'
    if row['Topic'] == 8:
        return 'Company/Human/Invest'
    if row['Topic'] == 9:
        return 'Bank account/Charge'
    if row['Topic'] == 10:
        return 'Book/Indian/App/Technology'
    if row['Topic'] == 11:
        return 'War/Future/Family/USA/Race/Political'
    if row['Topic'] == 12:
        return 'Government/President/Society'
    if row['Topic'] == 13:
        return 'Relationship/China/Parent/Japan'
    if row['Topic'] == 14:
        return 'Application/Energy/Machine/Economic/Art/Europe'
    if row['Topic'] == 15:
        return 'Earth/Marketing/Culture'
    if row['Topic'] == 16:
        return 'Air/Rate/Sleep/Blood/Email'
    if row['Topic'] == 17:
        return 'Student/Internet/Computer/Science/Research'
    if row['Topic'] == 18:
        return 'University/Engineering/Language/Software'
    if row['Topic'] == 19:
        return 'Job/Learn/Skill improvement'
        
quora['Topic_theme'] = quora.apply (lambda row: label_theme(row), axis=1)
quora.head(15)

Unnamed: 0,Question,Topic,Topic_theme
0,What is the step by step guide to invest in sh...,5,Highly educated Indian engineers in America
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16,Air/Rate/Sleep/Blood/Email
2,How can I increase the speed of my internet co...,17,Student/Internet/Computer/Science/Research
3,Why am I mentally very lonely? How can I solve...,11,War/Future/Family/USA/Race/Political
4,"Which one dissolve in water quikly sugar, salt...",14,Application/Energy/Machine/Economic/Art/Europe
5,Astrology: I am a Capricorn Sun Cap moon and c...,1,Education/Business/Money
6,Should I buy tiago?,0,American/Car/Marriage/Story/Life in general
7,How can I be a good geologist?,10,Book/Indian/App/Technology
8,When do you use シ instead of し?,19,Job/Learn/Skill improvement
9,Motorola (company): Can I hack my Charter Moto...,17,Student/Internet/Computer/Science/Research


# How to Predict the Topic of a New Document

#### Let’s say that we want to assign a topic of a new unseen document. Then, we will need to take the document, to transform the TF-IDF model and finally to transform the NMF model.

In [87]:
my_news = """Why i am mentally lonely"""
 
# Transform the TF-IDF
X = tfidf.transform([my_news])
# Transform the TF-IDF: nmf_features
nmf_features = nmf_model.transform(X)
 
pd.DataFrame(nmf_features)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.000109,0.0,8.668374e-07,3.7e-05,0.0,0.0,8.7e-05,0.0,0.000578,0.0,0.0,0.0,8e-06,0.0,1e-06,0.00028,9.7e-05,8.7e-05,0.000468


In [94]:
# if we want to get the index of the topic with the highest score:

topic = pd.DataFrame(nmf_features).idxmax(axis=1)
print(f'The given document belongs to Topic {topic[0]}.')

The given document belongs to Topic 9.
