<a href="https://colab.research.google.com/github/renadalahmadi/BigData-and-AI/blob/main/NLP_Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP - Topic Modeling Assignment



Renad Alahmadi

## Dataset 
- For this assignment you will be working with a dataset of over 400,000 quora questions that have no labeled cateogries.

## Main Objective 
- You are attempting to find 20 cateogries to assign these questions in the CVS file.


#### Task: Import pandas and read in the quora_questions.csv file.

In [1]:
# Importing packages 

import pandas as pd
import random

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
# Loading dataset 
dataset = pd.read_excel('/content/1664163408__Quora Questions.xlsx')

In [3]:
# Exploring the dataset 
dataset.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [4]:
# Dataset dimension
dataset.shape

(404289, 1)

In [5]:
# To print the first Question
dataset['Question'][0]

'What is the step by step guide to invest in share market in india?'

In [7]:
# I needed to do this step bc I think there's some questions that are only numbers or something similar 
dataset['Question']=dataset['Question'].apply(str)



# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. 
Note: You may want to explore the max_df and min_df parameters.

In [8]:
# Note: important step to ignore unnecessary words 
# ignore words occures more than 95% (words happen a lot) and less than 2 (not very relevant)
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english' , lowercase=False)

In [9]:
# Creating our document term matrix 
# doc term matrix 
dtm = cv.fit_transform(dataset['Question'])

In [10]:
# Note: 
    # number of articles: 11992
    # number of words: 54777 --> happen more than twice and less than 95% 
dtm

<404289x46943 sparse matrix of type '<class 'numpy.int64'>'
	with 2421187 stored elements in Compressed Sparse Row format>

# LDA - Latent Dirichlet Allocation

#### TASK: Using Scikit-Learn create an instance of LDA with 20 expected components. 
Note: Use random_state = 42

In [11]:
# Creating an LDA with 20 type of Questions 
# number of tpes --> n_components
# importnat: keep random state
LDA = LatentDirichletAllocation(n_components=20,random_state=42)

In [12]:
# Note: This can take a while, we're dealing with a large amount of documents!

LDA.fit(dtm)

LatentDirichletAllocation(n_components=20, random_state=42)

In [13]:
# This is the length of the words in the corpus 
len(cv.get_feature_names())



46943

In [17]:
# Print out 10 random words from our corpus 

for i in range(10):
    random_word_id = random.randint(0,46942)
    print(cv.get_feature_names()[random_word_id])

lappymaster
Shayari
mongering
Galil
witchcraft
hull
viscose
Cerberus
lone
ennikkumâ


In [18]:
# Verifying that we have 20 Ques in the corpus as output from LDA 

len(LDA.components_)

20

In [19]:
LDA.components_.shape

(20, 46943)

## Showing Stored Words

In [24]:
#lets check the first type of Ques  
single_Ques = LDA.components_[0]
# Note: the return array is the sort from the lowest value to the highest value
single_Ques.argsort()
# Note: this returns the index postions NOT the words
single_Ques.argsort()[-10:]

array([30432, 29577, 38355, 27097, 19482, 30001, 33609, 28874, 19433,
        8880])

In [25]:
# to return the words 
top_word_indices = single_Ques.argsort()[-15:]

# Return the top 15 words "based on probability" using their index
for index in top_word_indices:
    print(cv.get_feature_names()[index])

culture
tell
making
girlfriend
My
guy
friend
process
does
Why
girl
like
feel
What
How


#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [27]:
# Printing the top 15 words for each of the LDA 20 types of Ques 

for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR QUESTION #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR QUESTION #0
['culture', 'tell', 'making', 'girlfriend', 'My', 'guy', 'friend', 'process', 'does', 'Why', 'girl', 'like', 'feel', 'What', 'How']


THE TOP 15 WORDS FOR QUESTION #1
['true', 'English', 'War', 'hard', 'World', 'What', 'Why', 'Chinese', 'learning', 'learn', 'possible', 'stop', 'start', 'How', 'Is']


THE TOP 15 WORDS FOR QUESTION #2
['famous', 'project', 'used', 'experience', 'terms', 'difference', 'worst', 'good', 'happened', 'life', 'web', 've', 'business', 'thing', 'What']


THE TOP 15 WORDS FOR QUESTION #3
['purpose', 'effects', 'class', 'movies', 'watch', 'math', 'study', 'exam', 'science', 'school', 'computer', 'prepare', 'How', 'difference', 'What']


THE TOP 15 WORDS FOR QUESTION #4
['month', 'writing', 'skills', 'does', 'earn', 'What', 'English', 'online', 'love', 'lose', 'improve', 'weight', 'money', 'make', 'How']


THE TOP 15 WORDS FOR QUESTION #5
['universities', 'good', 'difference', 'makes', 'meaning', 'happens', 'differences', 'favorite'

I can't really tell what they're asking about but my best guess is for exmaple in 
- QUESTION #15 --> they're asking about soical media accounts 
- QUESTION #3 --> young people asking about school and movies same time :)


In [28]:
# Applying LDA to DTM 

type_results = LDA.transform(dtm)


In [30]:
type_results.argmax(axis=1)
# Creating a new cloumn in the dataset the gives the index "which represent the topic number" from LDA
dataset['TypeOfQues'] = type_results.argmax(axis=1)

In [31]:
dataset

Unnamed: 0,Question,TypeOfQues
0,What is the step by step guide to invest in sh...,18
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,6
2,How can I increase the speed of my internet co...,16
3,Why am I mentally very lonely? How can I solve...,17
4,"Which one dissolve in water quikly sugar, salt...",10
...,...,...
404284,How many keywords are there in the Racket prog...,15
404285,Do you believe there is life after death?,17
404286,What is one coin?,2
404287,What is the approx annual cost of living while...,14


#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [32]:
# What the the appropraite TypeOfQues from LDA ???

Ques_dict = {0:'type_1',1:'type_2',2:'type_3',3:'type_4',4:'type_5',5:'type_6',6:'type_6', 
             7:'type_7',8:'type_8',9:'type_9',10:'type_10',11:'type_11',12:'type_12',13:'type_13',
             14:'type_14',15:'type_15',16:'type_16',17:'type_17',18:'type_18',19:'type_19',20:'type_20'}
dataset["Ques Label"] = dataset["TypeOfQues"].map(Ques_dict)

In [33]:
dataset

Unnamed: 0,Question,TypeOfQues,Ques Label
0,What is the step by step guide to invest in sh...,18,type_18
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,6,type_6
2,How can I increase the speed of my internet co...,16,type_16
3,Why am I mentally very lonely? How can I solve...,17,type_17
4,"Which one dissolve in water quikly sugar, salt...",10,type_10
...,...,...,...
404284,How many keywords are there in the Racket prog...,15,type_15
404285,Do you believe there is life after death?,17,type_17
404286,What is one coin?,2,type_3
404287,What is the approx annual cost of living while...,14,type_14


# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components.
Note: Use random_state = 42

In [35]:
# Importing packages 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [36]:
# working with a copy from original dataset 
dataset2 = pd.read_excel('/content/1664163408__Quora Questions.xlsx')


In [37]:
dataset2

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."
...,...
404284,How many keywords are there in the Racket prog...
404285,Do you believe there is life after death?
404286,What is one coin?
404287,What is the approx annual cost of living while...


In [41]:
# I needed to do this step bc I think there's some questions that are only numbers or something similar 
dataset2['Question']=dataset2['Question'].apply(str)

# Applying the term frequency - Inverse document frequency 
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english',  lowercase=False)

dtm = tfidf.fit_transform(dataset2['Question'])

In [42]:
# Choosing to have 20 tpype of ques and setting a randmization to initialize NMF

ques_model = NMF(n_components=20,random_state=42)
ques_model.fit(dtm)



NMF(n_components=20, random_state=42)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [44]:
# Printing the top 15 words for each of the NMF 7 topics 

for index,topic in enumerate(ques_model.components_):
    print(f'THE TOP 15 WORDS FOR QUESTION #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR QUESTION #0
['happen', 'mean', 'books', 'things', 'think', 'favorite', 'meaning', 'way', 'ways', 'thing', 'best', 'examples', 'good', 'difference', 'What']


THE TOP 15 WORDS FOR QUESTION #1
['job', 'password', 'website', 'Instagram', 'work', 'long', 'rid', 'increase', 'Facebook', 'did', 'account', 'prepare', 'stop', 'start', 'How']


THE TOP 15 WORDS FOR QUESTION #2
['places', 'site', 'visit', 'place', 'phone', 'movie', 'buy', '2016', 'laptop', 'books', 'movies', 'book', 'way', 'Which', 'best']


THE TOP 15 WORDS FOR QUESTION #3
['girls', 'love', 'want', 'questions', 'doesn', 'use', 'bad', 'need', 'don', 'men', 'hate', 'women', 'important', 'did', 'Why']


THE TOP 15 WORDS FOR QUESTION #4
['friends', 'investment', 'internet', 'free', 'black', 'easiest', 'home', 'easy', 'YouTube', 'ways', 'way', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR QUESTION #5
['old', 'year', 'com', 'world', 'real', 'better', 'safe', 'really', 'way', 'worth', 'true', 'bad', 'po

In [45]:
topic_results = ques_model.transform(dtm)

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [46]:
dataset2['TypeOfQues'] = topic_results.argmax(axis=1)
dataset2.head(10)

Unnamed: 0,Question,TypeOfQues
0,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0
2,How can I increase the speed of my internet co...,1
3,Why am I mentally very lonely? How can I solve...,3
4,"Which one dissolve in water quikly sugar, salt...",2
5,Astrology: I am a Capricorn Sun Cap moon and c...,6
6,Should I buy tiago?,9
7,How can I be a good geologist?,1
8,When do you use ã‚· instead of ã—?,15
9,Motorola (company): Can I hack my Charter Moto...,15


In [47]:
# What the the appropraite TypeOfQues from LDA ???

Ques_dict = {0:'type_1',1:'type_2',2:'type_3',3:'type_4',4:'type_5',5:'type_6',6:'type_6', 
             7:'type_7',8:'type_8',9:'type_9',10:'type_10',11:'type_11',12:'type_12',13:'type_13',
             14:'type_14',15:'type_15',16:'type_16',17:'type_17',18:'type_18',19:'type_19',20:'type_20'}
dataset2["Ques Label"] = dataset2["TypeOfQues"].map(Ques_dict)

In [48]:
dataset2

Unnamed: 0,Question,TypeOfQues,Ques Label
0,What is the step by step guide to invest in sh...,0,type_1
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0,type_1
2,How can I increase the speed of my internet co...,1,type_2
3,Why am I mentally very lonely? How can I solve...,3,type_4
4,"Which one dissolve in water quikly sugar, salt...",2,type_3
...,...,...,...
404284,How many keywords are there in the Racket prog...,12,type_12
404285,Do you believe there is life after death?,8,type_8
404286,What is one coin?,0,type_1
404287,What is the approx annual cost of living while...,0,type_1
