___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Topic Modeling on Articles



## Data

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [30]:
import pandas as pd
import os
os.chdir(r'D:\Data Science Projects\NLP\Articles')

In [32]:
npr = pd.read_csv('npr.csv')

In [33]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [34]:
npr.shape

(11992, 1)

In [36]:
npr['Article'][7]

'I was standing by the airport exit, debating whether to get a snack, when a young man with a round face approached me. I focused hard to decipher his words. In a thick accent, he asked me to help him find his suitcase. As we walked to baggage claim, I learned his name: Edward Murinzi. This was his very first plane trip. A refugee from the Democratic Republic of Congo, he’d just arrived to begin his American life. Beside the luggage carousel at Washington’s Reagan Airport, he looked out at the two lanes of traffic and the concrete wall beyond. ”So this is America?” he said. From finding his bag to finding his apartment and finding a job, there was a lot for Edward to learn. Later, he acknowledged that while he was standing in the airport looking for his luggage, he felt the magnitude of the task before him. He says questions were zipping around his head: ”How will I start? You get scared. How will I manage?” After he found his bag and I called his caseworker to come and pick him up, we

# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

In [39]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [41]:
dtm = cv.fit_transform(npr['Article'])

In [42]:
#TF-IDF Dimensions
dtm.shape

(11992, 54777)

# Latent Dirichlet Allocation (LDA)

#### TASK: Using Scikit-Learn create an instance of LDA with 7 expected components. (Use random_state=42)..

In [43]:
from sklearn.decomposition import LatentDirichletAllocation

In [44]:
# Create an NMF instance: model
# the 20 components will be the topics
lda_model = LatentDirichletAllocation(n_components=7, random_state=42)

In [45]:
# Fit the model to TF-IDF
lda_model.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [46]:
for index, topic in enumerate(lda_model.components_):
    print(f"The top 15 words for TOPIC is: {index}")
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print()

The top 15 words for TOPIC is: 0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']

The top 15 words for TOPIC is: 1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']

The top 15 words for TOPIC is: 2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']

The top 15 words for TOPIC is: 3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']

The top 15 words for TOPIC is: 4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']

The top 15 words for TOPIC is: 5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'kn

In [47]:
# Transform the TF-IDF: nmf_features (document & topic matrix)
lda_features = lda_model.transform(dtm)  

In [48]:
#Features Dimensions
lda_features.shape

(11992, 7)

In [49]:
#Components Dimensions (topic and word matrix)
lda_model.components_.shape

(7, 54777)

In [50]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(lda_model.components_ , columns=cv.get_feature_names())
components_df

Unnamed: 0,00,000,00000,000s,000th,002,004,007,009,00s,...,zulu,zuma,zumba,zuraw,zurich,zwicky,zydeco,ángel,émigrés,überfunky
0,8.643328,2380.143327,0.142901,3.142641,0.142857,0.143743,0.143003,0.142878,0.143104,0.1429,...,0.142857,0.159712,0.143003,0.156732,0.987469,0.142857,0.142858,0.143007,0.142902,0.142862
1,27.619175,536.394437,0.142857,0.142861,0.143092,1.168792,0.142862,0.144935,1.896723,0.142862,...,0.142881,0.143611,0.143911,0.142857,0.142967,0.142857,0.142858,0.142862,0.142857,0.142907
2,7.227839,824.033986,0.142857,0.142928,0.143214,0.142867,0.14301,0.142902,0.142857,0.142922,...,0.14301,0.143709,0.144234,0.142857,2.373677,3.124346,0.142966,6.142362,2.140614,0.142924
3,1.752141,900.736692,0.142857,0.142857,1.768809,2.984565,0.142857,1.159362,1.387008,0.142864,...,0.142883,0.142897,3.712762,0.147033,4.066328,0.142857,0.14287,0.142944,0.143107,0.142857
4,3.114887,350.409655,0.142857,0.142915,0.143387,0.142943,2.316872,1.140668,0.144594,0.142869,...,0.142857,6.928944,0.142926,0.142857,0.142872,0.142989,0.142863,0.14286,0.142857,0.142867
5,46.148639,51.44086,3.142814,0.142886,0.143158,0.143412,0.968539,3.126244,0.142857,15.142685,...,2.142655,4.337526,0.57027,2.124807,0.143321,0.143914,13.142728,0.143108,0.143902,2.142718
6,0.493991,418.841042,0.142857,0.142911,2.515483,1.273678,0.142857,0.143011,0.142857,0.142897,...,0.142857,0.143601,0.142894,0.142857,0.143367,0.16018,0.142857,0.142857,0.14376,0.142866


In [51]:
#Get the Words of the Highest Value for each Topic

for topic in range(components_df.shape[0]):
    tmp = components_df.iloc[topic]
    print(f'For topic {topic+1}, the top 10 words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 1, the top 10 words with the highest value are:
says          6247.245511
said          4608.957060
health        3699.339794
people        3643.826184
care          2760.197441
million       2628.992411
company       2626.815541
government    2533.274225
percent       2529.319134
new           2454.825196
Name: 0, dtype: float64


For topic 2, the top 10 words with the highest value are:
said         10102.177169
trump         5131.084582
president     4136.107708
police        3486.606931
told          2866.730952
people        2775.498131
news          2708.201875
says          2611.376404
reports       2541.147184
npr           2520.698546
Name: 1, dtype: float64


For topic 3, the top 10 words with the highest value are:
says      13636.686911
like       4427.556101
people     4281.840795
just       3476.564642
food       3145.001238
years      2868.878582
new        2757.696699
city       2625.646160
water      2372.209237
time       2325.818241
Name: 2, dtype: float64


In [52]:
npr['Article'][55]

'Almost a million elephants roamed Africa 25 years ago. Assessments of their population now vary but suggest there are fewer than half that many. The main reason for the decline is ivory. Despite a 1989 ban on ivory trade, poachers continue to kill elephants for their tusks. Now China, the destination for most of that ivory, has announced it will shut down its domestic ivory market. Wildlife experts had thought that the international ban on ivory trade would slow or even stop the killing of elephants for their tusks. It didn’t. In fact, the killing got worse. That’s mostly because the ban didn’t cover older ivory, that is, ivory taken from elephants before the 1989 ban. So people are still killing elephants but passing off their ivory as old, and therefore legal to trade. John Robinson, with the Wildlife Conservation Society, says efforts to stop the supply of ivory at the source, in Africa, have not been very successful. ”Addressing the demand is absolutely essential if we are going t

In [53]:
# to get the topics from Feature Matrix for 55th document 
pd.DataFrame(lda_features).loc[55]

0    0.444300
1    0.000693
2    0.552238
3    0.000692
4    0.000693
5    0.000692
6    0.000692
Name: 55, dtype: float64

In [54]:
# to get the index in once
pd.DataFrame(lda_features).loc[55].idxmax()

2

In [55]:
# to see the number of documents for each topic
pd.DataFrame(lda_features).idxmax()

0     2759
1     2331
2     8956
3    10184
4     7966
5    10958
6     8253
dtype: int64

In [56]:
# Get dominant topic for each document
npr['Topic'] = lda_features.argmax(axis=1)

In [57]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


In [58]:
def label_theme(row):
    if row['Topic'] == 0 :
        return 'American/Car/Marriage/Story/Life in general'
    if row['Topic'] == 1 :
        return 'Education/Business/Money'
    if row['Topic'] == 2 :
        return 'American Medicare/Trump'
    if row['Topic'] == 3:
        return 'State/Social/Rights'
    if row['Topic']  == 4:
        return 'Build new life'
    if row['Topic'] == 5:
        return 'Highly educated Indian engineers in America'
    if row['Topic'] == 6:
        return 'Tips on improving work day efficiency'
        
npr['Topic_theme'] = npr.apply (lambda row: label_theme(row), axis=1)
npr.head(15)

Unnamed: 0,Article,Topic,Topic_theme
0,"In the Washington of 2016, even when the polic...",1,Education/Business/Money
1,Donald Trump has used Twitter — his prefe...,1,Education/Business/Money
2,Donald Trump is unabashedly praising Russian...,1,Education/Business/Money
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,Education/Business/Money
4,"From photography, illustration and video, to d...",2,American Medicare/Trump
5,I did not want to join yoga class. I hated tho...,3,State/Social/Rights
6,With a who has publicly supported the debunk...,3,State/Social/Rights
7,"I was standing by the airport exit, debating w...",2,American Medicare/Trump
8,"If movies were trying to be more realistic, pe...",3,State/Social/Rights
9,"Eighteen years ago, on New Year’s Eve, David F...",2,American Medicare/Trump


# How to Predict the Topic of a New Document

#### Let’s say that we want to assign a topic of a new unseen document. Then, we will need to take the document, to transform the TF-IDF model and finally to transform the NMF model.

In [66]:
my_news = """I will vote to Modi goverment in this Kolkata election for their work towards INDIA"""
 
# Transform the Count Vectorization
X = cv.transform([my_news])
# Transform the Count Vectorization: lda_features
lda_features = lda_model.transform(X)
 
pd.DataFrame(lda_features)

Unnamed: 0,0,1,2,3,4,5,6
0,0.236983,0.020464,0.347512,0.020513,0.333635,0.020442,0.020451


In [67]:
# if we want to get the index of the topic with the highest score:

topic = pd.DataFrame(lda_features).idxmax(axis=1)
print(f'The given document belongs to Topic {topic[0]}.')

The given document belongs to Topic 2.
