# Topic Modelling for news articles

## Overview

The aim of this project is to categorise news headline into various unique topics. The data is taken from https://www.kaggle.com/aaron7sun/stocknews. Topic modelling provides us with methods to organize, understand and summarize large collections of textual information. From there, we will be able to reduce the scope of large dataset and be able to dive in to our field of study.

### Steps:
    1. Data cleaning and processing
    2. Creating document term matrix
    3. Applying Non-Negative matrix factorisation
    4. Retrieving top 15 words for each topic
    5. Determining appropriate number of topics
    6. Attaching discovered topic labels to original articles
    7. Interpreting the topics based on keywords

In [173]:
import pandas as pd
import numpy as np

In [174]:
data = pd.read_csv("news.csv")

In [175]:
len(data)

73608

In [176]:
data.head()

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [177]:
# check if all news content is str type
count = 0

for i,date,news in data.itertuples():  
    if type(news) != str:
        count += 1

print(count)

0


In [178]:
# Lemmatize words, which transform words to its most basic form, such as ‘running’ and ‘ran’ to ‘run’ so that they are recognized as the same word
from nltk.stem import WordNetLemmatizer
import string

# Function to remove punctuation
def remove_punc(text):
    no_punc = ''.join([w for w in text if w not in string.punctuation])
    return no_punc

lem = WordNetLemmatizer()
def word_lemmatizer(text):
    lemmas = [lem.lemmatize(w) for w in text]
    return lemmas

data['News'] = data['News'].apply(lambda x:remove_punc(x))
data['News'] = data['News'].str.split(' ')
data['News'] = data['News'].apply(lambda x: word_lemmatizer(x))
data['News'] = data['News'].str.join(' ')

## 2. Creating document term matrix

In [179]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [180]:
# max_df = 0.9 : words present in more than 85% of the data will be removed
# min_df = 3    : words present less than 3 times in the data will be removed

tfidf = TfidfVectorizer(max_df=0.9, min_df=2, stop_words='english')

In [181]:
dtm = tfidf.fit_transform(data['News'])

In [182]:
dtm.shape

(73608, 25526)

## 3. Applying non-negative matrix factorisation

In [183]:
from sklearn.decomposition import NMF

In [184]:
nmf_model = NMF(n_components=20,random_state=21)     # n_components represents number of topics 
nmf_model.fit(dtm)



NMF(n_components=20, random_state=21)

In [185]:
nmf_model.components_.shape

(20, 25526)

## 4. Retrieving top 15 words for each topic

In [186]:
for index,topic in enumerate(nmf_model.components_):
    print(f'Top 15 words for topic #{index + 1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

Top 15 words for topic #1
['german', 'just', 'leader', 'called', 'international', 'time', 'month', 'states', 'ba', 'bthe', 'united', 'president', 'said', 'country', 'ha']


Top 15 words for topic #2
['rocket', 'border', 'military', 'threatens', 'koreas', 'test', 'nuclear', 'launch', 'bnorth', 'kim', 'missile', 'korean', 'south', 'north', 'korea']


Top 15 words for topic #3
['west', 'palestine', 'lebanon', 'egypt', 'jerusalem', 'jews', 'jewish', 'gaza', 'rocket', 'palestinian', 'settlement', 'peace', 'hamas', 'palestinians', 'israel']


Top 15 words for topic #4
['afghanistan', 'civilian', 'rebel', 'soldier', 'dead', 'pakistan', 'air', 'force', 'syrian', 'kill', 'isis', 'strike', 'people', 'syria', 'killed']


Top 15 words for topic #5
['arm', 'gas', 'europe', 'eu', 'snowden', 'ban', 'missile', 'nato', 'crimea', 'warns', 'sanction', 'putin', 'ukraine', 'syria', 'russia']


Top 15 words for topic #6
['change', 'expert', 'wont', 'pope', 'climate', 'prime', 'refugee', 'snowden', 'pm', 'ch

## 5. Determining appropriate number of topics
    

We started off by using 20 topics. However, we can see that there are some topics which seemed to have some overlapping words. Examples of them are topics 4, 15, 18, 19 etc. Since we want a general categorisation of the news articles, we will reduce the number of topics. 

In [187]:
# Using 13 topics
nmf_model2 = NMF(n_components=13,random_state=21)     
nmf_model2.fit(dtm)

for index,topic in enumerate(nmf_model2.components_):
    print(f'Top 15 words for topic #{index + 1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')



Top 15 words for topic #1
['human', 'internet', 'law', 'people', 'ban', 'court', 'said', 'state', 'right', 'minister', 'president', 'uk', 'country', 'government', 'ha']


Top 15 words for topic #2
['rocket', 'border', 'threatens', 'military', 'koreas', 'test', 'nuclear', 'launch', 'bnorth', 'kim', 'missile', 'korean', 'south', 'north', 'korea']


Top 15 words for topic #3
['jerusalem', 'peace', 'aid', 'jewish', 'bisrael', 'rocket', 'settlement', 'west', 'bank', 'palestinians', 'hamas', 'palestinian', 'israeli', 'gaza', 'israel']


Top 15 words for topic #4
['soldier', 'dead', 'military', 'air', 'iraq', 'people', 'force', 'syrian', 'pakistan', 'strike', 'isis', 'kill', 'syria', 'killed', 'attack']


Top 15 words for topic #5
['president', 'moscow', 'crisis', 'troop', 'warns', 'sanction', 'vladimir', 'crimea', 'nato', 'military', 'syria', 'putin', 'russian', 'ukraine', 'russia']


Top 15 words for topic #6
['cartel', 'bthe', 'warns', 'cold', 'bisrael', 'end', 'mexican', 'gaza', 'afghanis

Again, we see that topics 11 to 13 seemed a little vague and it is difficult to categorise them. We will try to reduce the scope by minimising the scope to 10 instead.

In [188]:
# Using 13 topics
nmf_model3 = NMF(n_components=10,random_state=21)     
nmf_model3.fit(dtm)

for index,topic in enumerate(nmf_model3.components_):
    print(f'Top 15 words for topic #{index + 1}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')



Top 15 words for topic #1
['million', 'woman', 'right', 'president', 'court', 'people', 'law', 'uk', 'wa', 'country', 'government', 'world', 'new', 'year', 'ha']


Top 15 words for topic #2
['rocket', 'border', 'threatens', 'military', 'koreas', 'test', 'nuclear', 'launch', 'bnorth', 'kim', 'missile', 'korean', 'south', 'north', 'korea']


Top 15 words for topic #3
['jerusalem', 'peace', 'aid', 'jewish', 'bisrael', 'rocket', 'settlement', 'west', 'bank', 'palestinians', 'hamas', 'palestinian', 'israeli', 'gaza', 'israel']


Top 15 words for topic #4
['soldier', 'dead', 'air', 'military', 'iraq', 'people', 'force', 'syrian', 'strike', 'pakistan', 'isis', 'kill', 'syria', 'killed', 'attack']


Top 15 words for topic #5
['moscow', 'crisis', 'troop', 'warns', 'sanction', 'president', 'crimea', 'vladimir', 'nato', 'military', 'syria', 'putin', 'russian', 'ukraine', 'russia']


Top 15 words for topic #6
['eu', 'nsa', 'wont', 'refugee', 'pm', 'human', 'chief', 'prime', 'snowden', 'right', 'pr

## 6. Attaching discovered topic labels to original articles
   

In [190]:
topic_results = nmf_model3.transform(dtm)

In [191]:
topic_results3.argmax(axis=1)

array([0, 9, 5, ..., 4, 0, 0])

## 7. Interpreting the topics based on keywords

In [192]:
data['Topic'] = topic_results.argmax(axis=1)

In [193]:
data['Topic'] = data['Topic'].map({0:'Politics', 
                               1:'nuclear tension in North Korea', 
                               2:'Problems in Israel Palestine', 
                               3:'Terrorism attacks', 
                               4:'Russia-Ukraine Conflict', 
                               5:'War on drugs', 
                               6:'South China Sea Tensions', 
                               7:'Nuclear power, oil', 
                               8:'Climate and refugees problem', 
                               9:'Riots and protests'})

In [233]:
for i in range(363,370):
    print("topic: " + data['Topic'][i])
    print("news: " + data['News'][i])
    print()
    

topic: Riots and protests
news: Taliban use honey trap boy to kill Afghan police  The Taliban are using child sex slave to mount crippling insider attack on police in southern Afghanistan

topic: Israel-Palestine Conflict
news: COGAT Israel water supply to Palestinians increased not decreased

topic: Riots and protests
news: Deaths arrest a looting erupts in Venezuela

topic: Politics
news: Elderly Japanese among the world richest retiree are flocking to inheritance adviser tackling historical taboo on discussing death and providing a rare avenue of growth for the country brokerage and bank

topic: Terrorism attacks
news: Russian hooligan attack Spanish tourist outside cathedral  Independentie

topic: Terrorism attacks
news: Boko Haram shoot dead 18 woman at funeral in northern Nigeria

topic: Terrorism attacks
news: ISIS Committed Genocide Against Yazidis in Syria and Iraq UN Panel Says

