 
# Text Mining of BBC news articles

Using NLU techniques we will analyze news articles. First, the NLTK library will be used to preprocess and clean the text then TF-IDF will be used to create the machine learning-ready data and finally unsupervised Topic Modeling will be used to group similar articles together.

BBC Dataset: All rights, including copyright, in the content of the original articles are owned by the BBC.
The dataset consists of 2410 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Class Labels: 5 (business, entertainment, politics, sport, tech)


## In this notebook

 - Work with Text data
 - Preprocessing Text using NLTK
 - Generating TF-IDF matrix
 - Topic Modeling

In [1]:
# machine learning library
!pip install scikit-learn 

# Natural language processing library
!pip install nltk

# data manipulation library
!pip install pandas



You should consider upgrading via the 'c:\users\vishal\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\vishal\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\vishal\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


In [2]:
import nltk
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [3]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# load data
file_path = 'bbc.csv'
raw_df = pd.read_csv(file_path)

In [5]:
raw_df.shape

(2410, 3)

In [6]:
raw_df.columns

Index(['Unnamed: 0', 'description', 'tags'], dtype='object')

In [7]:
raw_df

Unnamed: 0.1,Unnamed: 0,description,tags
0,0,chelsea sack mutu chelsea have sacked adrian ...,"sports, stamford bridge, football association,..."
1,1,record fails to lift lacklustre meet yelena i...,"sports, madrid, birmingham, france, scotland, ..."
2,2,edu describes tunnel fracas arsenals edu has ...,"sports, derby, brazil, tunnel fracasedu, food,..."
3,3,ogara revels in ireland victory ireland flyha...,"sports, bbc, united kingdom, ireland, brian o'..."
4,4,unclear future for striker baros liverpool fo...,"sports, liverpool, daily sport, millennium sta..."
...,...,...,...
2405,2405,gm in crunch talks on fiat future fiat will m...,"business, zurich, fiat, reuters, the financial..."
2406,2406,uk firm faces venezuelan land row venezuelan ...,"agroflora, reuters, vestey group, venezuela, u..."
2407,2407,winndixie files for bankruptcy us supermarket...,"business, jacksonville, kraft foods, winn-dixi..."
2408,2408,yangtze electrics profits double yangtze elec...,"environment, business, yangtze electric power,..."


In [8]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2410 entries, 0 to 2409
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   2410 non-null   int64 
 1   description  2410 non-null   object
 2   tags         2392 non-null   object
dtypes: int64(1), object(2)
memory usage: 56.6+ KB


In [9]:
# some tags are missing, replace with unknown
no_nulls_df = raw_df.fillna('Unknown')

In [10]:
no_nulls_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2410 entries, 0 to 2409
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   2410 non-null   int64 
 1   description  2410 non-null   object
 2   tags         2410 non-null   object
dtypes: int64(1), object(2)
memory usage: 56.6+ KB


In [11]:
# get article type from tags column (article type is the first word before the comma)

no_nulls_df['article_type'] = no_nulls_df['tags'].map(lambda x:(x.split(','))[0])
no_nulls_df

Unnamed: 0.1,Unnamed: 0,description,tags,article_type
0,0,chelsea sack mutu chelsea have sacked adrian ...,"sports, stamford bridge, football association,...",sports
1,1,record fails to lift lacklustre meet yelena i...,"sports, madrid, birmingham, france, scotland, ...",sports
2,2,edu describes tunnel fracas arsenals edu has ...,"sports, derby, brazil, tunnel fracasedu, food,...",sports
3,3,ogara revels in ireland victory ireland flyha...,"sports, bbc, united kingdom, ireland, brian o'...",sports
4,4,unclear future for striker baros liverpool fo...,"sports, liverpool, daily sport, millennium sta...",sports
...,...,...,...,...
2405,2405,gm in crunch talks on fiat future fiat will m...,"business, zurich, fiat, reuters, the financial...",business
2406,2406,uk firm faces venezuelan land row venezuelan ...,"agroflora, reuters, vestey group, venezuela, u...",agroflora
2407,2407,winndixie files for bankruptcy us supermarket...,"business, jacksonville, kraft foods, winn-dixi...",business
2408,2408,yangtze electrics profits double yangtze elec...,"environment, business, yangtze electric power,...",environment


In [12]:
no_nulls_df["article_type"].value_counts().head(n=20)

sports             473
entertainment      413
business           399
technology         391
politics           294
law                 61
social issues       56
human interest      31
disaster            27
labor               19
Unknown             18
education           17
health              17
environment         13
war                 12
london              10
hospitality          8
bbc                  6
iraq                 4
hewlett packard      4
Name: article_type, dtype: int64

In [13]:
no_nulls_df['article_type'] = no_nulls_df['article_type'].map(lambda x:"Unknown" if x not in ['sports', 'entertainment', 'business', 'technology', 'politics'] else x)
no_nulls_df

Unnamed: 0.1,Unnamed: 0,description,tags,article_type
0,0,chelsea sack mutu chelsea have sacked adrian ...,"sports, stamford bridge, football association,...",sports
1,1,record fails to lift lacklustre meet yelena i...,"sports, madrid, birmingham, france, scotland, ...",sports
2,2,edu describes tunnel fracas arsenals edu has ...,"sports, derby, brazil, tunnel fracasedu, food,...",sports
3,3,ogara revels in ireland victory ireland flyha...,"sports, bbc, united kingdom, ireland, brian o'...",sports
4,4,unclear future for striker baros liverpool fo...,"sports, liverpool, daily sport, millennium sta...",sports
...,...,...,...,...
2405,2405,gm in crunch talks on fiat future fiat will m...,"business, zurich, fiat, reuters, the financial...",business
2406,2406,uk firm faces venezuelan land row venezuelan ...,"agroflora, reuters, vestey group, venezuela, u...",Unknown
2407,2407,winndixie files for bankruptcy us supermarket...,"business, jacksonville, kraft foods, winn-dixi...",business
2408,2408,yangtze electrics profits double yangtze elec...,"environment, business, yangtze electric power,...",Unknown


In [14]:
# check the number of articles per article type

no_nulls_df['article_type'].value_counts().head(n=20)

sports           473
Unknown          440
entertainment    413
business         399
technology       391
politics         294
Name: article_type, dtype: int64

## Text Processing

In [15]:
# look at example of text
no_nulls_df['description'][0]

'chelsea sack mutu  chelsea have sacked adrian mutu after he failed a drugs test  the yearold tested positive for a banned substance  which he later denied was cocaine  in october chelsea have decided to write off a possible transfer fee for mutu a m signing from parma last season who may face a twoyear suspension a statement from chelsea explaining the decision readwe want to make clear that chelsea has a zero tolerance policy towards drugs mutu scored six goals in his first five games after arriving at stamford bridge but his form went into decline and he was frozen out by coach jose mourinho chelseas statement added this applies to both performanceenhancing drugs or socalled recreational drugs they have no place at our club or in sport in coming to a decision on this case chelsea believed the clubs social responsibility to its fans players employees and other stakeholders in football regarding drugs was more important than the major financial considerations to the company any player

In [16]:
# lowercase all articles text
no_nulls_df['description'] = no_nulls_df['description'].map(lambda x: x.lower())

# check the first article
no_nulls_df['description'][0]

'chelsea sack mutu  chelsea have sacked adrian mutu after he failed a drugs test  the yearold tested positive for a banned substance  which he later denied was cocaine  in october chelsea have decided to write off a possible transfer fee for mutu a m signing from parma last season who may face a twoyear suspension a statement from chelsea explaining the decision readwe want to make clear that chelsea has a zero tolerance policy towards drugs mutu scored six goals in his first five games after arriving at stamford bridge but his form went into decline and he was frozen out by coach jose mourinho chelseas statement added this applies to both performanceenhancing drugs or socalled recreational drugs they have no place at our club or in sport in coming to a decision on this case chelsea believed the clubs social responsibility to its fans players employees and other stakeholders in football regarding drugs was more important than the major financial considerations to the company any player

In [17]:
# Remove all non alphanumeric characters using regular expressions
# The [] create a list of chars. The ^ negates the list. 
# A-Za-z are the English alphabet, 0-9 are the numbers and is space. 
# For any one or more of these (that is, anything that is not A-Z, a-z,0-9 or space,) replace with the empty string.
# The resulting string is saved in variable newString

import re

no_nulls_df['description'] = no_nulls_df["description"].map(lambda x: re.sub(r'[^A-Za-z0-9 ]+', '', x))

# check for changes
no_nulls_df['description'][0]

'chelsea sack mutu  chelsea have sacked adrian mutu after he failed a drugs test  the yearold tested positive for a banned substance  which he later denied was cocaine  in october chelsea have decided to write off a possible transfer fee for mutu a m signing from parma last season who may face a twoyear suspension a statement from chelsea explaining the decision readwe want to make clear that chelsea has a zero tolerance policy towards drugs mutu scored six goals in his first five games after arriving at stamford bridge but his form went into decline and he was frozen out by coach jose mourinho chelseas statement added this applies to both performanceenhancing drugs or socalled recreational drugs they have no place at our club or in sport in coming to a decision on this case chelsea believed the clubs social responsibility to its fans players employees and other stakeholders in football regarding drugs was more important than the major financial considerations to the company any player

In [18]:
# drop stop words

# get NLTK stopwords
from nltk.corpus import stopwords

# download the stopwords corpus from nltk
nltk.download('stopwords')
 
# load the list of stop words
stop_words = stopwords.words('english')

# for word in x.split(" ") => Split the article text by space to words
# for each word check if it is not in the list of stop words
# glue/join all non stop_word words back together using spaces
no_nulls_df["description"] = no_nulls_df["description"].map(lambda x: " ".join([word for word in x.split(" ") if word not in stop_words]))

# check for change
no_nulls_df['description'][0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'chelsea sack mutu  chelsea sacked adrian mutu failed drugs test  yearold tested positive banned substance  later denied cocaine  october chelsea decided write possible transfer fee mutu signing parma last season may face twoyear suspension statement chelsea explaining decision readwe want make clear chelsea zero tolerance policy towards drugs mutu scored six goals first five games arriving stamford bridge form went decline frozen coach jose mourinho chelseas statement added applies performanceenhancing drugs socalled recreational drugs place club sport coming decision case chelsea believed clubs social responsibility fans players employees stakeholders football regarding drugs important major financial considerations company player takes drugs breaches contract club well football association rules club totally supports fa strong action drugs cases fifas disciplinary code stipulates first doping offence followed sixmonth ban sports world governing body reiterated stance mutus failed dr

In [19]:
# generate the part of speech tagging (PoS)

from nltk import word_tokenize, pos_tag

# download PoS corpus
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

# You can filter words out based on their PoS (i.e only keep verbs or adjective) 
# but we will not be using this so just apply to first article for the sake of trying it

tokens = word_tokenize(no_nulls_df['description'][0])
tagged = pos_tag(tokens, tagset='universal')
tagged

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Vishal\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('chelsea', 'NOUN'),
 ('sack', 'NOUN'),
 ('mutu', 'NOUN'),
 ('chelsea', 'NOUN'),
 ('sacked', 'VERB'),
 ('adrian', 'ADJ'),
 ('mutu', 'NOUN'),
 ('failed', 'VERB'),
 ('drugs', 'NOUN'),
 ('test', 'VERB'),
 ('yearold', 'ADV'),
 ('tested', 'VERB'),
 ('positive', 'ADJ'),
 ('banned', 'VERB'),
 ('substance', 'NOUN'),
 ('later', 'ADV'),
 ('denied', 'VERB'),
 ('cocaine', 'ADJ'),
 ('october', 'NOUN'),
 ('chelsea', 'NOUN'),
 ('decided', 'VERB'),
 ('write', 'ADV'),
 ('possible', 'ADJ'),
 ('transfer', 'NOUN'),
 ('fee', 'NOUN'),
 ('mutu', 'NOUN'),
 ('signing', 'VERB'),
 ('parma', 'NOUN'),
 ('last', 'ADJ'),
 ('season', 'NOUN'),
 ('may', 'VERB'),
 ('face', 'VERB'),
 ('twoyear', 'ADJ'),
 ('suspension', 'NOUN'),
 ('statement', 'NOUN'),
 ('chelsea', 'NOUN'),
 ('explaining', 'VERB'),
 ('decision', 'NOUN'),
 ('readwe', 'NOUN'),
 ('want', 'VERB'),
 ('make', 'NOUN'),
 ('clear', 'ADJ'),
 ('chelsea', 'NOUN'),
 ('zero', 'NUM'),
 ('tolerance', 'NOUN'),
 ('policy', 'NOUN'),
 ('towards', 'NOUN'),
 ('drugs', 'NOUN')

In [20]:
# stemming words to remove any tenses

from nltk.stem import PorterStemmer

porter = PorterStemmer()
no_nulls_df['description'] = no_nulls_df['description'].map(lambda x: " ".join([porter.stem(word) for word in x.split(" ")]))

# check for change
no_nulls_df['description'][0]

'chelsea sack mutu  chelsea sack adrian mutu fail drug test  yearold test posit ban substanc  later deni cocain  octob chelsea decid write possibl transfer fee mutu sign parma last season may face twoyear suspens statement chelsea explain decis readw want make clear chelsea zero toler polici toward drug mutu score six goal first five game arriv stamford bridg form went declin frozen coach jose mourinho chelsea statement ad appli performanceenhanc drug socal recreat drug place club sport come decis case chelsea believ club social respons fan player employe stakehold footbal regard drug import major financi consider compani player take drug breach contract club well footbal associ rule club total support fa strong action drug case fifa disciplinari code stipul first dope offenc follow sixmonth ban sport world govern bodi reiter stanc mutu fail drug test maintain matter domest sport author fifa posit make comment matter english fa inform us disciplinari decis relev inform associ said fifa

In [21]:
# remove words shorter than 3 characters

no_nulls_df['description'] = no_nulls_df['description'].map(lambda x: " ".join([word for word in x.split(" ") if len(word)>2]))

#check for change
no_nulls_df["description"][0]

'chelsea sack mutu chelsea sack adrian mutu fail drug test yearold test posit ban substanc later deni cocain octob chelsea decid write possibl transfer fee mutu sign parma last season may face twoyear suspens statement chelsea explain decis readw want make clear chelsea zero toler polici toward drug mutu score six goal first five game arriv stamford bridg form went declin frozen coach jose mourinho chelsea statement appli performanceenhanc drug socal recreat drug place club sport come decis case chelsea believ club social respons fan player employe stakehold footbal regard drug import major financi consider compani player take drug breach contract club well footbal associ rule club total support strong action drug case fifa disciplinari code stipul first dope offenc follow sixmonth ban sport world govern bodi reiter stanc mutu fail drug test maintain matter domest sport author fifa posit make comment matter english inform disciplinari decis relev inform associ said fifa spokesman chels

### Generating features matrix
* 3 approaches
    * approach 1: creating a count vectorizer for the text column in the pd dataframe
    * approach 2: word level tf-idf
    * approach 3: Ngram level tf-idf (2 and 3 words per column)
    * approach 4: character level tfidf (2,3,4,5 characters)

In [22]:
# approach 1: count vectorizer for text column

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_array = count_vect.fit_transform(no_nulls_df['description'])

# make count_array into a df so its easier to manage
count_df = pd.DataFrame(data=count_array.todense(), columns=count_vect.get_feature_names())
count_df

Unnamed: 0,aaa,aac,aadc,aaliyah,aaltra,aamir,aara,aarhu,aaron,aashar,...,zoom,zooropa,zornotza,zorro,zubair,zuluaga,zurich,zuton,zvonareva,zvyagintsev
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2406,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2407,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2408,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# approach 2: word level tdidf

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
word_tfidf_array = tfidf_vect.fit_transform(no_nulls_df['description'])

# convert array to df
word_tfidf_df = pd.DataFrame(data=word_tfidf_array.todense(), columns=tfidf_vect.get_feature_names())
word_tfidf_df

Unnamed: 0,aaa,abandon,abba,abbott,abc,abil,abl,abn,abolish,abort,...,yugansk,yuganskneftega,yuko,yushchenko,zealand,zen,zero,zombi,zone,zurich
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.052317,0.0,0.0,0.000000
1,0.067583,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.038881,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.055853
2406,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
2407,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000
2408,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000


In [24]:
# approach 3: Ngram level tfidf (2 and 3 words per column)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}',ngram_range=(2,3), max_features=5000)
ngram_tfidf_array = tfidf_vect_ngram.fit_transform(no_nulls_df["description"])

# convert array to df
ngram_tfidf_array = pd.DataFrame(data=ngram_tfidf_array.todense(), columns=tfidf_vect_ngram.get_feature_names())
ngram_tfidf_array

Unnamed: 0,aaa titl,abl access,abl get,abl make,abl play,abl see,abl take,abl watch,abn amro,academi award,...,york marathon,york time,young man,young peopl,young player,youv got,yugansk sold,yuko claim,yuko file,zurich premiership
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.143813,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.136352,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2406,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2407,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2408,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [25]:
# approach 4: character level tfidf (2,3,4,5 characters)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char',token_pattern=r'\w{1,}', ngram_range=(3,5), max_features=5000)
char_tfidf_array = tfidf_vect_ngram_chars.fit_transform(no_nulls_df["description"])

#Transform the resulting count array to pandas dataframe to be easier to manage
char_tfidf_df = pd.DataFrame(data=char_tfidf_array.todense(), columns=tfidf_vect_ngram_chars.get_feature_names())
char_tfidf_df

Unnamed: 0,ab,ac,acc,acce,acco,act,acti,ad,add,adv,...,year,yer,yer.1,yon,you,yst,yst.1,yste,ystem,yth
0,0.000000,0.014814,0.000000,0.00000,0.000000,0.022641,0.014451,0.009784,0.000000,0.000000,...,0.007892,0.038114,0.038415,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
1,0.000000,0.006825,0.000000,0.00000,0.000000,0.000000,0.000000,0.009016,0.000000,0.013591,...,0.007272,0.000000,0.000000,0.000000,0.014394,0.000000,0.000000,0.0,0.0,0.015412
2,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.025470,0.025672,0.031286,0.000000,0.000000,0.000000,0.0,0.0,0.000000
3,0.012032,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.020235,0.016595,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.016153,0.000000,0.000000,0.0,0.0,0.000000
4,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.044765,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.027676,0.000000,0.0,0.0,0.038261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0.000000,0.019800,0.026253,0.01997,0.016790,0.000000,0.000000,0.013078,0.000000,0.000000,...,0.010549,0.000000,0.000000,0.000000,0.000000,0.016171,0.021429,0.0,0.0,0.000000
2406,0.012659,0.048349,0.010684,0.00000,0.013666,0.036948,0.031442,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
2407,0.000000,0.022936,0.000000,0.00000,0.000000,0.017528,0.022374,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.056196,0.074468,0.0,0.0,0.000000
2408,0.000000,0.033059,0.000000,0.00000,0.000000,0.016842,0.021499,0.000000,0.000000,0.000000,...,0.023484,0.000000,0.000000,0.000000,0.000000,0.035999,0.047703,0.0,0.0,0.000000


In [26]:
# add approach 2 to article type

word_tfidf_df['article_type'] = no_nulls_df['article_type']
word_tfidf_df

Unnamed: 0,aaa,abandon,abba,abbott,abc,abil,abl,abn,abolish,abort,...,yuganskneftega,yuko,yushchenko,zealand,zen,zero,zombi,zone,zurich,article_type
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.052317,0.0,0.0,0.000000,sports
1,0.067583,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,sports
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,sports
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.038881,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,sports
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,sports
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.055853,business
2406,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,Unknown
2407,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,business
2408,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,Unknown


#### Now, we have a matrix where columns represent words (features) and rows represent articles and we have the last column that represents the article type.

####  Topic Modelling

- we can use topic modelling to group similar articles together, therefore if an "Unknown" article groups with "Sports" then we predict that article as a "Sports" article

In [27]:
# training NMF model

from sklearn.decomposition import NMF

# n_components is the number of topics, for flexibility we use 10

nmf_model = NMF(n_components=10, init='random', random_state=0)
nmf_doc2topic_array = nmf_model.fit_transform(word_tfidf_array)
nmf_topic2words_array = nmf_model.components_
nmf_vocab = tfidf_vect.get_feature_names()

In [28]:
# convert array to df 
nmf_doc2topic_df = pd.DataFrame(data=nmf_doc2topic_array,columns=["Topic 0", "Topic 1","Topic 2","Topic 3","Topic 4","Topic 5","Topic 6","Topic 7","Topic 8","Topic 9"])
nmf_doc2topic_df

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,0.000000,0.000000,0.000000,0.000000,0.106366,0.000000,0.037619,0.000000,0.000000,0.000000
1,0.003382,0.007838,0.005461,0.006720,0.016234,0.013914,0.005781,0.151411,0.006773,0.013271
2,0.030451,0.000000,0.000000,0.000000,0.110409,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.013592,0.002242,0.000000,0.000300,0.009616,0.000000,0.000000,0.236395,0.000000,0.000000
4,0.000000,0.001511,0.000000,0.000000,0.091734,0.000000,0.002909,0.014297,0.005061,0.000000
...,...,...,...,...,...,...,...,...,...,...
2405,0.005792,0.000000,0.000000,0.001517,0.015617,0.000000,0.045743,0.000000,0.000000,0.034390
2406,0.001791,0.004165,0.000000,0.001151,0.000000,0.000000,0.085570,0.000000,0.006420,0.030035
2407,0.003090,0.009919,0.016068,0.019590,0.003827,0.000264,0.057154,0.000000,0.000000,0.044921
2408,0.000164,0.000000,0.003181,0.004234,0.000000,0.002476,0.009526,0.000000,0.000000,0.091161


In [29]:
# add article_type to df to figure out what each topic represents

nmf_doc2topic_df['article_type'] = no_nulls_df['article_type']

nmf_doc2topic_df

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,article_type
0,0.000000,0.000000,0.000000,0.000000,0.106366,0.000000,0.037619,0.000000,0.000000,0.000000,sports
1,0.003382,0.007838,0.005461,0.006720,0.016234,0.013914,0.005781,0.151411,0.006773,0.013271,sports
2,0.030451,0.000000,0.000000,0.000000,0.110409,0.000000,0.000000,0.000000,0.000000,0.000000,sports
3,0.013592,0.002242,0.000000,0.000300,0.009616,0.000000,0.000000,0.236395,0.000000,0.000000,sports
4,0.000000,0.001511,0.000000,0.000000,0.091734,0.000000,0.002909,0.014297,0.005061,0.000000,sports
...,...,...,...,...,...,...,...,...,...,...,...
2405,0.005792,0.000000,0.000000,0.001517,0.015617,0.000000,0.045743,0.000000,0.000000,0.034390,business
2406,0.001791,0.004165,0.000000,0.001151,0.000000,0.000000,0.085570,0.000000,0.006420,0.030035,Unknown
2407,0.003090,0.009919,0.016068,0.019590,0.003827,0.000264,0.057154,0.000000,0.000000,0.044921,business
2408,0.000164,0.000000,0.003181,0.004234,0.000000,0.002476,0.009526,0.000000,0.000000,0.091161,Unknown


In [30]:
# get top predicted value for each article

import numpy

# numpy.argsort(document)[::-1][0] :
# numpy.argsort() returns the indices (Column) sorted based on the values 
# [::-1] reverses the list so that the topic with heighest value comes first
# [0] selects the first topic
nmf_doc2topic_df['predicted_topic'] = [numpy.argsort(document)[::-1][0] for document in nmf_doc2topic_array]

nmf_doc2topic_df

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,article_type,predicted_topic
0,0.000000,0.000000,0.000000,0.000000,0.106366,0.000000,0.037619,0.000000,0.000000,0.000000,sports,4
1,0.003382,0.007838,0.005461,0.006720,0.016234,0.013914,0.005781,0.151411,0.006773,0.013271,sports,7
2,0.030451,0.000000,0.000000,0.000000,0.110409,0.000000,0.000000,0.000000,0.000000,0.000000,sports,4
3,0.013592,0.002242,0.000000,0.000300,0.009616,0.000000,0.000000,0.236395,0.000000,0.000000,sports,7
4,0.000000,0.001511,0.000000,0.000000,0.091734,0.000000,0.002909,0.014297,0.005061,0.000000,sports,4
...,...,...,...,...,...,...,...,...,...,...,...,...
2405,0.005792,0.000000,0.000000,0.001517,0.015617,0.000000,0.045743,0.000000,0.000000,0.034390,business,6
2406,0.001791,0.004165,0.000000,0.001151,0.000000,0.000000,0.085570,0.000000,0.006420,0.030035,Unknown,6
2407,0.003090,0.009919,0.016068,0.019590,0.003827,0.000264,0.057154,0.000000,0.000000,0.044921,business,6
2408,0.000164,0.000000,0.003181,0.004234,0.000000,0.002476,0.009526,0.000000,0.000000,0.091161,Unknown,9


In [31]:
# Convert the Words per topic array to a dataframe for better manipulation 
nmf_topic2words_df = pd.DataFrame(data=nmf_topic2words_array,columns=nmf_vocab)
nmf_topic2words_df

Unnamed: 0,aaa,abandon,abba,abbott,abc,abil,abl,abn,abolish,abort,...,yugansk,yuganskneftega,yuko,yushchenko,zealand,zen,zero,zombi,zone,zurich
0,0.0,0.007105,0.0,0.0,0.0,0.021501,0.045838,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001008,0.0,0.0,9.9e-05,0.012305,0.000994,0.0
1,0.0,0.00904,0.0,0.000179,0.0,0.01066,0.032863,0.0,0.034425,0.010513,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000823,0.0,0.0,0.0
2,0.002506,0.0,0.007469,0.0,0.001628,0.021601,0.087596,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.034613,0.001228,0.004578,6.2e-05,0.00146
3,0.0,0.0,0.0,0.000541,0.0,0.052508,0.052843,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000197,0.009946,0.007815,0.0
4,0.0,0.009316,0.0,0.0,0.0,0.0,0.010407,0.000192,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.010676,0.0,0.011475,0.003176
5,0.0,0.005682,0.015346,0.005141,0.022827,0.0,0.012639,0.0,0.0,0.020428,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002465,0.0006,0.0
6,0.000225,0.013987,0.002862,0.001226,0.014625,0.016351,0.034955,0.001365,0.00085,0.004667,...,0.141339,0.080418,0.45019,0.021944,0.0,0.0,0.002509,0.0,0.0,0.0
7,0.032615,0.001559,0.002559,0.010814,0.0,0.007678,0.025477,0.000881,0.0,0.000379,...,0.0,0.0,0.0,0.0,0.102697,0.0,0.005416,0.0,0.003918,0.016771
8,0.0,0.001275,0.020256,0.0,0.0,0.014836,0.0,0.0,0.0,0.005115,...,0.0,0.0,0.0,0.008293,0.002199,0.0,0.001811,0.0,0.005512,0.0
9,0.0,0.005393,0.000781,0.0,0.0,0.000224,0.0,0.026659,0.0,0.0,...,0.0,0.0,0.0,0.005437,0.0,0.0,0.006041,0.000287,0.016779,0.001777


In [32]:
# view the words per topic

import numpy

n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(nmf_topic2words_array):
    topic_words = numpy.array(nmf_vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    
topic_summaries

['game nintendo consol soni gamer psp handheld video xbox releas',
 'parti tori tax elect labour howard lib dem conserv would',
 'mobil phone music technolog digit servic peopl use camera broadband',
 'user search softwar microsoft site program net comput email use',
 'club chelsea arsen leagu liverpool unit mourinho play player goal',
 'film award best star oscar nomin actor festiv actress director',
 'said law yuko court lord govern compani case firm rule',
 'england win wale ireland play match franc rugbi world coach',
 'brown blair labour chancellor minist prime elect said gordon toni',
 'growth economi rate bank price econom year rise market said']

#### Assign one of sports, entertainment, business, technology, politics to each of the 10 topics based on the most common words shown previously

In [33]:
# visualizing results
!pip install --user pyLDAvis
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(nmf_model, word_tfidf_array, tfidf_vect)



You should consider upgrading via the 'c:\users\vishal\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


#### Train the LDA model

In [34]:
# LDA works better with the count vectorizer

from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=10, learning_method='online', max_iter=20)
lda_doc2topic_array = lda_model.fit_transform(count_array)
lda_topic2word_array = lda_model.components_
lda_vocab = count_vect.get_feature_names()


  and should_run_async(code)


In [35]:
# convert array to df

lda_doc2topic_df = pd.DataFrame(data=lda_doc2topic_array, columns=["Topic 0","Topic 1","Topic 2","Topic 3","Topic 4","Topic 5","Topic 6","Topic 7","Topic 8","Topic 9"])
lda_doc2topic_df

  and should_run_async(code)


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,0.105517,0.000426,0.372493,0.195232,0.000426,0.230508,0.094122,0.000426,0.000426,0.000426
1,0.000309,0.000309,0.173538,0.823993,0.000309,0.000309,0.000309,0.000309,0.000309,0.000309
2,0.346992,0.000800,0.393273,0.196843,0.000800,0.000800,0.000801,0.000800,0.000800,0.058091
3,0.028159,0.000441,0.422725,0.429882,0.000441,0.000441,0.000441,0.116591,0.000441,0.000441
4,0.317606,0.001031,0.566880,0.096552,0.001031,0.001031,0.001031,0.001031,0.012776,0.001031
...,...,...,...,...,...,...,...,...,...,...
2405,0.000505,0.000505,0.030152,0.000505,0.000505,0.821871,0.038398,0.000505,0.106548,0.000505
2406,0.000439,0.000439,0.000439,0.000439,0.000439,0.827656,0.000439,0.155018,0.014255,0.000439
2407,0.033336,0.000676,0.000676,0.099330,0.019082,0.712460,0.021868,0.000676,0.021619,0.090277
2408,0.000658,0.000658,0.000658,0.000658,0.000658,0.844621,0.116737,0.000658,0.034037,0.000658


In [36]:
# Get the top predicted topic for each document

import numpy

# numpy.argsort(document)[::-1][0] :
# numpy.argsort() returns the indices (Column) sorted based on the values 
# [::-1] reverses the list so that the topic with heighest value comes first
# [0] selects the first topic
lda_doc2topic_df["predicted_topic"] = [numpy.argsort(document)[::-1][0] for document in lda_doc2topic_array]

lda_doc2topic_df

  and should_run_async(code)


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,predicted_topic
0,0.105517,0.000426,0.372493,0.195232,0.000426,0.230508,0.094122,0.000426,0.000426,0.000426,2
1,0.000309,0.000309,0.173538,0.823993,0.000309,0.000309,0.000309,0.000309,0.000309,0.000309,3
2,0.346992,0.000800,0.393273,0.196843,0.000800,0.000800,0.000801,0.000800,0.000800,0.058091,2
3,0.028159,0.000441,0.422725,0.429882,0.000441,0.000441,0.000441,0.116591,0.000441,0.000441,3
4,0.317606,0.001031,0.566880,0.096552,0.001031,0.001031,0.001031,0.001031,0.012776,0.001031,2
...,...,...,...,...,...,...,...,...,...,...,...
2405,0.000505,0.000505,0.030152,0.000505,0.000505,0.821871,0.038398,0.000505,0.106548,0.000505,5
2406,0.000439,0.000439,0.000439,0.000439,0.000439,0.827656,0.000439,0.155018,0.014255,0.000439,5
2407,0.033336,0.000676,0.000676,0.099330,0.019082,0.712460,0.021868,0.000676,0.021619,0.090277,5
2408,0.000658,0.000658,0.000658,0.000658,0.000658,0.844621,0.116737,0.000658,0.034037,0.000658,5


In [37]:
# view the top words per topic

import numpy

n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(lda_topic2word_array):
    topic_words = numpy.array(lda_vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    
    
topic_summaries

  and should_run_async(code)


['game chelsea arsen goal unit minut leagu player side manag',
 'user softwar search site program microsoft email secur blog use',
 'said would labour parti elect say blair england minist new',
 'year said win play star first world last one game',
 'film award best music includ band nomin song actor album',
 'said year compani would firm also govern new could market',
 'bid deutsch takeov lse sharehold boers worldcom ebber stake light',
 'yuko oil russian ireland russia itali tri gazprom irish yugansk',
 'fiat drug franc marsh forsyth stade walmart yuan laport copi',
 'said peopl game use technolog mobil phone new music like']

In [38]:
# Visualize the results

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_model, count_array, count_vect)

  and should_run_async(code)
