For this project, we'll use full texts of a few thousand articles to identify topics to which they belong. Since we do not already know which topic each article belongs to, we'll apply LDA, an unsupervised learning technique.

We will load the data and use Sklearn's CountVectorizer to count the frequency of each word present in the vocabulary in each article.

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [3]:
npr.shape[0]

11992

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.9, 
                     min_df=2,
                     stop_words='english')
# max_df throws out words that are common across lot of documents (x % docs)
# min_df throws out exrtemely rare words (n no. of docs)

# create the Document Term Matrix from the input data using
dtm = cv.fit_transform(npr['Article'])
dtm

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

# apply LDA to assign a topic to each article
LDA = LatentDirichletAllocation(n_components=7,
                               random_state=42)
# n_components is user-defined as number of topics

LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [6]:
# get the vocabulary of words in articles
len(cv.get_feature_names())

54777

In [7]:
cv.get_feature_names()[7000]

'bridal'

In [8]:
# get the first individual topic
first_topic = LDA.components_[0]

# sort words describing the component by their probabilities and filter top-10 words
first_top_20_words = first_topic.argsort()[-20:]
for i in first_top_20_words:
    print(cv.get_feature_names()[i])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


In [9]:
# look at top 15 words for each topic
for i, topic in enumerate(LDA.components_):
    print(f"Top 15 words for topic no. {i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')

Top 15 words for topic no. 0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


Top 15 words for topic no. 1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


Top 15 words for topic no. 2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


Top 15 words for topic no. 3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


Top 15 words for topic no. 4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


Top 15 words for topic no. 5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'peop

In [10]:
# assign topic no. to each article in original dataframe
pred_results = LDA.transform(dtm)
npr['Topic'] = pred_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


Based on the words that have a high probability of belonging to a topic, we could assign the generic description of the topic. For example, topic 0 above is related to healthcare and government, topic 1 is related to government and security, topic 2 is related to local news and food, and so on. We could lower the number of topics if we wanted to keep them more generic or increase it if we wanted to make them very specific. 

Unfortunately, there is no way to evaluate how "good" the model performs but going through a few samples and assigned topics reveals that the classification is quite accurate.