In [2]:
import pandas as pd
import nltk
import gensim
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
from pprint import pprint
import string

In [3]:
news = pd.read_csv("news.csv")
news_titles = news.title
news_articles = news.content

The first thing to do is to tokenize the news article titles into individual words

In [4]:
titles_tokenized = []
for i in range (len(news_titles.index)):
    titles_tokenized.append(nltk.word_tokenize(news_titles.iloc[i]))

Next, we remove stop words from the news titles, done using the stopwords package from nltk. With trial and error with the LDA, it was deemed that many of the words should be removed (including some individual tokens as well). This is because these words overpowered the topics and results in very weak topics. Namely, the works "new", "york", and "times" were removed, as the New York Times journal name appeared in many of the titles but did not add any value or input to the actual topics. Similarly, "breitbart" (another news journal) was removed. The names "Donald", "Trump", "Hillary", "Bill", and "Clinton" were removed because they also overpowered the topics. If not removed, almost every single topic puts these names as the strongest indicators of the topic, so I decided to remove them as well. At this stage I also choose to remove any words that have length less than 3, because these words do not contribute to topics and typically just stain the topics with useless information.

In [5]:
stop_words = stopwords.words('english')
stop_words.append(":")
stop_words.append(",")
stop_words.append("-")
stop_words.append("$")
stop_words.append(".")
stop_words.append("?")
stop_words.append("trump")
stop_words.append("breitbart")
stop_words.append("donald")
stop_words.append("new")
stop_words.append("york")
stop_words.append("times")
stop_words.append("clinton")
stop_words.append("bill")
stop_words.append("hillary")
stop_words.append("says")
stop_words.append("evening")
stop_words.append("briefing")
stop_words.append("news")
titles_no_stopwords = []
for i in range(len(news_titles.index)):
    titles_no_stopwords.append([word.lower() for word in titles_tokenized[i] if word.lower() not in stop_words and len(word) > 3])

The following iterates through each news article title and stems the words that were not removed during the stop words stage. I also experimented with a lemmatizer, but with trial and error and observing the topics, stemming produces better results.

In [6]:
ps = PorterStemmer()
lm = WordNetLemmatizer()
titles_stemmed = []

for i in range(len(news_titles.index)):
    sentence = []
    for j in range(len(titles_no_stopwords[i])):
        sentence.append(ps.stem(titles_no_stopwords[i][j]))
    titles_stemmed.append(sentence)

The following creates a dictionary of the words in the corpus. Here I am able to filter words based on the extremes; if they appear in less than X documents, or if they appear in more than Y% of the documents. Through repeated experimentation and hyperparameter tuning, I decided to filter words that appear in less than 140 documents or those that appear in more than 60% of the documents. I also automatically keep any word that appears more than 15000 times.

In [7]:
news_dictionary = gensim.corpora.Dictionary(titles_stemmed)
news_dictionary.filter_extremes(no_below=140, no_above=0.60, keep_n=15000)

The following creates a bag of words representation of the dictionary, then initializes the LDA model using the representation. I decided to create 8 topics. If the model has more, the topics seem to spread too thin and have a lot of overlap with each other. With less topics, they seem very mixed and no one topic can be discerned among the data. 

In [8]:
bow_news_corpus = [news_dictionary.doc2bow(doc) for doc in titles_stemmed]
lda_model = gensim.models.LdaMulticore(bow_news_corpus, num_topics=8, id2word=news_dictionary, passes=4, workers=4)
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.330*"border" + 0.298*"immigr" + 0.222*"nation" + 0.021*"plan" + 0.019*"democrat" + 0.017*"call" + 0.012*"state" + 0.012*"russia" + 0.011*"presid" + 0.007*"leader"
Topic: 1 
Words: 0.424*"u.s." + 0.220*"russia" + 0.196*"polit" + 0.042*"syria" + 0.020*"obama" + 0.013*"state" + 0.012*"plan" + 0.010*"take" + 0.009*"women" + 0.009*"immigr"
Topic: 2 
Words: 0.257*"presid" + 0.188*"call" + 0.138*"back" + 0.115*"women" + 0.110*"china" + 0.085*"syria" + 0.025*"obama" + 0.012*"state" + 0.012*"hous" + 0.011*"republican"
Topic: 3 
Words: 0.236*"attack" + 0.209*"state" + 0.144*"first" + 0.135*"protest" + 0.119*"year" + 0.094*"leader" + 0.011*"polic" + 0.009*"media" + 0.006*"kill" + 0.005*"hous"
Topic: 4 
Words: 0.420*"report" + 0.236*"media" + 0.152*"take" + 0.087*"hous" + 0.014*"white" + 0.012*"presid" + 0.011*"offic" + 0.010*"women" + 0.007*"american" + 0.007*"china"
Topic: 5 
Words: 0.232*"white" + 0.217*"hous" + 0.137*"american" + 0.120*"court" + 0.101*"die" + 0.097*"case" + 

The first thing that is apparent is that all of the topics are dominated by political terminology. This indicates to me that the bulk of the articles in this dataset are focused on political theory and/or international affairs. Throughout topics there are mentions of the American political parties and the present and former presidents, but also mentions of China, Russia, Syria, immigration, and the "state". Although there are related words among the topics, there are still discernable differences.

The following are my predictions about the above topics.

Topic 0 is about Immigration, with "border" and "immigr" being the most important. "immigr" is the stemmed version of either "immigration" or "immigrate" or "immigrant", either of which indicate that these articles often talk about them along with immigration policies.

Topic 1 is about International Relations and Conflict. Some of the words include "russia", "syria" as well as "u.s.", so the articles likely talk about America's relation to these countries and others.

Topic 2 is very hard to distinguish what is it about. Through many iterations of the LDA model there always seems to be one topic that is very jumbled/hard to classify. This is it. My best guess is that it is about Global Issues, as various countries are mentioned in the topic very often ("china", "syria"). "women" is also an important word here, so maybe many of the articles talk about feminism. Therefore, this topic seems to talk about different kinds of issues all over the world.

Topic 3 is about Issues in America, as "protest" and "state" are mentioned often, possibly indicating that these articles talk about the stress between the population and the government. "media" and "kill" are also mentioned here, maybe about the media depiction of these events and maybe some other more serious events that are causing these issues.

Topic 4 is about Journalism, because "media" and "report" are the most important words here. The articles probably talk about the different issues with the media in the country, possibly also about propoganda and similar issues. 

Topic 5 is about the American Administration. The words "white" and "house" indicate frequent talk about the white house and the president but "court" and "case" indicate the main senate and judicial system of the country and other issues that they must deal with other than the president himself. Many of these articles likely talk about prominent cases and the laws in America.

Topic 6 is about the Presidency. "obama" is mentioned here, as well as "democrats", "republican", "polic", likely the stemmed version of policies, all indicating entities commonly associated with the president of the U.S. I am sure if I did not remove trump from the data, he would be the most important word in this topic. "plans" is also here, probably the articles mentioning the President's "plan" for different things.

Topic 7 is about Safety/Terrorism. "health" and "kill" are important in this topic. America the good health and safety of the American people and to prevent attacks. Moreover, "syria" is also represented in this topic, which likely indicates many of the articles talking about past terrorism attacks by people in that area of the world.


Most representative news title for each topic:

In [11]:
num_topics = 8
scores = [0 for _ in range(num_topics)]
best_titles = ["" for _ in range(num_topics)]

for i in range(len(news_titles)):
    top = lda_model[bow_news_corpus[i]]
    for j in range(num_topics):
        score = top[j][1]
        if score > scores[j]:
            scores[j] = score
            best_titles[j] = news_titles[i]

In [12]:
best_titles

['Border Patrol Union: Trump’s Border Plan ’Gives Us the Tools We Need’',
 'Friday Mailbag: Politics, Politics, Politics - The New York Times',
 'China: Sean Spicer ’Not in a Position’ to Call South China Sea ’International Territory’ - Breitbart',
 'ESPN’s Sage Steele Under Attack for Criticizing Airport Protests After She Missed a Flight - Breitbart',
 'Report: DePaul University Banned ’Gay Lives Matter’ Poster for Gay Reporter’s Lecture on Radical Islam - Breitbart',
 'Case Study in Chaos: How Management Experts Grade a Trump White House - The New York Times',
 'Democrats Facing Elections Refusing to Hold Town Hall Meetings',
 'ISIS Claims Responsibility for Killing of French Police Officer - The New York Times']

Immigration -  'Border Patrol Union: Trump’s Border Plan ’Gives Us the Tools We Need’'

International Relations - 'Friday Mailbag: Politics, Politics, Politics - The New York Times'

Various Issues - 'China: Sean Spicer ’Not in a Position’ to Call South China Sea ’International Territory’ - Breitbart'

Issues in America - 'ESPN’s Sage Steele Under Attack for Criticizing Airport Protests After She Missed a Flight - Breitbart'

Journalism - 'Report: DePaul University Banned ’Gay Lives Matter’ Poster for Gay Reporter’s Lecture on Radical Islam - 
Breitbart'

American Administration - 'Case Study in Chaos: How Management Experts Grade a Trump White House - The New York Times'

Presidency - 'Democrats Facing Elections Refusing to Hold Town Hall Meetings'

Safety/Terrorism - 'ISIS Claims Responsibility for Killing of French Police Officer - The New York Times'

Now to repeat the same process but with the content of the news articles

In [None]:
articles_tokenized = []
for i in range (len(news_titles.index)):
    articles_tokenized.append(nltk.word_tokenize(news_articles.iloc[i]))

In [330]:
stop_words = stopwords.words('english')
stop_words.append(":")
stop_words.append(",")
stop_words.append("-")
stop_words.append("$")
stop_words.append(".")
stop_words.append("?")
stop_words.append("trump")
stop_words.append("breitbart")
stop_words.append("donald")
stop_words.append("clinton")
stop_words.append("bill")
stop_words.append("hillary")
stop_words.append("clinton")
stop_words.append("american")
stop_words.append("report")
stop_words.append("obama")

articles_no_stopwords = []
for i in range(len(news_titles.index)):
    articles_no_stopwords.append([word.lower() for word in articles_tokenized[i] if word.lower() not in stop_words and len(word) > 3])
                                
ps = PorterStemmer()
articles_stemmed = []
for i in range(len(news_titles.index)):
    sentence = []
    for j in range(len(articles_no_stopwords[i])):
        sentence.append(ps.stem(articles_no_stopwords[i][j]))
    articles_stemmed.append(sentence)
                                
news_dictionary = gensim.corpora.Dictionary(articles_stemmed)
news_dictionary.filter_extremes(no_below=100, no_above=0.25, keep_n=100000)

For the article contents, I decided to use less topics (only 5) because with more the topics seem to jumbled and I could not distinguish any real topics

In [331]:
bow_news_corpus = [news_dictionary.doc2bow(doc) for doc in articles_stemmed]
lda_model = gensim.models.LdaMulticore(bow_news_corpus, num_topics=5, id2word=news_dictionary, passes=4, workers=4)
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.005*"media" + 0.005*"vote" + 0.004*"republican" + 0.004*"student" + 0.004*"parti" + 0.004*"protest" + 0.004*"voter" + 0.004*"democrat" + 0.003*"women" + 0.003*"school"
Topic: 1 
Words: 0.004*"game" + 0.004*"play" + 0.003*"team" + 0.003*"compani" + 0.002*"life" + 0.002*"school" + 0.002*"women" + 0.002*"realli" + 0.002*"feel" + 0.002*"someth"
Topic: 2 
Words: 0.006*"democrat" + 0.004*"senat" + 0.004*"republican" + 0.004*"parti" + 0.004*"attack" + 0.003*"percent" + 0.003*"russia" + 0.003*"islam" + 0.003*"leader" + 0.003*"mrs."
Topic: 3 
Words: 0.010*"polic" + 0.006*"attack" + 0.005*"investig" + 0.005*"kill" + 0.003*"charg" + 0.003*"fire" + 0.003*"2017" + 0.003*"comey" + 0.003*"death" + 0.003*"russian"
Topic: 4 
Words: 0.006*"compani" + 0.005*"immigr" + 0.005*"republican" + 0.005*"court" + 0.005*"health" + 0.005*"feder" + 0.004*"percent" + 0.004*"china" + 0.004*"polici" + 0.004*"care"


The preprocessing of the data for the news articles was similar to the process for the news titles but some different words were excluded, namely "report", "american", 

Topic 0 is about the Elections. "vote", "voter", "republican" and "democrat" are all in this topic, as well as "student" and "protest", which can be indicative of the large volumes of protests during elections because of disagreements between the American people. Students are also commonly protesting during elections. 

Topic 1 is about Sports. We see "game", "play", and "team", so these articles are talking about the players who play in the different games and sports, and all of the teams in the different leagues.

Topic 2 is about the American Government. "democrat", "republican", "senate" and "party" are all present in this topic which clearly indicates the two parties of the American government and how they cooperate within the senate. 

Topic 3 is about Crime. "police", "attack", "investigation"/"investigating", "kill" and "charge". Almsot every single one of these words is used when a certain crime is being reported on the news, (maybe not "kill", unless it is a murder). This is one of the best represented topics.

Topic 4 is hard to distinguish but my best guess is that it is about American Policies. It mentions companies, immigration, court, health. All of these are common among American issues so maybe the topic is about policies established or being discussed by the american government. 

Most representative news article for each topic

In [14]:
num_topics = 8
scores = [0 for _ in range(num_topics)]
best_articles = ["" for _ in range(num_topics)]

for i in range(len(news_titles)):
    top = lda_model[bow_news_corpus[i]]
    for j in range(num_topics):
        score = top[j][1]
        if score > scores[j]:
            scores[j] = score
            best_articles[j] = news_articles[i]

In [15]:
best_articles

['As President Donald J. Trump prepares to kick off his new border security plan, various news outlets have begun to criticize the effort by focusing on the border wall. However, members from the union representing the men and women from the U. S. Border Patrol stated that the proposal comes from listening to agents instead of politicians. [Various outlets have continued to question the notion of building a border wall and have focused on the perceived challenges of such an enterprise. Other outlets have criticized the effectiveness of the measure claiming that it does not address the current immigration crisis. The various news organizations have failed to mention the complete control that Mexican drug cartels have over human smuggling, narcotics trafficking, and other illicit activities along both sides of the border.  The executive orders that President Trump will be signing provides border security agents with the tools that they have been denied for too long, said Hector Garza, a 