<h1><center>The PoliBot: Keeping Us Current & Unbiased on American Politics</center></h1>
<img src="images/robot-voter-dreamstime-image.png" alt="Drawing" style="width: 400px;"/>
<figcaption><center>Photo Credit: Dreamstime</center></figcaption>

## 0. Summary

The PoliBot (@thePoliBot) is an NLP-based chatbot that provides recommendations of current political news articles based on interactive user input. For example, a user can give thePoliBot queries such as: "Why did Donald Trump fire Comey?" or "Facebook's role in Russian meddling in the 2016 election", and it will respond with recent relevant news articles from 15 major news sources. This chatbot can be seen as a domain-specific search engine with suggested results based purely on relevance to the user query, exposing users to a broader set of media sources than they may normally follow. It also has a general dialogue component, and will decipher when the user asks for political information versus simply wanting to chat. The dialogue component is trained on the Cornell movie dialogues dataset, with the PoliBot responding in simple playful dialogue when chatting. The bot was deployed in the Telegram app, running in a Docker container on an AWS EC2 server instance (ubuntu).

## 1. Motivation

With the current unpreditability in the U.S. political system, it can be nearly impossible to stay ahead of the curve on the present state of things. Furthermore, the tendency for most of us to follow only a select few news sources has contributed to an increasingly polarized political environment, as we often reinforce our previous views in an attempt to flee from fake news. These joint trends fueled my motivation to create the PoliBot: a fun, interactive tool that provides news updates from a variety of different viewpoints to broaden the context in which we digest current events. 

## 2. Data

The bot is trained using two different datasets: a set of general dialogues and a set of political article texts. Each dataset is used to train on two different models. The general dialogue dataset will be used to: (1) train on responses to apolitical user input (general chit-chat), and (2) to classify user intent between political queries and chit-chat. The political article dataset will be used for: (1) the same intent classifier, and (2) a model to create article embeddings for the generation of query-response article recommendations. The following section describes each of these  models in detail. 

The dialogue dataset is pulled directly from the Cornell movie dialogues dataset, which includes 220,579 utterances across a variety of movie genres and characters. The dataset has been preprocessed from another project and is imported as a clean TSV file.

The political article dataset is hand accumulated using Selenium, BeautifulSoup, and Lynx to web scrape a final set of 44,193 articles across 15 major media sources. Those represented include: CNN, Fox News, Vox, New York Times, Washington Post, NPR, Los Angeles Times, Wall Street Journal, MSNBC, BBC, Politico, Reuters, USA Today, Chicago Tribune, and Newsweek. Articles are heavily concentrated within the last six months (late 2017 - April 2018), with rare exceptions extending as far back as October 2016. These are restricted to the domain of U.S. politics, given by the search specifications in *source\_urls.csv*. 

### 2a. Data Accumulation

We begin by importing a set of utility functions that facilitate data extraction, cleaning, manipulation, and analysis.

In [12]:
from utils import *

Article links are gathered top down using search filters on each media source homepage. There are two primary methods for pulling in URLs: xpath and Lynx. The extraction method for each source is determined by each website format. For those that expose their embedded URLs as visible links, Lynx proves more effective. 

In [9]:
exec(open('./extract_urls.py').read())

### 2b. Data Scrubbing

Upon pulling in a massive set of all potential URLs, there are many irrelevant articles and other links that must be removed from the final dataset. To account for this, a separate script compiles a clean dataset of only relevant article links from the raw input. Each news source contains a particular regular expression that determines pertinent text article URLs. These regex patterns are compared against all links to obtain a final set of articles.

In [13]:
exec(open('./clean_urls.py').read())

With a clean set of article URLs, it is time to extract the body text from each article for analysis. This process involves the extraction of HTML tags specific to each news source. Through a deep dive into the HTML structure of each site, I created a dictionary detailing the corresponding paths to extract from each link to form the resulting full dataset with text.

The text from this full dataset is then cleaned and pre-processed for easy modeling use. The final dataset is reformatted to remove duplicates and facilitate data frame exportation to TSV.

In [35]:
exec(open('./scrape_url_text.py').read())

           num_articles  median_word_count
bbc                 346                293
chicago            4849                537
cnn                6317                330
fox_news           2035                330
la_times           1418                535
msnbc              3996                389
newsweek           4084                323
npr                3963                484
ny_times           2189                632
politico           2566                975
reuters            5000                242
usa_today          2393                456
vox                1062                605
wapo                536                487
wsj                3386                468


## 3. Modeling

Two primary models are trained for chatbot functioning. The first determines which articles are recommended to the user given her input query. The second, which is actually called first during bot interaction, classifies user input as either a political query or general dialogue to guide the bot's response toward article recommendations or chit-chat, respectively.

### 3a. Starspace Embeddings Model

When the user inputs a political query that warrants article recommendations, the bot will need to assess the most relevant articles to recommend. This task is accomplished using the ultra-convenient Starspace package distributed by Facebook's research group. The model takes in the set of compiled article texts as input and returns embeddings for each article as output. 

For the purposes of this project, I use training mode 2 to establish word embeddings across all articles. This training mode effectively transforms a set of unsupervised training data (the body of text from each article) into supervised training examples. For each observation (article), one sentence is chosen at random as input, the remaining sentences of that article are used as a positive label, and a subset of other articles' text are randomly chosen as negative examples of association with the input sentence. Ultimately, embeddings for article vocabulary words are trained to optimize cosine similarity between sentences within the same article. These embeddings are then extended to rank similarity between user queries and articles at test time to provide article recommendations. 

For more in-depth information on Starspace embeddings, see the [GitHub](https://github.com/facebookresearch/StarSpace) and [ArXiV paper](https://arxiv.org/pdf/1709.03856.pdf), specifically the ArticleSpace sections.

This script prepares Starspace training and test data, then calls a bash script to train, test, and evaluate the model on Linux. Final document-level embeddings are computed as the mean of all word embeddings within each article's text. Article recommendations would be definitely be improved by using a more sophisticated aggregation method than this simple mean. 

The output is stored together with article URL and source information for bot use. 

In [34]:
exec(open('./build_starspace.py').read())

### 3b. Intent Classification Model

The second, heartier model deciphers user input between political queries and general dialogue. This is structured as a binary text classification problem. I choose a bidirectional LSTM neural network with dropout here for its ability to track seemingly convoluted and extensive sequences that may arise with user input. The LSTM employs an Adam optimizer with a sigmoid cross entropy loss function.

The intents dataset is first prepared for modeling. Political text samples are taken directly from the article text dataset used in the embeddings model, splitting text into one sentence per observation. These samples are then combined with the Cornell movie dialogues dataset to represent general dialogue. While these dialogue data do not exhaustively represent potential user input, they span a wide enough range of topics and contexts to be effective. The dataset is then prepared for analysis by pre-processing text, capping input at 200 words for outlier removal and efficiency, and adding 'political' and 'dialogue' labels. 

In [16]:
exec(open('./intents_prep.py').read())

The full dataset size comes to approximately 400,000 observations, balanced evenly between 'political' and 'dialogue' samples. These are split into 80/10/10 training, validation, and test sets for modeling. 

Dictionaries mapping between words and integer values are created, enabling each input batch of sentences to be converted to numerical LSTM inputs. The model is then implemented in TensorFlow and embedded within an IntentClassifier class object. After 10 epochs, the model achieves high performance with nearly 93% test accuracy. 

Note: The training and validation losses shown below derive from randomly compiled input batches, and therefore will not decrease monotonically. LSTM dropout also yields a slightly different network with each iteration, leading to further fluctuation in loss.

In [36]:
exec(open('./intent_classifier.py').read())

Begin training: 

Training epoch 1
Epoch: [1/10], step: [1/2486], loss: 0.696802
Epoch: [1/10], step: [401/2486], loss: 0.365114
Epoch: [1/10], step: [801/2486], loss: 0.366471
Epoch: [1/10], step: [1201/2486], loss: 0.356616
Epoch: [1/10], step: [1601/2486], loss: 0.360487
Epoch: [1/10], step: [2001/2486], loss: 0.310928
Epoch: [1/10], step: [2401/2486], loss: 0.297060
Validation epoch 1 loss: 0.21858135 

X: copyright 2017 npr # # # # # # # # # # # # # # # # # # #
Y: True
P: 0.9978624 

X: everyone wants legalize weed # # # # # # # # # # # # # # # # # #
Y: True
P: 0.39922687 

X: wrote first book used carry around looking publisher good book marcia writer # # # # # # # # # #
Y: False
P: 0.44563374 

Training epoch 2
Epoch: [2/10], step: [401/2486], loss: 0.263141
Epoch: [2/10], step: [801/2486], loss: 0.281445
Epoch: [2/10], step: [1201/2486], loss: 0.283057
Epoch: [2/10], step: [1601/2486], loss: 0.219097
Epoch: [2/10], step: [2001/2486], loss: 0.294121
Epoch: [2/10], step: [2401/24

<img src="images/loss_accuracy.png" alt="Drawing" style="width: 600px;"/>

<img src="images/confusion_matrix.png" alt="Drawing" style="width: 600px;"/>

## 4. Bot Implementation and Use

Finally, all pieces to implement the PoliBot are complete! The final component is the actual creation and implementation of the bot. This includes two programs: the core script to run the bot, and a response manager to guide its responses. 

The response manager incorporates the output of both models described above. The user's query is first passed through the intent classifier to determine either political or chitchat intents. In the case of political intent, the query is then measured against all article embeddings to provide the URLs of its top 3 article recommendations. Otherwise, it is passed to a baseline dialogue model provided by Gunther Cox's ChatterBot package. See the [GitHub](https://github.com/gunthercox/ChatterBot) for details on the ChatterBot's logic and training data.

In [18]:
exec(open('./response_manager.py').read())

The PoliBot's core script imports the response manager and establishes a BotHandler class to manage its functions. It is initialized using Telegram's API and provided token from BotFather. Once initialized, we're in business!

In [19]:
subprocess.call(['python3', '/.run_bot.py', '--token==597544738:AAFPiA4ejfcpRI20EtbzQCH2-kNCjX1hwIw'])

When the user begins a chat session, the PoliBot will send a welcome message. Importantly, it also allows the user to remove any subset of sources from the recommendation pool at any point during the conversation. As Wall Street Journal articles are only accessible with a subscription, users without a WSJ membership may choose to exclude those articles. The numerical mappings can also be resent at any time with user input "sources". This also means that users cannot chat or search for articles by sending solely comma-separated numbers, as it will be interpreted as a search filter!

## 5. Discussion and Future Improvements

This was a fun project to create! While a significant portion of the brunt work is complete, there are a number of improvements that can be made in the future for more effective user-bot interaction:

First, and most importantly, right now the article recommendations leave much to be desired. With a limited search domain (some sources pulling from narrow filters such as "Trump") and only slight hyperparameter tuning in the Starspace model, the current embeddings yield bot recs that would not stand up in the real world. The current PoliBot should be seen more as prototype than product. :)

Beyond corpus size and embeddings, I'd like to spend more time on the dialogue interaction component to improve UX. One obvious next step is to obtain a broader set of dialogues outside of the pre-trained ChatterBot conversations. Training the bot on the Cornell movie dialogues would be helpful. Even better would be data from a substantial Twitter scrape. 

For longevity's sake, it would be useful to create a bash script that maintains a rolling addition of new articles over time (perhaps once a week).  The current infrastructure makes this straightforward, barring format changes to source websites and their embedded HTML paths.

Lastly, the PoliBot could cater more to the idea of debiasing users' political news consumption. There is a growing literature on media bias (e.g. [Budak et al. 2014](http://dx.doi.org/10.2139/ssrn.2526461)). If articles can be accurately classified on the political spectrum, users could potentially specify their political leanings and open-mindedness levels to drive article recommendations. Instead of responding solely based on relevance, the PoliBot could take user profiles into account to more precisely target their news preferences. This would require the daunting task of article classification by political affiliation, which has been proven quite difficult. Still, the potential for this feature to provide more meaningful news consumption is an exciting prospect.