## Design Interview of a Fake News Detector - Notes

By Robin Forsberg (2022)

*The following notes simulate a client commission of designing a fake news detector system.*

# 1. Client Expectations

Scrape news articles and fake news datasets from the web

-Needs to support: HTTPS, Javascript, Login prompts, paywalls, CAPTCHAs

-ScrapeBox, scrape stack, Scrapy

-Download Fake News datasets from their source

-Information Security and Object Tech (ISOT)

 

Wrangle the data into a consistent structure for creating features.

-Date and time format manipulations

-Changing data types to match consistent schema

-Replacing missing data with NAs

-Joining data

-Exporting processed data to storage layer 

-These functionalities can be designed from scratch using pyspark library. 

-Storage layer is postgresql. We will have table there where each row is keyed on a unique article id. The rows will contain columns including article contents themselves and article metadata such as the publication date, the publisher, article length and perhaps the news article author. The storage will be portioned based on date, so we can partition one for training the data. These articles will then be needed to be labeled as either ‘reliable’ or ‘unreliable’. If the data came from a fake news dataset, we won’t need to label them because they will be labeled for us.



# 2. System specifications

-ML solution to Social Media Platforms. 

-How is fake news defined?: ‘Let’s define fake news as news articles that contain intentionally misleading info from what appears to be a reliable source.'

-Why are our customers: paying us?’: We want to limit the negative financial, health, and social impacts of misinformation on the platforms.’

-Service is API based - the clients, the social media companies, will call our API solution.

-The platforms call our service as users post news articles and depending on our services’s response:

Allow the post

*   Allow the post
*   Limit exposure to the post
*   Disallow the post

# 3. Design

Once you've scraped the data, figure out how to label the newly scraped news articles. Labelling must here be done manually. 

Use an internal workforce or
3rd party Service such as Mechanical Turk. Then:
-Establish comprehensive labelling guidelines
-Extract linguistic features from the articles

Linguistic Inquire and Word Count (LIWC) (this is our primary source). This process will extract over 90 features for us, eg.:
Summary Language Variables (words per sentence, the tone of the article, words with more than 6 letters, different counts like that)
Linguistic Dimensions (total pronouns)
Psychological Processes (positive or negative emotions, social references)

Then proceed to modelling. In this case, we'll train a random forest algorithm with the following specifications:

128 trees of max depth 6 
200.000 total news articles (3mb of memory)

At this stage it's good consider that there’s typically far more reliable articles than unreliable articles. Therefore our your model be able to handle this imbalance?

The answer's no, that’s why need to implement class weighting. Class weighting can be used when mitigating the imbalanced data. All we have to do is to add a stronger weight for the minority class when calculating the gini-impurity for a particular split. Class weighting typically results in a higher rate of false positives so let’s look at another technique called SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE. 

The core steps here are:

1. Tune the model
2. Calculate Max tree depth
3. Calculate Number of trees
4. Max number of leaf nodes
5. Minimum impurity decrease
6. Minimum samples in a leaf
7. Minimum samples required to split
8. Max features to evaluate
9. FI-score will be a better metric than accuracy. Also Cohen’s Kappa.

# 4. Initial model evalutions

I. At this stage, we have completed the model. Now we're faced with a question by the client: ‘Let’s say your model achieves a 95% accuracy. If your model predicts an article to be unreliable, what’s the probability that it actually is?’

To calculate this, we'll use Bayesian inference:

P(unreliable. | model+) = P(unreliable) x P(model+ |  unreliable) / P(Model+) <— Bayesian probability = .02 x .92 / (.02 x .92) + (.98 x .03) = 38% —> “The probability of an article actually being  unreliable is 38% given that our model says it is unreliable.”

II. Second question to consider: ‘Okay, so is this any accurate / good? Does it actually help us?’ 

In this case, it actually is good, given that the average human accuracy of detecting a fake news article lies between 55 and 60 percent. If we hold the same proportion of false positive rates and true positive rates with that new assumption of 55 percent accuracy, then an average human has an 18 percent probability of an article being unreliable given that they think it is unreliable. Our model is roughly twice as good as that. 

# 5. System architecture and deployment



1.   Collect articles via web scraper.
2.   Feed articles to the data wrangler.
3.   After processing the data, the data wrangler will send the data to the PosgreSQL-table. This database table will be keyed on a unique article ID and the columns will contain the article contents themselves as well as metadata of the articles.
4. Next we'll send this to a labelling queue, and the labelling queue will be pulled by a labelling interface such that the labellers can be logged in to the labelling interface and start manually label these news articles as either reliable or unreliable. For the labelling queue, we can use AWS sq or Queue service, and for the labelling interface we can use a simple webapp hosted on a lightweight web app such as Flask. The Flask server can directly add labels to the label column in the same exact postgreSQL table used by the data wrangler for each article.
5. Now when articles have been added to the database and consumed by the labelling interface, we can now extract features from these articles. We are going to have a feature extractor. The Feature Extractor is going to be activated the same way the Labelling Queue is activated by using the change data caption in the database. The difference is that we will be using Rabbit MQ as the broker, or queue here, and the Feature Extractor will be a cluster of Celery workers which will pull tasks from Rabbit MQ and perform the LIWC Feature Extracting. Since the lexicon of the LIWC FE is proprietary, we will need to integrate with their the RECEPTIVITY-API which holds the exclusive rights to LIWC-lexicons or work with them to get a custom solution to get a local local LIWC library so we don’t need to make an API-call for each article. In our case we are going to use FEAST.
6. Once we have the summary data, we send it to ISOT (the fake news dataset), where we commence the model development.
7. We split the data in test or train
8. Conduct model training
9. Model tuning
10. Evaluation
11. Deployment to the cloud (AWS / Heroku / other cloud provider).
12. Create an interface for technical account manager to query the data easily. Elasticsearch helps to search log stash. Kibana and PageDuty function as the interface.