video-test

This project was developed to contribute to the fight against fake news, with a stronger focus on the European Portuguese language. If you find this repository useful, please cite it in your work, alongside our paper.

The diagram below depicts the project's pipeline, with the two English and European Portuguese approaches:

European Portuguese Dataset

Based on the conducted search, this is the first publicly available dataset with fake and real news in European Portuguese.

It contains over 60 000 rows with news articles and statements extracted through Web Scraping. The web scrapers were automated by resorting to Beautiful Soup and Selenium.

The dataset is comprised of 4 columns: Text (news title and body merged together), Label (0 for fake, 1 for real), Source, and URL.

The Source column was added because many fake news websites love to promote articles from other fake news websites, which means not all articles present on a given website belong to it.

All the fact-checks also had the source behind the statement being fact-checked, which varied from individuals like politicians or celebrities to social media as a whole.

The Web Scrapers used to gather the data are also available, alongside many Python notebooks with different classification models and techniques.

Best Machine Learning and Deep Learning models

Many models were trained using different packages and technologies, including:

Scikit-learn for Machine Learning
Tensorflow, Keras, Transformers, and PyTorch for Deep Learning
NLTK and Spacy for Natural Language Processing
Pandas, Numpy, and Matplotlib for Exploratory Data Analysis

However, the best models for the English and European Portuguese approach were BERT (0.96 F1-score) with tokenized text data and XGBoost (0.957 F1-score) with pre-processed text (lemmatization and stopword removal), Sentiment Analysis, POS tagging and TF-IDF, respectively.

The distilled version of the English BERT model is available here.

The European Portuguese XGBoost model is available here. Since the AWS EC2 instance of the Free Tier used in the project only has 1 GB of RAM, this model couldn't be used. To solve this issue, another distilBERT model was trained, this time with the European Portuguese data, with an F1-score of 0.92. The model is available here.

System Development

To put the ML and DL models into action, the following system was developed:

A Chrome extension and Android application communicate with a Flask app run with Gunicorn on a Docker container inside an AWS EC2 instance, which allows users to check whether a given text is real or fake through POST and GET requests.

Users can also report fake or real news articles, which are then processed in a script run on a local computer with a dedicated Graphical Processing Unit (GPU).

The models are fine-tuned with the feedback data and then sent over to the cloud instance through Secure Shell (SSH) and Secure File Transfer Protocol (SFTP) commands, as well as a POST request which allows the Flask app to replace the old models with the improved ones.

Requirements

VS Code or a similar code editor
AWS Account (1 Year Free Tier)
Anaconda
Android Studio
GitHub Desktop

System Setup with Video Demonstration

Follow the steps below to set up and deploy the system. The video demos visually guide you through each step, highlighting the system's features and functionality:

AWS EC2 with Containerised Flask App

fake-news-flask-cloud.mp4

After creating your AWS account, navigate to the EC2 dashboard
Create a pem file to use as the key pair, name it "fake-news-demo.pem", and save it in the "Flask Cloud and Local RESTful Script" local folder
Launch a new EC2 instance using the following free tier eligible offers and specifications:
- Select the "Amazon Linux 2023 AMI" as the AMI
- Select either "t3.micro" or "t3.nano" as the instance type (varies with region)
- Select the pem file created earlier as the key pair
- Allow SSH, HTTP, and HTTPS traffic
- Select 15 GBs of gp3 storage
After launching the EC2 instance, you need to ensure the key is not publicly viewable. To do this, open PowerShell or CMD on your local pc, set the "Flask Cloud and Local RESTful Script" folder path, and run the following command:
- chmod 400 fake-news-demo.pem
Connect to your EC2 instance using your Public IPv4 DNS and the command below:
- ssh -i fake-news-demo.pem ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS
Once connected, run the following commands to install Docker in EC2:
- sudo yum install docker
- sudo service docker start
- sudo usermod -a -G docker ec2-user
Disconnect from EC2 with exit and reconnect again using the ssh command. Confirm that Docker is installed by running docker info
Disconnect from EC2 with exit and open the "compose.yml" file
- This will be used alongside the Dockerfile to mount volumes so that the Flask app can access the ML and DL models within the Docker container
Change the domain used in the first two directories of the "volumes" section. You can either use a paid domain if you own one or create a domain for free with your EC2 IP address and nip.io
- For example, if your EC2 IP is 01.23.456.789 then your domain would be 01-23-456-789.nip.io (only for testing purposes, not recommended in production)
Send a copy of all Docker, Python, and TFLite files inside the "Flask Cloud and Local RESTful Script" folder using the following commands:
- scp -i fake-news-demo.pem NEW_mobile_distilBERT_optimized.tflite ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS:/home/ec2-user
- scp -i fake-news-demo.pem mobile_portuguese_distilBERT_optimized.tflite ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS:/home/ec2-user
- scp -i fake-news-demo.pem Dockerfile ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS:/home/ec2-user
- scp -i fake-news-demo.pem docker-compose.yml ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS:/home/ec2-user
- scp -i fake-news-demo.pem Flask_app_optimized.py ec2-user@YOUR-EC2-PUBLIC-IPv4-DNS:/home/ec2-user
Connect to EC2 and run ls to confirm that all files were transferred successfully
It is advised to use https instead of the default and less secure http connection in your Flask App. To do this, install certbot by running the command below:
- sudo yum install certbot
Once installed, run the following command to create your free SSL certificate with certbot:
- sudo certbot certonly --standalone -d YOUR-DOMAIN
- As mentioned before, use a paid domain or your EC2 IP with nip.io (for example, 01-23-456-789.nip.io)
Follow the instructions on the terminal to create your certificate. Once finished, run the following command to check if the certificate was created successfully:
- sudo certbot certificates
The certificate is only valid for 90 days, but it can be renewed. Start by running sudo yum install cronie
Open the editor with Vim by running sudo crontab -e
Type "i" to enter "insert mode" and paste the following command:
- 0 */12 * * * certbot renew --quiet --post-hook "docker restart fake-news-cont"
- This cron job runs every day at midday and midnight to renew the certificate if its expiration date is close. The Docker container is then restarted automatically to apply the new certificate
Press "Esc", type ":wq" and hit "enter" to save and exit the editor
Install docker-compose by running the following commands:
- sudo curl -L "https://github.com/docker/compose/releases/download/v2.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
- sudo mv /usr/local/bin/docker-compose /usr/bin/docker-compose
- sudo chmod +x /usr/bin/docker-compose
- Confirm that docker-compose is installed by running docker-compose version
Build the Docker image using the "docker-compose.yml" file by running the command below:
- docker-compose build
Create and deploy your Docker container with Gunicorn by running the following command:
- docker-compose up -d
With the container running, the Flask app is ready to use. A message will also be displayed by the Flask app when you visit your domain in a web browser
To get a prediction for a given news article, send a POST request using the following PowerShell command in VS Code:
- Invoke-RestMethod -Method POST -Uri "YOUR-FULL-DOMAIN/predict" -Headers @{"Content-Type" = "application/json"} -Body '{"text": "Your news article to predict here", "language" : "english"}'
- Change the text, language, and full domain (for example, https://01-23-456-789.nip.io)
You can also report news articles in the feedback mode by running the following PowerShell command:
- Invoke-RestMethod -Method POST -Uri "YOUR-FULL-DOMAIN/feedback" -Headers @{"Content-Type" = "application/json"} -Body '{"text": "Report news article here" , "label" : "0", "language" : "english"}'
- Change the text, label, language, and full domain (for example, https://01-23-456-789.nip.io)

Chrome Extension

fake-news-chrome-extension.mp4

Open the "script.js" file located inside the "News Detector Chrome Extension" folder and change the IP address to your full domain (for example, https://01-23-456-789.nip.io).
Access extension settings in Google Chrome, click on "Load Unpacked" and select the "News Detector Chrome Extension" folder to load it
Fill in the input fields of the Chrome extension and experiment with both prediction and feedback modes

Android App

fake-news-android-app.mp4

Open the "Fake News Android App" folder using Android Studio and locate the "RetrofitClient.kt" file under "app/java/com/example/fakenewsapp/RetrofitClient.kt"
Open the "local.properties" file under "Gradle Scripts" and adapt the Android SDK path with your user
Change the "BASE_URL" value to your full domain (for example, https://01-23-456-789.nip.io)
Run the app to install and test it on the simulator or your own device connected via USB (the latter requires developer options turned on)

RESTful Script for Model Improvement

fake-news-model-improvement.mp4

The datasets were pushed via Git LFS given their size. This requires an initial step for data retrieval:
- Open CMD in the main folder where the datasets are located
- Fetch all the pointer files from the LFS remote server by running git lfs fetch
- Replace the resulting pointer files with the actual datasets by running git lfs checkout
- Notice how the size of each dataset has increased considerably after these steps
Move the "Final_Dataset_English.csv" and "Final_dataset_portuguese.csv" files to the "Flask Cloud and Local RESTful Script" folder
Download the distilBERT models from here and here, and move them to the "Flask Cloud and Local RESTful Script" folder
Open the terminal in Anaconda, set the path to the "Flask Cloud and Local RESTful Script" folder, and create the environment by running the following command:
- conda env create -f environment.yml
Once the "news-feedback-fetch" environment is created, open VS Code using the Anaconda launcher with the new environment
Open the "Flask Cloud and Local RESTful Script" folder and modify the "data_fetch_websocket.py" script as follows:
- Change the IP address in the "send_model_to_ec2" function to your Public IPv4 DNS
- Change the IP address in the "fetch_user_feedback_data" function to your full domain (for example, https://01-23-456-789.nip.io)
Run the "data_fetch_websocket.py" script using your Anaconda environment. Make sure "news-feedback-fetch:conda" is shown in the lower right corner of VS Code to use the environment
The script will start fetching the feedback data from the Flask app, with an interval of 30 seconds between each GET request
Test the feedback functionality by reporting news articles that are incorrectly predicted by the models
- The script requires at least two different news articles to improve the models. If only one article is received, the feedback data is discarded
- Reporting many similar news articles can trigger the similarity check of the script, which converts them into a single news article
- This is done to group similar topics or events together, as a way to reduce model bias and decide between contradicting stances
Once the program finishes improving the models with the feedback data, connect to your EC2 instance. Run ls to check the new and improved TFLite files received over SFTP and SSH
The news articles reported to the Flask App should now return the right predictions, according to the feedback data sent earlier
- Be aware that the Transfer Learning technique used to improve the models still keeps all the knowledge acquired during the initial training phase with the datasets
- The most significant aspect in this phase is the complexity of the data patterns, which varies based on how much the reported news articles differ from the ones in the initial datasets
- As a consequence, reporting just a few news articles might not be enough to improve the predictions of the new models

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
English Classification		English Classification
European Portuguese Classification		European Portuguese Classification
Fake News Android App		Fake News Android App
Flask Cloud and Local RESTful Script		Flask Cloud and Local RESTful Script
News Detector Chrome Extension		News Detector Chrome Extension
Final_Dataset_English.csv		Final_Dataset_English.csv
Final_dataset_portuguese.csv		Final_dataset_portuguese.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

video-test

European Portuguese Dataset

Best Machine Learning and Deep Learning models

System Development

Requirements