Fake News Classifier (Dataset from Kaggle)

About

Our project is all about fighting fake news by building a system that can automatically tell if a news article is real or fake. In today's world of social media and online news, it's becoming harder to know what to believe.

To address this problem, we have built a machine learning model that uses natural language processing techniques to analyse the content of news articles and determine their credibility. We trained our model on a large dataset of labeled news articles, which we carefully curated to include examples of both real and fake news.

Our model uses a variety of features, such as the text and title, the length of the article, and the overall tone and sentiment, to make its classification decision. We also experimented with different machine learning algorithms and techniques, such as PyTorch and Keras, to improve the accuracy and robustness of our system.

We're hoping that our fake news classification system will be really useful for anyone who wants to stay on top of what's happening in the world without getting duped by fake news. It's gonna be a big help for journalists, fact-checkers, and anyone else who wants to know what's really going on out there.

Problem Statement

"Is this news article real or fake?"

Cleaning Methods

Identify information leaks

Text leaks
Date leaks
URL leaks

Basic NLP data cleaning

Contractions
Punctuations
Spaces
Lowercase
Duplicates

Analysis Done

NLTK and WordCloud analysis
Distribution of news article length (log transformation)
Sentiment analysis using TextBlob
Subject analysis

Models Used

Machine Learning Models:

Logistic Regression (Sentiment Score and Bag of Words)
Binary Tree Classification
Random Forest with Cross Validation
XGBoost using TF-IDF to vectorise text (model included) with Cross Validation

Deep Learning Models:

Pytorch using Bert Based Uncased Model (model not included in github due to large file size)
Keras ANN using Tokenizer for preprocessing text (model included)

Further Testing (on an unseen dataset)

We tested our models (keras and xgboost) on a completely new dataset to test its perfomance against real world news.

Conclusion

We concluded that deep learning models are the best for this problem since they excel at handling large amounts of data and can find nuanced patterns and complex features that are not immediately visible. Additionally, the ability to automatically extract hierarchical representations make these models excel at NLP applications.

Takeaways

Text outperformed title in most cases.
NLP features like sentiment analysis help refine our models.
Deep learning models require a lot of computational power for large datasets.
Overfitting is a common issue and there is a tradeoff between "too much capacity" (overfitting) VS "too little capacity" (not converging).

Future Improvements

Reduce overfitting by starting with a few layers and parameters and increase it until we see diminishing return with regard to validation loss.
Aim to include models that can detect sarcasm and irony based on context. Improved models can better capture and analyse context which can greatly improve the accuracy of our news classifier.
Build a working website with HTML, CSS and JS that can incorporate our model to detect fake news based on the article. This offers functionality and usability of the deep learning model we trained.

Presentation Video

Fake News Classifier

Contributors

Name	Github Account
Timothy Lee	@timooo-thy
Jain Amitbikram	@spinelessknave8
Vivian Kho	@svftbuns

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.ipynb_checkpoints		.ipynb_checkpoints
models		models
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Fake.csv		Fake.csv
News_Classifier_Final.ipynb		News_Classifier_Final.ipynb
News_Crossover.ipynb		News_Crossover.ipynb
README.md		README.md
True.csv		True.csv
news_updated.csv		news_updated.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News Classifier (Dataset from Kaggle)

About

Problem Statement

Cleaning Methods

Analysis Done

Models Used

Machine Learning Models:

Deep Learning Models:

Further Testing (on an unseen dataset)

Conclusion

Takeaways

Future Improvements

Presentation Video

Contributors

References

About

Releases

Packages

Contributors 2

Languages

timooo-thy/fake-real-news-classifier

Folders and files

Latest commit

History

Repository files navigation

Fake News Classifier (Dataset from Kaggle)

About

Problem Statement

Cleaning Methods

Analysis Done

Models Used

Machine Learning Models:

Deep Learning Models:

Further Testing (on an unseen dataset)

Conclusion

Takeaways

Future Improvements

Presentation Video

Contributors

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages