The spreading of disinformation throughout the web has become a critical problem for a democratic society. The dissemination of fake news has become a profitable business and a common practice among politicians and content producers. A recent study entitled 'Regulating disinformation with artificial intelligence', examines the trade-offs involved in using automated technology to limit the spread of disinformation online. Although AI and Natural Language Generation have evolved so much in the last decade, there are still few shortcomings that must be better understood for a stronger solution. The students will dive deeper into Natural Language Processing; therefore a strong knowledge of Python and AI is necessary.
Understand more explore the vulnerabilities and limitations of automatic fact-checking detection for supporting development of regulations and better technology.
https://www.qu.tu-berlin.de/menue/team/senior_researchers/vinicius_woloszyn/
https://github.com/untruenews/ss2021/blob/main/slides/course.pdf
GROUP B start at 14HS, and GROUP A at 15HS,
- 1# 27/04/2021 - Introduction and definition of groups
- 2# 04/05/2021 - Flair
- 3# 11/05/2021 - Flair
- 4# 18/05/2021 - Flair
- 5# 25/05/2021 - Experiments with textattack: (GROUP A) ATTACKS, (GROUP B) DATA Augmentation
GROUP A & B: present the evolution of the work
- 6# 01/06/2021 - Experiments with textattack: (GROUP A) ATTACKS, (GROUP B) DATA Augmentation
GROUP A: Perform attacks to the flair models using https://github.com/QData/TextAttack . Compare the resilience of "RoBERTa","BERTweet", "FlairEmbeddings" to Synonym Substitution, Character Substitution, Word Insertion or Removal, General Paraphrase attacks.
GROUP B: train a german detector for fake news: 1) Extract data in different languages from the raw dataset (e.g., portuguese, italian). 2) Experimetn with google tranlate https://pypi.org/project/googletrans/ to translate the data to german. 3) Use the translated data to train a pre-trained language model in german. 4) train a multilanguage model (e.g., tf-xlm-roberta-base) 4) compare and Present the results.
- 7# 08/06/2021 - Presentation:
GROUP A will present: https://www.europarl.europa.eu/RegData/etudes/STUD/2019/624279/EPRS_STU(2019)624279_EN.pdf
GROUP B will present: https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence-artificial-intelligence
- 8# 15/06/2021 - experiments / writing the paper
- 9# 22/06/2021 - experiments / writing the paper
- 10# 29/06/2021 - experiments / writing the paper
- 11# 06/07/2021 - experiments / writing the paper
- 12# 13/07/2021 - experiments / writing the paper
- 13# 17/07/2021 - submiting the paper
- Dive into Deep Learning, https://d2l.ai/d2l-en.pdf
- Speech and Language Processing, https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf
- https://github.com/QData/TextAttack
- https://github.com/flairNLP/flair
- https://huggingface.co/
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- Woloszyn, Vinicius, et al. "Untrue. News: A New Search Engine For Fake Stories." arXiv preprint arXiv:2002.06585 (2020).
- Zhou, Zhixuan, et al. "Fake news detection via NLP is vulnerable to adversarial attacks." arXiv preprint arXiv:1901.09657 (2019).
- Sinha, Abhishek, et al. "Negative Data Augmentation." arXiv preprint arXiv:2102.05113 (2021).
- Morris, John, et al. "TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020
- https://github.com/afshinea/stanford-cs-229-machine-learning/
- http://cs229.stanford.edu/syllabus-spring2021.html