# Part 0: Form Study Groups [End of Week 6]
---
If you haven't already done so, the first task is to form a study group of a maximum, and minimum of four people. There is an announcement on Absalon describing how to do this. If you need to form a group with fewer than four people, you must write to the Course Responsible to explain why you need dispensation. Make sure that you list the names of all members of the group at the top of the final report, along with your group number. You should also make the contributions of each group member clear.

# Part 1: Data Processing (~1 page) [End of Week 8]
---

In the first part of the project, you should work on retrieving, structuring, and cleaning data.

You will be using a subset of the FakeNewsCorpus dataset in your project, which is available from Absalon. You can also find more information about the full dataset Links to an external site. and find information about how the data is collected, the available fields, etc.

### Task 1:

Your first task is to retrieve a sample of the FakeNewsCorpus from https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv  and structure, process, clean it. You should follow the methodology you developed in Exercise 1. When you have finished cleaning, you can start to process the text. NLTK (https://www.nltk.org/ Links to an external site.) has built-in support for many common operations. Try the following:

-   Tokenize the text.

-   Remove stopwords and compute the size of the vocabulary. Compute the reduction rate of the vocabulary size after removing stopwords.

-   Remove word variations with stemming and compute the size of the vocabulary. Compute the reduction rate of the vocabulary size after stemming.

Describe which procedures (and which libraries) you used and why they are appropriate.

### Task 2:

Now try to explore the 995K FakeNewsCorpus subset Download 995K FakeNewsCorpus subset. Make at least three non-trivial observations/discoveries about the data. These observations could be related to outliers, artefacts, or even better: genuinely interesting patterns in the data that could potentially be used for fake-news detection. Examples of simple observations could be how many missing values there are in particular columns - or what the distribution over domains is. Be creative!

1.  Describe how you ended up representing the FakeNewsCorpus dataset (for instance with a Pandas dataframe). Argue for why you chose this design.

2.  Did you discover any inherent problems with the data while working with it?

3.  Report key properties of the data set - for instance through statistics or visualization.

The exploration can include (but need not be limited to):

1.  counting the number of URLs in the content

2.  counting the number of dates in the content

3.  counting the number of numeric values in the content

4.  determining the 100 more frequent words that appear in the content

5.  plot the frequency of the 10000 most frequent words (any interesting patterns?)

6.  run the analysis in point 4 and 5 both before and after removing stopwords and applying stemming: do you see any difference?

### Task 3: 

Apply your data preprocessing pipeline to the 995,000 rows sampled from the FakeNewsCorpus.

### Task 4: 

Split the resulting dataset into a training, validation, and test splits. A common strategy is to uniformly at random split the data 80% / 10% / 10%. You will use the training data to train your baseline and advanced models, the validation data can be used for model selection and hyperparameter tuning, while the test data should only be used in Part 4.

# Part 2: Simple Logistic Regression Model (~1 page) [End of Week 10]
---

### Task 0:

You should create one or more reasonable baselines for your Fake News predictor. These should be simple models that you can use to benchmark your more advanced models against. You should aim to train a binary classification model that can predict whether an article is reliable or fake.

### Task 1: 

Start by implementing and training a simple logistic regression classifier using a fixed vocabulary of the 10,000 most frequent words extracted from the content field, as the input features. You do not need to apply TF-IDF weighting to the features. It should take no more than five minutes to fit this model on a modern laptop, and you should expect to achieve an F1 score of ~94% on your test split. Write in your report the performance that you achieve with your implementation of this model, and remember to report any hyper-parameters used for the training process.

### Task 2: 

Consider whether it would make sense to include meta-data features as well. If so, which ones, and why? If relevant, report the performance when including these additional features and compare it to the first baselines. Discuss whether these results match your expectations.

### Task 3: 

Apply your data preprocessing pipeline to the extra reliable data you scraped during Graded Exercise 2 and add this to the training data and observe how this changes the performance of your simple model. Discuss whether you will continue to use this extra reliable data for the Advanced Model.

# Part 3: Advanced Model (~1 page) [End of Week 11]
---

Create the best Fake News predictor that you can come up with. This should be a more complex model than the simple logistic regression model, either in the sense that it uses a more advanced method, or because it uses a more elaborate set of features. For example, you might consider using a Support Vector Machine, a Naive Bayes Classifier, or a neural network. The input features might use more complex text representations, such as TF-IDF weights or continuous word embeddings. Report necessary details about your models ensuring full reproducibility. This could include, for example, the choice of relevant parameters and how you chose them. Make sure to argue for why you chose this approach over potential alternatives.

*Optional: If you want to go even further, you might want to try training your models on even more data. The full FakeNewsCorpus Links to an external site. is a total of 9GB of source material available for training your model. You will need to use a multi-part decompression tool, e.g. `7z.` Given all the files, execute the following command: `7z x news.csv.zip`. This should create a 27GB file on disk (`29.322.513.705` bytes). You may find it challenging to run your data processing pipeline on the entire FakeNewsCorpus, so take care if you attempt this step.*


# Part 4: Evaluation (~1 page) [End of Week 12]
---

You should now evaluate your models on the FakeNews and the LIAR dataset. Arrange all these results in a table to facilitate a comparison between them. You should be evaluating the model on how well it classifies articles correctly using F-score. You may want to include a confusion matrix to visualize the types of classification errors made by your models.


### Task 1: 

Evaluate the performance of your Simple and Advanced Models on your FakeNewsCorpus test set. It should be possible to achieve > 80% accuracy but you will not fail the project if your model cannot reach this performance.

### Task 2: 

In order to allow you to play around cross-domain performance, try the same exercise on the LIAR dataset Links to an external site., where you know the labels, and can thus immediately calculate the performance. You are expected to directly evaluate the model you trained on the FakeNewsCorpus. In other words, you do not need to retrain the model on the LIAR dataset.

### Task 3: 

Compare the results of this experiment to the results you obtained in Task 1. Report your LIAR results as part of your report. *Remember to test the performance of both your Simple and Advanced Model on the LIAR dataset.*


# Part 5: Conclusions (~0.5 page) [End of Week 13]
---

Conclude your report by discussing the results you obtained. Explain the discrepancy between the performance on your test set and on the LIAR set. If relevant, use visualizations or report relevant statistics to point out differences in the datasets. Discuss the issues about sample bias when evaluating on a different distribution of data than the training data. Conclude with describing overall lessons learned during the project, for instance considering questions like: Does the discrepancy between performance on different data sets surprise you? What can be done to improve the performance of Fake News prediction? Will further progress be driven primarily by better models or by better data?

Please note that the general discussion is not merely a summary of what you have done in the other questions. We expect to see some non-trivial reflection in this section.