# Research on Sources of Data

## Introduction
In this document, we perform a review of sources of data that are suitable for use in this project. In particular, we look at the following sources:

- Past Data Science Toolbox projects
- [Kaggle Datasets](https://www.kaggle.com/)
- [Dataset List](https://www.datasetlist.com/) - Contains a wide variety of image classification datasets.
- [QuantumStat](https://index.quantumstat.com/) - A resource for datasets suitable for Natural Language Processing (NLP).
- [Google Dataset Search](https://datasetsearch.research.google.com/) - A useful tool for finding datasets.

The types of data we tried to look for are:
- Datasets with images for classification tasks with a convolutional neural network (CNNs)
- Datasets suitable for use with a feed-forward neural network (FFNN)
- Datasets suitable for use with a recurrent neural network (RNN)
- Datasets suitable for topic modelling

We present our findings below.

## Review on Past Data Science Toolbox Projects

In this section, we perform a review of recently completed Data Science Toolbox projects. Learning from past projects is a very fruitful task [1], as it can lead us to new insights and help us understand the ways people have approached their projects in the past. A detailed review is provided below, with links to the repositories and to the data provided where available.

### Neural networks

There are several projects that used neural networks. We provide a summary below.

- [A project](https://github.com/hhplnt/DST-assessment-3/tree/main/Data/PlantVillage) [6] used the `PlantVillage` [Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease) from Kaggle [2] which provides a multi-class classification problem in identifying diseases from images of plant leaves. They considered the use of convolutional neural networks, autoencoders and parellelism with GPUs. They also investigated the properties of neural networks by investigating the use of different activation functions. They also investigated the use of pre-trained models and then tried to build their own neural network. An HPC environment was used.
- [Another project](https://github.com/AdamEiffert/DST-Assessment-4/blob/main/Report/01-Introduction.ipynb) [5] used the [KDD1999 Cup Dataset](https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data) which is a cybersecurity dataset [4] used in a competition. which is intended to be used to create a predictive model capable of distinguishing between attacks and normal connections. This involved exploring the performance and time taken to fit models using different depths (ranging from 1 to 10) and 4 activation functions: `tanh`, `RELU`, `Swish` and `sigmoid`. They used an HPC. The second part of the project involved using adversarial training to guard against adversarial attacks. This involved using the Fast Gradient Sign Method to generate adversarial examples and use them in training. The effect of the activation function in adversarial training was studied.
- Another dataset used was the [CIC-IDS-2017 dataset](https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset) [11], exploring the use of a feed-forward neural network (FFNN) [[10]](https://github.com/xiaozhang-github/DST-Assessment-4/blob/main/Report/03%20-%20FFNN.ipynb). The use of scaling was explored. The effects that number of epochs, number of layers, number of neurons, batch size and learning rate have on training were explored. The use of dropout to combat overfitting was explored. The use of an autoencoder and different optimizers was explored. A mention of extensions such as exploring regularisation is mentioned. Metrics of performance (accuracy, precision, recall and $F_1$ score) were compared for the FFNN and classifiers such as random forests.
- A [project](https://github.com/shreyashah24/DST-Assessment-3/blob/main/Report/-01-Introduction.ipynb) [12] used [Brain Tumour data](https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset) from Kaggle. A CNN was used. PCA was performed for visualisation.  The `ResNet50` pretrained neural network was also explored and PyTorch was used for exploring scalability.
- A [project](https://github.com/erinp0/DST-NN-Project/blob/main/Report/01-Introduction.ipynb) [13] used [German Traffic Road Sign data](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) from Kaggle to explore the use of robust neural networks. There is detail on how adversarial examples are created (for example by adding Gaussian noise to the images).
- A [project](https://github.com/billnunn/Assessment-4-Bill-Mo-Oliver/blob/main/meeting/Meeting%20Summary.ipynb) [14] used the KD1999 Cup dataset [4] and compared the use of different widths, depths and different optimizers. Stochastic Gradient Descent was used as a baseline. This was compared to Nesterov's Accelerated Gradient (NAG), and ADAM. 
- A [project](https://github.com/gabejg/DST-Assessment-04/blob/main/Report/01%20-%20Introduction.md) [15] used the `UNSW-NB15 Dataset`, which is no longer publicly available. This involved using FFNNs using `Neuralnet`, a single layer perceptron (SLP), an autoencoder, the Naive Bayes classifier, an ELM (extreme learning machine) and a random forest. The use of dropout and the use of no hidden layers (NHL) was also explored. 

### Text Processing Tasks

There are several projects that involved text processing. We provide a summary below.

- A project (see [[7]](https://github.com/xiaozhang-github) and [[8]](https://github.com/mosefaq/dst_assessment_3/blob/main/report/01%20-%20Introduction.ipynb)) used the Enron email [dataset](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset) and used topic modelling. They used approaches such as td-idf (term-frequency inverse-document frequency), latent Dirichlet allocation (LDA) and Hierarchical Dirichlet Process (HDP) approaches to topic modelling. There was a comparison of how this can be used in classification of spam. Timing of the modelling process was also done in one group, for LDA. The non-negative matrix factorisation (NMF) topic model was also explored.
- A [project](https://github.com/gabejg/DST-Assessment-05/tree/main) [16] used web scraping from [NCSC](https://www.ncsc.gov.uk/) Weekly threat reports using Selenium. The use of PySpark in paralellisation of a topic model (using TF-IDF) was then explored. There were detailed reflections on the results of the parallelisation.
- A [project](https://github.com/gabejg/DST-Assessment-03) [17] used SSH logs and used topic modelling. The data is available in the data folder of the github. This used the TD-IDF and LDA approaches to topic modelling, along with the use of t-SNE for dimensionality reduction.
- A [project](https://github.com/billnunn/Assessment-3-Bill-Adam/blob/main/Report/01-Introduction.ipynb) [18] used a topic model on the Common Vulnerabilities and Exploits (CVE) database. LDA was then used to perform topic modelling.

### Other

- The [CIC-IDS-2017 dataset](https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset) [11] was used in a [project](https://github.com/xiaozhang-github/DST-Assessment-5/blob/main/Report/01%20-%20Introduction.ipynb) where Spark was used together with random forests.
- A [project](https://github.com/billnunn/Assessment-5-Bill-Naz-Adam/tree/main/Report) used the [email data of CMU’s synthetic insider threat dataset](https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1) to explore the use Apache-Spark’s GraphFrames package, giving consideration to parallelisation.


# An Excursion Through Data Sources

In this section, we present data that we find ourselves.

## Datasets Suitable for Convolutional Neural Networks

Through our review of past `Data Science Toolbox` projects, we saw that small datasets with "easy" problems do not properly showcase the power of neural networks, since it is easy to get good performance (high accuracy and high AUC) with simpler and less computationally intensive classifiers. For this reason, we focus on large datasets in searching for data.

1. [MNIST](https://www.kaggle.com/oddrationale/mnist-in-csv): 70,000 grayscale images of handwritten digits for image classification tasks.

2. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html): 60,000 32x32 color images in 10 classes for image classification.

3. [ImageNet](http://www.image-net.org/): Over 1.5 million images in 1,000 categories for visual recognition and benchmarking.

4. [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist): 70,000 grayscale images of fashion products in 10 categories. The linked Github provides reasons for why this dataset may be more suitable for analysis than the MNIST digits dataset.

5. [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html): Nearly 9 million annotated images for object detection tasks.

6. [MS-COCO](http://cocodataset.org/): 330,000 images with 1.5 million object instances for object detection, segmentation, and captioning.

7. [SVHN (Street View House Numbers)](http://ufldl.stanford.edu/housenumbers/): Over 600,000 labeled digits from real-world street numbers for digit recognition.

8. [Stanford Large Network Dataset Collection](http://snap.stanford.edu/data/): Large network datasets across domains, featuring millions of nodes and edges.

A lot of these are too large for analysis on our personal computers. This will give us a chance to use downsampling or to reduce the amount of information (for example converting images to greyscale).

## Datasets for Recurrent Neural Networks

These datasets provide a variety of sequential data types to experiment with recurrent neural networks. 

1. [IMDB Reviews](https://ai.stanford.edu/~amaas/data/sentiment/): A dataset of 50,000 movie reviews labeled as positive or negative, commonly used for sentiment analysis tasks in NLP.

2. [Amazon Reviews](https://nijianmo.github.io/amazon/index.html): Millions of Amazon product reviews and metadata, useful for training models on sentiment analysis and recommendation systems.

3. [Google's Billion Words Corpus](https://code.google.com/archive/p/1-billion-word-language-modeling-benchmark/): A large language model training dataset with over a billion words from English text, ideal for language modeling and word prediction.

4. [Jena Climate Dataset](https://www.kaggle.com/datasets/mnassrib/jena-climate): Weather time series dataset recorded at the Weather Station of the Max Planck Institute for Biogeochemistry in Jena, Germany.

## Other

Another possibility could be to explore the use of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), a large language model, as described in a Data Science Toolbox (2024/25) Workshop.

We could also use the following datasets in *topic modelling*:
1. [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/): A dataset consisting of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, suitable for text classification and topic modeling tasks.

2. [BBC News Classification Dataset](https://www.kaggle.com/c/learn-bbc-news): Around 2,000 articles from BBC news categorized into five topics: business, entertainment, politics, sports, and technology, useful for training models to identify news topics.


# Conclusion

The data we found ourselves, together with the list of datasets considered in past projects, provide a rich source of problems to explore. We will be selecting a dataset and considering a scientific question to analyse in the following documents.

# References

[1] **Importance of Learning from the Past**: A comprehensive article that highlights the significance of understanding the past to inform our future decisions. [Read on Medium](https://medium.com/@chizinduudum/learning-from-the-past-what-can-history-teach-and-tell-us-c9caa6001d99).

[2] **Kaggle**: An essential platform for data science and machine learning projects, offering competitions, datasets, and educational resources. [Visit Kaggle](https://www.kaggle.com/).

**Datasets**: Various datasets used in machine learning and data science projects:

- [3] **PlantVillage Dataset**: A valuable resource for plant disease research and classification, hosted on Kaggle. [Access the Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease)
- [4] **KDD 1999 Cup Dataset**: A widely used dataset for network intrusion detection and machine learning experiments. [Access the Dataset](https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data)
- [9] **Pre-processed Enron Email Dataset**: An email dataset prepared for spam detection research. [Access the Dataset](https://www2.aueb.gr/users/ion/data/enron-spam/)
- [11] **CIC-IDS-2017 Dataset**: A detailed dataset for cybersecurity and intrusion detection, available on Kaggle. [Access the Dataset](https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset)

**Data Science Toolbox Projects**: A compilation of data science and machine learning projects from various contributors:

- [5] A GitHub project showcasing deep learning methods to identify plant diseases. [GitHub](https://github.com/AdamEiffert/DST-Assessment-4)
- [6] A machine learning project focused on network intrusion detection using neural networks. [GitHub](https://github.com/hhplnt/DST-assessment-3/tree/main/Data/PlantVillage)
- [7] A Project on Topic Modelling using the Enron email [dataset](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset). [GitHub](https://github.com/xiaozhang-github)
- [8] A project on Topic Modelling Enron email [dataset](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset), inspired from [7]. [GitHub](https://github.com/mosefaq/dst_assessment_3/blob/main/report/01%20-%20Introduction.ipynb)
- [10] A project on network intrusion detection using deep learning. [GitHub](https://github.com/xiaozhang-github)
- [12] A project using NNs in classification of brain tumours. [GitHub](https://github.com/shreyashah24/DST-Assessment-3/blob/main/Report/-01-Introduction.ipynb)
- [13] A project using NNs in classification of road signs. [GitHub](https://github.com/erinp0/DST-NN-Project/blob/main/Report/01-Introduction.ipynb)
- [14] A project investigating the use of different optimisers for NNs. [GitHub](https://github.com/billnunn/Assessment-4-Bill-Mo-Oliver/blob/main/meeting/Meeting%20Summary.ipynb)
- [15] A project investigating the use of different architectures in NNs. [GitHub](https://github.com/gabejg/DST-Assessment-04/blob/main/Report/01%20-%20Introduction.md)
- [16] A project on Topic Modelling investigating the use of PySpark. [GitHub](https://github.com/gabejg/DST-Assessment-05/tree/main)
- [17] A project on Topic Modelling investigating various techniques like TD-IDF and LDA. [GitHub](https://github.com/gabejg/DST-Assessment-03)
- [18] A project on Topic Modelling using LDA on data containing a temporal variable. [GitHub](https://github.com/billnunn/Assessment-3-Bill-Adam/blob/main/Report/01-Introduction.ipynb)
