# Research on Sources of Data

In this document, we perform a review of sources of data that are suitable for use in this project. In particular, we look at the following sources:
- Datasets with images for classification tasks with convolutional neural networks (CNNs)
- Datasets suitable for use with a feed-forward neural network (FFNN)
- Datasets suitable for topic modelling

The sources of data we use are as follows:

- Past Data Science Toolbox Projects
- Kaggle
- UCI Machine Learning Repository
- Other Sources


## Review on Past Data Science Toolbox Projects

Learning from past projects is a very fruitful task [1], as it can lead us to new insights and help us understand the ways people have approached their projects in the past. A detailed review is provided below, with links to the repositories provided, and data links provided where available.

### Neural networks

There are several projects that used neural networks. We provide a summary below.

- [A project](https://github.com/hhplnt/DST-assessment-3/tree/main/Data/PlantVillage) [6] used the `PlantVillage` [Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease) from Kaggle [2] which provides a multi-class classification problem in identifying diseases from images of plant leaves. They considered the use of convolutional neural networks, autoencoders and parellelism with GPUs. They also investigated the properties of neural networks by investigating the use of different activation functions. They also investigated the use of pre-trained models and then tried to build their own neural network. An HPC environment was used.
- [Another project](https://github.com/AdamEiffert/DST-Assessment-4/blob/main/Report/01-Introduction.ipynb) [5] used the [KDD1999 Cup Dataset](https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data) which is a cybersecurity dataset [4] used in a competition. which is intended to be used to create a predictive model capable of distinguishing between attacks and normal connections. This involved exploring the performance and time taken to fit models using different depths (ranging from 1 to 10) and 4 activation functions: `tanh`, `RELU`, `Swish` and `sigmoid`. They used an HPC. The second part of the project involved using adversarial training to guard against adversarial attacks. This involved using the Fast Gradient Sign Method to generate adversarial examples and use them in training. The effect of the activation function in adversarial training was studied.
- Another dataset used was the [CIC-IDS-2017 dataset](https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset), exploring the use of a feed-forward neural network (FFNN) [[10]](https://github.com/xiaozhang-github). The use of scaling was explored. The effects that number of epochs, number of layers, number of neurons, batch size and learning rate have on training were explored. The use of dropout to combat overfitting was explored. The use of an autoencoder and different optimizers was explored. A mention of extensions such as exploring regularisation is mentioned. Metrics of performance (accuracy, precision, recall and $F_1$ score) were compared for the FFNN and classifiers such as random forests.
- A [project](https://github.com/shreyashah24/DST-Assessment-3/blob/main/Report/-01-Introduction.ipynb) used [Brain Tumour data](https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset) from Kaggle. A CNN was used. PCA was performed for visualisation.  The `ResNet50` pretrained neural network was also explored and PyTorch was used for exploring scalability.
- A [project](https://github.com/erinp0/DST-NN-Project/blob/main/Report/01-Introduction.ipynb) used [German Traffic Road Sign data](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) from Kaggle to explore the use of robust neural networks. There is detail on how adversarial examples are created (for example by adding Gaussian noise to the images).
- A [project](https://github.com/billnunn/Assessment-4-Bill-Mo-Oliver/blob/main/meeting/Meeting%20Summary.ipynb) used the KD1999 Cup dataset and compared the use of different optimizers. Stochastic Gradient Descent was used as a baseline. This was compared to Nesterov's Accelerated Gradient (NAG), and ADAM. 
- A [project](https://github.com/gabejg/DST-Assessment-04/blob/main/Report/01%20-%20Introduction.md) used the `UNSW-NB15 Dataset`, which is no longer publicly available. This involved using FFNNs using `Neuralnet`, a single layer perceptron (SLP), an autoencoder, the Naive Bayes classifier, an ELM (extreme learning machine) and a random forest. The use of dropout and the use of no hidden layers (NHL) was also explored. 

### Text Processing Tasks

There are several projects that involved text processing. We provide a summary below.

- A project (see [[7]](https://github.com/xiaozhang-github) and [[8]](https://github.com/mosefaq/dst_assessment_3/blob/main/report/01%20-%20Introduction.ipynb)) used the Enron email [dataset](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset) and used topic modelling. They used approaches such as td-idf (term-frequency inverse-document frequency), latent Dirichlet allocation (LDA) and Hierarchical Dirichlet Process (HDP) approaches to topic modelling. There was a comparison of how this can be used in classification of spam. Timing of the modelling process was also done in one group, for LDA. The non-negative matrix factorisation (NMF) topic model was also explored.
- A [project]((https://github.com/gabejg/DST-Assessment-05/tree/main)) used web scraping from [NCSC](https://www.ncsc.gov.uk/) Weekly threat reports using Selenium. The use of PySpark in paralellisation of a topic model (using TF-IDF) was then explored. There were detailed reflections on the results of the parallelisation.
- A [project](https://github.com/gabejg/DST-Assessment-03) used SSH logs and used topic modelling. The data is available in the data folder of the github. This used the TD-IDF and LDA approaches to topic modelling, along with the use of t-SNE for dimensionality reduction.
- A [project](https://github.com/billnunn/Assessment-3-Bill-Adam/blob/main/Report/01-Introduction.ipynb) used a topic model on the Common Vulnerabilities and Exploits (CVE) database. LDA was then used to perform topic modelling.

### Other

- The [CIC-IDS-2017 dataset](https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset) was used in a different [project](https://github.com/xiaozhang-github/DST-Assessment-5/blob/main/Report/01%20-%20Introduction.ipynb) where Spark was used together with random forests.
- A [project](https://github.com/billnunn/Assessment-5-Bill-Naz-Adam/tree/main/Report) used the [email data of CMU’s synthetic insider threat dataset](https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1) to explore the use Apache-Spark’s GraphFrames package, giving consideration to parallelisation.


# References

[1] Importance of learning from the past: https://medium.com/@chizinduudum/learning-from-the-past-what-can-history-teach-and-tell-us-c9caa6001d99

[2] Kaggle: https://www.kaggle.com/

[3] `PlantVillage` [Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease)

[4] [KDD1999 Cup Dataset](https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data)

[5] Neural network project on Plant Disease identification: https://github.com/AdamEiffert/DST-Assessment-4

[6] Neural network project on KD1999 Cup dataset: https://github.com/hhplnt/DST-assessment-3/tree/main/Data/PlantVillage

[7, 8] Projects on Topic Modelling: https://github.com/xiaozhang-github, https://github.com/mosefaq/dst_assessment_3/blob/main/report/01%20-%20Introduction.ipynb

[9] Pre-processed Enron email dataset: https://www2.aueb.gr/users/ion/data/enron-spam/

[10] Neural network project on CIC-IDS-2017 dataset: https://github.com/xiaozhang-github