# List of Resources

Welcome to the `Resources` folder. This folder contains a list of resources that can be helpful for us. This document contains things on setting up, and on the project brief. The `ForConsistency.ipynb` is the shared template we will use for consistency in our modelling. The `Tensorflow-Setup.ipynb` is a file to help with the setup of Tensorflow.

---

Here is a list of some resources which can be helpful.

- [Tensorflow installation](https://www.tensorflow.org/install/pip): This describes how to install TensorFlow. It is involved if you try to use it with a GPU.
- [Google Colaboratory](https://colab.research.google.com/): This is Google Colab. It has Tensorflow pre-installed, and also has free GPUs to use.

    - About GPU Availability on Colab: https://research.google.com/colaboratory/faq.html#gpu-availability
- [Project description](https://dsbristol.github.io/dst/assets/assessments/Assessment2.pdf): This is the project description.

--- 

Transcription of Data Science Toolbox 10/31/2024 Lecture. (Tips and Tricks from Lecturer):

- The goal of the next project is to do Massively Parallel Data Science.
- Training a NN is massively parallel.
- Got lots of freedom. Get to choose which question to work on. Get to choose how to assess computational performance. Apply model to achieve question. Doesn't need to be classification task but recommended as it is easy (if don't have other ideas).
- Typically suggest:
  - Evaluating an autoencoder. How to compress what NN is doing down to low dimensions. How to know if it's working and doing a good job.
  - How to evaluate the difference between running something in Keras and in PyTorch. Set up same architecture in both. Run a not particularly interesting scientific question. in order to evaluate a technical question: how do the platforms differ?
- Deep learning gives mysterious outputs. Lots can be done on interpretable AI. It is a massive topic.
- NLP: Learning how e.g., LDA works. Bag of words - treating words as independent draws. Loads of models more sophisticated that are context-aware. NN can do it but not the key.
- Image processing: Classification task. Pug or Muffin game.
- Graph Neural Networks: Getting them to work is a challenge. Can use in classification.
- Can learn more specifics that we are interested in (e.g., parallelism).

**Assessment**
- How did you set things up, how did you go about it? Off-the-shelf encouraged. Try things out. Looking into why or how this works.
- Critical thing to think about: if this goes to production, what changes?

---

For ease of reference, here is the project description.

# Data Science Toolbox Assessed Coursework 2: Data at Scale

**Deadline:** Wednesday Noon, Week 11

## Group Project Description

You will choose an application domain that your group will work with for Assessment 2. Your challenge is to apply Massively Parallel Data Science technology to that data.

### Requirements

You should:

- Choose an appropriate scientific/analysis question;  
- Use an appropriate strategy to learn about the computational performance of the model(s);  
- Apply the model(s) to achieve your question.

### Appropriate Methods

- **Classifiers**
- **Neural Networks**
  - Auto-encoders
  - Choice of Deep Learning platform
  - Choice of architecture
  - Interpreting Deep Learning decisions
- **Text Processing Systems**
  - Latent Dirichlet Allocation
- **Image Processing Systems**
- **Recommender Systems**
- **Exploratory Data Analysis**
  - Graph visualisation and algorithms on graphs

### Appropriate Technologies

You may use this opportunity to delve further into:

- **Algorithmic Approaches**
- **Parallelism via GPUs**
- **Map/Reduce**
- **(Py)Spark**
- **Pregel/GraphX for graph-based computation**
- **Distributed Data vs Distributed Computation Paradigms**
- **Comparison Between Parallel and Single-Machine Problems**
- **Scaling Performance in Terms of Data Volume and Resource Allocated**

### Advice on Assessment

You will be assessed on:

a) The implementation of the model, that is, you can be awarded credit for:
   - Additional implementation if an off-the-shelf implementation falls short.
   - Exploring multiple implementations.
   - Examining the mathematical details of choices.

b) The application of the model to your chosen domain, that is, you can be awarded credit for:
   - Identifying an appropriate dataset.
   - Using your understanding of the structure of datasets to make arguments comparing the dataset you chose to one that you might encounter in a “real” data-science setting.
   - Plotting or otherwise describing various inputs, outputs, or parameters.

c) The correctness of the methods used to achieve their stated goals.

d) The robustness of the results in supporting the conclusions.

In order to be attributed credit for your efforts to choose appropriate data, ensure that you document the data exploration process. You should aim to demonstrate diligence that there is no more appropriate data source in your chosen category. You do not need to excel in all areas in order to get a high mark. Instead, you need to perform robustly in all areas and additionally demonstrate insight somewhere to score highly. You are not expected to work in a genuine high-volume environment, but you should demonstrate how you expect your method would perform at scale.

### Individual Reflection Description

- Discuss the rationale behind the inference goal that you selected.
- Discuss what changes you might have to make were the volume of data to be increased by a factor of 1000.
- Relate your data source to those you might encounter in a real-world setting.
- Discuss a mathematical issue raised in the project, different from those of your group.