# Introduction

In this notebook you can find more advanced tasks or mini-projects wich employs the methods and algorithms we are learning in this course. Tasks are describing problems, which can be handled with learning predictors and classifiers, building mathematical models, image manipulations and etc.

The tasks are not rigit and you are free to propose your own variation of the proposed tasks or even try to come up with your own mini-project.

These pet-projects are expected to be an independent work of a student, although we will guide and help when neccessary. 

The mini-project has to be return in a jupyter notebook format (see the instructions below).

**Note!** There is no garantee that these tasks will have perfect solutions. Therefore, we will grade notebooks not just based on the result, but on the ways the student tried to solve specific problem

## Problem 1. Predicting infectious disease spread with SIR model

### (300 points)


"Mathematical models can project how infectious diseases progress to show the likely outcome of an epidemic and help inform public health interventions. Models use basic assumptions or collected statistics along with mathematics to find parameters for various infectious diseases and use those parameters to calculate the effects of different interventions, like mass vaccination programmes. The modelling can help decide which intervention/s to avoid and which to trial, or can predict future growth patterns, etc." [link](https://en.wikipedia.org/wiki/Mathematical_modelling_of_infectious_disease)

One of the simplest models for describing spread of infectios diseas is SIR model, which stands for Susceptible Infected and Removed. SIR is compartmental models, meaning that whole population is divided into compartments, 3 compartments in the case of SIR. 

<img src="SIR_Flow_Diagram.svg" alt="Drawing" style="width: 600px;"/>

The aim of the project is to formulate, implement and compare SIR models of covid-19 spread in several different countries. As an option, one can compare SIR models of two different infectious diseases (e.g. ebola and covid-19). It is advisible to choose the most complete datasets available.

[Basic reproduction number](https://en.wikipedia.org/wiki/Basic_reproduction_number) should be also estimated.

## Problem 2. Analyses of microarray gene expression data

### (300 points)

"A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome." [link](https://en.wikipedia.org/wiki/DNA_microarray)

<img src="microarray.png" alt="Drawing" style="width: 500px;"/>


[image source](https://www.researchgate.net/figure/Schematic-representation-of-the-DNA-microarray-methodology-DNA-microarrays-are-commonly_fig3_235431026)

**DATA**

The microarray data for this problem consist of normalized relative expression of certain genes measured in different tissue.
There are 22283 gene probes and 5373 samples. 
The full dataset can be loaded from https://www.ebi.ac.uk/arrayexpress/ (accession number E-MTAB-62). Load the processed data and 'Sample and data relationship' file. The 'Sample and data relationship' file contatins annotations for all samples. Annotations are based on the relevance of the sample to a certain type of tissue. There are bigger meta-groups (e.g. 'cell line', 'disease', 'neoplasm', 'normal') and finer divisions (6, 15, 96, 369 groups). The gene expression data and annotations of the sample can be accesssed by sample id (e.g. '1102960533.CEL'). 

**Note!** Use first 10 000 probes expression data, as using full dataset might be slow and require more computational resources. 


**Related articles**
- https://www.nature.com/articles/srep25696#Sec8
- https://www.nature.com/articles/nbt0410-322


**Choose from two tasks (each 300 points):**

### Subproblem 2a. PCA and clustering of microarray gene expression data 

The task is to perform clustering of the array data after reducing the dimensionality of the data with PCA.
Compare if the clusters you have obtained represent any bologically defined group.
For example, you can obtain 6 clusters and compare if these clusters corresponds to meta-group containing 6 tissue types.


### Subproblem 2b. PCA and classification of microarray gene expression data

Alternatively, we can define this problem as classification problem, where annotations will serve as sample labels. The task is to implement PCA, classification and perform model validation.
In the simplest scenario, you can divide dataset into two parts, training and test sets.
Alternatively, you can perfome cross-validation with e.g. K-folds.

**Note!** Think about when to transform the data with PCA, before or after splitting the data? If PCA performed after data splitting, should it be done only on training dataset and test data projected on principal components of the training set or PCA should be performed separately on training and test dataset?

**Note!!** The groups defined by annotations are not of equal size (unbalanced). Take into account when splitting the data or perfoming any kind of analyses.

## Problem 3. Sleep stages classification of rodent EEG signal

### (400 points)

"Electroencephalography (EEG) is an electrophysiological monitoring method to record electrical activity of the brain. EEG measures voltage fluctuations resulting from ionic current within the neurons of the brain" [link](https://en.wikipedia.org/wiki/Electroencephalography)


In the current task you will have to build a classifier for automatically assigning the brain activity state based on the EEG recording. 

<img src="eeg.png" alt="Drawing" style="width: 600px;"/>

[image source](https://www.nature.com/articles/s41598-019-51269-8)

**DATA**

Approximately 4 hours of EEG recordings of 34 animals. The sampling rate of the EEG signal is 200 Hz. 
EEG signals were manually analyzed at 4 seconds time intervals and labels were assigned to each of these 4s time intervals (which are also called epochs).

Labels are codes (0,1,2,3,8) for different classes, where 0 is Wake state, 1 - NREM sleep, 2 - REM sleep, 3 - Artefact, 8 - not analyzed. 

For example, if you have recording with 1600 data points at the sampling rate 200 Hz, the time of recording is 8 seconds or 2 epochs, therefore 2 labels will be assigned to this recording.

The EEG data is in 'EEG.npy' file, where you will find recordings of 34 animals. EEG recording is divided into 4000 x 4s intervals (4s = 800 data points at frequency 200 Hz), therefore the shape of the whole numpy array is (34,4000,800). 
The file 'EEG_codes.npy' contains lables of the 4s epoches (shape of the numpy array is (34, 4000)).

The epochs with label '8' (and associated EEG signal) should be removed from classification.

Classifier should be validated with validation and/or test sets. In addition, you can build several classifiers and compare the classifiers with cross-validation methods.

**Hint!** You have to think what to use as a feature for building a  classifier. One option is to extract power spectrum of each epoch and use it as a vector of features. One of the common ways to get power-frequency information from a signal is Fourier transformation. In Python you can check `spectrogram` and `hann` from `scipy.signal` to perform fast fourier transform with Hanning window (typically of size 256 for EEG signal).
It is advisable to filter out the lower frequencies (below 1.5-2 Hz) as they usually represent moving artefacts. 

You can also use other feature extracting/dimensionality reduction methods such as PCA.

**Note!** EEG signal collected from each animal is unique and the quality of the recording is dependent on many factors. Therefore, even the values range of EEG signal may differ from animal to animal and you might want to use some sort of normalization to account for this effect. 

**Note!!** The data is highly unbalanced, while 'Wake' and 'NREM' represent majority of the recording, 'REM' signal is much more rare and 'Artefact' is anomaly of the recording. Therefore, you might want to filter out 'Artefact' data and select part of the data with equal representation of 'Wake', 'NREM' and 'REM'. Alternatevily, you can try to build an artefact detection method, to detect faulted eeg recording segments. 


## Problem 4. Spam detection 

### (200 points)

Build a simple spam detection classifier based on these data:

https://www.kaggle.com/uciml/sms-spam-collection-dataset



## Problem 5. Foreground separation from video

### (200 points)

Perform background and foreground separation of the video (https://www.youtube.com/watch?v=8IVMo9lvJQk) and compile a new video with only moving object.

**Hint!** Use PCA to extract background.

# How to prepare mini-project notebook 

The notebook should consist of the following sections:



### Introduction

Provide short general background on the problem and formulate the aim of the project.
Formulate the task as machine learning problem: what are the data points, features of the data points, labels (if availible). Describe the format and shape of the data.

Describe predictor/ model of choice and its hypothesis space. Describe how model was validated (loss function, cross-validation, model metrics).

### Implementation

Provide implementation of the task with comprehensive code comments.


### Results 

Provide final results, evaluation of the model. Discuss what were the problems you encountered during solving task.

### Conclusions

Describe in general words, if the problem was solved successfully, what are conclusions and other possible extensions of a specific problem/ other possible solutions.