<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# 🧫 Finding Outliers in Mass Spectrometer Data

<img src="https://drive.google.com/uc?id=1eIB1nVKS3u6GLEHxvAob9-XjW84PENJx" style="width:600px">

Welcome to our group's **<a href="http://www.carbonateresearch.org/clumpedLab">mass spectrometer lab!</a>** 

When we are not busy on a `machine learning` project, we work on reconstructing temperature from carbonate minerals using a technique known as <a href="http://www.carbonateresearch.org/clumped_isotope"> Clumped Isotope Paleo-thermometry</a>. This is achieved by using a highly sensitive mass spectrometer, and measuring samples and `standards`.

One of the issues with sensitive measurements is that many things can go wrong, and this can impact the quality of the data. This is especially problematic because the data needs to be corrected by the use of `external standards`: carbonate material with known values.

In this exercise, you will use a range of `anomaly detection` methods to determine whether or not there are outliers in a set of standards. This is actual data, from our actual lab: in fact, it comes from the work of one of my PhD students. If you find an outlier, they will want to know!

First, create a new DataFrame called `data` by loading the `mass_spectrometer.csv` file, and inspect it.

In [None]:
from nbta.utils import download_data
download_data(id='1iRkWXEUM8tnZkvP80RcueJLj7_6M8soJ')

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

## Features

The data contains 6 different types of carbonate standards (*ETH1-4*, *IOL* and *Carrara Marble*), and we measured the following features:
* `first_MX44_Ref_Gaz`: the *mV* measured on the reference side of the mass spectrometer: this could indicate reference with too small amount of gas
* `first_MX44_Sample_Gaz`: Same as above, but for the sample side
* `D47`: The clumped isotope value of isotopologues of mass 47 (this is the main isotopologue to be used to estimate the temperature of precipitation of the mineral)
* `d13C`: The carbon isotope ratio of the standard
* `d18O`: The oxygen isotope ratio of the standard
* `name`: The name of the standard - essentially, a label

## Expected values

Because we are dealing with standards, we know what value we can expect. This information is in the `standards.csv` file: load it into a `standards` dataframe and have a look at the values.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Deviation from expected values

One of the challenge we have is that we want to estimate if any given standard deviates from the expected values. But if we look at the raw values in `d13C`, `d18O`, and `D47`, each standard will have different values (the other features, `first_MZ44_Ref_Gas`, `first_MZ44_Sample_name` and `49_param`, are expected to be the same for all measured material). In order to determine if a measurement deviates from the means, we will need to calculate the difference between the measured sample and the column.

Using the `data` Dataframe for the measured samples, and the expected standard values (the `standards` DataFrame) that I have given you above, create three new columns that reflect the difference between the measured values (`data`), and the expected value (`standards`): `d13C_diff`, `d18O_diff`, and `D47_diff`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('check_df',
                         df = data
)

result.write()
print(result.check())

# Preparing the dataset

We are nearly ready to do our investigation and see if we find any outliers. But first, we want to do a few things:

* Create a `y` label baed on the `name` column, and use a `LabelEncoder` to encode it. This will allow us later to plot data with the type of sample as a unique color
* Create a new `X` feature matrix that will contain the following features: `first_MZ44_Ref_Gas`, `first_MZ44_Sample`, `49_param`,  `d13C_diff`, `d18O_diff`, `D47_diff`. As you can see, we have dropped the original features and stick to the differences (except for three features)
* Because many of our algorithms are sensitive to scale, we want to use a `scaler that transform our values in a range from 0 to 1` (choose well)

Create the `y` label and the `X` feature as suggested above.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('data_prepared',
                         data = X,
                         label = y
)

result.write()
print(result.check())

# Modelling

We are now in a position where we can model potential outliers. Outlier detection can be subjective, so our strategy will be to use three different algorithms (`IsolationForest`, `OnceClassSVM`, and `HDBSCAN`) and combine their outputs to decide on what is an outlier.

## `IsolationForest`

Create an `IsolationForest` model, fit it, and predict your `X` feature (save the prediction into a variable called `ifor_pred`). Then, create a scatter plot of `D47_dff` versus `d13C_diff`, and use the `ifor_pred` as the color of each datapoint (look at the `c` parameter of `plt.scatter`).

Do you think that the amount of outliers suggested by the `IsolationForest` is reasonable?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### A better visualization thanks to PCA

The original space is not ideal to see outliers versus normal samples, because we used 6 features for outliers detection and need to plot this in 2 dimensions. A better approach is to perform a PCA analysis, and plot the data along its two principal components.

Do the following:
* create a PCA model, and `fit_transform` your `X` features into a new variable (`data_pca`)
* Replot the `ifor_pred`, but this time use the `first principal component` and the `second principal component` of your data as you x and y for the scaller plot
* Still use the `ifor_pred` as the `c` parameter

Are things clearer this way?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details><summary>Observations</summary><br>
    We can indeed see a better separation between the outliers and the normal samples. However, we seem to have many outliers!</details>
    
### Ensuring that our outliers are not representing different samples

One worry is that it would be possible, given that we have different samples, that the algorithm picks-up on valid sample differences. To ensure that this is not the case, replot the same plot as above, but this time using the `y` (effectively, the sample names) as your `c` paramter. Comparing the two, do you see any evidence that the outliers are simply different types of samples?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details><summary>Observations</summary><br>
    Although there is a little bit of separation of different samples along <code>principal component 2</code> we can clearly see that outliers belong to different samples. We do not seem to have an issue.</details>
    
## One-Class SVM

Now, use a One-Class SVM to predict outliers, and save the results into `ocsvm_pred`. Plot the data in the `PCA` space, using `ocsvm_pred` as the color for your label. What do you observe?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details><summary>Observations</summary><br>
    Again, `OneClassSVM` seems to pick too many outliers. This could be sorted by setting the <code>nu</code> hyperparameter (effectively, the percentage of outliers) to something more reasonable, for instance 0.05 (5%). The default is 50%! Go ahead, change that and replot your data. </details>
    
## HDBSCAN

Finally, we will use `HDBSCAN`. Remember that we can use the distance from clusters detected by `HDBSCAN` as a potential measure for outliers. Create an `HDBSCAN` model, and fit it with your `X` dataset. Then, plot your data into the `PCA` space, and use the `outlier_scores_` feature of your `HDBSCAN` fitted model as your `c` parameter.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### From outlier score to classification

You will notice that `HDBSCAN` is different than the previous two models: the other two output either `1` (not an outlier) or `-1` (outlier), wherewas `HDBSCAN` outputs a continuous value from 0 to 1 that can be interpreted as a probability of the sample being an outlier.

To be compatible with the other two metric, we will need to convert this score into a class. Go ahead and create a new variable named `dbscan_pred`: use a threshold of 0.7 to create outliers (`-1`) for any samples above this, and normal sample (`1`) for the others.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Combining the output of all of our outlier detectors

Combine the prediction of all three `outlier predictions` into one: only classify a sample as an outlier **if all three** predictions are that they are an outlier. Plot the data once again, and this time using this final prediction as the color for `c`.

Does the final prediction make sense?

<details><summary><strong>💡 Tip</strong></summary><br>
    It would help if your prediction was in <code>Boolan</code> format (with the outliers labelled as <code>True</code>), at least later!</details>

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### Identifying the outliers

Create a new dataframe called `outliers` that contains only the outliers: now you know which samples is suspicious! (also, if you followed my advice of turning your final classifier as a `boolean` array)

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Making sense of the outliers

Can we make sense of the outliers? Let's do a scatter plot of the data, using the following criteria:
* Plot the `49_param` vs `D47`. `49_param` (the 49 parameter) is a measure of contamination, and `D47` is the clumped isotope composition we are interested in
* Plot your data in a large enough axis: I suggest a `figsize=(10,10)`
* Plot the entire dataset with a symbol size of 120, and use a 10% transparency (`alpha=.1`)
* On the same axis, plot the `outliers` but in full opaque mode (`alpha=1`) and a square marker
* For both dataset, use the encoded `name` as your label (you might need to encode it for your `outliers`)

If you follow the instruction above, you should be able to see the outliers right on top of the original data, and plotted with the same colors.

What do you conclude?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details><summary><strong>Observations</strong></summary><br>
    This requires some experience with the details of clumped isotopes, but if you have done this right, we can make sense of the outliers that are detected here:
    <li> Two outliers plot with high <code>49_param</code>, indicating that they are probably contaminated and should be ignored</li>
    <li> One outlier plots at around 0.2 for <code>D47</code>, and clearly has the wrong label: it has probably been mislabelled, and should be renamed in the database. This happens relatively often!</li>
    <li> The last outlier plots at around 0.5 for <code>D47</code>, it could either be mislabelled (this could be investigated), it could be a little bit contaminated, or it could be fine and picked up as an outlier by accident. This require further investigation.</li>
</details>

# Conclusions
 
Hopefully this has convinced you that outlier detection is not that hard, and very helpful to check your data. Keep in mind that this could be an initial step of a wider machine learning project.
    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('check_outliers',
                         outliers = outliers
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.