# Data Cleaning
The following notebook is for data cleaning and preparation. The dataset provided by {cite}`fitzgerald_morrin_holland_2021` represents GCMS analysis of VOCs from pure cultures of bacteria. The data is semi-structured in nature. It presents some challenges such as missing values. In the Excel file, the data obtained from the GCMS is presented in multiple formats, namely:
1. Long
2. Wide

Both sheets represent the same data. We will be working with the '**Wide**' dataset. This is because features represented as columns work better for Google's AutoML Tables. There are various other sheets available in the Excel, but these serve no purpose for our analysis.

In [3]:
import pandas as pd
import numpy as np

In [4]:
raw = pd.read_excel("data/FrontiersDS.xlsx", sheet_name="Wide", skiprows=3)

In [5]:
raw.shape

(84, 70)

## Null-Values
In the given dataset, rows represent **species & strains** of bacterial micro-organisms. The columns represent individual chemical-compounds commonly found in the volatile organic compounds (VOCs). {cite:p}`fitzgerald2021` informs us that:
* Cells with missing data represent a species-media specific combination inwhich the presence of that particular compound was never recorded.
* Cells with the value 0 represent a species-media spcific combination inwhich the presence of that compound was found in some samples, but not this particular sample.

Because of this knowledge, it is difficult to understand what should be done with the missing values. According to the Google Cloud Platform documentation for ['Best Practices for creating training data'](https://cloud.google.com/automl-tables/docs/data-best-practices#avoid_missing_values_where_possible), it is best to avoid missing values where possible. Values can be left missing if the column is set to be nullable.

[**TPOT**](http://epistasislab.github.io/tpot/) is an Automatic Machine Learning package in Python. In this particular case, using TPOT will prove more beneficial to us and will allow us more control. As of *Version 0.9* TPOT supports sparse matrices with a new built-in TPOT configuration "TPOT sparse". So, for us to support the use of missing values, we must use this particular configuration.

In [3]:
raw.to_csv('data/cleaned/long.csv', index=False)