# 2. Data cleaning and initial exploration

In this notebook we will be loading and performing an initial exploration of the dataset from the Protocols track for the Hercules challenge. This dataset consists of a series of 100 experimental protocols from the [Bio-Protocols journal](https://bio-protocol.org/Default.aspx).

## Setup
As always, we will begin by loading the logging system and setting up the path to import additional code from the _src_ folder.

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
from bokeh.io import output_notebook

output_notebook()

In [3]:
from herc_common import BokehHistogram

hist = BokehHistogram(color_fill="mediumslateblue", color_hover="slateblue")

In [4]:
def print_empty_cols(df):
    for col in df.columns:
        print(col)
        print('-' * len(col))
        res = df[df[col] == ''].index
        print(f"{len(res)} articles have no value for column {col}")
        print(res)
        print('\n')


## Parsing the data
Since all the _html_ files of each protocol have been scraped in the previous phase, we will begin by getting the path of all the files that will be parsed later on:

In [5]:
import glob

files_to_read = glob.glob(f"{PROTOCOLS_DIR}/*.html")
len(files_to_read)

100

We will also define a simple Protocol class that will be used to stored the parsed information from every HTML file, and provides some utility methods to be loaded in a pandas DataFrame:

In [6]:
from src.protocol import Protocol

Since not every protocol will have a value for every defined field (for example, some protocols may not have a background section), we will also define a decorator that will be used later on to specify which fields are optional. These fields will be an empty string if no value can be found for them:

In [7]:
from src.data_reader import parse_protocol

protocols = []
for file in files_to_read:
    protocol_id = os.path.basename(file).split('.')[0]
    with open(file, 'r', encoding='utf-8') as f:
        protocols.append(parse_protocol(f.read(), protocol_id))

## Creating a dataframe

Now that every protocol has been parsed, we can load them into a pandas DataFrame:

In [8]:
import pandas as pd

df = pd.DataFrame([protocol.to_dict() for protocol in protocols])
df.head()

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors
0,e1467,Measurement of Chlorophyll a and Carotenoids C...,This is a protocol for precise measurement of ...,Cyanobacteria culture (Note 1)|Methanol ≥99.9%...,Work under modest irradiance [up to 5 µmol (ph...,Eppendorf safe-lock tubes (1.5 ml)|Centrifuge ...,,Microbiology|Microbial biochemistry|Other comp...,Tomáš Zavřel|Maria A. Sinetova|Jan Červený
1,e1308,Minimal Inhibitory Concentration (MIC) Assay f...,Minimal inhibition concentration (MIC) is the ...,"A. baumannii (ATCC, catalog number: 17978 )|...",Preparation of antibiotic stock solution and d...,50 ml polystyrene culture tubes (sterile)|Spec...,,Microbiology|Antimicrobial assay|Antibacterial...,Ming-Feng Lin|Yun-You Lin|Chung-Yu Lan
2,e1471,Murine Liver Myeloid Cell Isolation Protocol,"In homeostasis, the liver is critical for the ...",7-8 weeks old female C57Black/6 mice (Janvier ...,Preparation of a liver single cell suspension ...,"Polyester filters cut in 10 x 10 cm squares, t...",,Immunology|Immune cell isolation|Myeloid cell|...,Benoit Stijlemans|Amanda Sparkes|Chloé Abels|J...
3,e834,Whole Spleen Flow Cytometry Assay,"In the Whole Spleen Flow Cytometry Assay, we u...","PerCP/Cy5.5 anti-mouse CD11b (Biolegend, catal...",Splenocyte Isolation\n\t\t\n\n\t\t\t\tExtract ...,"15 ml conical tubes (BD Biosciences, Falcon®, ...",,Immunology|Immune cell function|Cytokine|Cell ...,Cathy S. Yam|Adeline M. Hajjar
4,e1236,Dimethylmethylene Blue Assay (DMMB),Glycosaminoglycans (GAGs) are long unbranched ...,"Dimethylmethylene blue (DMMB) (Sigma-Aldrich, ...",Prepare DMMB reagent and paper filter using Wh...,"Plate mixer (VWR International, catalog number...",,Biochemistry|Carbohydrate|Glycoprotein|Biochem...,Vivien Jane Coulson- Thomas|Tarsis Ferreira...


We can already see that none of the first 4 protocols have a value for the "background" field. we will analyse the data more in depth in the following sections.

## Cleaning and feature engineering

In [9]:
df.describe()

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors
count,100,100,100,100.0,100,100,100.0,100,100
unique,100,100,100,99.0,100,100,38.0,99,95
top,e263,Cell Isolation of Spleen Mononuclear Cells,Subcellular localization is crucial for the pr...,,Preparation of total soluble extracts\n\t\t\n\...,60 mm culture dishes (Thermo Fisher Scientific...,,Immunology|Immune cell isolation|Lymphocyte|Ce...,Santosh K Panda|Balachandran Ravindran
freq,1,1,1,2.0,1,1,63.0,2,2


From the output above we can see that there is a unique id, title, abstract, procedure and equipment for each different protocol. Other fields have some values which are not unique. This could mean either that they have the same value or that they are empty. We will check later on whether they are empty or not.

For now, we will start by joining all the steps of each procedure into a new column which will be called '_full\_procedure\_cleaned_':

In [10]:
import re

def join_procedure_steps(procedure):
    return ' '.join(procedure.split('|'))

def clean(procedure):
    merged_procedure = join_procedure_steps(procedure)
    return re.sub('\s+', ' ', merged_procedure).strip()
    

df['full_text'] = df['title'] + '. ' + df['abstract'] + '. ' + df['procedure']
df['full_text_cleaned'] = df['full_text'].apply(lambda x: clean(x))
df['full_text_cleaned'].loc[0][:500]

INFO:numexpr.utils:NumExpr defaulting to 6 threads.


'Measurement of Chlorophyll a and Carotenoids Concentration in Cyanobacteria. This is a protocol for precise measurement of chlorophyll a and total carotenoid concentrations in cyanobacteria cells. Cellular chlorophyll concentration is one of the central physiological parameters, routinely followed in many research areas ranging from stress physiology to biotechnology. Carotenoids concentration is often related to cellular stress level; combined pigments assessment provides useful insight into ce'

We will finally add another column with the number of characters of each procedure:

In [11]:
df['num_chars_text'] = df['full_text_cleaned'].apply(lambda x: len(x))

## Initial exploration

As we have seen before, some of the values for the background, materials, categories and authors columns are not unique. We will se now whether they contain empty values or not:

In [12]:
df.head()

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors,full_text,full_text_cleaned,num_chars_text
0,e1467,Measurement of Chlorophyll a and Carotenoids C...,This is a protocol for precise measurement of ...,Cyanobacteria culture (Note 1)|Methanol ≥99.9%...,Work under modest irradiance [up to 5 µmol (ph...,Eppendorf safe-lock tubes (1.5 ml)|Centrifuge ...,,Microbiology|Microbial biochemistry|Other comp...,Tomáš Zavřel|Maria A. Sinetova|Jan Červený,Measurement of Chlorophyll a and Carotenoids C...,Measurement of Chlorophyll a and Carotenoids C...,2653
1,e1308,Minimal Inhibitory Concentration (MIC) Assay f...,Minimal inhibition concentration (MIC) is the ...,"A. baumannii (ATCC, catalog number: 17978 )|...",Preparation of antibiotic stock solution and d...,50 ml polystyrene culture tubes (sterile)|Spec...,,Microbiology|Antimicrobial assay|Antibacterial...,Ming-Feng Lin|Yun-You Lin|Chung-Yu Lan,Minimal Inhibitory Concentration (MIC) Assay f...,Minimal Inhibitory Concentration (MIC) Assay f...,3329
2,e1471,Murine Liver Myeloid Cell Isolation Protocol,"In homeostasis, the liver is critical for the ...",7-8 weeks old female C57Black/6 mice (Janvier ...,Preparation of a liver single cell suspension ...,"Polyester filters cut in 10 x 10 cm squares, t...",,Immunology|Immune cell isolation|Myeloid cell|...,Benoit Stijlemans|Amanda Sparkes|Chloé Abels|J...,Murine Liver Myeloid Cell Isolation Protocol. ...,Murine Liver Myeloid Cell Isolation Protocol. ...,12779
3,e834,Whole Spleen Flow Cytometry Assay,"In the Whole Spleen Flow Cytometry Assay, we u...","PerCP/Cy5.5 anti-mouse CD11b (Biolegend, catal...",Splenocyte Isolation\n\t\t\n\n\t\t\t\tExtract ...,"15 ml conical tubes (BD Biosciences, Falcon®, ...",,Immunology|Immune cell function|Cytokine|Cell ...,Cathy S. Yam|Adeline M. Hajjar,Whole Spleen Flow Cytometry Assay. In the Whol...,Whole Spleen Flow Cytometry Assay. In the Whol...,5237
4,e1236,Dimethylmethylene Blue Assay (DMMB),Glycosaminoglycans (GAGs) are long unbranched ...,"Dimethylmethylene blue (DMMB) (Sigma-Aldrich, ...",Prepare DMMB reagent and paper filter using Wh...,"Plate mixer (VWR International, catalog number...",,Biochemistry|Carbohydrate|Glycoprotein|Biochem...,Vivien Jane Coulson- Thomas|Tarsis Ferreira...,Dimethylmethylene Blue Assay (DMMB). Glycosami...,Dimethylmethylene Blue Assay (DMMB). Glycosami...,1911


In [13]:
print_empty_cols(df)

pr_id
-----
0 articles have no value for column pr_id
Int64Index([], dtype='int64')


title
-----
0 articles have no value for column title
Int64Index([], dtype='int64')


abstract
--------
0 articles have no value for column abstract
Int64Index([], dtype='int64')


materials
---------
2 articles have no value for column materials
Int64Index([21, 76], dtype='int64')


procedure
---------
0 articles have no value for column procedure
Int64Index([], dtype='int64')


equipment
---------
0 articles have no value for column equipment
Int64Index([], dtype='int64')


background
----------
63 articles have no value for column background
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  9, 10, 13, 14, 16, 20, 22, 27, 28, 29,
            30, 33, 34, 36, 37, 38, 40, 42, 43, 47, 48, 49, 50, 51, 54, 55, 56,
            57, 58, 60, 61, 62, 63, 65, 67, 68, 69, 71, 72, 73, 74, 75, 79, 80,
            81, 83, 84, 85, 87, 88, 90, 92, 94, 95, 97, 98],
           dtype='int64')


categories
----------
0 articles h

  res_values = method(rvalues)


As we can see, for the materials and background columns this is the case (all non-unique values are empty ones). However, there are repeated values for the equipment and authors fields, which is expected.

If we are going to use the bakground column in future steps we have to notice that most of the protocols don't have a value for it.

We are going to see how the length of each procedure is distributed amongst the dataset:

In [14]:
df['num_chars_text'].describe()

count      100.000000
mean      6740.970000
std       7026.485712
min        849.000000
25%       3096.750000
50%       5177.500000
75%       8126.750000
max      57180.000000
Name: num_chars_text, dtype: float64

In [16]:
HIST_COLUMN = 'num_chars_text'
HIST_TITLE = "Procedure length distribution for the Bio-Protocols dataset"
HIST_XLABEL = "Procedure length (# of characters)"
HIST_YLABEL = "Number of protocols"

hist.load_plot(df, HIST_COLUMN, HIST_TITLE,
               HIST_XLABEL, HIST_YLABEL, True)

In [17]:
df.loc[0].full_text_cleaned

'Measurement of Chlorophyll a and Carotenoids Concentration in Cyanobacteria. This is a protocol for precise measurement of chlorophyll a and total carotenoid concentrations in cyanobacteria cells. Cellular chlorophyll concentration is one of the central physiological parameters, routinely followed in many research areas ranging from stress physiology to biotechnology. Carotenoids concentration is often related to cellular stress level; combined pigments assessment provides useful insight into cellular physiological state. The current protocol was established to minimize time and equipment requirements for the routine pigments analysis. It is important to note that this protocol is suitable only for cyanobacteria containing chlorophyll a, and is not designed for species containing other chlorophyll molecules.. Work under modest irradiance [up to 5 µmol (photons) m-2 s-1 of white light or 10 µmol (photons) m-2 s-1 of green light] in order to prevent degradation of extracted pigments. Ha

From the cells above we can see that the mean length of each procedure is about 5800 characters, and most of the protocols have procedures of 2000-7000 characters.

In [18]:
hist.save_plot(os.path.join(NOTEBOOK_2_RESULTS_DIR, '1_Protocol_procedure_length.svg'))

There was an error exporting the plot. Please verify that both Selenium and Geckodriver are installed: Neither firefox and geckodriver nor a variant of chromium browser and chromedriver are available on system PATH. You can install the former with 'conda install -c conda-forge firefox geckodriver'.


## Saving the dataframe

Finally, we are going to save the dataframe so we can use it later on in the following steps:

In [19]:
DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df.to_pickle(DF_FILE_PATH)