# ***Data Engineering Part***

# Paper: **Discrimination of essential oils exposed and non-exposed to gamma rays using Raman spectroscopy and machine learning**

### **Authors:** *Paul Vargas Jentzsch (a), Sebastián Sarasti Zambonino (a), Daniela Ramirez (a), Gonzalo Jácome Camacho (a), Marco Sinche Serra (a), Edwin Vera (b), Roque Santos(a), Luis Ramos Guerrero (c), Valerian Ciobotă (d)

### **Notebook created by:** **Sebastián Sarasti Zambonino**

### **Institutions:**
a) Departamento de Ciencias Nucleares, Facultad de Ingeniería
Química y Agroindustria, Escuela Politécnica Nacional,
Ladrón de Guevara E11-253, 170525 Quito, Ecuador

b) Departamento de Ciencias de Alimentos y Biotecnología, Facultad de Ingeniería
Química y Agroindustria, Escuela Politécnica Nacional,
Ladrón de Guevara E11-253, 170525 Quito, Ecuador

c) Centro de Investigación de Alimentos, CIAL, Universidad UTE, 
Av. Mariscal Sucre y Mariana de Jesús, 170527 Quito, Ecuador

d) Rigaku Analytical Devices, Inc.,
30 Upton Drive, Suite 2
01887 Wilmington, USA

## Important announcement

This notebook serves as a guide throughout all the work done for the data engineering process. The goal of the notebook is to structure the Raman spectra recorded in files in a single dataframe. If you had any question, please, you should reach Sebastian Sarasti Zambonino out by his e-mail (sebastian.sarasti@epn.edu.ec). 

# Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Mount drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load library to work with files

In [None]:
import os
import itertools

Load library to work with datetime objects

In [None]:
import re
from datetime import datetime

# Transformation part

1. Create a function to read all files inside the folder

In [None]:
def get_txt_files(dir):
  contenido_1 = os.listdir(dir)
  contenido_2 = [i for i in contenido_1]
  h = []
  for i in contenido_1:
    h1 = os.listdir(dir+'/'+i)
    j1 = []
    for j in h1:
      j1.append(dir+'/'+i+'/'+j)
    h.append(j1)
  h = [j for i in h for j in i]
  k = []
  for i in h:
    for filename in os.listdir(i):
        if filename.endswith(".txt"):
            k.append(i+'/'+filename)
  return k

2. Create a function to get the datetime of the measurement

In [None]:
def get_date(directions):
  dates = []
  for i in directions:
      matches = re.search(r"RS(\d+)_P", i)
      if matches:
          rs_number = matches.group(1)
          rs_number = datetime.strptime(rs_number, '%Y%m%d%H%M%S')
          dates.append(rs_number)
  return dates

3. Create a function to read the txt files and get the useful data

In [None]:
def read_data_raman(path):
  with open(path, "r") as file:
      lines = file.readlines()
      a = []
      for i, line in enumerate(lines):
          if i >= 101:
              # print(line)
              a.append(line)

  a = [i.replace('\t', " ") for i in a]
  a = [i.strip('\n') for i in a]
  a = [i.rstrip() for i in a]
  a = [i.split() for i in a]
  a = [float(j) for i in a for j in i]
  a = np.array(a)
  a = np.reshape(a, (int(len(a)/5),5))
  a = a[:,1:3]
  return a

4. Create a function to get the dose

In [None]:
def get_dosis(paths):
  dosis = []
  for i in paths:
    a = i.split('/')
    b = re.findall(r'\d+', a[6])
    if len(b) > 0:
      dosis.append(float(b[0]))
    else:
      dosis.append(0)
  return dosis

5. Create a function to get the plant name

In [None]:
def get_name(paths):
  names = []
  paths = [i.replace("Muña", "Muna") for i in paths] 
  for i in paths:
    if "Muna" in i:
      names.append("Muna")
    elif "Chilca" in i:
      names.append("Chilca")
    # else:
    #   names.append("NA")
  return names

6. Create a function to determine when the sample was irradiated

In [None]:
def get_irradiation_stage(paths):
  stage = []
  for i in paths:
    a = i.split('/')
    a = a[6]
    if 'AI' in a:
      stage.append("Oil")
    elif 'MI' in a:
      stage.append("Sample")
    else:
      stage.append('Not irradiated')
  return stage

7. Create a function to get the whole dataset directly. This function creates a DF which has a column for each Raman value, and other columns for other features such as dose, irradiation stage, or date.

In [None]:
def dataset_raman(dir):
  # get direction from data
  at = get_txt_files(dir)
  # verify which one was irradiated or not
  labels = [0 if "SinIrradiar" in i else 1 for i in at]
  # get the date
  date = get_date(at)
  # read the txt files
  data = [read_data_raman(i) for i in at]
  # reshape the files into matrix
  features = [i.reshape(511*2, order = 'F') for i in data]
  # get the dose of irradiation
  dosis = get_dosis(at)
  # get where the irradiation was carried out
  irradiation_stages = get_irradiation_stage(at)
  # create a list of the features
  fea = ['F'+str(i+1) for i in range(1022)]
  # create a DF
  df = pd.DataFrame(data = features, columns = fea)
  # define the columns of the DF
  df['Dosis'] = dosis
  df['Irradiation Stage'] = irradiation_stages
  df['Irradiation'] = labels
  df['Date'] = date
  df['Plant'] = get_name(at)
  return df

# Data Transformation

Create the DF for the first measurements

In [None]:
df = dataset_raman('/content/drive/MyDrive/#6 Aceites irradiados/Essential oils 1st part')

Create the DF for the second measurements

In [None]:
df_2 = dataset_raman('/content/drive/MyDrive/#6 Aceites irradiados/Essential oils 2nd part')

Third measurements

In [None]:
df_3 = dataset_raman('/content/drive/MyDrive/#6 Aceites irradiados/Essential oils 3rd part')

Fourth measurements

In [None]:
df_4 = dataset_raman('/content/drive/MyDrive/#6 Aceites irradiados/Essential oils 4th part')

Export results

In [None]:
df.to_csv('data_irradiation_1.csv')
df_2.to_csv('data_irradiation_2.csv')
df_3.to_csv('data_irradiation_3.csv')
df_4.to_csv('data_irradiation_4.csv')

The final results all are concatenated in a unique DF.

In [None]:
df_final = pd.concat([df, df_2, df_3, df_4], axis = 0)

In [None]:
df_final.to_csv('data_final.csv')