# Data Wrangling with Python using Jupyter Notebooks

### Geek Meet, March 13, 2019
### Tom Madsen

---

# SAFETY MOMENT
## IS THERE A REPRODUCIBILITY CRISIS?
[1,500 Scientists Lift the Lid on Reproducibility (Nature, 2016)](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970)

<img src="../assets/is_there_reproducibility_crisis.jpeg" width=600>__________<img src="../assets/reproducibility_by_field.jpg" width=400>

---

## "...I know I did some really useful analysis but I can’t find it..."

[Building a Repeatable Data Analysis Process with Jupyter Notebooks (Practical Business Python, 2018)](https://pbpython.com/notebook-process.html)

<img src="../assets/maze.jpg" width=600>

---

# MY EXAMPLE

## First, we were awarded a new project - a contaminated site that had a long investigative history and lots of data.

## We got excel data tables from the previous consultant, in typical wide and un-tidied format.

## Client asking us to evaluate and implement the cleanup at the site, which involves excavation and disposal of over 35,000 cubic yards of waste and contaminated soil.

## We wanted soil quality data in a format that could be used for

    1) Using 3D modeling to estimate volumes to be excavated
    2) Look at correlations in constitutent concentrations and establish cleanup levels
    
---

# HERE WE GO!

<img src='../assets/never_do_live_demo.png' width=800>

## 1. Setting up the Project

- folders
- notes file
- locking down raw data (and a notes file)

## 2. Python Imports

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import re

### Read in Raw Data Files

In [None]:
tph = pd.read_excel('../data/raw/Hydrocarbons Detected in Soils - 2013-2017.xlsx', sheet_name='TPH')[:367]
tph.columns = ['sample_id','depth_ft', 'sample_date', 'dro','gro','oilgrease','trph']
tph = tph.dropna(subset=['depth_ft','sample_date']).set_index('sample_id')
tph.head(2)

In [None]:
voc = pd.read_excel('../data/raw/VOCs Detected in Soils - 2013-2017.xlsx', sheet_name='VOC')[:252]
firstv3 = ['sample_id','depth_ft','sample_date']
constitv = list(voc.iloc[0,3:])
voc.columns = firstv3+constitv
voc = voc.dropna(subset=['depth_ft','sample_date']).set_index('sample_id')
voc.head(2)

In [None]:
svoc = pd.read_excel('../data/raw/SemiVOCs Detected in Soils - 2013-2017.xlsx', 
                     sheet_name='Soil Data SVOCs')[:258]
firsts3 = ['sample_id','depth_ft','sample_date']
constits = list(svoc.iloc[0,3:])
svoc.columns = firsts3+constits
svoc = svoc.dropna(subset=['depth_ft','sample_date']).set_index('sample_id')
svoc.head(2)

In [None]:
svoc.loc['DC-B51  2.9-4.2', 'Acenaphthene':]

In [None]:
print(tph.shape)
print(voc.shape)
print(svoc.shape)

### Merge All Data Files by the Index (i.e., Sample_id)

In [None]:
tph_voc = tph.merge(voc.iloc[:,2:], how='left', left_index=True, right_index=True)

In [None]:
tph_voc.shape

In [None]:
tph_voc.head(2)

In [None]:
tph_voc_svoc = tph_voc.merge(svoc.iloc[:,2:], how='left', left_index=True, right_index=True, suffixes=('_voc', '_svoc'))

In [None]:
tph_voc_svoc.shape

In [None]:
tph_voc_svoc.head(2)

### Save File with All Data Combined

In [None]:
# save processed file
now = dt.datetime.now()
date = str(now)[:10]
time = str(now.hour) + str(now.minute)
proc_name ='../data/processed/all_data_uncleaned_{}_{}.xlsx'.format(date, time)
tph_voc_svoc.to_excel(proc_name)

### Select COPCs for Residential and Industrial
##### (based on comparison to RSLs and ISLs)

In [None]:
list(tph_voc_svoc.columns)

### From the draft RAWP and data summary:

The ISL constituents are:  **benzene**, **toluene**, **ethylbenzene**, **xylenes**, **naphthalene**, **MTBE**, **gro**, **dro**, **O&G or TRPH**  

As a point of reference and to confirm TPH-DRO is an indicator compound for remediation of the Site, the detected constituents were conservatively compared to the USEPA Regional Screening Levels (RSLs) for industrial land use.  The analysis showed that concentrations on-site are within, or more conservative than, a risk factor based on 10-6 and are protective of the environment.  The only VOCs detected above the USEPA Industrial RSLs were **naphthalene** and a **single detection of 1,2-dibromo-3-chloropropane** at depths of less than 8.5 feet.  The only SVOCs detected above the USEPA Industrial RSLs was **benzo(a)pyrene** at depths of less than 8.5 feet.  The base of the contaminated soil zone is predominately located within the wet sand layer, which is located above clean native silty clay soil.

### Generate list of COPCs that have exceeded industrial and residential screening levels
(also added sample depth and date columns)

In [None]:
copcs_resid = ['depth_ft','sample_date',
               'dro','gro','oilgrease','trph','1,2-Dibromo-3-chloropropane','Benzene','Ethylbenzene','Toluene','Xylene (Total)','Methyl-tert-butyl ether','Naphthalene_voc',
               '2-Methylnaphthalene',
               'Benzo(a)anthracene','Benzo(a)pyrene','Benzo(b)fluoranthene','Benzyl alcohol','Indeno(1,2,3-cd)pyrene','Phenol','bis(2-Ethylhexyl)phthalate']
copcs_indus = ['depth_ft','sample_date',
               'dro','gro','oilgrease','trph','1,2-Dibromo-3-chloropropane','Benzene','Ethylbenzene','Toluene','Xylene (Total)','Methyl-tert-butyl ether','Naphthalene_voc',
               'Benzo(a)anthracene','Benzo(a)pyrene']

In [None]:
copc_data_resid = tph_voc_svoc[copcs_resid]
copc_data_indus = tph_voc_svoc[copcs_indus]

In [None]:
copc_data_resid.loc['DC-B51  2.9-4.2',:]

### Create and Implement Helper Functions to Clean Data

In [None]:
# removes commas in values
def no_comma(value):
    if ',' in str(value):
        return value.replace(',','')
    else:
        return value

In [None]:
# converts "ND" entries to 1.0 (ppm for TPH's and ppb for VOCs/SVOCs)
def nd_to_1(value):
    if value == 'ND':
        return 1
    else:
        return value

In [None]:
# converts "<###.# XX" entries to 1/2 the reporting limit
def half_nd(value):
    if '<' in str(value):
        dl = value.split('<')[1].split(' ')[0] # works for '<430 3' and for '<6.1'
        if type(dl) == 'float':
            return dl/2
        else:
            return dl
    else:
        return value

In [None]:
# deletes "J" flags
def no_j(value):
    if 'J' in str(value):
        val = value.split('J')[0]
        if type(val) == 'float':
            return val
        else:
            return float(val.split(',')[0])
    else:
        return value

In [None]:
# deletes "U" flags
def no_u(value):
    if 'U' in str(value):
        val = value.split('U')[0]
        if type(val) == 'float':
            return val
        else:
            return float(val.split(',')[0])
    else:
        return value

In [None]:
# removes most superscripts, where there is a space between the value and the superscrips
def no_ss(value):
    if len(str(value).split(' ')) > 1:
        val = str(value).split(' ')[0]
        if type(val) == 'float':
            return val
        else:
            return float(val.replace(',', ''))
    else:
        return value

In [None]:
# removes superscript at the end of the value string - only occurs in one row for VOCs
def no_ss1(value):
    if chr(185) in str(value):
        val = str(value).split(chr(185))[0]
        if type(val) == 'float':
            return val
        else:
            return float(val.replace(',', ''))
    else:
        return value

In [None]:
# test for no_ss1 function for finding superscript
print(chr(185))
value = '12,200' + chr(185)
print(value)
no_ss1(value)

## Run all the helper functions and clean the data, and convert column data to numeric

<img src='../assets/crossed_fingers.jpg'>

In [None]:
def clean_data(df):
    for col in df.columns[2:]:
        df.loc[:,col] = df.loc[:,col].apply(no_comma)
        df.loc[:,col] = df.loc[:,col].apply(nd_to_1)
        df.loc[:,col] = df.loc[:,col].apply(half_nd)
        df.loc[:,col] = df.loc[:,col].apply(no_j)
        df.loc[:,col] = df.loc[:,col].apply(no_u)
        df.loc[:,col] = df.loc[:,col].apply(no_ss)
        df.loc[:,col] = df.loc[:,col].apply(no_ss1)
        df.loc[:,col] = pd.to_numeric(df[col],errors='raise')
    return df

In [None]:
copc_indus_clean = clean_data(copc_data_indus)

In [None]:
copc_resid_clean = clean_data(copc_data_resid)

### Save Cleaned Data to Processed Data

In [None]:
# save copc_indus_clean processed file
now = dt.datetime.now()
date = str(now)[:10]
time = str(now.hour) + str(now.minute)
proc_name ='../data/processed/copc_data_indus_cleaned_{}_{}.xlsx'.format(date, time)
copc_indus_clean.to_excel(proc_name)

In [None]:
# save copc_indus_clean processed file
now = dt.datetime.now()
date = str(now)[:10]
time = str(now.hour) + str(now.minute)
proc_name ='../data/processed/copc_data_resid_cleaned_{}_{}.xlsx'.format(date, time)
copc_resid_clean.to_excel(proc_name)

# Success!

<img src='../assets/thumbs_up2.jpg' width=600>