# Data Cleaning with Pandas and Jupyter Notebooks

import needed packages

[pathlib](https://docs.python.org/3/library/pathlib.html) - Python module to handle file system paths

In [None]:
import pandas as pd
from pathlib import Path 

Use `sys.path.append` to add parent directory to system paths so the notebook can access the `scripts` directory
https://stackoverflow.com/a/64562179

In [None]:
import sys
sys.path.append(str(Path.cwd().parent))

from scripts.normalize_data import (
    normalize_columns, 
    normalize_expedition_section_cols,
    remove_bracket_text,
    remove_whitespace,
    print_df
)

In [None]:
normalized_nontaxa_path = Path('..', 'processed_data', 'normalized_nontaxa_list.csv')
normalized_taxa_search_path = Path('..', 'processed_data',  'taxa_list_search.csv')


## Working with multiple files

To process multiple files, we need to get the paths for all the files. 

Use `Path` and `rglob` to get all the cvs in `data_clean` directory.

In [None]:
paths = list(Path('..', 'processed_data', 'clean_data').rglob('*.csv'))
paths

In [None]:
len(paths)

## read files

We used `pandas.read_csv(path, dtype=str)` to read csv and treat all columns as strings. The reason why we used `dtype=str` is because `pandas.read_csv(path)`  will automatically convert the columns to strings, integers, floats, dates. This automatic conversion can change values in unexpected ways such as converting a column with integers and NaN into floats and NaN. 

In [None]:
path = Path('..', 'processed_data', 'clean_data', 'Micropal_CSV_2', '362_U1480E_planktic_forams.csv')

correct integer values

In [None]:
df = pd.read_csv(path, nrows=5 , dtype=str)
df['Pulleniatina coiling (dextral)']

pandas automatically converts the integers to floats because of NaNs.

In [None]:
df = pd.read_csv(path, nrows=5)
df['Pulleniatina coiling (dextral)']

## viewing changes

One thing that we found helpful when data cleaning is to see the dataframe and the total number of rows and columns.

`print_df` is a custom function that calls `pd.DataFrame.shape` and `pd.DataFrame.head()`

In [None]:
path = Path('..', 'processed_data', 'clean_data', 'Micropal_CSV_2', '362_U1480E_planktic_forams.csv')
df = pd.read_csv(path, dtype=str)

print_df(df)

## Basic cleanup pattern


In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)
    
    # code to change file   
    
    df.to_csv(path, index=False)

## Basic file cleanup

pandas has methods that can be used to do some basic file cleanup.

- delete dataframe column if all values are NA 

  dropna(axis='columns', how='all', inplace=True) - [pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)



- delete dataframe row if all values are NA 

  dropna(axis='index', how='all', inplace=True) - [pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

- remove duplicate rows in dataframe 

  drop_duplicates(inplace=True) - [pandas.DataFrame.drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

before cleanup

In [None]:
path = Path('..', 'processed_data', 'clean_data', 'Micropal_CSV_3', '341_planktic_forams_U1417B.csv')

df = pd.read_csv(path, dtype=str)
print_df(df)

after cleanup

In [None]:
df.dropna(axis='columns', how='all', inplace=True)  
df.dropna(axis='index', how='all', inplace=True)
df.drop_duplicates(inplace=True)

print_df(df)

Use `for` loop to run basic cleanup on all files. 

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)
    
    df.dropna(axis='columns', how='all', inplace=True)  
    df.dropna(axis='index', how='all', inplace=True)
    df.drop_duplicates(inplace=True)
    
    df.to_csv(path, index=False)

## remove leading and trailing white spaces

We created a custom function `remove_whitespace` to remove all leading and trailing white spaces from a dataframe. 

Since we wanted to remove white spaces from the headers, we used `read_csv(header=None)` and `to_csv(header=False)` so that pandas treat the first row like any other row.

In [None]:
df = pd.read_csv(paths[0], dtype=str, header=None)

remove_whitespace(df)

print_df(df)

remove white space from all files

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str, header=None)
    
    remove_whitespace(df)
    
    df.to_csv(path, index=False, header=False)

## Normalizing columns names

For the expedition 312 and later, the researchers for each expedition  determined the format of their data files. This resulted in a lot of variability in the file columns.  Another challenge with parsing the files is that each taxa is stored as separate column in the files.

### get all unique column names

In order to normalize the header header names, we needed to get all the headers for all the files. 

Since we only need the header names, use `nrow=0` with `read_csv`. 

In [None]:
pd.read_csv(paths[1], dtype=str, nrows=0)


I used `pandas.DataFrame.columns()` and python `set` to get all the unique columns fo all the files.

In [None]:
all_columns = set()
for path in paths:
    df = pd.read_csv(path, dtype=str, nrows=0)
    
    all_columns.update(df.columns)
    
len(all_columns)

In [None]:
all_columns

I then manually separate taxa names from other headers so that we could do some more processing on the taxa.

In [None]:
taxa_columns = {
    'Candeina nitida',
    'Dentoglobigerina altispira _T_ _PL5',
    'Dentoglobigerina altispira _T_ _PL5_',
    'Dextral:Sinistral _P. obliquiloculata_',
    'Dextral:Sinistral _P. praecursor_',
    'Dextral:Sinistral _P. primalis_',
    'Globigerina bulloides',
    'Globigerina cf. woodi',
    'Globigerina falconensis',
    'Globigerina umbilicata',
    'Globigerinella aequilateralis',
    'Globigerinella calida',
    'Globigerinella calida _B',
    'Globigerinella calida _B_',
    'Globigerinella obesa',
    'Globigerinita glutinata',
    'Globigerinita parkerae',
    'Globigerinita uvula',
    'Globigerinoides bulloideus',
    'Globigerinoides conglobatus',
    'Globigerinoides extremus _T and B',
    'Globigerinoides extremus _T and B_',
    'Globigerinoides fistulosus',
    'Globigerinoides obliquus _T',
    'Globigerinoides obliquus _T_',
    'Globigerinoides quadrilobatus',
    'Globigerinoides ruber',
    'Globigerinoides ruber (pink)',
    'Globigerinoides ruber (white)',
    'Globigerinoides ruber _pink_ T',
    'Globigerinoides ruber _pink_ _T_',
    'Globigerinoides sacculifer',
    'Globigerinoides sacculifer (without sack)',
    'Globigerinoides tenellus',
    'Globigerinoides trilobus',
    'Globigerinoidesella fistulosa _T and B_ _Pt1a',
    'Globigerinoidesella fistulosa _T and B_ _Pt1a_',
    'Globoconella miozea',
    'Globorotalia (Globoconella) inflata',
    'Globorotalia (Globorotalia) tumida tumida',
    'Globorotalia (Hirsutella) hirsuta',
    'Globorotalia (Hirsutella) scitula',
    'Globorotalia (Truncorotalia) crossaformis',
    'Globorotalia (Truncorotalia) truncatulinoides',
    'Globorotalia anfracta',
    'Globorotalia crassaformis',
    'Globorotalia crassaformis sensu lato',
    'Globorotalia flexuosa',
    'Globorotalia flexuosa _T and B_',
    'Globorotalia hessi',
    'Globorotalia hessi _B_',
    'Globorotalia hirsuta',
    'Globorotalia inflata',
    'Globorotalia limbata _B',
    'Globorotalia limbata _B_',
    'Globorotalia limbata _T_',
    'Globorotalia margaritae _T and B_ _PL3',
    'Globorotalia margaritae _T and B_ _PL3_',
    'Globorotalia menardii',
    'Globorotalia multicamerata _T',
    'Globorotalia multicamerata _T_',
    'Globorotalia plesiotumida _B_ _M13b_',
    'Globorotalia plesiotumida _T',
    'Globorotalia plesiotumida _T_',
    'Globorotalia pseudomiocenica _T_ _PL6',
    'Globorotalia pseudomiocenica _T_ _PL6_',
    'Globorotalia scitula',
    'Globorotalia tosaensis',
    'Globorotalia tosaensis _T and B_ _Pt1b',
    'Globorotalia tosaensis _T and B_ _Pt1b_',
    'Globorotalia truncatulinoides',
    'Globorotalia truncatulinoides _B',
    'Globorotalia truncatulinoides _B_',
    'Globorotalia tumida',
    'Globorotalia tumida _B_ _PL1a_',
    'Globoturborotalita apertura _T and B',
    'Globoturborotalita apertura _T and B_',
    'Globoturborotalita decoraperta _T and B',
    'Globoturborotalita decoraperta _T and B_',
    'Globoturborotalita rubescens',
    'Neogloboquadrina acostaensis',
    'Neogloboquadrina acostaensis (dextral)',
    'Neogloboquadrina cf. pachyderma',
    'Neogloboquadrina dutertrei',
    'Neogloboquadrina humerosa',
    'Neogloboquadrina incompta (dextral)',
    'Neogloboquadrina inglei',
    'Neogloboquadrina kagaensis',
    'Neogloboquadrina nympha',
    'Neogloboquadrina pachyderma (dextral)',
    'Neogloboquadrina pachyderma (sin)',
    'Neogloboquadrina pachyderma (sinistral)',
    'Neogloboquadrina pachyderma B (sinistral, inflated form)',
    'Neogloboquadrina pachyderma(dex)',
    'Orbulina universa',
    'Pulleniatina coiling (dextral)',
    'Pulleniatina coiling (sinistral)',
    'Pulleniatina finalis',
    'Pulleniatina finalis _B',
    'Pulleniatina finalis _B_',
    'Pulleniatina obliquiloculata',
    'Pulleniatina obliquiloculata (D)',
    'Pulleniatina praecursor',
    'Pulleniatina praespectabilis',
    'Pulleniatina primalis  _Tand B',
    'Pulleniatina primalis  _Tand B_',
    'Sphaeroidinella dahiscens sensu lato',
    'Sphaeroidinella dehiscens',
    'Sphaeroidinella dehiscens s.l.',
    'Sphaeroidinella dehiscens sensu lato _B_',
    'Sphaeroidinellopsis kochi _T',
    'Sphaeroidinellopsis kochi _T_',
    'Sphaeroidinellopsis seminulina _T_ _PL4',
    'Sphaeroidinellopsis seminulina _T_ _PL4_',
}

In [None]:
len(taxa_columns)

Since both `all_columns` and `taxa_columns` are sets, we can subtract them to get the nontaxa headers.

In [None]:
nontaxa_columns = all_columns - taxa_columns

nontaxa_columns

In [None]:
len(nontaxa_columns)

### create taxa and non-taxa file

I saved the the taxa and nontaxa headers to csv so that I can access them later.

In [None]:
taxa_df = pd.DataFrame(taxa_columns, columns=['verbatim_name'])
taxa_df.sort_values('verbatim_name', inplace=True)

print_df(taxa_df)

In [None]:
path = Path('..', 'processed_data', 'drafts', 'taxa_list.csv')
taxa_df.to_csv(path, index=False)

In [None]:
non_taxa_df = pd.DataFrame(nontaxa_columns, columns=['field'])
non_taxa_df.sort_values('field', inplace=True)

print_df(non_taxa_df)

In [None]:
path = Path('..', 'processed_data', 'drafts', 'nontaxa_list.csv')
non_taxa_df.to_csv(path, index=False)

### normalize headers

After the project PIs manually normalized the columns, we need to update the data files with the noramlized columns.

In [None]:
nontaxa_df = pd.read_csv(normalized_nontaxa_path, dtype=str)
print_df(nontaxa_df)

create a dictionary that lists the original field name and normalized field name.

In [None]:
nontaxa_mapping = nontaxa_df.set_index('field').to_dict()['normalized_field']
nontaxa_mapping

`normalize_columns` updates the column names for a data frame

In [None]:
df = pd.read_csv(paths[0], dtype=str)    
df.columns


In [None]:
df = pd.read_csv(paths[0], dtype=str) 
normalize_columns(df, nontaxa_mapping)
df.columns

normalize columns for all files

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)    
    
    normalize_columns(df, nontaxa_mapping)
    
    df.to_csv(path, index=False)


## Clean up row values

`remove_bracket_text` removes the [text] values at the end of some taxa columns.

In [None]:
df = pd.read_csv(paths[0], dtype=str)    
print_df(df)

In [None]:
df = pd.read_csv(paths[0], dtype=str) 
df = remove_bracket_text(df)
print_df(df)

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)
    
    df = remove_bracket_text(df)
    
    df.to_csv(path, index=False)

## Turn one column into multiple columns 

For some files, `Sample` column was given, but `Exp, Site, Hole, Core, Type, Section, A/W` columns where not given. 

Sample: 363-U1483A-1H-2-W 75/77-FORAM  
Exp: 363, Site: U1483, Hole: A, Core: 1, Type: H, Section: 2, A/W: W

create `normalize_expedition_section_cols` tp convert `Sample` into separate `Exp, Site, Hole, Core, Type, Section, A/W` columns. 

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)   
    
    df = normalize_expedition_section_cols(df)
    
    df.to_csv(path, index=False) 

## check if mandatory columns exists

In [None]:
required_columns = {
 'A/W',
 'Bottom [cm]',
 'Bottom Depth [m]',
 'Core',
 'Exp',
 'Hole',
 'Sample',
 'Section',
 'Site',
 'Top [cm]',
 'Top Depth [m]',
 'Type'
}

In [None]:
for path in paths:
    df = pd.read_csv(path, dtype=str)    
    cols = set(df.columns)
    diff = required_columns - cols
    
    if(len(diff) > 0):
        print(path)
        print(required_columns - cols)
    