# Common Data Cleaning and Structuring Tasks 

This notebook will be a space to test out some common data cleaning tasks. The goal is to turn these into functions and scripts that could also then be fed to a GUI like (for example [Gooey](https://github.com/chriskiehl/Gooey)). 

## Double Spaces
Consecutive spaces is something that is super difficult to detect visually, but that in all cases, we want to get rid of. So it will make a good first test for our data cleaning tasks.  
We will need to use regular expressions, so let's first get some `import` statements going, for the regex module `re`. We will also be using DataFrames heavily, so we will import `pandas` for good measure

In [27]:
import re
import pandas as pd

We need to have `pandas` read the test (csv) file that we _know_ has double spaces, which is in our `/test` directory: `double_spaces.csv`. This will create our DataFrame:

In [28]:
data_file = pd.read_csv("../test/double_spaces.csv")

Now that we have our DataFrame (called `data_file`), we can call the `pandas` function `.head()` to inspect the first 5 rows:

In [30]:
data_file.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


Now we can use `pd.replace()` (which is a `pandas` specific function that is similar to the regular expression module `re.sub` function).  
We want to replace the pattern of two consecutive spaces with one space:

In [5]:
data_file.replace('\s\s', ' ', regex=True, inplace=True)

Let's define this in a function so we can reuse it if we want in the future:

In [24]:
def remove_double_spaces(data_input):
    """
    Take the DataFrame and remove consecutive spaces
    """
    data_input = data_input.replace(to_replace='\s\s', value=' ', regex=True, inplace=True)
    return data_input

In [25]:
remove_double_spaces(data_file)

In [26]:
data_file.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


Upon inspection of the result, we see the spaces are now all singles. And since we made a function, we can add it in our overall program, as well as other things, like write up a test to make sure the function is actually doing what we think it should be doing. 

## Trimming whitespace

Similar to double spacing, cell values with leading or trailing whitespaces are a pain, and again, very hard to detect visually. We know that python has the built-in `trim()` function, but `pandas` also has an exact same function to use for DataFrames: `.strip()`.  

Where this gets tricky is that for DataFrames, `.apply()` is more specific, whereas we need `.applymap()` in order to apply it to the _entire_ DataFrame. We must also take into account that not every data type in our DataFrame might be strings, so we will use a conditional to _only_ apply it to strings, otherwise we will raise errors:

In [33]:
data_file2 = pd.read_csv("../test/untrimmed_spaces.csv")

In [34]:
def trim_spaces(data_frame):
    """
    Take the DataFrame and remove surrounding spaces of values
    """
    trim_strings = lambda x: x.strip() if type(x) is str else x
    return data_frame.applymap(trim_strings)

In [35]:
result2 = trim_spaces(data_file2)

In [36]:
result2.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


That's another function for our program done!

### Stripping strings based on a character
A use case we have is that in the report we get exported, entities are placed in a cell with both its value and then an '@ bdxxxxx' ARK number. We don't want the ARK values in the cell, so we'll try to slice those out. Let's try this using just base (built-in) Python functions first

In [4]:
string = "tha - Thai @ bd21038465"
stripped = string.rsplit('@')
print(stripped[0].strip())

tha - Thai


In [14]:
string = ["tha - Thai @ bd21038465", "tgl - Tagalog @ bd07385470", "tam - Tamil @ bd11482017", "syr - Syriac @ bd2445142b"]
for s in string:
    s = s.rsplit('@')
    print(s[0].strip())

tha - Thai
tgl - Tagalog
tam - Tamil
syr - Syriac


Now, let's try to do that in `pandas` 

In [42]:
df = pd.DataFrame({'Language': ['tha - Thai @ bd21038465', 
                                'tgl - Tagalog @ bd07385470', 
                                'tam - Tamil @ bd11482017', 
                                'syr - Syriac @ bd2445142b']})
df

Unnamed: 0,Language
0,tha - Thai @ bd21038465
1,tgl - Tagalog @ bd07385470
2,tam - Tamil @ bd11482017
3,syr - Syriac @ bd2445142b


In [43]:
df['Language'] = df.Language.str.split('@', expand=True)

In [44]:
df

Unnamed: 0,Language
0,tha - Thai
1,tgl - Tagalog
2,tam - Tamil
3,syr - Syriac


### Dates
Ah, the true bane of our existence: dates. We will attempt to do some dates parsing. Also, if we can, we should take clean dates and try to populate the "Begin Date" and "End Date" columns 

In [37]:
df_dates = pd.read_excel("../test/catalhoyuk_glimpse_v2_delimiters.xlsx")

In [38]:
columns = df_dates.columns
print(columns)

Index(['Object Unique ID', 'Level', 'File name', 'File use',
       'Type of Resource', 'Language', 'Title',
       'Person:Principal investigator', 'Person:Author',
       'Person:Research team member', 'Corporate:Contributor', 'Date:creation',
       'Begin date', 'End date', 'Note:description', 'Note:technical details',
       'Note:scope and content', 'Note:preferred citation',
       'Note:related publications', 'Subject:topic', 'Geographic:point',
       'Related resource:related', 'Related resource:related.1'],
      dtype='object')


In [39]:
for column in columns:
    if column.lower().startswith("date"):
        print(column)

Date:creation


In [40]:
date_begin = []
date_end = []

for column in columns:
    if column.lower().startswith("date"):
        for row in df_dates[column]:
            if len(str(row)) == 4:
                date_begin.append(str(row) + "-01-01")
                date_end.append(str(row) + "-12-31")
            else:
                date_begin.append("")
                date_end.append("")

In [41]:
df_dates['Begin date'] = date_begin
df_dates['End date'] = date_end

In [42]:
df_dates.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Principal investigator,Person:Author,Person:Research team member,...,End date,Note:description,Note:technical details,Note:scope and content,Note:preferred citation,Note:related publications,Subject:topic,Geographic:point,Related resource:related,Related resource:related.1
0,1.0,Object,,,data | still image | software,zxx - No linguistic content; Not applicable,"Çatalhöyük, South Area, 'Shrine' 10 Virtual Re...","Lercari, Nicola",,"Cox, Grant; Busacca, Gesualdo; Campiani, Arian...",...,2018-12-31,3-D reconstruction of the entire Çatalhöyük 'S...,,Updates to the preceding (2018) version: Win 1...,"Lercari, Nicola; Cox, Grant; Busacca, Gesualdo...","Hodder, I. 2007. Çatalhöyük in the Context of ...","Neolithic settlement, Neolithic building","37.672122,32.823508",,2018 version (pre-peer review) of this item @ ...
1,1.0,Component,S.10_Sequence_Levels_Render.PNG,image-source,,,3-D render of all superimposed buildings in th...,,"Lercari, Nicola",,...,2018-12-31,3-D render displaying 'Shrine' 10 Sequence lev...,Snapshot captured in Unity 3D. File format is PNG,,,,,,,
2,1.0,Component,S.10_Sequence_App_Win64.zip,data-service,,,Unity 3D app including Win64 Executable and al...,,"Aboulhosn, Jad; Lercari, Nicola",,...,2019-12-31,This archive includes a fully functional demo ...,This app was created in Unity 3D version 2018....,,,,,,,
3,1.0,Component,S.10_Sequence_App_Project.zip,data-service,,,Unity 3D app project folder,,"Aboulhosn, Jad; Lercari, Nicola",,...,2019-12-31,This file includes all assets needed to recrea...,This Unity project was created in Unity 2018.2...,,,,,,Unity User Manual @ https://docs.unity3d.com/M...,
4,2.0,Object,,,data | still image | software | three dimensio...,eng - English,"Çatalhöyük, South Area, Building 17.2.2","Lercari, Nicola",,"Cox, Grant; Busacca, Gesualdo; Campiani, Arian...",...,2018-12-31,"3-D reconstruction of Çatalhöyük, South Area, ...",,Updates to the preceding (2018) version: metad...,"Lercari, Nicola; Cox, Grant; Busacca, Gesualdo...","Farid, S. 2007a. Level IX Relative Heights, Bu...","Neolithic settlement, Neolithic building","37.672122,32.823508",Building 17 excavation data provided by Çatalh...,2018 version (pre-peer review) of this item @ ...
