# Common Data Cleaning and Structuring Tasks 

This notebook will be a space to test out some common data cleaning tasks. The goal is to turn these into functions and scripts that could also then be fed to a GUI like (for example [Gooey](https://github.com/chriskiehl/Gooey)). 

## Double Spaces
Consecutive spaces is something that is super difficult to detect visually, but that in all cases, we want to get rid of. So it will make a good first test for our data cleaning tasks. We will need to use regular expressions, so let's first get some `import` statements going, for the regex module `re` and `pandas` for good measure

In [1]:
import re
import csv
import pandas as pd

We need to have python read the test file (a .csv) that we know has double spaces, `double_spaces.csv`

In [45]:
data_file = pd.read_csv("test/double_spaces.csv")

In [46]:
data_file.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


Now we can use `pd.replace` (which is simlar to `re.sub` function) to replace the pattern of two spaces with one:

In [47]:
data_file.replace('\s\s', ' ', regex=True, inplace=True)

Let's confirm it worked by defining it in a function:

In [48]:
def remove_double_spaces(data_frame):
    """
    Take the DataFrame and remove consecutive spaces
    """
    despaced = data_frame.replace(to_replace='\s\s', value=' ', regex=True, inplace=True)
    return despaced

In [51]:
result1 = remove_double_spaces(data_file)

In [52]:
result1.head()

AttributeError: 'NoneType' object has no attribute 'head'

In [53]:
# Comment out to not create the file
#result1.to_csv('output_test1.csv')

Upon inspection of the result file, we see the spaces are now all singles. And since we made a function, we can add it in our overall program, as well as write up a test if we want. 

## Trimming whitespace

Similar to double spacing, cell values with leading or trailing whitespaces are a pain, and again, very hard to detect visually. We know that python has the built-in `trim()` function, so perhaps that can be applied to our data?

In [54]:
data_file2 = pd.read_csv("test/untrimmed_spaces.csv")

In [55]:
def trim_spaces(data_frame):
    """
    Take the DataFrame and remove surrounding spaces of values
    """
    trim_strings = lambda x: x.strip() if type(x) is str else x
    return data_frame.applymap(trim_strings)

In [56]:
result2 = trim_spaces(data_file2)

In [57]:
result2.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


In [41]:
# Comment out to not create the file
# result2.to_csv('output_test2.csv')