# Common Data Cleaning and Structuring Tasks 

This notebook will be a space to test out some common data cleaning tasks. The goal is to turn these into functions and scripts that could also then be fed to a GUI like for example Gooey. 

## Double Spaces
Double-spacing is something that is super difficult to detect visually, but that in all cases, we want to get rid of. So it will make a good first test for our data cleaning tasks. We will need to use regular expressions, so let's first get some `import` statements going, for the regex module `re` and `pandas` for good measure

In [80]:
import re
import csv
import pandas as pd

We need to have python read the test file (a .csv) that we know has double spaces, `double_spaces.csv`

In [81]:
data_file = pd.read_csv("test/double_spaces.csv")

In [82]:
data_file.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


Now we can use `pd.replace` (which is simlar to `re.sub` function) to replace the pattern of two spaces with one:

In [74]:
data_file.replace('\s\s', ' ', regex=True, inplace=True)

Let's confirm it worked:

In [83]:
def remove_double_spaces(data_frame):
    """
    Take the DataFrame and remove consecutive spaces
    """
    despaced = data_frame.replace(to_replace='\s\s', value=' ', regex=True, inplace=True)
    return despaced

In [78]:
def remove_double_spaces(data_input):
    """
    Take the data and remove consecutive spaces
    """
    result = re.sub(r'\s\s', ' ', data_input)
    return result

In [84]:
remove_double_spaces(data_file)

In [88]:
data_file.head()

Unnamed: 0,Object Unique ID,Level,File name,File use,Type of Resource,Language,Title,Person:Creator,Person:Choreographer,Person:Composer,...,End date,Note:description,Subject:topic,Subject:personal name,Subject:geographic,Related resource:related,Access granted,Copyright status,Copyright holder,CC license
0,1,Object,,,moving image | still image,eng - English,St. Francis de los Barrios,,,"Waters, Joseph",...,2017-12-06,"An indie-class, trans-genre rock opera based o...",Opera | Transgender sex workers,Saint Francis of Assisi | Pope Francis,"Tijuana (Baja California, Mexico) | Zona Norte...",IDEAS Performance Photographs via Flickr @ htt...,,,,
1,1,Component,IdeasStFrancis_2017_12_06_HQ.mp4,video-source,,,Program,,,,...,,,,,,,,,,
2,1,Component,IDEAS_St_Francis_de_los_Barrios_01_of_88.jpg,image-source,,,Image 1,,,,...,,,,,,,,,,
3,1,Component,IDEAS_St_Francis_de_los_Barrios_02_of_88.jpg,image-source,,,Image 2,,,,...,,,,,,,,,,
4,1,Component,IDEAS_St_Francis_de_los_Barrios_03_of_88.jpg,image-source,,,Image 3,,,,...,,,,,,,,,,


In [45]:
with open('output.csv', 'w') as output_file:
    writer = csv.writer(output_file)
    writer.writerows(result)