[Dataquest](https://app.dataquest.io/m/354/regular-expression-basics/1/introduction)
[RegularEx](https://app.dataquest.io/m/351/cleaning-and-preparing-data-in-python/1/introducing-data-cleaning)

[Good tutorial](https://www.dataquest.io/blog/regular-expressions-data-scientists/)
[Ref](https://www.tutorialspoint.com/python/python_reg_expressions.htm)

# List of lists

In [2]:
from csv import reader

In [11]:
opened_file=open("artworks.csv")
read_file=reader(opened_file)
moma=list(read_file)
moma=moma[1:]

## Regular expression

### [String methods](https://www.datacamp.com/community/tutorials/python-string-tutorial)

### str.replace - remove symbols (tidy data)

In [16]:
moma[1]

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
 'Pablo Palazuelo',
 'Spanish',
 '(1916)',
 '(2007)',
 'Male',
 '1978',
 'Prints & Illustrated Books']

In [17]:
for row in moma:
    nationality=row[2]
    nationality=nationality.replace("(","")
    nationality=nationality.replace(")","")
    row[2]=nationality
    
    gender=row[5]
    gender=gender.replace("(","")
    gender=gender.replace(")","")
    row[5]=gender

In [18]:
moma[1]

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
 'Pablo Palazuelo',
 'Spanish',
 '(1916)',
 '(2007)',
 'Male',
 '1978',
 'Prints & Illustrated Books']

----

#### Symbol at head/tail

In [24]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]

def strip_characters(string):
    for char in bad_chars:
        string=string.replace(char,"")
    return string

stripped_test_data=[]

for row in test_data:
    row=strip_characters(row)
    stripped_test_data.append(row)

#### Symbol in between str.split()

In [26]:
def process_date(date):
    if "-" in date:
        split_date = date.split("-")
        date_one = split_date[0]
        date_two = split_date[1]       
        date = (int(date_one) + int(date_two)) / 2
        date = round(date)
    else:
        date = int(date)
    return date

In [27]:
stripped_test_data = ['1912', '1929', '1913-1923',
                      '1951', '1994', '1934',
                      '1915', '1995', '1912',
                      '1988', '2002', '1957-1959',
                      '1955', '1970', '1990-1999']


processed_test_data = []

for d in stripped_test_data:
    date = process_date(d)
    processed_test_data.append(date)


In [28]:
for row in moma:
    date = row[6]
    date = strip_characters(date)
    date = process_date(date)
    row[6] = date

### str.title - captalization

The Gender column in our data set contains four unique values:

- "" (an empty string)
- "Male"
- "Female"
- "male"

Inconsistency in data

- We could use str.replace() to replace m with M, but then we'd end up with instances of FeMale.
- We could use str.replace() to replace male with Male. This would also give us instances of FeMale.

Even if the word "male" wasn't contained in the word "female," both of these techniques wouldn't be good options if we had a column with many different values.

The str.title() method returns a copy of the string with the first letter of each word transformed to uppercase (also known as title case).

In [19]:
for row in moma:
    gender=row[5]
    gender=gender.title()
    if not gender:
        gender="Gender Unknown/Other"
    row[5]=gender
    
    nationality=row[2]
    nationality=nationality.title()
    if not nationality:
        nationality="Nationality Unknown"
        
    row[2]=nationality

In [20]:
moma[1]

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
 'Pablo Palazuelo',
 'Spanish',
 '(1916)',
 '(2007)',
 'Male',
 '1978',
 'Prints & Illustrated Books']

### str2num

- Takes a single argument
- Uses str.replace() to remove the "(" character
- Uses str.replace() to remove the ")" character
- Uses the int() function to convert the string to an integer

In [21]:
def clean_and_convert(date):
    # check that we don't have an empty string
    if date != "":

        date = date.replace("(", "")
        date = date.replace(")", "")
        date = int(date)
    return date


In [22]:

for row in moma:
    BeginDate=row[3]
    EndDate=row[4]
    BeginDate=clean_and_convert(BeginDate)
    EndDate=clean_and_convert(EndDate)
    
    row[3]=BeginDate
    row[4]=EndDate


In [23]:
moma[1]

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
 'Pablo Palazuelo',
 'Spanish',
 1916,
 2007,
 'Male',
 '1978',
 'Prints & Illustrated Books']

### Re package

#### Find and match

- re.findall() matches **all instances** of a pattern in a string and returns them in **a list**,
- re.search() matches the **first instance** of a pattern in a string, and returns it as a re **match object.**
  - group() converts the match object into a string.