# Cleaning and Preparing Data in Python

A lot of what Data Scientists do is about cleaning data. In this following notebook, you will be going over some basic steps on hwo to do this.
***

## 1. Reading our MoMA Data Set

This time we will work with data from the Museum of Modern Art (MoMA), a museum with one of the largest collections of modern art in the world in the center of New York City. Each row in this table represents a unique piece of art from the Museum of Modern Art. Let's take a look at the first five rows:



| Title |  Artist |  Nationality | BeginDate|EndDate|Gender|Date|Department|
|------------|:------:|----------:|---------------------:|-----------:|----------:|---------------------:|-----------:|
| Dress MacLeod from Tartan Sets |Sarah Charlesworth |(American) | (1947) | (2013)|(Female)|1986|Prints & Illustrated Books|
|Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA|   Pablo Palazuelo |  (Spanish)|(1916) |(2007)|(Male)|1978|Prints & Illustrated Books|
|Tailpiece (page 55) from SAGESSE | Maurice Denis| (French) | (1870)|(1943)|(Male)|1889-1911|Prints & Illustrated Books|
|Headpiece (page 129) from LIVRET DE FOLASTRIES, À JANOT PARISIEN |Aristide Maillol| (French) |	(1861) |(1944)|(Male)|1927-1940|Prints & Illustrated Books|
|97 rue du Bac| Eugène Atget| (French) |(1857) |(1927)|(Male)|1903|	Photography|

The MoMA data is in a ``CSV`` file called ``artworks.csv``. Below is a short explanation of some of the variable names that you encountered above.

- `Title:` The title of the artwork.
- `Artist:` The name of the artist who created the artwork.
- `Nationality:` The nationality of the artist.
- `BeginDate:` The year in which the artist was born.
- `EndDate:` The year in which the artist died.
- `Gender:` The gender of the artist.
- `Date:` The date that the artwork was created.
- `Department:` The department inside MoMA to which the artwork belongs.

How do we access the csv file using Python then?
Just like we learned in the first course, **Python has a built-in csv module that can handle the work of opening a CSV for us**.

In [0]:
#First, import the reader() function from the csv module:
from csv import reader

#Second, use the Python built-in function open() to open the Artworks.csv file:
opened_file = open('Artworks.csv')

#Then use reader() to parse (or interpret) the opened_file:
read_file = reader(opened_file)

#Use the list() function to convert the read_file into a list of lists format:
artworks = list(read_file)

#Finally, remove the first row of the data, which contains the column names:
artworks = artworks[1:]

## 2. Replacing Substrings with the replace Method

Sometimes when we're cleaning data, some parts of strings need to be replace in order to make our data look clean and consistent.

The technique we will learn in this section is called <b>substring</b>. 
>For example, if we have a string "Swimming is my favorite activity" and we want to change "Swimming" to "Running", with the substring technique, the sentence will look like this: "Running is my favorite activity".

In order to do this, we'll need to use the `str.replace()` function. The following steps take place:
1.  to find all instances of the old substring, which in our example "Swimming".
2.  to replace each of those instances with the new substring, "Running".

`str.replace()` takes two arguments:
1. old: The substring we want to find and replace.
2. new: The substring we want to replace old with.

When we use `str.replace()`, we substitute the str for the variable name of the string we want to modify. Let's look at an example in code:

In [0]:
fav_activity = "Swimming is my favorite activity."
print(fav_activity) 
fav_activity = fav_activity.replace("Swimming", "Running")
print(fav_activity)

In the code above, we:

- Created the original string and assigned it to the variable name ``fav_activity``.
- Replaced the substring "Swimming" with the substring "Running" by calling `fav_activity.replace()`.
- Assigned the result back to the original variable name using the `=` sign.

There is something to pay attention to is that when we use `str.replace()`, this function will replace all instances of the substring it finds. See the following example:

In [0]:
fav_activity = fav_activity.replace("i", "I")
print(fav_activity)

You see that in the code above, all "i"s in the fav_activity are replaced with "I". This is something to keep in mind when we use `str.replace()`.

### Task 2.1.2:

In the text editor below, we have created a string variable `string1` containing the string `"I am awesome"`.
Now use the `str.replace()` method to create a new string, `string2`:
- The new string should have the value `"I am amazing"`.

In [0]:
string1 = "I am awesome."

# Start your code below:


## 3. Cleaning the Nationality and Gender Columns

Now let's see how we can use the `str.replace()` method on a bigger data set. I have a shortened version of our data set below:

| Title |  Artist |  Nationality |Gender|
|------------|:------:|----------:|---------------------:
| Dress MacLeod from Tartan Sets |Sarah Charlesworth |(American) | (1947) | (2013)|(Female)|
|Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA|   Pablo Palazuelo |  (Spanish)|(Male)|
|Tailpiece (page 55) from SAGESSE | Maurice Denis| (French) |(Male)|
|Headpiece (page 129) from LIVRET DE FOLASTRIES, À JANOT PARISIEN |Aristide Maillol| (French) |(Male)|
|97 rue du Bac| Eugène Atget| (French) |(Male)|

Do you see that the Nationality and Gender columns have parentheses (()) at the start and the end of the values? In this session, we want to learn how to remove those values.

In the session, we learned how to use `str.replace()` to replace one substring with another. What we want, however, is to remove a substring, not replace it. **In order to remove a substring, all we need to do is to replace the substring with an empty string: `""`**.

We need to perform this action many times in order to replace all unwanted characters in our whole moma data set. We can do this with a for loop. Let's see an example using a small sample of our data:

In [0]:
nationalities = ['(American)', '(Spanish)', '(French)']

for n in nationalities:
    clean_open = n.replace("(","")
    print(clean_open)

We removed the `(` character from the start of each string. In order to remove both, we'll have to perform the `str.replace()` twice:

In [0]:
nationalities = ['(American)', '(Spanish)', '(French)']

for n in nationalities:
    clean_open = n.replace("(","")
    clean_both = clean_open.replace(")","")
    print(clean_both)

How can we adopt this code to work on the whole data set? We'll start by printing the value from the Nationalities column (with a column index `4`) for three rows in our moma data set. We'll use the same rows after our loop so we can see how the values changed:

In [0]:
# Read in csv file
from csv import reader
opened_file = open('Artworks.csv',encoding="utf-8")
read_file = reader(opened_file)
moma = list(read_file)
moma = moma[1:]

print(moma[200][4])
print(moma[400][4])
print(moma[800][4])

Next, we'll loop over each row in the moma data set. In each row, we'll:

- Assign the Nationality value from index `4` to a variable name.
- Use `nationality.replace()` to remove all instances of the open parentheses.
- Use `nationality.replace()` to remove all instances of the close parentheses.
- Assign the cleaned nationality back to row index `4`.

In [0]:
for row in moma:
    nationality = row[4]
    nationality = nationality.replace("(","")
    nationality = nationality.replace(")","")
    row[4] = nationality

In [0]:
# Let's look at the values of the same three rows after our code:

print(moma[200][4])
print(moma[400][4])
print(moma[800][4])

### Task 2.1.3:
Now it's your turn — you'll be removing the parentheses from both the `Nationality` and `Gender` columns.
Gender information you can find at index `7` of the row.

In [0]:
# Variables you create in previous screens
# are available to you, so you don't need
# to read the CSV again.

# Start your code here:


## 4. String Capitalization

The Gender column in our data set contains four unique values:

- (an empty string)
- "Male"
- "Female"
- "male"

In our data set, there are two different capitalizations used in our data set for "male." This could be caused by manual data entry. Different people could use different capitalizations when they enter words.

There are a few ways we could handle this using what we know so far:

1. We could use ``str.replace()`` to replace m with ``M``, but then we'd end up with instances of FeMale.
2. We could use ``str.replace()`` to replace male with ``Male``. This would also give us instances of FeMale.

However, here comes the problem: even if the word "male" wasn't contained in the word "female," both of these techniques wouldn't be good options if we had a column with many different values, like our Nationality column. Instead, what we should use is the <b>str.title()</b> method.
> ``str.title()``: a Python string method designed specifically for handling capitalization. The method returns a copy of the string with the first letter of each word transformed to uppercase (also known as <b>title case</b>).

Let's take a look at an example:

In [0]:
my_string = "The cool thing about this string is that it has a CoMbInAtIoN of UPPERCASE and lowercase letters!"

my_string_title = my_string.title()
print(my_string_title)

Using title case will give us consistent capitalization for all values in the Gender column.

We have a number of rows containing an empty string (`""`) for the Gender column. This could be a result of:

- The person entering the data has no information about the gender of the artist.
- The artist is unknown and so is the gender.
- The artist's gender is non-binary.

Now let's try to use this technique to make the capitalization of both the Nationality and Gender columns uniform. The Nationality column has 84 unique values, so to help you, we'll provide you with the values that aren't already in title case:

- `''`
- `'Nationality unknown'`

### Task 2.1.4:

Use a loop to iterate over all rows in the moma list of lists. For each row:

1. Clean the Gender column.
    - Assign the value from the Gender column, at index ``7``, to a variable.
    - Make the changes to the value of that variable.
        - Use the `str.title()` method to make the capitalization uniform.
        - Use an if statement to check if the value is an empty string. If the value is an empty string, give it the value `"Gender Unknown/Other"`.
    - Assign the modified variable back to list index `7` of the row.
2. Clean the Nationality column of the data set (found at index `4`) by repeating the same technique you used for the Gender column.
    - For missing values in the `Nationality` column, use the string `"Nationality Unknown"`.

In [0]:
# Start your code below:


## 5. Errors During Data Cleaning

We have analyzed the artist nationalities. Now let's have a look at the <b>BeginDate</b> and <b>EndDate</b> columns

These two columns contain the birth and death dates of the artist who created the work. Let's take a look at the column:

In [0]:
for row in moma[:5]:
    birth_date = row[5]
    death_date = row[6]
    print([birth_date, death_date])

These values are wrapped in parentheses as four-digit strings. How can we clean these columns? We need to:
- Remove the parentheses from the start and the end of each value.
- Convert the values from the string type to an integer type. This will help us perform calculations with them later.

In the previous two screens, we had to repeat code twice — first when we cleaned the Gender column, and then when we cleaned the Nationality column. If we don't want to keep repeating code, we can create a function that performs these operations, then use that function to clean each column.

In [0]:
def clean_and_convert(date): #Takes a single argument
    date = date.replace("(", "") #Uses str.replace() to remove the "(" character

    date = date.replace(")", "") #Uses str.replace() to remove the ")" character
    date = int(date) #Convert the string to an integer
    return date

If we have a ``BeginDate`` value of '(1958)':

In [0]:
birth_date = '(1958)'
cleaned_date = clean_and_convert(birth_date)
print(cleaned_date)
print(type(cleaned_date))

Our function successfully removes the parentheses and converts the value to an integer type. Unfortunately, our function won't work for every value in our data set. If we have two values at the same time:

In [0]:
row_43 = moma[42] # row 43
print(row_43)

In [0]:
#This will throw an error

birth_date = '(1936) (0) (1936) (1931) (1931) (1944)'
cleaned_date = clean_and_convert(birth_date)

Our code has not completed successfully, instead returning a `ValueError`. As we learned in the previous course, the name for the error message is called a traceback. The final line of the traceback tells us the underlying error:

ValueError: invalid literal for int() with base 10: '1936 0 1936 1931 1931 1944'

One way to handle these scenarios is to use an if statement to make sure we aren't encountering an empty string before we convert our value.

## 6. Parsing Numbers from Complex Strings, Part One

We have successfully converted the ``BeginDate`` and ``EndDate`` columns into numeric values. If we were to combine the data from the BeginDate column (the artist's year of birth) with the data in the Date column (the year of creation) we can therefore calculate the age at which the artist produced this piece of artwork.

That means we need to clean the data in the `Date` column in order to perform such calculation as mentioned above.

Let's examine a sample of the types of values contained in this column:

````python
1912
1929
1913-1923
(1951)
1994
1934
c. 1915
1995
c. 1912
(1988)
2002
1957-1959
c. 1955.
c. 1970's
C. 1990-1999
````

This column contains data in many different formats:

- Some years are in parentheses.
- Some years have c. or C. before them, indicating that the year is approximate.
- Some have year ranges, indicated with a dash.
- Some have 's to indicate a decade.

In this session,we want to to remove all the extra characters and be left with either a range or a single year. We will then finish processing the data in the sessions that follow. For the two special cases listed above:

- Where there is an 's that signifies a decade, we'll use the year without the 's.
- Where c. or similar indicates an approximate year, we'll remove the c. but keep the year.

### Task 2.1.6 (OPTIONAL):
1. Create a function called `strip_characters()`, which accepts a string argument and:
    - Iterates over the `bad_chars` list, using `str.replace()` to remove each character.
    - Returns the cleaned string.
2. Create an empty list, `stripped_test_data`.
3. Iterate over the strings in `test_data`, and on each iteration:
    - Use the function you created earlier to clean the string.
    - Append the cleaned string to the `stripped_test_data` list.

In [0]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]


bad_chars = ["(",")","c","C",".","s","'", " "]

# Start your code here:


## 7. Parsing Numbers from Complex Strings, Part Two

Now let's take a look at the result from your previous task:

1912
1929
1913-1923
1951
1994
1934
1915
1995
1912
1988
2002
1957-1959
1955
1970
1990-1999

There are two different scenarios that we need to pay attention to when we are converting them into integers:
- we have values in a single year, like 1912
- we also have values in ranges of years, like 1913-1923

As a data scientist, you need to make decisions on how you will structure your code. One option could be to discard all approximate years so we know that our calculations are exact. For example, when we're calculating an artist's age, an approximate age is also acceptable (the difference between 30 and 33 years old is more nuanced than we need).

Whichever way you decide to proceed isn't as important as thinking about your analysis and having a valid reason for this particular decision.

So this is what we will do:
- when we have values in a single year, like 1912, we'll keep it as it is.
- when we also have values in ranges of years, like 1913-1923, we'll average the two years.

How do we proceed with our above decision? We can do the following:
1. Have an if statement to check if there is a dash character ``-`` in the string, so we know if it's a range or not.
2. If the date is a range:
    - Split the string into two strings, take the first part (before the dash), and the second part (after the dash)
    - Convert the two numbers into integer type
    - Take the average of those two numbers
    - Use the round() function to round the average
3.  If the date isn't a range:
    - Convert the value to an integer type.
    
To check whether a substring exists in a string (to check if the year is a range or not), we need to use the `in` operator. See in the example below:

In [0]:
if "male" in "female":
    print("The substring was found.")
else:
    print("The substring was not found.")

In [0]:
if "love" in "loving":
    print("The substring was found.")
else:
    print("The substring was not found.")

Second step, to split a string into two parts we need to use the `str.split()` method. This method can help us to split a string into a list of strings. See a quick example below:

In [0]:
year_in_range = "1995-1998"
print(year_in_range.split("-"))

### Task 2.1.7(HARD):

The `stripped_test_data` list, `strip_characters()` function and `bad_chars` list are provided for you in the editor below.

1. Create a function called `process_date()` which accepts a string, and follows the logic we outlined above:
    - Checks if the dash character ``-`` is in the string so we know if it's a range or not.
    - If it is a range:
        * Splits the string into two strings, before and after the dash character.
        * Converts the two numbers to the integer type and then averages them by adding them together and dividing by two.
        * Uses the `round()` function to round the average, so values like 1964.5 become 1964.
    - If it isn't a range:
        - Converts the value to an `integer` type.
    - Finally, returns the value.
2. Create an empty list `processed_test_data`.
3. Loop over the `test_data` list using the `strip_characters()` function and your `process_date()` function. Process the dates and append each processed date back to the `processed_test_data` list.


4. (OPTIONAL) Once your code works with the test data, you can then iterate over the moma list of lists. This list contains several date formats that we have not discussed so far. Try to deal with them, and any error you might get, in a way that seems sensible to you. In each iteration:
    - Create an empty list called `moma_dates`.
    - Loop over the rows of the `moma` list of lists.
    - Assign the value from the Date column (index `8`) to a variable.
    - Use the `strip_characters()` function to remove any bad characters.
    - Use the `process_date()` to convert the date.
    - Perform any other processing that you see fit to get a clean, single date.
    - Append the processed value to `moma_dates`.

 

In [0]:
from csv import reader
opened_file = open('Artworks.csv', encoding='utf8')
read_file = reader(opened_file)
moma = list(read_file)

test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

def strip_characters(string):
    for char in string:
        if char not in '01234567890-':
            string = string.replace(char,"")
    return string



# Start your code here:
