### Python Intermediate Practice Code

#### Intro to Data Cleaning

Data with a consistent format is often described as "clean." As data scientists, not all data we encounter is clean; we often need to prepare it in a process called __data cleaning__.

We'll perform data cleaning on a real-world data set of artworks contained in the Museum of Modern Art (MoMA).

#### Reading our MoMA Data Set

In [1]:
from csv import reader
opened_file = open('data/artworks.csv')
read_file = reader(opened_file)
moma = list(read_file)
moma_header = moma[0]
moma_data = moma[1:]

#### Replacing substrings with the 'replace()' function

Often when we're cleaning data, we need to replace parts of strings so our data is consistent. When we want to refer to part of a string, we use the term __substring__.

The __str.replace()__ method is like a "find and replace" tool for strings. To achieve this using str.replace(), we need to provide two arguments:

1. __old__: The substring we want to find and replace.
2. __new__: The substring we want to replace old with.

When we use __str.replace()__, we substitute the _str_ for the variable name of the string we want to modify.
We need to use _=_ to assign the modified string to a variable name.

In [2]:
age1 = "I am twenty-six years old"

age2 = age1.replace('six','seven')
age2

'I am twenty-seven years old'

#### Cleaning the _Nationality_ and _Gender_ Columns

In [3]:
for row in moma_data:
    nationality = row[2] # Index 2 for Nationality
    nationality = nationality.replace('(', '')
    nationality = nationality.replace(')', '')
    row[2] = nationality
    
    gender = row[5] # Index 5 for Gender
    gender = gender.replace('(', '')
    gender = gender.replace(')', '')
    row[5] = gender

#### String Capitalization

The Gender column in our data set contains four unique values:

- "" (an empty string)
- "Male"
- "Female"
- "male"

There are a few ways we could handle this using what we know so far:

1. We could use str.replace() to replace m with M, but then we'd end up with instances of FeMale.
2. We could use str.replace() to replace male with Male. This would also give us instances of FeMale.

We can use a Python string method designed specifically for handling capitalization: the __str.title()__ method. The str.title() method returns a copy of the string with the first letter of each word transformed to uppercase (also known as __title__ case).

We have a number of rows containing an empty string ("") for the Gender column. This could mean:

- That the person entering the data didn't know the gender of the artist.
- That the artist is unknown and so the gender is also unknown.
- That the artist's gender is non-binary.

When we correct the capitalization, we'll also take the opportunity to label these with the string "Gender Unknown/Other"

In [4]:
for row in moma_data:
    # Cleaning the Gender column
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = "Gender Unknown/Other"
    
    row[5] = gender
    
    # Cleaning the Nationality column
    nationality = row[2]
    nationality = nationality.title()
    if not nationality:
        nationality = "Nationality Unknown"
    row[2] = nationality

#### Errors during data cleaning

To clean up the BeginDate and EndDate columns, let's write a function that will remove '()' and convert str to int for later use

In [5]:
def clean_and_convert(date):
    # check that we don't have any empty strings
    if date != '':
        date = date.replace('-', '')
        date = date.replace(')','')
        date = int(date)
    return date

In [6]:
for row in moma_data:
    BeginDate = clean_and_convert(row[3]) # Index 3 for BeginDate
    EndDate = clean_and_convert(row[4])   # Index 4 for EndDate
    row[3], row[4] = BeginDate, EndDate

In [9]:
moma_data[254][3]

1944

If we were to combine the data from the BeginDate column — which represents the artist's year of birth — with the data in the Date column — which represents the year in which the piece of art was created — we can calculate the age at which the artist produced the work.

The Date column contains data in many different formats:

- Some years are in parentheses.
- Some years have c. or C. before them, indicating that the year is approximate.
- Some have year ranges, indicated with a dash.
- Some have 's to indicate a decade.

In [10]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]
stripped_test_data = []

def strip_characters(date):
    for char in bad_chars:
        date = date.replace(char,"")
    return date

for entry in test_data:
    date = strip_characters(entry)
    stripped_test_data.append(date)

In our test data, we successfully removed the bad characters and we are left with two types of dates.
- Some are a single year, e.g. 1912.
- Some are ranges of years, e.g. 1913-1923.

When we encounter data like this, we need to make decisions on how you'll proceed. One option might be to discard all approximate years so we know that our calculations are exact. Because we're calculating an artist's age, approximate data is fine — the difference between an artist being 42 and 44 years old, for instance, is more nuanced than we need.

In [11]:
stripped_test_data

['1912',
 '1929',
 '1913-1923',
 '1951',
 '1994',
 '1934',
 '1915',
 '1995',
 '1912',
 '1988',
 '2002',
 '1957-1959',
 '1955',
 '1970',
 '1990-1999']

Here are the ways we'll treat the various cases:

- Where there is a single year, we'll keep it.
- Where there is a year range, we'll average the two years.

We want to write code that does the following for each value:

- Checks if the dash character (-) is in the string so we know if it's a range or not.
- If the __date is a range__:
    - Splits the string into two strings, before and after the dash character.
    - Converts the two numbers to the integer type and then average them by adding them together and dividing by two.
    - Uses the round() function to round the average, so values like 1964.5 become 1964.
- If the __date isn't a range__:
    - Converts the value to an integer type.

In [12]:
processed_test_date = []

def process_date(date):
    if '-' in date:
        date = date.split('-')
        date =round((int(date[0]) + int(date[1]))/len(date))
    elif not date:
        date = 0
    else:
        date = int(date)
    return date

for entry in stripped_test_data:
    date = process_date(entry)
    processed_test_date.append(date)

In [13]:
processed_test_date

[1912,
 1929,
 1918,
 1951,
 1994,
 1934,
 1915,
 1995,
 1912,
 1988,
 2002,
 1958,
 1955,
 1970,
 1994]

Let's test the functions on the actual data. Date column is in index 8 in our dataset. We will not run the code on our original datset since it is very very noisy. I am currently working on it to clean it and will add the new code accordingly.

__Note:__ For now, let's focus on the lessons from Dataquest and just have the code written. Since this is my own copy, I will post the code that works after testing them on their console.

In [None]:
# for row in moma_data:
#     date = row[8]
#     strip_date = strip_characters(date) # Index 8 for Date
    
#     clean_date = process_date(strip_date)
#     row[8]= clean_date
    
# moma_data[7][8]

In [15]:
import csv

output_file = "data/artworks_clean.csv"
with open(output_file, "w", newline='') as result:
    writer = csv.writer(result)
    writer.writerow(moma_header)
    for row in moma_data:
        writer.writerow(row)
    