##In this notebook, we will convert dates stored inn various formats and string data types to same format and integer data type.

In [1]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

This test_data list contains data in many different formats:

Some years are in parentheses.
Some years have c. or C. before them, indicating that the year is approximate.
Some have year ranges, indicated with a dash.
Some have 's to indicate a decade.

To complete our task, there are many characters we'll need to remove. One option is to write an individual str.replace() for each character. A more efficient option is to write a function that iterates over a list of "bad" characters, so our code is easier to understand.

**bad_chars**: This is a list of the characters you'll need to replace. This means you spend your time thinking about the structure of your code, rather than making sure you got every character.

In [2]:
bad_chars = ["(",")","c","C",".","s","'", " "]

We are creating a function called strip_characters(), which accepts a string argument and does the following:
* Iterates over the bad_chars list, using str.replace() to remove each character
* Returns the cleaned string



In [3]:
def strip_characters(string):
    for word in bad_chars:
        string = string.replace(word,"")
    return string

stripped_test_data = []
for word in test_data:
    word = strip_characters(word)
    stripped_test_data.append(word)
    
print(stripped_test_data)

['1912', '1929', '1913-1923', '1951', '1994', '1934', '1915', '1995', '1912', '1988', '2002', '1957-1959', '1955', '1970', '1990-1999']


There are two different scenarios that we need to prepare for when converting these to integers:

* Some are a single year (e.g., 1912).
* Some are ranges of years (e.g., 1913-1923).

We want to write code that does the following for each value:

* Checks if the dash character (-) is in the string so we know if it's a range or not.
* If the date is a range:
1. Splits the string into two strings, before and after the dash character
2. Converts the two numbers to the integer type and then averages them by adding them together and dividing by two
3. Uses the round() function to round the average, so values like 1964.5 become 1964
* If the date isn't a range, we want the code to do the following:
1. Convert the value to an integer type

In [4]:
def process_date(string):
    if '-' in string:
        string = string.split('-')
        date_1 = int(string[0])
        date_2 = int(string[1])
        avg_date = (date_1 + date_2) / 2
        avg_date = round(avg_date)
    else:
        avg_date = int(string)
    return avg_date

Now lets use the function to obtain the processed test data.

In [5]:
processed_test_data = []

for data in stripped_test_data:
    processed = process_date(data)
    processed_test_data.append(processed)
    
print(processed_test_data)

[1912, 1929, 1918, 1951, 1994, 1934, 1915, 1995, 1912, 1988, 2002, 1958, 1955, 1970, 1994]
