## Data cleaning project

This mini project heavily focuses on the data cleaning aspect! The datasets used can be found [here](https://www.kaggle.com/momanyc/museum-collection). As a data scientist,not all the data we encounter comes in a consistent format.We often need to prepare the data for analyis called **data cleaning** ( This project is a "Work in Progress")

Description of the dataset:

`Title`: The title of the artwork.

`Artist`: The name of the artist who created the artwork.

`Nationality`: The nationality of the artist.

`BeginDate`: The year in which the artist was born.

`EndDate`: The year in which the artist died.

`Gender`: The gender of the artist.

`Date`: The date that the artwork was created.

`Department`: The department inside MoMA to which the artwork belongs.

Helper Functions:

parenthesis () - Removes `(`, `)` from the list

frequency_table() - Creates a dictionary with their frequency


In [180]:
opened_file = open('artworks.csv',encoding='utf8')
from csv import reader
read_file = reader(opened_file)
moma = list(read_file)


In [181]:
#Evaluating the number of records available in the dataset
print("Number of records : ",len(moma))

#Undertadning the data
print("Header of the dataset", moma[0], "\n")
print(moma[1],"\n")

Number of records :  16726
Header of the dataset ['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department'] 

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', '(American)', '(1947)', '(2013)', '(Female)', '1986', 'Prints & Illustrated Books'] 



Looking at the dataset, we can see that **Nationality** , **BeginDate** , **EndDate** and **Gender** have parenthesis wrapped up! Lets clean the columns which has strings wrapped up in `(` and `)`

In [182]:
# Function to remove parenthesis 
def parenthesis(row_number):
    for x in moma[1:]:
        row_name = x[row_number]
        row_name = row_name.replace("(","")
        row_name = row_name.replace(")","")
        x[row_number] = row_name
    

In [183]:
nationality = parenthesis(2)
b = parenthesis(3)
e = parenthesis(4)
gender = parenthesis(5)

for data in moma[:5]:
    print(data, "\n")

['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department'] 

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', '1947', '2013', 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', '1916', '2007', 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '1870', '1943', 'Male', '1889-1911', 'Prints & Illustrated Books'] 

['Headpiece (page 129) from LIVRET DE FOLASTRIES, Ã€ JANOT PARISIEN', 'Aristide Maillol', 'French', '1861', '1944', 'Male', '1927-1940', 'Prints & Illustrated Books'] 



In [184]:
# Function to create frequency table
def frequency_table(row_number):
    empty_dict = {}
    for x in moma[1:]:
        row_name = x[row_number]
        if row_name in empty_dict:
            empty_dict[row_name] += 1
        else:
            empty_dict[row_name] = 1
    return empty_dict



Understanding the distribution of gender in the dataset

In [185]:
gender_ration = frequency_table(5)
print(gender_ration)

print(frequency_table(2))

{'Female': 2443, 'Male': 13490, '': 791, 'male': 1}
{'American': 7444, 'Spanish': 343, 'French': 3042, 'South African': 45, 'Canadian': 113, 'Czech': 115, 'Belgian': 190, 'Russian': 274, 'British': 748, 'German': 1141, '': 491, 'Swiss': 270, 'Polish': 79, 'Japanese': 299, 'Portuguese': 27, 'Austrian': 100, 'Australian': 46, 'Italian': 405, 'Chilean': 77, 'Colombian': 107, 'Mexican': 169, 'Brazilian': 102, 'Dutch': 203, 'Romanian': 10, 'Venezuelan': 57, 'Korean': 17, 'Israeli': 52, 'Argentine': 82, 'Indian': 34, 'Nationality unknown': 56, 'Swedish': 32, 'Yugoslav': 15, 'Cuban': 36, 'Nationality Unknown': 80, 'Various': 70, 'Luxembourgish': 5, 'Croatian': 27, 'Bulgarian': 2, 'Hungarian': 24, 'Georgian': 6, 'Puerto Rican': 1, 'Danish': 67, 'Serbian': 1, 'Pakistani': 5, 'Ecuadorian': 3, 'Chinese': 26, 'Iranian': 4, 'Finnish': 32, 'Lebanese': 1, 'Thai': 5, 'Cambodian': 1, 'Scottish': 16, 'Kenyan': 1, 'Latvian': 5, 'Sudanese': 3, 'Uruguayan': 8, 'Peruvian': 15, 'New Zealander': 3, 'Moroccan'

We use from the above analysis that there are missing values in Gender and the use of lowercase of `male` making in another type of data. We solve this is error by subsituting the missing values with **Unknown/Other** and converting the string of `male` to title form using `.title()`

In [186]:
for row in moma[1:]:
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = "Gender Unknown/Other"
    row[5] = gender

    nationality = row[2]
    nationality = nationality.title()
    if not nationality:
        nationality = "Nationality Unknown"
    row[2] = nationality
    

In [187]:
print(frequency_table(2))

{'American': 7444, 'Spanish': 343, 'French': 3042, 'South African': 45, 'Canadian': 113, 'Czech': 115, 'Belgian': 190, 'Russian': 274, 'British': 748, 'German': 1141, 'Nationality Unknown': 627, 'Swiss': 270, 'Polish': 79, 'Japanese': 299, 'Portuguese': 27, 'Austrian': 100, 'Australian': 46, 'Italian': 405, 'Chilean': 77, 'Colombian': 107, 'Mexican': 169, 'Brazilian': 102, 'Dutch': 203, 'Romanian': 10, 'Venezuelan': 57, 'Korean': 17, 'Israeli': 52, 'Argentine': 82, 'Indian': 34, 'Swedish': 32, 'Yugoslav': 15, 'Cuban': 36, 'Various': 70, 'Luxembourgish': 5, 'Croatian': 27, 'Bulgarian': 2, 'Hungarian': 24, 'Georgian': 6, 'Puerto Rican': 1, 'Danish': 67, 'Serbian': 1, 'Pakistani': 5, 'Ecuadorian': 3, 'Chinese': 26, 'Iranian': 4, 'Finnish': 32, 'Lebanese': 1, 'Thai': 5, 'Cambodian': 1, 'Scottish': 16, 'Kenyan': 1, 'Latvian': 5, 'Sudanese': 3, 'Uruguayan': 8, 'Peruvian': 15, 'New Zealander': 3, 'Moroccan': 2, 'Guatemalan': 11, 'Cameroonian': 3, 'Egyptian': 5, 'Nigerian': 2, 'Icelandic': 2, 

In [188]:
print(frequency_table(5))

{'Female': 2443, 'Male': 13491, 'Gender Unknown/Other': 791}


In [189]:
print(moma[1])

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', '1947', '2013', 'Female', '1986', 'Prints & Illustrated Books']


In [190]:
def clean_and_convert(date):
    if date != "":  
        date = int(date)
    return date
for row in moma[1:]:
    birth_date = row[3]
    death_date = row[4]
    
    birth_date = clean_and_convert(birth_date)
    death_date = clean_and_convert(death_date)
    
    row[3] = birth_date
    row[4] = death_date

In [191]:
print(moma[1])

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books']


In [193]:
print(frequency_table(6))

{'1986': 99, '1978': 112, '1889-1911': 14, '1927-1940': 4, '1903': 30, '1957': 94, '1924': 134, '1978-1983': 1, '2001': 147, '1941': 72, '1949-1950': 33, '1963': 205, '1908-1911': 16, '1934': 126, '1997': 95, '1931-1933': 11, '1972': 180, '1967': 297, '1923-1924': 15, '1979': 125, '1925-1927': 33, '1929': 99, '1974': 155, '1920-1930': 30, '1915': 45, '1912': 61, '1988-1990': 7, '1925': 118, 'c. 1925': 20, '1980': 141, '1964': 212, '1968': 229, '1969': 183, '1953': 91, '1971': 184, '1988': 127, '1818': 49, '2002': 130, '1926': 113, '1914': 53, '1966': 222, '1904': 24, '1949': 95, '1981': 114, '1970': 211, '1923-1925': 8, '1932': 90, '1928': 168, '1965': 249, '2003': 207, '1971-1974': 9, '1983': 114, '1930': 166, '1946': 82, '1984': 151, '2006': 88, 'c. 1885': 1, '1942': 91, '1913': 97, '2015': 31, '1987': 108, '1947-1949': 15, '1961': 143, '1916-1932': 22, '1962': 180, '1944': 54, '1991': 129, '1991-1994': 33, '2008': 93, '1975': 114, '1893': 40, '1968-1969': 16, '1964-1968': 9, '1928-1