# Getting rid of punctuations and accents in the Catalogue Raisonne Dataset

## 1. Import the requires libraries

In [None]:
import pandas as pd
import numpy as np
from unidecode import unidecode

If you do not have unidecode installed, please use the following command to install it. Remember that if you are using a Mac you might have to use `pip3` for the following command to work.

In [None]:
!pip install unidecode

## 2. Read the data and clean it

Read the dataset and assign it to the variable df (a.k.a. dataframe).

In [None]:
# the excel spreadsheet provided had the data in "Sheet2" so update this as necessary
df = pd.read_excel('Your/Path/catalogue_raisonne_data.xlsx', sheet_name="Sheet2")

The following lines of code `.apply` a `lambda` function to clean each pandas Series. They follow the steps below.
1. Take the pandas Series we would like to modify
2. Convert each element into string (np.nan are consider floats)
3. Clean the punctuations and accents for each element in the Series
4. For those containing "NaN" values it we reformat them as np.nan
5. Create a new variable in our dataframe called "new_..."

In [None]:
df['new_artist'] = df['artist'].apply(lambda x: unidecode(str(x)))
df['new_author'] = df['author'].apply(lambda x: unidecode(str(x))).apply(lambda x: np.nan if x == 'nan' else x)
df['new_author_s'] = df['author_s'].apply(lambda x: unidecode(str(x))).apply(lambda x: np.nan if x == 'nan' else x)
df['new_imprint'] = df['imprint'].apply(lambda x: unidecode(str(x))).apply(lambda x: np.nan if x == 'nan' else x)
df['new_public_note'] = df['public_note'].apply(lambda x: unidecode(str(x))).apply(lambda x: np.nan if x == 'nan' else x)
df.head() #observe your data to make sure the columns were successfully created

## 3. Save the data

In [None]:
df.to_csv('new_catalogue_raisonne.csv', index=False, encoding='utf-8')

Make sure it was saved successfully.

In [None]:
pd.read_csv('new_catalogue_raisonne.csv')