# Data Cleaning

We were able to locate a dataset available through the Harvard Dataverse Repository, containing data on the characteristics of individual Nobel Prize laureates. We downloaded that initial .csv file, and then pursued the following process to clean up that data and turn it into a dataframe suitable for our research: 

**Step 1:** Imported libraries for our data analysis.

In [1]:
import pandas as pd 
import numpy as np
import seaborn 
from matplotlib import pyplot
from datetime import datetime, date

**Step 2:** Loaded the CSV file, laureate.csv, into a pandas data frame called nobel_data_raw and printed nobel_data_raw to check laureate.csv was loaded correctly into the data frame.

In [None]:
nobel_data_raw = pd.read_csv("laureate.csv")
nobel_data_raw.head()

**Step 3:** Printed out the number of rows and columns in nobel_data_raw to make sure that it matches the number of rows and columns of laureate.csv from the original source.

In [None]:
nobel_data_raw.shape

**Step 4:** Printed out the type of each column so that we know if we need to convert to another type in our upcoming data analysis.

In [None]:
nobel_data_raw.dtypes

**Step 5:** We converted values in 'born' and 'died' columns to datetime. If the values were not formatted to be able to convert to datetime, such as "0000-00-00," we renamed those values as "NaT". We also printed out the head of the data frame to check if the columns were displaying values in datetime and in NaT properly.

In [None]:
nobel_data = nobel_data_raw
nobel_data['born'] = pd.to_datetime(nobel_data_raw['born'], format = '%Y-%m-%d', errors = 'coerce')
nobel_data['died'] = pd.to_datetime(nobel_data_raw['died'], format = '%Y-%m-%d', errors = 'coerce')
nobel_data.head()

**Step 6:** We found the age in days of each laureate in the dataframe by using the "year" and year extracted from datetime values in the 'born' column. We made a new column called "age" because it will help us answer a few of our research questions. The values in the 'age' column were rounded down using np.floor to accurately represent each laureate's age at the time of their win. We also used .fillna() to fill missing "age" values with 0 to create models in the future without any errors involving NaN. Lastly, we used .head() to check if our data frame includes the new column.

In [None]:
nobel_data['age'] = nobel_data['year'] - nobel_data['born'].dt.year
nobel_data['age'] = nobel_data['age'].apply(np.floor)
nobel_data['age'] = nobel_data['age'].fillna(0)
nobel_data['age'] = nobel_data['age'].astype(int)
nobel_data.head()

**Step 7:** Upon this early analysis, we discovered 6 rows in our dataframe that were empty of all data (except for gender: male), and did not correspond to any potential collection gaps that we could ascertain. As such, we concluded that these rows were erroneously included, and dropped them from the dataframe. 

In [None]:
nobel_data = nobel_data.drop([932, 933, 934, 935, 936, 937])

**Step 8:** Created a new dataframe, nobel_data_valid, without any "age" values that are 0 so that we can create models using the "age" column without laureates who are missing their ages. Most of these rows represent organizational winners, rather than individuals, so the age data for these instances would not be relevant to our analysis. 

In [None]:
nobel_data_valid = nobel_data.loc[nobel_data_raw['age']!=0]
nobel_data_valid.head()

**Step 9:** Printed the types of columns again to check to make sure the conversions made above are reflected in our data frame. 

In [None]:
nobel_data.dtypes