# Data Scientist Technical Test

Ross Guthery

2 November 2022

In [1]:
# Import packages.
import pandas as pd
from toolbox import Data

### 1. Data Collection and Cleaning

To manage the reading in and cleaning of the data needed to complete this test, I wrote a class called ```Data```. Data collection is obviously straightforward. Data cleaning, however, isn't. The class's private ```clean_data``` function retitles the columns so they are easier to understand, drops columns I either can't use or don't understand, and removes unecessary characters like "_z", "<", "$", and ",". Furthermore, it ensures each column is the appropriate data type and replaces boolean values like "yes" and "no" with more usable ones and zeros. As an aside, it's during this step that I drop the one outlier I found, which is a negative car age. I also drop the only duplicate row I stumbled upon. (I only carry out the last two steps on the training data.)

In [2]:
# Instantiate a Data object for the train and test sets.
train_data: Data = Data(
    train=True, file_path="data/train_auto.csv", index_col=0,
)
test_data: Data = Data(
    train=False, file_path="data/test_auto.csv", index_col=0,
)

In [3]:
# Call the train and test set Data objects.
train_data()
test_data()

In [4]:
# Display where the null values are in the train set.
pd.concat(
    objs=[train_data.data.isnull().sum(), train_data.data.eq('').sum()],
    keys=['Nulls', 'Empty'],
    axis=1,
)

Unnamed: 0,Nulls,Empty
target_flag,0,0
num_kids_driving,0,0
age,6,0
num_kids_home,0,0
income,413,0
is_single_parent,0,0
home_value,428,0
is_married,0,0
is_female,0,0
education,0,0


I spent a fair amount of time reflecting upon how to best manage the null values that appear in the age, job, income, car age, and home value columns of the data set. In the end, I decided to proceed as follows. First, I replaced the null values in the age and car age columns with the mean of said columns as the mean and median of each are less than one year apart. Next, I removed rows with null values in the job column as the average income earned by blue collar workers, the most common occupation, does not align with the average income earned by those whose occupation isn't listed ($59,282 vs. $118,457). Moreover, the standard deviation of the latter group's income distribution is $58,834, which essentially means that five of the eight possible occupations are within one standard deviation of the their average income. Furthermore, I also removed rows with null values in either the income or home value columns as, in my opinion, inferring a value for either could portend a gross misjudgement on our part. An income of zero could mean that someone is unemployed, just as a home value of zero could mean that someone is a renter. In other words, it would be unjust to make an assumption here.

In [5]:
# Drop null values in the train and test sets according to the strategy above.
train_data.deal_with_nulls()
test_data.deal_with_nulls()