In [1]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Remaining data-cleaning tasks

As mentioned in [the last notebook](./1_data_preparing.ipynb), we've now removed all legitimately-missing values and all outliers.  What remains is to fill in the "missing" values with "DNE" (Does Not Exist) or with 0 as appropriate.

How can we tell that a missing value means "DNE" or "0" rather than actually being missing?  The [data description](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) sheds light on this.  For example, it says that the variable `Alley` is supposed to take on 3 values: 'Grvl', 'Pave', and 'NA.'  However, the vast majority of observations are have missing values in this column.  But: we see that among all observations in the data set, *only the values 'Grvl' and 'Pave' appear in the column `Alley`*!  This implies that what the data dictionary indicates as "NA" means "We left this value blank."

So to eliminate missing data problems, we'll want to automatically fill in with "DNE" (Does Not Exist) any missing values that have this property.  That is, if a categorical variable is listed in the data dictionary as being able to take on the value "NA" and furthermore has missing values, then we'll check whether that variable has achieved the full number of categories it's supposed to (according to the data dictionary).  If not, then we'll automatically fill its missing values with "DNE".

Below, we construct a dictionary of all the categorical variables that the data description lists as being able to take on the value "NA" or "None", and for each such variable we list the number of different values (including "NA") that that variable is supposed to be able to take on.  Since there are relatively few values that each of these variables can potentially take one, in a data set of 2000+ observations we should expect to be able to find examples of each of these values being taken on, *unless* the value "NA" was encoded as a missing value.  So to test whether one of these categorical variables is (likely to be) *missing values only because of the encoding method and not because they were legitimately missing*, we will simply compare the number of values that that variable takes on in our data set to the number it's *supposed to* be able to take on according to the data description.  If the number of values taken on is less than the number claimed by the data description, then we'll assume that any missing values should properly be understood as "DNE" (Does Not Exist).

For the numeric variables, it's impossible to use this method to check whether a value is "legitimately" missing: the possible values that the numeric variables can take on are not listed in the data description, as there are infinitely many of them!  Instead, if a variable represents a quantity that could possibly be 0 in some homes (e.g. 'Wood Deck SF', if the home doesn't have a wood deck at all), then we'll just assume that any missing data corresponds to a value of 0.  We construct a list of all such numeric variables below.

This method seems reasonable since (a) the previous notebook only uncovered 11 total data points (out of more than 2000) that could not fit with the idea of missing data points simply being a shorthand way of recording "Does Not Exist", and (b) this data set has a lot of categorical variables with missing values, but we will find (using the test mentioned 2 paragraphs ago) that *none* of these values are legitimately missing (other than those 11 removed in the last notebook).

Finally, there are a few variables representing the year in which something took place (e.g. `Year Built`, `Garage Yr Blt` and `Yr Sold`).  Since this is a data set of *homes that were built and then sold*, there is no reason for us to assume that a missing value in `Year Built`, `Yr Sold`, or similar categories represents a "not legitimately missing" value.  However, the variable `Garage Yr Blt` is the unique one where there is strong reason to believe that a missing value indicates not a "truly missing" value, but instead a way of encoding "Does Not Exist".  Indeed, some homes do not have garages!  For this reason, we will fill any missing `Garage Yr Blt` values with "DNE".


## Variables to fill in since their values may not be legitimately missing

In [2]:
#Create a dictionary listing how many distinct values each of the categorical variables
#is supposed to be able to take on.  If fewer than this many are achieved, then that's
#a sign that this variable's "missing values" are just instances of "NA".
categorical_nums = {'Alley': 3, 'Mas Vnr Type': 5, 'Bsmt Qual': 6, 'Bsmt Cond': 6, 'Bsmt Exposure': 5,
                'BsmtFin Type 1': 7, 'BsmtFin Type 2': 7, 'Fireplace Qu': 6, 'Garage Type': 7,
                'Garage Finish': 4, 'Garage Qual': 6, 'Garage Cond': 6, 'Pool QC': 5, 'Fence': 5,
                'Misc Feature': 6,
                'Garage Yr Blt': np.Inf,
                }
#Some homes may not have a garage, so its "year built" may be missing for this reason, even though
#this variable is numeric (not categorical).  But we want to fill these with "NA" rather than 0,
#which is why we list 'Garage Yr Blt' here rather than with the numeric variables.  We gave it a
#threshold number of infinity so that when we check whether the number of values it takes on
#is less than the threshold number, we will necessarily get the answer "no".

categorical_names = categorical_nums.keys()

In [3]:
#Also, we'll want to create a list of those numeric variable names that are likely to be ones
#where a missing value would be equivalent to a "zero."  For example, "Lot Frontage" being
#missing probably means that there are just 0 feet of street connected to the property.
numeric_names = ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
                 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF',
                 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
                 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars',
                 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
                 'Screen Porch', 'Pool Area', 'Misc Val']
#We don't include SalePrice in this list since it's the target variable we're predicting.
#It doesn't appear in the test set, and none of its values were missing in the training set.

## Data Cleaning Functions

In [4]:
def check_missing(df):
    '''
    Input: a Pandas dataframe df
    
    Prints out how many values are missing from each column.
    Only prints out columns that are missing at least one value.
    '''
    variables_with_missing = df.isnull().sum().index
    missing_nos = df.isnull().sum().values
    
    if missing_nos.sum()==0:
        print('No missing values')

    for i in range(len(variables_with_missing)):
        if missing_nos[i] != 0:
            print(f'{missing_nos[i]} missing values from variable {variables_with_missing[i]}')
            

In [5]:
def fill_categoricals(df, nums_dict):
    '''
    Inputs:
    df: A Pandas dataframe
    nums_dict: A dictionary.  Each item in the dictionary should be of the form key:x,
        where key is the name of a column in df representing a categorical variable and
        where x is the number of different values that that categorical variable is
        supposed to be able to take on (according to the source of the data).
        
    Output:
    A Pandas dataframe identical to df except with the missing values filled with 'DNE'
    (Does Not Exist) in all columns meeting the following conditions:
        1. The column's name appears as a key in nums_dict;
        2. The number of different values taken on by that variable in df is LESS THAN
           the the number x that is keyed to by that variable name in nums_dict
    Any missing values in columns not meeting these criteria will remain as they are.
    '''
    for key, x in nums_dict.items():
        
        #If that variable doesn't have as many unique values as it should...
        if len(df[key].value_counts()) < x:
            
            #Get the index of all missing values
            missings = df[ df[key].isnull() ].index
            
            #Fill them in with 'DNE'
            df.iloc[missings, df.columns.get_loc(key)] = 'DNE'
            
    return df


In [6]:
def fill_numerics(df, names_list):
    '''
    Inputs:
    df: A Pandas dataframe
    names_list: A list of names of columns in df that are each numeric variables.
    
    Output:
    A Pandas dataframe identical to df except with the missing values in columns in
    names_list filled with 0.
    '''
    
    for col in names_list:
        
            #Get the index of all missing values
            missings = df[ df[col].isnull() ].index
            
            #Fill them in with 'DNE'
            df.iloc[missings, df.columns.get_loc(col)] = 0
            
    return df

In [7]:
def fill_missings(df, categorical_dict, numeric_list):
    '''
    Carries out the function fill_categoricals(df, categorical_dict)
    followed by the function fill_numerics(df, numeric_list),
    then returns the resulting dataframe.  Prints out what it's doing
    along the way, and checks for missing values at each step.
    '''

    print('Checking missing values... \n')
    check_missing(df)
    
    print('\nFilling missing categorical values with "DNE", if we have reason\nto believe they are not legitimately missing...')
    
    df = fill_categoricals(df, categorical_dict)
    
    print('\nChecking missing values again... \n')
    check_missing(df)
    
    print('\nFilling missing numeric values with 0... ')
    df = fill_numerics(df, numeric_list)
    
    print('\nChecking missing values again... \n')
    check_missing(df)
    
    return df

## Data Cleaning: Training Set

In [8]:
df = pd.read_csv('../datasets/train_prepared.csv')

In [9]:
df = fill_missings(df, categorical_nums, numeric_names)

Checking missing values... 

329 missing values from variable Lot Frontage
1900 missing values from variable Alley
1234 missing values from variable Mas Vnr Type
22 missing values from variable Mas Vnr Area
53 missing values from variable Bsmt Qual
53 missing values from variable Bsmt Cond
53 missing values from variable Bsmt Exposure
53 missing values from variable BsmtFin Type 1
53 missing values from variable BsmtFin Type 2
997 missing values from variable Fireplace Qu
113 missing values from variable Garage Type
113 missing values from variable Garage Yr Blt
113 missing values from variable Garage Finish
113 missing values from variable Garage Qual
113 missing values from variable Garage Cond
2032 missing values from variable Pool QC
1642 missing values from variable Fence
1976 missing values from variable Misc Feature

Filling missing categorical values with "DNE", if we have reason
to believe they are not legitimately missing...

Checking missing values again... 

329 missing val

## Data Export: Training Set

In [10]:
df.to_csv('../datasets/train_cleaned.csv', index=False)

## Data Cleaning: Test Set

In [11]:
df = pd.read_csv('../datasets/test_prepared.csv')

In [12]:
df = fill_missings(df, categorical_nums, numeric_names)

Checking missing values... 

160 missing values from variable Lot Frontage
820 missing values from variable Alley
533 missing values from variable Mas Vnr Type
1 missing values from variable Mas Vnr Area
25 missing values from variable Bsmt Qual
25 missing values from variable Bsmt Cond
25 missing values from variable Bsmt Exposure
25 missing values from variable BsmtFin Type 1
25 missing values from variable BsmtFin Type 2
422 missing values from variable Fireplace Qu
45 missing values from variable Garage Type
45 missing values from variable Garage Yr Blt
45 missing values from variable Garage Finish
45 missing values from variable Garage Qual
45 missing values from variable Garage Cond
874 missing values from variable Pool QC
706 missing values from variable Fence
837 missing values from variable Misc Feature

Filling missing categorical values with "DNE", if we have reason
to believe they are not legitimately missing...

Checking missing values again... 

160 missing values from va

## Data Export: Test Set

In [13]:
df.to_csv('../datasets/test_cleaned.csv', index=False)

## What's next?

We've cleaned the data and filled in missing values.  All that remains is to convert the data types to convenient formats and then do some feature engineering (e.g., turning categorical variables into dummies or numeric variables).  We'll do so in the [next notebook](./3_data_processing.ipynb).