# Summary of this notebook

Although missing values have been dealt with in the [last notebook](./2_data_cleaning.ipynb), we still have some feature engineering to do!  In this notebook, we accomplish two tasks:

1. We convert the data types of variables to the appropriate formats, including encoding ordinal variables with numeric values.
2. We impute missing numerical values in the varaible `Garage Yr Blt` with the mean of the non-missing `Garage Yr Blt` values.

### Imports

In [1]:
#Package Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

## Data Import: Training Set

In [2]:
#Data import
df = pd.read_csv('../datasets/train_cleaned.csv')

## Fixing Data Types

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2040 entries, 0 to 2039
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2040 non-null   int64  
 1   PID              2040 non-null   int64  
 2   MS SubClass      2040 non-null   int64  
 3   MS Zoning        2040 non-null   object 
 4   Lot Frontage     2040 non-null   float64
 5   Lot Area         2040 non-null   int64  
 6   Street           2040 non-null   object 
 7   Alley            2040 non-null   object 
 8   Lot Shape        2040 non-null   object 
 9   Land Contour     2040 non-null   object 
 10  Utilities        2040 non-null   object 
 11  Lot Config       2040 non-null   object 
 12  Land Slope       2040 non-null   object 
 13  Neighborhood     2040 non-null   object 
 14  Condition 1      2040 non-null   object 
 15  Condition 2      2040 non-null   object 
 16  Bldg Type        2040 non-null   object 
 17  House Style   

Some of these variables, such as `Id` and `PID`, are assigned essentially at random and hence won't be useful for predictive modeling.  We should keep `Id` in the data set just so we can keep track of which observations are which, but other such "useless" variables will be dropped.  We'll make a list called `useless` of these variables' names so we can drop them later.  We'll also include in this list variables that are likely to be multicollinear with other variables that we're not including in our `useless` list.

Some other variables, such as `MS SubClass`, are encoded as numbers (codes) even though they're categorical variables.  We'll make a list called `categorical` of such variables.

Other variables are listed in the data description as "ordinal", indicating that their values have a natural order to them.  We'll add these variables' names to a dictionary (of dictionaries) called `ordinal_dict`; each name will key to a dictionary encoding the various values that variable can take on as numbers (according to the data description).

NOTE: We will *exclude* from our `ordinal` list those variables, such as `Bsmt Qual`, that can take on the value 'NA' or 'None' (which we encoded as 'DNE' in the last notebook).  This is because it's not obvious that "having no basement is worse than having a basement in poor condition."  Instead of us deciding whether it's worse to have a bad basement or no basement, we'll let our categorical variables encoder do so for us by checking which values are associated with higher/lower sale prices!  We'll do the same with varaibles like `Electrical` where it's not clear that a "mixed" electrical system is necessarily worse than a "poor" one.

Lastly, the variable `Garage Yr Blt` currently has data type `Object` because it contains the non-numeric entries 'DNE' that we added because of missing values.  We'll list this in a list called `numeric` and handle it later.

In [4]:
useless = ['PID', 'BsmtFin SF 2']
#'BsmtFin Sf 2' can be solved for using the other values provided.
#It also seems less informative than these other values.

categorical = ['MS SubClass']
           
numeric = ['Garage Yr Blt']

In [5]:
#Dictionary of ordinal variables
ordinal_dict = {
 'Lot Shape' : {'Reg':0, 'IR1':1, 'IR2':2, 'IR3':3},
 'Utilities' : {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1},
 'Land Slope': {'Gtl':0, 'Mod':1, 'Sev':2},
 'Exter Qual': {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
 'Exter Cond': {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
 'Heating QC': {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
 'Kitchen Qual':{'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1},
 'Functional': {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1},
 'Paved Drive':{'Y':3, 'P':2, 'N':1}
}

ordinal = [x for x in ordinal_dict.keys()]

Now that we know how we want everything to be encoded, let's write a function that does this all for us.  Specifically, we want it to do the following:
- For each `useless` variable, drop that variable.
- For each `categorical` variable, convert its values to strings.
- For each `numeric` variable, convert any non-numeric values to numbers.  To do so, we'll just assign all such non-numeric entries a numeric value equal to the *mean of the numeric values in that list*.
- For each `ordinal` variable, map its values to numbers according to the `ordinal_dict`.

In [6]:
def process_data(df, useless, categorical, numeric, ordinal_dict):
    #Does the processes described above.
    
    #Drop the useless columns
    df.drop(columns=useless, inplace=True)
    
    #Fix variables that appear numeric but should be categorical
    for col in categorical:
        df[col] = df[col].map(lambda x: str(x)+'*')
        #Add a '*' so that Pandas doesn't treat this as a number
    
    #Fix columns that are supposed to be numeric
    #This ONLY works if all non-numeric items are 'DNE'
    for col in numeric:
        df[col] = df[col].map(lambda x : np.nan if x=='DNE' else float(x))
        mean = df[col].mean()
        #Map all NaN values to this mean
        df[col] = df[col].map(lambda x : x if x < np.Inf else mean)
        
        
    #Give numeric values to ordinal variables, according to ordinal_dict
    for col in ordinal_dict.keys():
        df[col] = df[col].map(ordinal_dict[col])

In [7]:
process_data(df, useless=useless, categorical=categorical, numeric=numeric, ordinal_dict=ordinal_dict)

## Checking data types

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2040 entries, 0 to 2039
Data columns (total 79 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2040 non-null   int64  
 1   MS SubClass      2040 non-null   object 
 2   MS Zoning        2040 non-null   object 
 3   Lot Frontage     2040 non-null   float64
 4   Lot Area         2040 non-null   int64  
 5   Street           2040 non-null   object 
 6   Alley            2040 non-null   object 
 7   Lot Shape        2040 non-null   int64  
 8   Land Contour     2040 non-null   object 
 9   Utilities        2040 non-null   int64  
 10  Lot Config       2040 non-null   object 
 11  Land Slope       2040 non-null   int64  
 12  Neighborhood     2040 non-null   object 
 13  Condition 1      2040 non-null   object 
 14  Condition 2      2040 non-null   object 
 15  Bldg Type        2040 non-null   object 
 16  House Style      2040 non-null   object 
 17  Overall Qual  

Everything is fixed now!

## Data Export: Training Set

Now that our training data is fully prepared, in the next notebook we'll use it for modeling.  However, we want to make sure to separate out a portion of our data to be used as a test set.  We'll do so now.

NOTE: We already have a "test" data set, but that data set doesn't have any values for our target variable `SalePrice`!  So if we want to be able to evaluate our models prior to using them "in the real world" (i.e. on the test data set given to us), we'll have to reserve a portion of our training set to be used for model evaluation.  We'll call this `train_processed_reserved.csv`, and the rest of our training data set will be exported as simply `train_processed.csv`.

In [9]:
df_train, df_test = train_test_split(df, test_size=.15, random_state=1234)

In [10]:
print(f"Number of observations in main (training) set: {len(df_train)}")
print(f"Number of observations in reserved (testing) set: {len(df_test)}")

Number of observations in main (training) set: 1734
Number of observations in reserved (testing) set: 306


In [11]:
df_train.to_csv('../datasets/train_processed.csv', index=False)
df_test.to_csv('../datasets/train_processed_reserved.csv', index=False)

## Data Import: Test Set

In [12]:
df = pd.read_csv('../datasets/test_cleaned.csv')

### Clean the data

In [13]:
process_data(df, useless=useless, categorical=categorical, numeric=numeric, ordinal_dict=ordinal_dict)

## Data Export: Test Set

In [14]:
df.to_csv('../datasets/test_processed.csv', index=False)

## What's next?

In the [next notebook](./4_modeling.ipynb), we do some exploratory data analysis and then delve into modeling sale prices with linear regressions.