# Melbourne Housing Prices - Missing Values
Through out this Kaggle course for intermediate machine learning level, I learned three different approaches to dealing with missing values and compared effectiveness these approaches on a real-world dataset.
 1) A Simple Option: Drop Columns with Missing Values
 2) A Better Option: Imputation
 3) An Extension To Imputation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('melb_data.csv')

# Select target
y = data.Price

# To keep things simple, use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

### code explanation above
- melb_predictors = data.drop(['Price'], axis=1) : 
The 'Price' column is typically the target variable in a machine learning task, and you're removing it because you want to use the remaining columns as predictor variables. The axis=1 argument indicates that you're dropping a column (as opposed to a row), and the result is a DataFrame without the 'Price' column.
- X = melb_predictors.select_dtypes(exclude=['object']) : 
excluding columns with the data type 'object' (commonly used for text or categorical data) and creating a new DataFrame called X.

### Define Function to Measure Quality of Each Approach
Define a function "score_dataset()" to compare different approaches to dealing with missing values. This function reports the mean absolute error (MAE) from a random forest model. 

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

#### Explained about n_estimators=10 above: 
In the code you provided, the n_estimators parameter for the RandomForestRegressor is set to 10. The n_estimators parameter specifies the number of decision trees in the random forest ensemble. In this case, the value of 10 means that the random forest will consist of 10 decision trees.

The choice of n_estimators is a hyperparameter, and it determines how many trees are used in the ensemble. The number of trees can have an impact on the model's performance. Typically, increasing the number of trees can make the model more robust and accurate, but it also increases the computational complexity and training time.

A value of 10 for n_estimators is relatively low, and it may be used for quick experimentation and initial model testing. It can provide a baseline for model performance without requiring a significant amount of computational resources. However, for more accurate and robust results, it's common to use a larger number of trees, often in the hundreds or even thousands, depending on the dataset and problem.

The choice of the best n_estimators value depends on the specific problem and dataset. It is common practice to perform hyperparameter tuning, such as using techniques like cross-validation, to find the optimal number of trees for a given problem. This allows you to strike a balance between model accuracy and computational efficiency.

## Score from Approach 1 (Drop Columns with Missing Values)
since I am working with both training and validation sets, I am careful to drop the same columns in both DataFrames.

In [4]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                    if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


## Score from Approach 2 (Imputation)
Next, use "SimpleImputer" to replace missing values withthe mean value along each column.
Although it's simple, filing in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.
*Imputation: process of filling in or estimating missing values in a dataset using various methods

In [6]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):
178166.46269899711


#### Explanation about putting back imputation removed column names
- Imputation with SimpleImputer: When you use SimpleImputer to fill in missing values in your data, it essentially creates a NumPy array without column names because SimpleImputer works with arrays. The resulting imputed_X_train and imputed_X_valid do not have column names attached to them.

- Putting Back Column Names: Column names are essential for maintaining the structure and interpretability of your dataset. Without column names, it can be challenging to understand the meaning of each feature or variable in your data. Additionally, if you plan to use these imputed datasets for training machine learning models or for any further analysis, it's important that the columns are correctly labeled so that the models can work with the features as intended.

By assigning the column names back to the imputed DataFrames using imputed_X_train.columns = X_train.columns and imputed_X_valid.columns = X_valid.columns, you are essentially ensuring that the imputed data retains the original structure and feature names. This step is crucial to maintain data consistency and readability and to avoid potential errors or confusion when working with the imputed datasets in subsequent steps of your data analysis or modeling process.

## Score from Approach 3 (An Extension to Imputation)
Impute the missing values, while also keeping track of which values were imputed.

In [9]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
    
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extention to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extention to Imputation):
178927.503183954


So, why did imputation perform better than dropping the columns?
The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it makes sense that imputation would perform better.

In [10]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each columns of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64
