# Table of Content:
* [Important Libraries](#important_libraries)
* [How to import data](#import_data)
* [How to filter rows with missing values](#filter_row_with_missing_value)
* [How to choose target and predictors](#choose_target_and_predictors)
* [How to split training and testing data](#split_training_and_testing)
* [How to define and fit model](#define_and_fit_model)
* [How to get predictions form model](#get_predictions)
* [How to validate model](#validate_model)
* [How to compare MAE at different max leaf nodes of DecisionTreeRegressor](#max_leaf_nodes)
* [How to use Random Forest Regressor Model](#random_forest)
* [How to make submission at kaggle](#make_submission)
* [How to Handle Missing Values](#handle_missing_value)
* [Using Categorical Data with One Hot Encoding](#one_hot_encoding)
* [XGBRegressor Model](#XGBRegressor)

# Important Libraries <a id="important_libraries"></a>

In [1]:
import pandas as pd # Pandas
from sklearn.model_selection import train_test_split # For Splitting testing and training data
from sklearn.metrics import mean_absolute_error # For calculating mean absolute error
from sklearn.tree import DecisionTreeRegressor # Model builder

# How to import data <a id = "import_data"></a>

In [2]:
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# How to filter rows with missing values <a id = "filter_row_with_missing_value"></a>

In [3]:
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# How to choose target and predictors <a id="choose_target_and_predictors"></a>

In [4]:
# How to choose target and predictors
y = filtered_melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_predictors]

# How to split training and testing data <a id="split_training_and_testing"></a>

In [5]:
# Split data for training and testing
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

# How to define and fit model<a id="define_and_fit_model"></a>

In [6]:
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

# Get Predictions from model <a id="get_predictions"></a>

In [7]:
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)

# Validate Model <a id = "validate_model"></a>

In [9]:
# get mean absolute error
melbourne_mean_absolute_error = mean_absolute_error(val_predictions,val_y)
print(melbourne_mean_absolute_error)

260745.79664299547


# DecisionTreeRegressor at different max leaf nodes <a id="max_leaf_nodes"></a>

In [10]:
# Mean Absolute Error function
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  257829
Max leaf nodes: 500  		 Mean Absolute Error:  243176
Max leaf nodes: 5000  		 Mean Absolute Error:  254915


# Random Forest Regressor Model <a id="random_forest"></a>

In [17]:
# Random forest model
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

201579.4136432107


# How to make submission at kaggle <a id="make_submission"></a>

In [16]:
# create data frame for submission
my_submission = pd.DataFrame({'Id': val_X.index, 'SalePrice': val_y})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

# Handling missing values <a id="handle_missing_value"></a>

## 1) A Simple Option: Drop Columns with Missing Values
If your data is in a DataFrame called `original_data`, you can drop columns with missing values. One way to do that is
```
data_without_missing_values = original_data.dropna(axis=1)
```

In many cases, you'll have both a training dataset and a test dataset.  You will want to drop the same columns in both DataFrames. In that case, you would write

```
cols_with_missing = [col for col in original_data.columns 
                                 if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)
```
If those columns had useful information (in the places that were not missing), your model loses access to this information when the column is dropped. Also, if your test data has missing values in places where your training data did not, this will result in an error.  

So, it's somewhat usually not the best solution. However, it can be useful when most values in a column are missing.



## 2) A Better Option: Imputation
Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.

This is done with
```
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
```
The default behavior fills in the mean value for imputation.  Statisticians have researched more complex strategies, but those complex strategies typically give no benefit once you plug the results into sophisticated machine learning models.

One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline. Pipelines simplify model building, model validation and model deployment.

## 3) An Extension To Imputation
Imputation is the standard approach, and it usually works well.  However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.  Here's how it might look:
```
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[c].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = Imputer()
new_data = my_imputer.fit_transform(new_data)
```

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.

# Using Categorical Data with One Hot Encoding <a id="one_hot_encoding"></a>

In [None]:
# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]
# Use built-in function
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
# Applying it to compare with catagorical data removed
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))
# Aligning train and test data
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1)

# XGBRegressor Model <a id="XGBRegressor"></a>

In [None]:
# Without Tuning
from xgboost import XGBRegressor
my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)
# make predictions
predictions = my_model.predict(test_X)
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

# With Tuning
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
             eval_set=[(test_X, test_y)], verbose=False)

# With Learning Rate
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
             eval_set=[(test_X, test_y)], verbose=False)