*This tutorial is part of the [Learn Machine Learning](https://www.kaggle.com/learn/machine-learning) educational track.*

# Starting Your Project

You are about to build a simple model and then continually improve it. It is easiest to keep one browser tab (or window) for the tutorials you are reading, and a separate browser window with the code you are writing. You will continue writing code in the same place even as you progress through the sequence of tutorials.

** The starting point for your project is at [THIS LINK](https://www.kaggle.com/dansbecker/my-model/).  Open that link in a new tab. Then hit the "Fork Notebook" button towards the top of the screen.**

![Imgur](https://i.imgur.com/GRtMTWw.png)

**You will see examples predicting home prices using data from Melbourne, Australia. You will then write code to build a model predicting prices in the US state of Iowa. The Iowa data is pre-loaded in your coding notebook.**

### Working in Kaggle Notebooks
You will be coding in a "notebook" environment. These allow you to easily see your code and its output in one place.  A couple tips on the Kaggle notebook environment:

1) It is composed of "cells."  You will write code in the cells. Add a new cell by clicking on a cell, and then using the buttons in that look like this. ![Imgur](https://i.imgur.com/Lscji3d.png) The arrows indicate whether the new cell goes above or below your current location. <br><br>
2) Execute the code in the current cell with the keyboard shortcut Control-Enter.


---
# Using Pandas to Get Familiar With Your Data

The first thing you'll want to do is familiarize yourself with the data.  You'll use the Pandas library for this.  Pandas is the primary tool that modern data scientists use for exploring and manipulating data.  Most people abbreviate pandas in their code as `pd`.  We do this with the command

In [1]:
import pandas as pd

The most important part of the Pandas library is the DataFrame.  A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.  Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa.

The example will use data at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.  Your data will be available in your notebook at `../input/train.csv` (which is already typed into the sample code for you).

We load and explore the data with the following:

In [2]:
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

In [3]:
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
train.describe()

# Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the **count**,  shows how many rows have non-missing values.  

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the **mean**, which is the average.  Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value.  The first (smallest) value is the min.  If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  That is the **25%** value (pronounced "25th percentile").  The 50th and 75th percentiles are defined analgously, and the **max** is the largest number.

--- 
# Your Turn
**Remember, the notebook you want to "fork" is [here](https://www.kaggle.com/dansbecker/my-model/).**

Run the equivalent commands (to read the data and print the summary) in the code cell below.  The file path for your data is already shown in your coding notebook. Look at the mean, minimum and maximum values for the first few fields. Are any of the values so crazy that it makes you think you've misinterpreted the data?

There are a lot of fields in this data.  You don't need to look at it all quite yet.

When your code is correct, you'll see the size, in square feet, of the smallest lot in your dataset.  This is from the **min** value of **LotArea**, and you can see the **max** size too.  You should notice that it's a big range of lot sizes! 

You'll also see some columns filled with `...`.  That indicates that we had too many columns of data to print, so the middle ones were omitted from printing.

We'll take care of both issues in the next step.

# Continue
Move on to the next [page](https://www.kaggle.com/dansbecker/Selecting-And-Filtering-In-Pandas/) where you will focus in on the most relevant columns.

In [4]:
print(melbourne_data.columns)

# Selecting and Filtering Data

## Selecting a Single Column
You can pull out any variable (or column) with **dot-notation**. This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data. Here's an example:

In [5]:
melbourne_price_data = melbourne_data.Price
print(melbourne_price_data.head())

## Selecting Multiple Columns
You can select multiple columns from a DataFrame by providing a list of column names inside brackets. Remember, each item in that list should be a string (with quotes).

In [6]:
column_of_interest = ['Landsize','BuildingArea']
two_column_of_data = melbourne_data[column_of_interest]
two_column_of_data.describe()

In [7]:
# Print the train dataset Colummns
train.columns

In [8]:
house_sales_price = train.SalePrice
house_sales_price.head()

In [9]:
column_of_interest1 = ['SaleCondition','SaleType']
two_columns_of_data_train = train[column_of_interest1]
two_columns_of_data_train.describe()

In [10]:
y = train.SalePrice
melbourne_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
x = train[melbourne_predictors]
x.head()

# Building Your Model
You will use the **scikit-learn** library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

**Define:** What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.  
**Fit: **Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like  
**Evaluate:** Determine how accurate the model's predictions are.  
Here is the example for defining and fitting the model.  

In [11]:
from sklearn.tree import DecisionTreeRegressor

# Define model
my_model  = DecisionTreeRegressor()

# Fit model
my_model.fit(x, y)

In [12]:
print("Making Prediction for following 5 houses:")
print(x.head())
print("The Predictions are")
print(my_model.predict(x.head()))

In [13]:
from sklearn.metrics import mean_absolute_error

predicated_home_sale_prices = my_model.predict(x)
mean_absolute_error(y,predicated_home_sale_prices)

In [14]:
from sklearn.model_selection import train_test_split

train_X,val_X, train_y,val_y =  train_test_split(x, y, random_state = 0)
my_model1 = DecisionTreeRegressor()

my_model1.fit(train_X, train_y)
val_predictions = my_model1.predict(val_X)
print(mean_absolute_error(val_y,val_predictions))

# Checking Underfitting and Overfitting

In [15]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [16]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# Of the options listed, 50 is the optimal number of leaves. Apply the function to your Iowa data to find the best decision tree.

> # **Model Train With Random Forest**

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(n_estimators=1000)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print("Mean Square Error is:",mean_absolute_error(val_y, melb_preds))

> # Reduce the mean square error Model by using Imputation

In [20]:
import pandas as pd

# Load data
melb_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

y_target = melb_data.SalePrice
y_predictors = melb_data.drop(['SalePrice'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors. 
y_numeric_predictors = y_predictors.select_dtypes(exclude=['object'])

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(y_numeric_predictors, 
                                                    y_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

In [22]:
cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

In [24]:
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

# Using Categorical Data with One Hot Encoding


In [25]:
# Reading the dataset
import pandas as pd
train_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")


#Drop houses where the target is missing
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# Since missing values isn't the focus of this tutorial, we use the simplest
# possible approach, which drops these columns. 
# For more detail (and a better approach) to missing values, see
# https://www.kaggle.com/dansbecker/handling-missing-values
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]                                  
print(cols_with_missing)
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

In [26]:
train_predictors.dtypes.sample(10)

In [27]:
one_hot_encoded_training_predictors =  pd.get_dummies(train_predictors)

In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Absolute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

In [30]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1)