## My First Jupyter Notebook
This is my first jupyter notebook.
I want to create the class work from **Kaggle's Intermediate** course for *Machine Learning*.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

Read the data from the Melbourne Housing Data. This is available in CSV format. Since we will be predicting the price, we wil drop the 'Price' from the Xs (Labelled data)

In [2]:
data = pd.read_csv(".//input//melb_data.csv")
y = data.Price
X = data.drop(['Price'], axis = 1)
X.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


Divide the data into training and validation subsets

In [3]:
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

Retain only columns with low cardinality and numerical values. *Cardinality* means the number of unique values in a column. Here we will select columns with unique values less than 10 and numerical valued columns.

In [4]:
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

We construct the full pipeline in three steps.

**Step 1: Define Preprocessing Steps** <br>
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:<br>

>* imputes missing values in *numerical* data, and
>* imputes missing values and applies a one-hot encoding to categorical data.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numerical_transformer = SimpleImputer(strategy='constant') # preprocessing for numerical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
]) # preprocessing for categorical data

# bundle pre-processing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

**Step 2: Define the model**<br>
Next, we define a random forest model with the familiar RandomForestRegressor class.

In [6]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

**Step 3: Create and Evaluate the Pipeline**<br>
Finally, we use the Pipeline class to define a pipelne that bundles the preprocessing and modelling steps. There are a few important things to note about the Pipeline class:
<br>
>* With the pipeline, we preprocess the training data nd fit the model in a  single line of code. 
>* With the pipeline, we supply the unprocessed features in *X_valid* to the *predict()* command, and the pipeline automatically preprocesses the features before generating predictions.

In [7]:
from sklearn.metrics import  mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluare the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 160679.18917034855
