# Machine Learning Stages Lab

### Intro and objectives


### In this lab you will learn:
1. An overview of how machine learning models work and how they are used


## 1. Business Understanding Stage


You work for a real estate company, you are tasked with the development of a ML-based model to predict real estate values.

Based on interviews with real estate agents you learn than house prices are based on characteristics such as size, number of rooms, location, age, etc.


## 2. Data Understanding Stage
### We conduct some basic Exploratory Data Analysis (EDA)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:

# Load data

melbourne_data = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/ML4DS301/main/data/melb_data.csv')
melbourne_data.head(4)

In [None]:
melbourne_data.head(4).T

In [None]:
melbourne_data.info()

In [None]:
melbourne_data.describe()

In [None]:
melbourne_data.groupby('Type').count()

In [None]:
melbourne_data.groupby('Regionname').count()

In [None]:
melbourne_data.columns

In [None]:
qualitative_features=['Rooms','Type','Bedroom2', 'Bathroom', 'Car','Regionname']

In [None]:
# prompt: supress warnings

import warnings
warnings.filterwarnings('ignore')


In [None]:
for feature in qualitative_features:

  # Create the boxplot without outliers
  sns.boxplot(x=feature, y='Price', data=melbourne_data, showfliers=False)
  plt.xticks(rotation=45)
  plt.show()


In [None]:
# prompt: seaborn scatterplot x-axis landsize y-axis price, log scale for both

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Landsize', y='Price', data=melbourne_data)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Landsize (log scale)')
plt.ylabel('Price (log scale)')
plt.title('Landsize vs. Price (Log Scale)')
plt.show()


## 3. Model Training
### We need to select the features (factors) to be used to train the model
### We follow the Train-Test-Evaluation Method

You will use the scikit-learn library to create your models.

### 3.1. Data Split

In [None]:
 #Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

In [None]:
# Choose target and features
y = filtered_melbourne_data['Price']
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

In [None]:
# split data into training and test data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

In [None]:
train_X

In [None]:
train_y

In [None]:
test_X

In [None]:
test_y

### 3.2. Model Fit
#### We fit our model using the training dataset ONLY




In [None]:
# Define model, in this case we choose a DecisionTreeRegressor()
melbourne_model = DecisionTreeRegressor()

In [None]:
# Fit model
melbourne_model.fit(train_X, train_y)

## 4. Model Evaluation
#### We evaluate the performance of our model using the test dataset ONLY

In [None]:
# get predicted prices on validation data
val_predictions = melbourne_model.predict(test_X)
print(mean_absolute_error(test_y, val_predictions))

The mean absolute error is larger than 260K US Dollars. The average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

## 5. Deployment
#### Once we are satisfied with the performance of our model we are ready to use it to predict house prices

Let's asume that we have three new houses

In [48]:
new_houses=X.sample(3)
new_houses

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
3396,5,3.0,1006.0,411.0,1920.0,-37.7765,145.0427
10414,3,1.0,396.0,108.0,1955.0,-37.71665,145.11682
8249,2,1.0,0.0,59.0,1970.0,-37.8538,145.0088


In [49]:
print("Making predictions for houses:")
print("The predictions are")
melbourne_model.predict(new_houses)

Making predictions for houses:
The predictions are


array([2600000.,  806000.,  630000.])