In [1]:
import pandas as pd
# save filepath to variable for easier access
melbourne_file_path = './input/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns


Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [2]:
melbourne_data = melbourne_data.dropna(axis=0)

Selecting The Prediction Target

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is

In [7]:
y = melbourne_data.Price
shape_target=melbourne_data.shape
print(y)
print(shape_target)

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64
(6196, 21)


As we can see above, the target matrix is a (6196, 21) matrix meaning we have 6196 rows and 21 columns


Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

X here is the feature matrix with 5 features and 6196 rows.
To have a summary of the first 5 rows, we'll use X.head command. Notice that the head() method returns the first 5 rows if a number is not specified.

In [13]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X_shape=X.shape
X.describe()
X.head()
# print(X_shape)

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


Building a model and making a prediction using scikit-learn. 

sklearn library , implements three specific methods namely fit(), predict(). 
fit(): 
 -It calculates the parameters or weights on the training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal object state.

 predict()
 -Use the above-calculated weights on the test data to make the predictions

In [23]:
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)


Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [28]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
print("The target values are")
melbourne_features_price = melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude', 'Price']]
print(melbourne_features_price.head())

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]
The target values are
   Rooms  Bathroom  Landsize  Lattitude  Longtitude      Price
1      2       1.0     156.0   -37.8079    144.9934  1035000.0
2      3       2.0     134.0   -37.8093    144.9944  1465000.0
4      4       1.0     120.0   -37.8072    144.9941  1600000.0
6      3       2.0     245.0   -37.8024    144.9993  1876000.0
7      2       1.0     256.0   -37.8060    144.9954  1636000.0


As we can see above, the model's prediction and the target values are the same. For exemple, for the first row, the target value is 1035000.0 which is the same as prediction that is 1035000 (horizontal value)

That means we have overfiting problem. That's beacause our training set includes the entire data set. We had to hide some data from the model to see if the model can be generelized to the split data set. 