# Basic data exploration
## Pandas and the Meldbourne housing prices dataset
The original tutorial can be found at [kaggle.com](https://www.kaggle.com/learn/intro-to-machine-learning):
- [Part 1: Basic Data Exploration](https://www.kaggle.com/dansbecker/basic-data-exploration) by [Dan Becker](https://www.kaggle.com/dansbecker)
- [Part 2: Your First Machine Learning Model](https://www.kaggle.com/dansbecker/your-first-machine-learning-model) by [Dan Becker](https://www.kaggle.com/dansbecker)


In [None]:
# import pandas and alias it as `pd`
import pandas as pd

In [None]:
# the path of a CSV file relative to the executed notebook
file_path = 'melbourne-housing-market/Melbourne_housing_FULL.csv'

In [None]:
# load the CSV data into a `pandas.core.frame.DataFrame`
housing_data = pd.read_csv(file_path)

### DataFrame#describe

[API: pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

In [None]:
housing_data.describe()

### DataFrame#columns
[API pandas.DataFrame.columns](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html#pandas.DataFrame.columns)

In [None]:
housing_data.columns

The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
We'll learn to handle missing values in a later tutorial.  
Your Iowa data doesn't have missing values in the columns you use. 
So we will take the simplest option for now, and drop houses from our data.

### DataFrame.dropna (deprecated)

[API pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [None]:
housing_data = housing_data.dropna(axis=0)

In [None]:
housing_data.describe()

### Selecting The Prediction Target

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called ***y***. So the code we need to save the house prices in the Melbourne data is

In [None]:
y = housing_data.Price

### Choosing "Features"

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

By convention, this data is called ***X***.

In [None]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

In [None]:
X = housing_data[features]

Let's quickly review the data we'll be using to predict house prices using the [`pandas.DataFrame.describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method and the [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, which shows the top few rows.

In [None]:
X.describe()

In [None]:
X.head()

Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve further inspection.

### Building Your Model

[scikit-learn](https://scikit-learn.org/stable/) is used to create the related models to work with the data stored in the DataFrames. The steps to build and use a model are:
- ***Define***: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- ***Fit***: Capture patterns from provided data. This is the heart of modeling.
- ***Predict***: Just what it sounds like
- ***Evaluate***: Determine how accurate the model's predictions are.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
housing_model = DecisionTreeRegressor(random_state=1)

# Fit model
housing_model.fit(X, y)

Making predictions for the following 5 houses:

In [None]:
X.head()

Executing a prediction

In [None]:
housing_model.predict(X.head())

### Model Validation

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:

```
error = actual − predicted
```
So, if a house cost \\$150.000 and you predicted it would cost \\$100,000 the error is \\$50,000.

### sklearn.metrics.mean_absolute_error

[API sklearn.metrics.mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)

In [85]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, housing_model.predict(X))

897.2844229398746

In [86]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
training_data_X, validation_data_X, training_data_y, validation_data_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(training_data_X, training_data_y)

# get predicted prices on validation data
predictions = melbourne_model.predict(validation_data_X)

In [None]:
mean_absolute_error(validation_data_y, predictions)