# Basic Workflow of ML

This document shows the basic workflow of machine learning.  
Referring to [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) on Kaggle.  

---

## Data processing

More details are shown in [Pandas_tutorial.ipynb](Pandas_tutorial.ipynb).

- import data  

In [None]:
import pandas as pd
# save filepath to variable for easier access
file_path = 'train.csv'
# read the data and store data in DataFrame titled melbourne_data
data = pd.read_csv(file_path)

- view data

In [None]:
# print a summary of the data
data.describe()

In [None]:
# print first 5 rows of the data
data.head()

In [None]:
# print last 3 rows of the data
data.tail(3)

In [None]:
# print the index
data.index

In [None]:
# print the tags
data.columns

- transform

In [None]:
data.T

In [None]:
data.to_numpy()

In [None]:
# 0 index, 1 columns
data.sort_index(axis=1, ascending=False)
data.sort_index(axis=0, ascending=False)

In [None]:
data.sort_values(by="price_range",axis=0,ascending=False)

- slice data

In [None]:
X = data[3:5]   # row 4,5
y = data["price_range"]

In [None]:
# one column
# usually for prediction target
y = data.price_range

In [None]:
# several columns
# usually for features
features = ['battery_power', 'clock_speed', 'four_g', 'ram']
X = data[features]

- data cleaning

In [None]:
data = data.dropna(axis=0)

## Training model

Here use decision tree as an example.

- split test and train

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

- train the model

In [None]:
from sklearn.tree import DecisionTreeRegressor
#specify the model. 
#For model reproducibility, set a numeric value for random_state when specifying the model
model = DecisionTreeRegressor(random_state = 3)

# Fit the model
model.fit(train_X,train_y)

- validation

In [None]:
from sklearn.metrics import mean_absolute_error
# Make validation predictions and calculate mean absolute error
val_predictions = model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

- tune parameters

In decision tree, max_leaf_nodes is the parameter.

In [None]:
# train and validation
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [None]:
# avoid Underfitting and Overfitting
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500, 750, 1000]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
temp_mae = 0
min_mae = 1000000
best_tree_size = 0
for cmln in candidate_max_leaf_nodes:
    temp_mae = get_mae(cmln, train_X, val_X, train_y, val_y)
    if temp_mae<min_mae:
        min_mae = temp_mae
        best_tree_size = cmln
best_tree_size

- fit model using all data

In [None]:
final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size,random_state = 1)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)