# Classification

- Inputs: numerical, categorical
- Targets: predict only categorical values
- Intuition: Creating the "category boxes" based on training data.
- Decision Boundary = There are separation boundary that identifies the classes
- Common models:
    - `from sklearn.tree import DecisionTreeClassifier`
    - `from sklearn.linear_model import LogisticRegression`
    - `from sklearn.svm import SVC`
    - `from sklearn.ensemble import RandomForestClassifier`

# Regression

- Inputs: numerical, categorical
- Targets: predict only numerical values
- Intuition: Creating the best continuous value based on training data.
- line / Curve fitting = It tries to create a line that goes along data points as much as possible
- Common models:
    - `from sklearn.linear_model import LinearRegression`
    - `from sklearn.linear_model import Lasso`
    - `from sklearn.linear_model import Ridge`

# Training and evaluating Classification Model

- Never judge a model with its own training data
- Hold-out method : Before training the model, separate some portion (40%) of the dataset 
- Train the model with 60% of data
- Measure the goodness of the model with 40% of the unseen data (held-out set)
- use `from sklearn.model_selection import train_test_split` for splitting

In [2]:
# # Import necessary packages
# from sklearn.ensemble import RandomForestClassifier

# # Use a custom model configuration/hyper-parameters
# model = RandomForestClassifier(n_estimators=500, max_depth=20)

# # Start the training procedure
# model.fit(X_train, y_train)

# # Generate Predictions
# y_predicted = model.predict(X_test)

# # Compare predictions with original data: Is y_predicted == y_true ?
# from sklearn.metrics import confusion_matrix
# confusion_matrix(y_test, y_predicted)

# # Estimate each metric
# from sklearn.metrics import accuracy_score, precision_score, recall_score
# accuracy_score(y_true, y_predicted)
# precision_score(y_true, y_predicted)
# recall_score(y_true, y_predicted)

# Confusion Matrix for Classification

<center><img src="images/01.jpg"  style="width: 200px, height: 200px;"/></center>
<center><img src="images/02.jpg"  style="width: 200px, height: 200px;"/></center>


# Training and evaluating Regression Model


- line or surface fitted closely to the data, not separating it into regions. Goes non-linear by:
    - Input features: (a, b)
    - Output features: (1, a, b, a^2, a*b, b^2)
    ```
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(degree=2)
    polynomial_X_train = poly.fit_transform(X_train)
    model.fit(polynomial_X_train, y_train)
    ```
- Hold-out method : Before training the model, separate some portion (40%) of the dataset 
- Train the model with 60% of data
- Measure the goodness of the model with 40% of the unseen data (held-out set) with Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).
- use `from sklearn.model_selection import train_test_split` for splitting

# Error Metric for Classification (MAE or RMSE)


- Error is in continuous value
- Always will be error, but we measure "how much error"
- We need absolute value to determine mean error, which is Mean Absolute Error (otherwise positive and negative errors will cancel each other)
- If there are spikes in error, it is good to use Median Absolute Error 
- Another way, all errors are squared and then summed. The summation is then root-squared. This is root mean squared error. 
-  R^2 score is a nice and simple measure of goodness-of-fit which is used for unitless error measurement


<center><img src="images/03.jpg"  style="width: 200px, height: 200px;"/></center>


In [4]:
# # Import necessary packages
# from sklearn.preprocessing import PolynomialFeatures

# # Use a custom model configuration/hyper-parameters
# poly = PolynomialFeatures(degree=2)
# polynomial_X_train = poly.fit_transform(X_train)

# # Start the training procedure
# model.fit(polynomial_X_train, y_train)

# # Generate Predictions
# polynomial_X_test = poly.fit_transform(X_test)
# y_predicted = model.predict(polynomial_X_test)

# # Compare predictions with original data: how much far is point from regression line ?
# # Mean absolute error; range: [-Inf..+Inf]
# from sklearn.metrics import mean_absolute_error
# # Median absolute error; range: [-Inf..+Inf]
# from sklearn.metrics import median_absolute_error
# # R^2 (coefficient of determination); range: [0..1]
# from sklearn.metrics import r2_score

# # Estimate each metric
# mean_absolute_error(y_test, y_predicted)
# median_absolute_error(y_test, y_predicted)
# r2_score(y_test, y_predicted)