# Classification

Classifications are specific ML algorithms which are used to predict categorical response columns. When the number of categories of the response column is greater than two, we use the term 'Multiclass Classification' to describe them. Examples of classification are predicting the flower species using specific characteristics or predicting telco customers churn.

To understand how to create a classification model, let's predict the flower species using the Iris dataset. 

First, let's import the Random Forest Classifier.

In [53]:
from vertica_ml_python.learn.ensemble import RandomForestClassifier

We can create a model object. As Vertica has its own model management system, we need to choose a model name with the other parameters.

In [54]:
model = RandomForestClassifier("RF_Iris")

We can fit the model using the corresponding data.

In [55]:
model.fit("iris", ["PetalLengthCm", "SepalLengthCm"], "Species")



call_string
SELECT rf_classifier('public.RF_Iris', 'iris', '"species"', '"PetalLengthCm", "SepalLengthCm"' USING PARAMETERS exclude_columns='', ntree=10, mtry=1, sampling_size=0.632, max_depth=5, max_breadth=1000000000, min_leaf_size=1, min_info_gain=0, nbins=32);

details
  predictor  |      type      
-------------+----------------
petallengthcm|float or numeric
sepallengthcm|float or numeric


Additional Info
       Name       |Value
------------------+-----
    tree_count    | 10  
rejected_row_count|  0  
accepted_row_count| 150 

To evaluate the model, we can use different metrics.

In [56]:
model.classification_report()

0,1,2,3
,Iris-setosa,Iris-versicolor,Iris-virginica
auc,1.0,0.9956000000000004,0.9968000000000006
prc_auc,1.0,0.992143797360809,0.9936004843289119
accuracy,1.0,0.9666666666666667,0.9733333333333334
log_loss,0.0126536049390048,0.0404264391589519,0.0352071123419837
precision,1.0,0.96,0.98
recall,1.0,0.9411764705882353,0.9423076923076923
f1_score,1.0,0.9600989792762139,0.9654682104407882
mcc,1.0,0.9254762227411247,0.9410092614535137
informedness,1.0,0.9209744503862152,0.9321036106750391


<object>

We did not split the data into train and test which will be more relevant. The purpose is to understand all the possible metrics to evaluate a Classification. The most famous one is the Accuracy for which the closer it is to 1, the better it is. However, wrong metrics can lead to wrong interpretations. 

Let's take the example where the purpose is to find Bank Frauds. As frauds are rare they can represent less than 1% of the data. Predicting that all the data do not correspond to Frauds will then lead to more than 99% of accuracy. That's why ROC AUC and PRC AUC are more robust metrics.

Besides, a good model is a model which will solve the Business problem. Most of the time we consider that any model better than the random model is good.

In the next lesson, you'll learn how to build unsupervised models.