# Building a Classification Model for the Wine data set

Chanin Nantasenamat

<i>Data Professor YouTube channel, http://youtube.com/dataprofessor </i>

In this Jupyter notebook, we will be building a classification model for the Iris data set using the random forest algorithm.

## 1. Import libraries

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

## 2. Load the *breast_cancer* data set [https://scikit-learn.org/1.6/datasets/toy_dataset.html#]

In [14]:
breast_cancer = datasets.load_breast_cancer()

In [15]:
breast_cancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

## 3. Input features
The ***breast_cancer*** data set contains 30 input features and 1 output variable (the class label).

### 3.1. Input features

In [22]:
print(breast_cancer.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


### 3.2. Output features

In [23]:
print(breast_cancer.target_names)

['malignant' 'benign']


In [24]:
print(breast_cancer.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

## 4. Glimpse of the data

### 4.1. Input features

In [25]:
breast_cancer.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

### 4.2. Output variable (the Class label)

In [26]:
breast_cancer.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

### 4.3. Assigning *input* and *output* variables
Let's assign the 30 input variables to X and the output variable (class label) to Y

In [27]:
X = breast_cancer.data
Y = breast_cancer.target

### 4.3. Let's examine the data dimension

In [34]:
X.shape

(569, 30)

In [35]:
Y.shape

(569,)

## 5. Build Classification Model using Random Forest

In [36]:
clf = RandomForestClassifier()

In [37]:
clf.fit(X, Y)

## 6. Feature Importance

In [38]:
print(clf.feature_importances_)

[0.03513216 0.01585052 0.0186254  0.02670663 0.00605895 0.00789894
 0.04515099 0.1049147  0.00337675 0.00486425 0.00570774 0.00325288
 0.01496999 0.01766458 0.00356438 0.00399239 0.00960841 0.00326919
 0.00411958 0.00443326 0.11656498 0.02730529 0.10295829 0.16711184
 0.0147452  0.00857458 0.04993573 0.15610176 0.01194578 0.00559486]


## 7. Make Prediction

In [74]:
X[100]

array([1.361e+01, 2.498e+01, 8.805e+01, 5.827e+02, 9.488e-02, 8.511e-02,
       8.625e-02, 4.489e-02, 1.609e-01, 5.871e-02, 4.565e-01, 1.290e+00,
       2.861e+00, 4.314e+01, 5.872e-03, 1.488e-02, 2.647e-02, 9.921e-03,
       1.465e-02, 2.355e-03, 1.699e+01, 3.527e+01, 1.086e+02, 9.065e+02,
       1.265e-01, 1.943e-01, 3.169e-01, 1.184e-01, 2.651e-01, 7.397e-02])

In [45]:
print(clf.predict([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01]]))

[0]


In [50]:
print(clf.predict(X[[3]]))

[0]


In [51]:
print(clf.predict_proba(X[[0]]))

[[0.97 0.03]]


In [52]:
clf.fit(breast_cancer.data, breast_cancer.target_names[breast_cancer.target])

## 8. Data split (70/30 ratio)

In [82]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

In [83]:
X_train.shape, Y_train.shape

((398, 30), (398,))

In [84]:
X_test.shape, Y_test.shape

((171, 30), (171,))

## 9. Rebuild the Random Forest Model

In [94]:
clf.fit(X_train, Y_train)

### 9.1. Performs prediction on single sample from the data set

In [86]:
print(clf.predict([[1.176e+01, 2.160e+01, 7.472e+01, 4.279e+02, 8.637e-02, 4.966e-02,
       1.657e-02, 1.115e-02, 1.495e-01, 5.888e-02, 4.062e-01, 1.210e+00,
       2.635e+00, 2.847e+01, 5.857e-03, 9.758e-03, 1.168e-02, 7.445e-03,
       2.406e-02, 1.769e-03, 1.298e+01, 2.572e+01, 8.298e+01, 5.165e+02,
       1.085e-01, 8.615e-02, 5.523e-02, 3.715e-02, 2.433e-01, 6.563e-02]]))

[1]


In [87]:
print(clf.predict_proba([[1.361e+01, 2.498e+01, 8.805e+01, 5.827e+02, 9.488e-02, 8.511e-02,
       8.625e-02, 4.489e-02, 1.609e-01, 5.871e-02, 4.565e-01, 1.290e+00,
       2.861e+00, 4.314e+01, 5.872e-03, 1.488e-02, 2.647e-02, 9.921e-03,
       1.465e-02, 2.355e-03, 1.699e+01, 3.527e+01, 1.086e+02, 9.065e+02,
       1.265e-01, 1.943e-01, 3.169e-01, 1.184e-01, 2.651e-01, 7.397e-02]]))

[[0.91 0.09]]


### 9.2. Performs prediction on the test set

#### *Predicted class labels*

In [95]:
print(clf.predict(X_test))

[1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1
 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 1
 1 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1
 1 0 1 1 0 1 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1
 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1]


#### *copy from 7.Make Prediction, the line "clf.fit(breast_cancer.data, breast_cancer.target_names[breast_cancer.target])" and run, then re-run 9.2.Performs perdiction on the test set, line "print(clf.predict(X_test))"... Later, reboot 9.ReBuild the Random Forest Model, the line "clf.fit(X_train, Y_train)" before running 10.Model Performance, line "print(clf.score(X_test, Y_test))"*

In [92]:
clf.fit(breast_cancer.data, breast_cancer.target_names[breast_cancer.target])

#### *Actual class labels*

In [96]:
print(Y_test)

[1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1
 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 1
 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1
 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1
 0 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1]


## 10. Model Performance

In [97]:
print(clf.score(X_test, Y_test))

0.9707602339181286
