#### Use iris flower dataset to create classification model. Your task is to predict the class to which these plants belong. There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica. Create the classification model using Random Forest classifier and evaluate the performance of your classifier. Also you need to do the feature analysis and find out the important feature in iris dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/rahul96rajan/sample_datasets/master/iris.csv')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
X = data.drop('species', axis=1)
y = data['species']

In [5]:
y.value_counts()

setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
                                                    random_state=42)

In [7]:
rfc = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {'n_estimators':[400,800,1200],'max_features': [2,3,4],
          'max_leaf_nodes': [4,5,6]}

gs = GridSearchCV(estimator=rfc, param_grid=params, scoring='f1_micro', cv=10)

In [8]:
gs.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             param_grid={'max_features': [2, 3, 4], 'max_leaf_nodes': [4, 5, 6],
                         'n_estimators': [400, 800, 1200]},
             scoring='f1_micro')

In [9]:
print(gs.best_params_)
print(gs.best_estimator_)
print(gs.best_score_)

{'max_features': 3, 'max_leaf_nodes': 4, 'n_estimators': 400}
RandomForestClassifier(max_features=3, max_leaf_nodes=4, n_estimators=400,
                       n_jobs=-1, random_state=42)
0.95


In [10]:
y_pred_test = gs.predict(X_test)
y_pred_train = gs.predict(X_train)

In [11]:
print('F1 Score(Train): {:.4f}'.format(f1_score(y_train, y_pred_train,
                                                average='micro')))
print('F1 Score(Test): {:.4f}'.format(f1_score(y_test, y_pred_test,
                                               average='micro')))

print('\nAccuracy Score(Train): {:.4f}'.format(accuracy_score(y_train,
                                                              y_pred_train)))
print('Accuracy Score(Test): {:.4f}'.format(accuracy_score(y_test,
                                                           y_pred_test)))

F1 Score(Train): 0.9600
F1 Score(Test): 0.9800

Accuracy Score(Train): 0.9600
Accuracy Score(Test): 0.9800
