# Rice type classification

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict the classification of rice type?

# 2. Data

Data from: https://www.kaggle.com/mssmartypants/rice-type-classification

## Context

This is a set of data created for rice classification. I recommend using this dataset for educational purposes, for practice and to acquire the necessary knowledge. It is modified dataset from this resource: https://www.kaggle.com/seymasa/rice-dataset-gonenjasmine

## Content

What's inside is more than just rows and columns. You can see rice details listed as column names. 

# 3. Evaluation

as this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## Features / inputs


    1. id
    2. Area
    3. MajorAxisLength
    4. MinorAxisLength
    5. Eccentricity
    6. ConvexArea
    7. EquivDiameter
    8. Extent
    9. Perimeter
    10. Roundness
    11. AspectRation
    
## Label / output    
    12.Class


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the DataSet

In [None]:
# Local
# df = pd.read_csv('riceClassification.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/rice-type-classification/riceClassification.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

no null values

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Class')
sns.countplot(data=df, x='Class');

We can consider that the classes are balanced

In [None]:
df.describe().transpose()

we can drop the ID as that is just a unique ID for the data

In [None]:
df = df.drop('id', axis=1)

In [None]:
plt.figure(figsize=(20,20))
plt.title('Heatmap corralation')
sns.heatmap(data=pd.get_dummies(df).corr(), annot=True);

In [None]:
sns.pairplot(data=df, hue='Class');

In [None]:
len(df['Area'].unique())

In [None]:
df.info()

# 5. Modelling

In [None]:
X = df.drop('Class', axis=1)
y = df['Class']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

## Baseline Model Scores

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier(),
          'CatBoostClassifier': CatBoostClassifier(),
          'LGBMClassifier':LGBMClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

The best performing model is LogisticRegression at 0.990836, and SVC at 0.990286
let look to improve it via a Random search CV and grid Search CV

## Random Search CV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                     param_distributions=params[name],
                                      cv=5,
                                     n_iter=20,n_jobs=-1,
                                     verbose=2)        
        rs_model.fit(X_train,y_train)
        model_rs_scores[name] = rs_model.score(X_test,y_test)
        model_rs_best_param[name] = rs_model.best_params_
        
    return model_rs_scores, model_rs_best_param

### RS Model 1

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
         'SVC': SVC()}
params = {'LogisticRegression':{'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                                'penalty':['none','l1','l2','elasticnet'],
                                'C': [0.01,0.1,1,10,100],
                                'l1_ratio':[0,1,5,10,50,100]},
          'SVC':{'C': [0.1,0.5,1, 10,100,500], 
              'kernel':['linear', 'poly', 'rbf','sigmoid'],
              'gamma':['scale','auto'],
              'degree':[2,3,4]}
         }

In [None]:
model_rs_scores_1, model_rs_best_param_1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_1

In [None]:
model_rs_best_param_1

### RS Model 2

In [None]:
params = {'LogisticRegression':{'solver': ['saga'],
                                'penalty':['l1'],
                                'C': [5,10,20,30,40],
                               'l1_ratio':[80,90,100,120,150]},
          'SVC': {'C': [80,90,100,120,150], 
              'kernel':['rbf'],
              'gamma':['auto'],
              'degree':[1,2,3]}
         }

In [None]:
model_rs_scores_2, model_rs_best_param_2 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_2

In [None]:
model_rs_best_param_2

### RS Model 3

In [None]:
params = {'LogisticRegression':{'solver': ['saga'],
                                'penalty':['l1'],
                                'C': [3,4,5,6,7,8,9],
                               'l1_ratio':[70,75,80, 85]},
          'SVC': {'C': [95,100,105,110], 
              'kernel':['rbf'],
              'gamma':['auto'],
              'degree':[1,2]}
         }

In [None]:
model_rs_scores_3, model_rs_best_param_3 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_3

In [None]:
model_rs_scores_2

In [None]:
model_rs_best_param_3

Model does not seem to be improving with the given hyperparmeters.
we will use the SVC to bulid the model and evalute it.

# 6. Model Evlaution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 
from sklearn.model_selection import cross_val_score

In [None]:
model = SVC(kernel='rbf', gamma='auto', degree=1,C=105)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## SVC

### Classification Report

In [None]:
print(classification_report(y_test,y_preds))

### Confusion matrix

In [None]:
plot_confusion_matrix(model, X_test,y_preds)

### ROC Curve

In [None]:
plot_roc_curve(model, X_test, y_test);

### Evalution using cross-validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=5,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=5,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=5,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=5,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics


In [None]:
cv_merics = get_cv_score(model, X, y, cv=5)

In [None]:
cv_merics

In [None]:
plt.figure(figsize=(20,10))
plt.title('CV Scores')
sns.barplot(data=cv_merics);

After the CV is done, it seem either the model is overfitted or is not suitable. we will try with the logistic Regression model instead

## Logistic Regression

In [None]:
model = LogisticRegression(solver='saga', penalty='l1',C=5,max_iter=10000)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test,y_preds))

### Confusion matrix

In [None]:
plot_confusion_matrix(model, X_test,y_preds)

### ROC Curve

In [None]:
plot_roc_curve(model, X_test, y_test);

### Evalution using cross-validation

In [None]:
cv_merics = get_cv_score(model, X, y, cv=5)

In [None]:
cv_merics

In [None]:
plt.figure(figsize=(20,10))
plt.title('CV Scores')
sns.barplot(data=cv_merics);

With the failer of the SVC model on the CV evalution, we will build the model using the Logistic Regression. with a mean CV score of:

    * Accuracy: 0.980368
    * Precision: 0.977688	
    * Recall: 0.987982
    * f1: 0.987982
 	 	 	