# Classification using pycaret

https://pycaret.org/

This is the default and basic strategy to develop a machine learning model in pycaret.

The strategy have the following steps:

* Identify files (default in kaggle)
* Get data
* Setup an experiment
* Run several models over training set and compare
* Analyse best model
* Tune the best model
* Ensemble tunned model
* Analyse final model
* Predict over test data
* Analyse results
* Predict over submit data
* Submit
* Finalize and save model

## Identify files (default in kaggle)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Get data

In [None]:
raw = pd.read_csv('/kaggle/input/av-healthcare-analytics-ii/healthcare/train_data.csv')

raw = raw.dropna()

raw.head()

### Get raw data shape

In [None]:
raw.shape

### Get some data for tests

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(raw, test_size=0.05)

### Get train data shape

In [None]:
train.shape

### Get test data shape

In [None]:
test.shape

## Setup an experiment

### Fisrt install pycaret

In [None]:
!pip install pycaret

### Setup

Ignore the features that don't have any correlation with the number of days of each pacient will stay in the hospital.

The choice of the features to ignore was did using empirical strategy. It was based on my personal opinion.

* case_id
* patientid
* Visitors with Patient

In [None]:
from pycaret.classification import *

clf1 = setup(
    train, 
    target = 'Stay',
    ignore_features = ['case_id', 'patientid', 'Visitors with Patient'],
    session_id=1945,
    # normalize = True, 
    # transform_target = True, 
    # polynomial_features = True, 
    # feature_selection = True, 
    # train_size=0.7,
    categorical_features=['City_Code_Patient', 'Hospital_code', 'Bed Grade'], 
    # log_experiment=True,
    # log_plots=True,
    use_gpu=True,
    # experiment_name='av-healthcare-analytics-ii-ex-v1'
    silent = True
)

## Run several models over training set and compare

Compare models - This process is a high time consumption and won't be execute in kaggle

By running command compare_models, the result for the best model is ’lightgbm’ - Light Gradient Boosting

So I will just create this model

In [None]:
# best = compare_models(fold = 5)

best = create_model('lightgbm')

## Analyse best model

Check ROC curves

In [None]:
plot_model(best)

Check confusion matrix

In [None]:
plot_model(best, plot='confusion_matrix')

Evaluate model

In [None]:
evaluate_model(best)

## Tune best model

Tune model searching for the best hyperparameters setup

In [None]:
tunned = tune_model(best)

## Ensemble tunned model

In [None]:
ensembled = ensemble_model(tunned)

## Analyse final model

Check ROC cuarves

In [None]:
plot_model(ensembled)

Check confusion matrix

In [None]:
plot_model(ensembled, plot='confusion_matrix')

Evaluate model

In [None]:
evaluate_model(ensembled)

## Predict over test data

In [None]:
predict_test = predict_model(ensembled, test)
predict_test = predict_test.dropna()
predict_test.to_csv('predict_test.csv', index=False)
predict_test.head()

## Analyse results

In [None]:
predict_test['comp'] = np.where(predict_test['Stay'] == predict_test['Label'], 'Correct', 'Incorrect')
predict_test.groupby('comp').count()['Label']

Test accuracy

In [None]:
print(predict_test.groupby('comp').count()['Label'][0] / predict_test.groupby('comp').count()['Label'][1])

The test accuracy is almost the same as for training, which is good and indicates that the model have the ability to generalize and there is no overfit

## Predict over submit data

Get data and predict with model

In [None]:
submit = pd.read_csv('/kaggle/input/av-healthcare-analytics-ii/healthcare/test_data.csv')
predict_submit = predict_model(ensembled, submit)

Take a look into results

In [None]:
predict_submit

## Submit

In [None]:
predict_submit_format = pd.DataFrame({ 'case_id': predict_submit['case_id'], 'Stay': predict_submit['Label']})
predict_submit_format.to_csv('Submission.csv', index=False)
predict_submit_format

## Finalize and save model

Finalize model

In [None]:
finalize_model(ensembled)

Save model

In [None]:
save_model(ensembled, 'model')