# **1. Introduction**

**Objective:** multi-class classification

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train_df=pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv')
test_df=pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv')
sub_df=pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv')

In [None]:
train_df.head()

In [None]:
train_df.shape

# **2. EDA**

In [None]:
train_df.isnull().sum().sum()

In [None]:
train_df.info()

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

**PS:** I took this method from [this](https://www.kaggle.com/questions-and-answers/134639#767486) comment, to avoid the error mentioned in this comment.

In [None]:
# reduce memory usage
train_df = reduce_mem_usage(train_df)

In [None]:
train_df.groupby('Cover_Type')['Cover_Type'].count().sort_values(ascending=False)

In [None]:
import seaborn as sns
sns.histplot(data=train_df, x='Cover_Type',binwidth=1,stat="percent",discrete=True)

**Observation:** We have *imbalance* data in our training dataset. Imbalance data makes the classifier model **biased** toward the one or two classes ( which have lot of data ). For example, here Cover_Type 2 and 1 are majority classes.

I tried various data balancing technique like SMOTE and Cost sensitive training. However, I was getting memory issue and/or it takes so much time to training the model. So, I found [this](https://rdcu.be/cDRy9) paper discussing various bossting methods for multi-class imbalanced data classification. Even, the original dataset *Forest Cover Type Prediction* was also part of this paper. They mentioned **LogitBoost** algorithm performs better than other algorithm for big dataset ( i.e., dataset with > 10k instances ). So, I choose this algorithm to built the model. 

Cover_Type= 5 have only 1 count. So, we can remove it safely.

In [None]:
train_df = train_df[train_df['Cover_Type'] != 5]
train_df.shape

In [None]:
y = train_df['Cover_Type']
X = train_df.drop(columns=['Id','Cover_Type'],axis=1)

In [None]:
X.shape, y.shape

# **3. Modeling**

In [None]:
pip install logitboost

In [None]:
from logitboost import LogitBoost

models ={'LB': LogitBoost(random_state=0)}

In [None]:
for key, value in models.items():
    print(key)
    print(value)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y)

I choose test_size=0.3, since we have very large dataset ~4 million instances. For the same reason, I skipped cross-validation ( I tried StratifiedKFold and few others and it took lot of time and/or memory issue ) and split data in a stratified fashion. 

Feel free to let me know in comments, how can I use any CV with LogitBoost.

**PS:** LogitBoost runs only on CPU.

In [None]:
%%time

lb = models.get('LB')
lb.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score,matthews_corrcoef

y_test_pred = lb.predict(X_test)
print("Mean Accuracy={}".format(np.mean(accuracy_score(y_test, y_test_pred))))
print("Matthews correlation coefficient={}".format(matthews_corrcoef(y_test, y_test_pred)))

IMHO, along with Accuracy score ( which is required as part of the competition ),Matthews correlation coefficient is also required. As per Scikit learn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html), even for the classes of very different size ( i.e., imbalance data ), it takes into account. Since, I just build the default model and unable to use any techniques to handle imbalance data, I feel this metric is appropriate for this scenario.

> The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used ***even if the classes are of very different sizes***.

# **4. Submission**

In [None]:
test_df.drop('Id', axis=1, inplace=True)

test_df.shape

In [None]:
y_pred = lb.predict(test_df)

In [None]:
sub_df['Cover_Type']= y_pred
sub_df.head()

In [None]:
sub_df.to_csv('submission.csv', index = False)