<h1 style="color: #ff5733"><strong>Introduction</strong></h1>

<p style="font-size:120%">
<strong>Kaggle</strong> competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.
<p> 
    
<p style="font-size:120%">
The dataset is used for this competition, <strong><a href="https://www.kaggle.com/c/tabular-playground-series-dec-2021">Tabular Playground Series - Dec 2021</a></strong>, is synthetic, but based on a real dataset and generated using a <a href="https://github.com/sdv-dev/CTGAN">CTGAN</a>. For this competition, you will be predicting a categorical target based on a number of feature columns given in the data.  This dataset is based off of the original <a href="https://www.kaggle.com/c/forest-cover-type-prediction/overview">Forest Cover Type Prediction competition</a>. Submissions are evaluated on <strong>multi-class classification accuracy</strong>.
<p>

<h1 style="color: #ff5733"><strong>Setup</strong></h1>

In [None]:
# import libraries

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
# reduce memory usage function
# credits : Guillaume Martin (https://www.kaggle.com/gemartin/load-data-reduce-memory-usage/notebook)

def reduce_memory_usage(df):
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    return df

In [None]:
# read data into dataframe

train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
reduce_memory_usage(train)

test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')
reduce_memory_usage(test);

<h1 style="color:#ff5733"><strong>Exploratory Data Analysis</strong></h1>

In [None]:
# concise summary of dataset
train.info()

In [None]:
# first five rows
train.head()

<p style="font-size:120%">
<strong>Data description</strong> can be found <a href="https://www.kaggle.com/c/forest-cover-type-prediction/data">here</a>.
</p>

In [None]:
# shape of data
print(train.shape)
print(test.shape)

In [None]:
# descriptive statistics
train.describe().T.sort_values(by='std' , ascending = False)

<p style="font-size:120%">
    Columns <strong>Soil_Type7</strong> and <strong>Soil_Type15</strong> have only zero values. So it will be dropped later
</p>

In [None]:
# distribution of labels

plt.figure(figsize=(10,8))
sns.countplot(x='Cover_Type', data=train, palette='icefire_r');

In [None]:
train['Cover_Type'].value_counts(ascending=False)

<p style="font-size:120%">The distribution of label is very unbalanced.

<h1 style="color:#ff5733"><strong>Preprocessing</strong></h1>

In [None]:
# predictor
X = train.drop(columns=['Id','Cover_Type','Soil_Type7','Soil_Type15'])

# target
y = train['Cover_Type']

# test data 
test_df = test.drop(columns=['Id','Soil_Type7','Soil_Type15'])

In [None]:
# train-test split

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=123, shuffle =True)

<h1 style="color:#ff5733"><strong>CatBoostClassifier</strong></h1>


In [None]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(task_type='GPU')
model.fit(X_train,y_train)

In [None]:
# validation predictionShrink model to first 8860 iterations.
y_pred=model.predict(X_val)

In [None]:
# validation accuracy
from sklearn.metrics import accuracy_score
print('Accuracy Score : ',accuracy_score(y_val, y_pred))

In [None]:
# test prediction
y_pred = model.predict(test_df)

In [None]:
# submission
submission = pd.read_csv('../input/tabular-playground-series-dec-2021/sample_submission.csv')
submission['Cover_Type'] = y_pred
submission.to_csv("submission.csv",index=False)
submission.head()

<p style="font-size:120%"><strong>Thank You</strong><p/>