This notebook is continuation from my previous notebook, [Part-1](https://www.kaggle.com/vishnukarthiklu/tps-12-part-1-data-visualization-and-eda). If you haven't visited the previous notebook yet, do have a look at that too! 

So, from [Part-1](https://www.kaggle.com/vishnukarthiklu/tps-12-part-1-data-visualization-and-eda), we can clearly get the following insights:
* There are no null values in both the test and training set.
* Among the target variables, most of them are classified as Cover_Type (2) and also Cover_Type (4) and (5) has least.
* We discussed about the training and testing data descriptions.
* Then we saw about the percentage of zeros in the features among which we came to know that Soil_Type7 and Soil_Type15 are fully of zeros.
* Through the correlation map, most of the features have correlation between 0 to 0.2

Before jumping straight into the problem lets try to understand what are all the features.

* Elevation - Elevation in meters
* Aspect - Aspect is the compass direction or azimuth that a terrain surface faces.
* Slope - Slope in degrees
* Horizontal_Distance_To_Hydrology - Horizotal distance to nearest surface water body
* Vertical_Distance_To_Hydrology - Vertical distance to nearest surface water body
* Horizontal_Distance_To_Roadways - Horizontal distance to nearest roadways
* Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
* Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
* Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
* Horizontal_Distance_To_Fire_Points - Horizontal distance to nearest wildfire ignition points
* Wilderness_Area (4 Binary columns, 0 -> absence or 1 -> presence) - Type of wilderness area 
* Soil_Type (40 Binary columns, 0 -> absence or 1 -> presence) - Soil Type designation
* Cover_Type (Categorical, 7 types, integers 1 to 7) - Forest Cover Type designation

Now

Importing packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import optuna
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

Loading the data

In [None]:
train = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-dec-2021/test.csv")

In [None]:
train.head()

In [None]:
test.head()

Dropping columns that are not important

In [None]:
train.drop("Id", axis=1, inplace=True)
test.drop("Id", axis=1, inplace=True)

In [None]:
cols = ["Soil_Type7", "Soil_Type15"]

train.drop(cols, axis=1, inplace=True)
test.drop(cols, axis=1, inplace=True)

In [None]:
ign = train[train["Cover_Type"] == 5].index
train.drop(ign, axis=0, inplace=True)

Reducing the memory usage

Reffered from this [link](https://www.kaggle.com/gulshanmishra/tps-dec-21-tensorflow-nn-feature-engineering) 

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

In [None]:
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

In [None]:
train_X = train.drop('Cover_Type', axis=1)
train_y = train['Cover_Type']

Splitting train and test. 

Using XGB and CatBoost algos.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_X, train_y, test_size=0.22, random_state=2021 )

In [None]:
xgb = XGBClassifier(objective = 'multi:softmax', tree_method = 'gpu_hist', eval_metric = 'mlogloss', 
                    subsample = 0.6,gamma = 0.5,max_depth = 7,alpha = 4,learning_rate = .03,
                    n_estimators = 2400,predictor = 'gpu_predictor')
xgb.fit(X_train, y_train,
          early_stopping_rounds=200,
          eval_set=[(X_test,y_test)],
          verbose=True)

In [None]:
preds_valid = xgb.predict(X_test).astype('int')
acc = accuracy_score(y_test,  preds_valid)
print("accuracy score:", acc)

In [None]:
model = CatBoostClassifier(task_type = 'GPU', devices = '0')
model.fit(X_train, y_train)

In [None]:
preds_valid = model.predict(X_test).astype('int')
acc = accuracy_score(y_test,  preds_valid)
print("accuracy score:", acc)

In [None]:
sub = pd.read_csv('../input/tabular-playground-series-dec-2021/sample_submission.csv')
sub['Cover_Type'] =xgb.predict(test).astype('int')
sub.to_csv("xgb_submission.csv",index=False)
sub.head()

In [None]:
sub['Cover_Type'] =model.predict(test).astype('int')
sub.to_csv("cat_submission.csv",index=False)
sub.head()

With no feature engineering done and without any hyperparameter tuning we are still able to reach an accuracy of nearly 96%. If you find this notebook useful, please do upvote. Thanks for your time Kaggler!