# Goal of this notebook

* to do analysis of data.
* Perform Under sampling ,to mitigate any biasing due to imbalanced dataset.
* Used XGB classifier (tuned using Optuna) and acheived 94% accuracy.

# Data Analysis

* Imported required libraries and loaded csv files.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from imblearn.under_sampling import RandomUnderSampler # Used for under sampling. explained further in notebook.
import collections
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import optuna

In [None]:
df_train_og = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
df_test_og  = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')
submission  = pd.read_csv('../input/tabular-playground-series-dec-2021/sample_submission.csv')

In [None]:
df_train_og.shape

We have 56 columns in dataset out of which "Cover_Type" is the target variable that we have to predict.

In [None]:
df_train_og.head()

* Below snippet describes number of unique values each column has.
* Except for first few columns , all other columns are actually categorical in nature with value 0 or 1.
* Columns named "Soil_Type15" and "Soil_Type7" have only 1 unique value which is 0. We can drop this 2 columns.

In [None]:
df_train_og.nunique()

* As this data consists of almost 4 million records and to make our computation faster lets reduce memory size of it using following code snippet.

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

In [None]:
df_train = reduce_mem_usage(df_train_og)
df_test = reduce_mem_usage(df_test_og)
del df_train_og
del df_test_og

# Dealing with Imbalanced dataset

* From below graph we can observe that target variable is highly imbalanced. Infact, category 4 has only 377 records associated with it and category 5 has only one.
* Note - here value on y-axis is represented in 10 to the power 6. That means 0.1 on y-axis represents value of 100000.

In [None]:
cat_count = collections.Counter(df_train['Cover_Type'])
cat_freq = cat_count.values()
cat = cat_count.keys()
plt.bar(cat , cat_freq)

print(cat_count)

* Lets drop all rows associated with categories 4 and 5.

In [None]:
df_train = df_train[(df_train['Cover_Type'] != 4) & (df_train['Cover_Type'] != 5)]

* In order to transform our dataset to balanced dataset , lets reduce number of rows for each target category to number of rows associated with least frequent category.
*  Here category 6 has 11426 rows associated with it. For each category , we will be selecting random 11426 rows so that each category will have equal number of rows.
* ID column in data is unique for each row and does not contribute for predicting target variale.  
* seperating target variable from all the features below.

In [None]:
rus = RandomUnderSampler(sampling_strategy = "not minority")
X  = df_train.drop(columns = ['Id' , 'Cover_Type','Soil_Type7' , 'Soil_Type15'])
y = df_train['Cover_Type']
X_res,y_res = rus.fit_resample(X,y)

* We can see that, after under sampling all categories has equal number of rows. (Categories 4 and 5 are displayed as blank because we have removed it from dataset)

In [None]:
cat_count = collections.Counter(y_res)
cat_freq = cat_count.values()
cat = cat_count.keys()
plt.bar(cat , cat_freq)

print(cat_count)

<!-- # One way Anova test for finding important features.

* Now we will try to find important features using ANOVA test.
* Anova test is generally carried out between numerical features and categorical features.
* Here categorical feature is target variable i.e "Cover_Type" and numerical features are - 

        Elevation                                
        Aspect                                    
        Slope                                     
        Horizontal_Distance_To_Hydrology         
        Vertical_Distance_To_Hydrology           
        Horizontal_Distance_To_Roadways        
        Hillshade_9am                             
        Hillshade_Noon                            
        Hillshade_3pm                             
        Horizontal_Distance_To_Fire_Points
        
 * For scope of this notebook we will be considering only numerical features and not other categorical features.
         -->

# Pipeline and model building 

* Train test split

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X_res,y_res,test_size = 0.2)

In [None]:
def objective_xgb(trial):
    xgb_params = {
        'learning_rate': 0.03,
        'tree_method': 'gpu_hist',
        'booster': 'gbtree',
        'eval_metric' : 'mlogloss',
        'objective' : 'multi:softmax',
        'n_estimators': trial.suggest_int('n_estimators', 500, 1000, 100),
#         'reg_lambda': trial.suggest_int('reg_lambda', 1, 100),
#         'reg_alpha': trial.suggest_int('reg_alpha', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.2, 0.8, step=0.1),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.2, 1.0, step=0.1),
        'max_depth': trial.suggest_int('max_depth', 3, 10), 
#         'min_child_weight': trial.suggest_int('min_child_weight', 2, 10),
        'gamma': trial.suggest_float('gamma', 0, 1.0),
        'predictor' : 'gpu_predictor'
    }
    
    pipe = Pipeline(steps = [
    
    ('step1' , StandardScaler()),
    ('step2' , XGBClassifier(**xgb_params))
     ])
    
    pipe.fit(x_train,y_train)
    y_pred = pipe.predict(x_test)
    return accuracy_score(y_test,y_pred)

In [None]:
study_xgb= optuna.create_study(direction = 'maximize')
study_xgb.optimize(objective_xgb, n_trials=50)

In [None]:
best_params_xgb = study_xgb.best_params

* Below pipeline first performs Standardization on input features and then uses XGBClassifier to predict target variable.

In [None]:
pipe = Pipeline(steps = [
    
    ('step1' , StandardScaler()),
    ('step2' , XGBClassifier(**best_params_xgb))
     ])

In [None]:
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)
print(accuracy_score(y_test,y_pred))

In [None]:
df_test = df_test.drop(columns = ['Id' , 'Soil_Type7' , 'Soil_Type15'])
Final_pred = pipe.predict(df_test)

In [None]:
submission['Cover_Type'] = Final_pred
submission.to_csv('Submission.csv' , index=False)

Thanks for reading till here. If you found it helpul or interesting please consider dropping a comment.

In case of any correction or suggestion, you can let me know in comments as well.

Thanks.