# Data Description

For this competition, you will be predicting a categorical target based on a number of feature columns given in the data. The data is synthetically generated by a GAN that was trained on a the data from the Forest Cover Type Prediction. This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.

Please refer to this data page for a detailed explanation of the features.

Files
* train.csv - the training data with the target Cover_Type column
* test.csv - the test set; you will be predicting the Cover_Type for each row in this file (the target integer class)
* sample_submission.csv - a sample submission file in the correct format


From the competition data page.
***************************************************************************


# Background of this notebook

According [to the discussiong board](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/295617), deep neural networks works well with this competition data. It will be a good practice to write a multi classification NN model.

My best score was 0.95516 with XGBClassifier on [this notebook](https://www.kaggle.com/satoshiss/tps-december-xgbclassifier)

Let's see how it goes.
****************************************************************

# Import libraries and Load Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from functools import partial
import optuna
import warnings
warnings.filterwarnings('ignore')


from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

from sklearn.model_selection import StratifiedKFold

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import gc
from scipy import stats

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv',index_col = 'Id')
df_test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv',index_col='Id')

In [None]:
#Use this notebook to make my pseudolabels file https://www.kaggle.com/remekkinas/tps-12-pseudolabels-for-classification-tutorial/notebook

pseudo_df = pd.read_csv('../input/tbsdexxgbclassifierprediction/tps12-pseudolabels.csv',index_col ="Id")

new_df_train = pd.concat([df_train,pseudo_df],axis =0)
new_df_train.reset_index(drop=True)

In [None]:
del df_train

In [None]:
# reduce the data usage
# from the discussion board (https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/291844)
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
new_df_train = reduce_mem_usage(new_df_train)
df_test = reduce_mem_usage(df_test)

**************************************************************************
# 1. Data Explanatory Analysis and Cleaning

This is my third time to join Tabular Play Ground Series. I did not put much time for data explanatory analysis in the last two competitions. I will put some more effort on it this time to do effective feature engineering later on. To do so, I referred to [Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability) course on Kaggle. 


In [None]:
#Soil_Type7 and SoilType15 has only zero values. Need to delete those two columns.

new_df_train = new_df_train.drop(['Soil_Type7','Soil_Type15'],axis=1)
df_test = df_test.drop(['Soil_Type7','Soil_Type15'],axis=1)


In [None]:
# Cover_Type (Target) distribution. 
new_df_train.Cover_Type.value_counts()

In [None]:
# Cover_Type 5 was only one sample in this data.  
new_df_train = new_df_train[new_df_train.Cover_Type != 5]

In [None]:
#separate targets and features

targets = new_df_train.Cover_Type
features = new_df_train.drop(['Cover_Type'],axis=1)
features = reduce_mem_usage(features)



In [None]:
del new_df_train

In [None]:
encoder = LabelEncoder()
targets[:] = encoder.fit_transform(targets[:])

In [None]:
#Make Aspect values from 0 to 359 degree
#Extra feature engineering from the discussion board https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373

features["Aspect"][features["Aspect"] <0] +=360
features["Aspect"][features["Aspect"] >359]-=360

df_test["Aspect"][df_test["Aspect"] <0] +=360
df_test["Aspect"][df_test["Aspect"] >359] -=360


features.loc[features["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
df_test.loc[df_test["Hillshade_9am"] < 0, "Hillshade_9am"] = 0

features.loc[features["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0
df_test.loc[df_test["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0

features.loc[features["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0
df_test.loc[df_test["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0

features.loc[features["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
df_test.loc[df_test["Hillshade_9am"] > 255, "Hillshade_9am"] = 255

features.loc[features["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255
df_test.loc[df_test["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255

features.loc[features["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255
df_test.loc[df_test["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255



In [None]:
#some more features engineering
# from this discussion https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293612


features['Euclidean_Distance_to_Hydrology'] =  ((features['Horizontal_Distance_To_Hydrology']).astype(np.int32)**2 + (features['Vertical_Distance_To_Hydrology']).astype(np.int32)**2)**0.5


features['Manhattan_Distance_to_Hydrology'] = np.abs(features['Horizontal_Distance_To_Hydrology']) + np.abs(features['Vertical_Distance_To_Hydrology'])


df_test['Euclidean_Distance_to_Hydrology'] =  ((df_test['Horizontal_Distance_To_Hydrology']).astype(np.int32)**2 + (df_test['Vertical_Distance_To_Hydrology']).astype(np.int32)**2)**0.5

df_test['Manhattan_Distance_to_Hydrology'] = np.abs(df_test['Horizontal_Distance_To_Hydrology']) + np.abs(df_test['Vertical_Distance_To_Hydrology'])

In [None]:
features

In [None]:
#extra feature engineering from the discussion board. 
#sum of soil_type and wilderness_ares https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/292823


feature_list = features.columns
soil_features = [x for x in feature_list if x.startswith("Soil_Type")]
features['soil_type_count'] = features[soil_features].sum(axis=1)
df_test['soil_type_count'] =df_test[soil_features].sum(axis=1)

wilderness_features= [x for x in feature_list if x.startswith('Wilderness')]
features['wilderness_area_count']=features[wilderness_features].sum(axis=1)
df_test['wilderness_area_count'] = df_test[wilderness_features].sum(axis=1)

In [None]:
#Scaling the values.


from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
#scaler = preprocessing.StandardScaler()

numeric_features = features.columns[0:11].to_list() + features.columns[-4:].to_list()

features[numeric_features] = scaler.fit_transform(features[numeric_features])
df_test[numeric_features] = scaler.transform(df_test[numeric_features])

In [None]:
features

In [None]:
#Separate data into train and validation data
#NN model input should be array. 
#X_train,X_val,y_train,y_val = train_test_split(features.values,targets.values, random_state=15)

# Making Model and Predict 


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.utils import plot_model

In [None]:
def build_model():
    model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=[features.shape[1]]),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(256, kernel_initializer="lecun_normal", activation='selu'),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(128,kernel_initializer="lecun_normal" ,activation='selu'),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(64,kernel_initializer="lecun_normal", activation='selu'),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(units=6,activation='softmax')
       ])
      
    model.compile(loss='sparse_categorical_crossentropy',
             optimizer= 'adam',
             metrics=['accuracy'])
    
    return model

      

In [None]:
# setting is from  this notebook https://www.kaggle.com/balamurugan1603/tps-dec-21-nn-feature-engg-tf

from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping


reduce_lr = ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.5,
    patience=5
)

early_stop = EarlyStopping(
    monitor="val_accuracy",
    patience=20,
    restore_best_weights=True
)

callbacks = [reduce_lr, early_stop]

In [None]:
# referred to this notebook  https://www.kaggle.com/hamzaghanmi/tps-dec-step-by-step/notebook

preds = []

kf = StratifiedKFold(n_splits=23, random_state=2,shuffle=True)
acc = []
n=0

for trn_idx, test_idx in kf.split(features,targets):
    X_tr,X_val = features.iloc[trn_idx].values, features.iloc[test_idx].values
    y_tr,y_val = targets.iloc[trn_idx].values, targets.iloc[test_idx].values
    
    model = build_model()
    model.fit(X_tr,y_tr,
                    epochs=10,
                    batch_size=2021,
                    verbose=False,
                    callbacks=callbacks,
                    validation_data=(X_val,y_val))
    
    preds.append(model.predict(df_test))
    
    pre = np.argmax(model.predict(X_val),axis=1)
    
    acc.append(accuracy_score(y_val,pre))
                    
    print(f"fold: {n+1} , accuracy: {round(acc[n]*100,3)}")
    n+=1
                                                    
    del X_tr,X_val,y_tr, y_val
    gc.collect()
                                                    
    
    

In [None]:
from scipy import stats
predictions = stats.mode(preds)[0][0]


In [None]:
preds = np.argmax(predictions,axis=1)
preds = encoder.inverse_transform(preds)


In [None]:
index = pd.read_csv("../input/tabular-playground-series-dec-2021/sample_submission.csv")
index['Cover_Type'] = preds
index.to_csv('submission.csv',index=False)

In [None]:
index.head(10)