**Tabular Playground Series by Kaggle**

***

December 2021

***

**Created by Berkay Alan**

## Case

**Dataset:**

https://www.kaggle.com/c/tabular-playground-series-dec-2021

***

**Description**

For this competition, you will be predicting a binary target based on 100 feature columns given in the data. All columns are continuous.

The data is synthetically generated by a GAN that was trained on a real-world dataset used to identify spam emails via various extracted features from the email.

Files
train.csv - the training data with the target column
test.csv - the test set; you will be predicting the target for each row in this file (the probability of the binary target)
sample_submission.csv - a sample submission file in the correct format

***

**Columns**

Elevation - Elevation in meters

Aspect - Aspect in degrees azimuth

Slope - Slope in degrees

Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features

Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features

Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway

Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice

Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice

Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice

Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points

Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation

Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation

Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

The wilderness areas are:

1 - Rawah Wilderness Area

2 - Neota Wilderness Area

3 - Comanche Peak Wilderness Area

4 - Cache la Poudre Wilderness Area

The soil types are:

1 Cathedral family - Rock outcrop complex, extremely stony.

2 Vanet - Ratake families complex, very stony.

3 Haploborolis - Rock outcrop complex, rubbly.

4 Ratake family - Rock outcrop complex, rubbly.

5 Vanet family - Rock outcrop complex complex, rubbly.

6 Vanet - Wetmore families - Rock outcrop complex, stony.

7 Gothic family.

8 Supervisor - Limber families complex.

9 Troutville family, very stony.

10 Bullwark - Catamount families - Rock outcrop complex, rubbly.

11 Bullwark - Catamount families - Rock land complex, rubbly.

12 Legault family - Rock land complex, stony.

13 Catamount family - Rock land - Bullwark family complex, rubbly.

14 Pachic Argiborolis - Aquolis complex.

15 unspecified in the USFS Soil and ELU Survey.

16 Cryaquolis - Cryoborolis complex.

17 Gateview family - Cryaquolis complex.

18 Rogert family, very stony.

19 Typic Cryaquolis - Borohemists complex.

20 Typic Cryaquepts - Typic Cryaquolls complex.

21 Typic Cryaquolls - Leighcan family, till substratum complex.

22 Leighcan family, till substratum, extremely bouldery.

23 Leighcan family, till substratum - Typic Cryaquolls complex.

24 Leighcan family, extremely stony.

25 Leighcan family, warm, extremely stony.

26 Granile - Catamount families complex, very stony.

27 Leighcan family, warm - Rock outcrop complex, extremely stony.

28 Leighcan family - Rock outcrop complex, extremely stony.

29 Como - Legault families complex, extremely stony.

30 Como family - Rock land - Legault family complex, extremely stony.

31 Leighcan - Catamount families complex, extremely stony.

32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.

33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.

34 Cryorthents - Rock land complex, extremely stony.

35 Cryumbrepts - Rock outcrop - Cryaquepts complex.

36 Bross family - Rock land - Cryumbrepts complex, extremely stony.

37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.

38 Leighcan - Moran families - Cryaquolls complex, extremely stony.

39 Moran family - Cryorthents - Leighcan family complex, extremely stony.

40 Moran family - Cryorthents - Rock land complex, extremely stony.


## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from matplotlib import font_manager as fm
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,RobustScaler
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
import time
from joblib import dump, load
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

## Functions

In [None]:
def label_encoder(dataframe,column):
    le = preprocessing.LabelEncoder()
    dataframe[column] = le.fit_transform(list(dataframe[column].values))
    return dataframe[column]

In [None]:
def addFeature(X):
    # Thanks @mpwolke : https://www.kaggle.com/mpwolke/tooezy-where-are-you-no-camping-here
    X["Soil_Count"] = X[soil_features].apply(sum, axis=1)

    # Thanks @yannbarthelemy : https://www.kaggle.com/yannbarthelemy/tps-december-first-simple-feature-engineering
    X["Wilderness_Area_Count"] = X[wilderness_features].apply(sum, axis=1)
    X["Hillshade_mean"] = X[features_Hillshade].mean(axis=1)
    X['amp_Hillshade'] = X[features_Hillshade].max(axis=1) - X[features_Hillshade].min(axis=1)

## Reading Files

In [None]:
train = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-dec-2021/test.csv")
sample_submission = pd.read_csv("../input/tabular-playground-series-dec-2021/sample_submission.csv")

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
train.shape

In [None]:
train["Id"].nunique()

In [None]:
train.isna().sum()

In [None]:
train["Elevation"].value_counts()

In [None]:
train["Cover_Type"].value_counts()

In [None]:
train["Cover_Type"].hist();

In [None]:
plt.figure(figsize=(10,10))

plt.rcParams['font.size'] = 20

plt.pie(train["Cover_Type"].value_counts().values,labels=train["Cover_Type"].value_counts().index, autopct="%1.1f%%")

plt.legend(title="Cover Types")

plt.title("Distribution of Cover Types")

plt.show()

In [None]:
test.shape

In [None]:
test.isna().sum()

## Exploratory Data Analysis

In [None]:
train.head()

In [None]:
plt.bar(train.Cover_Type.value_counts().keys(),train.Cover_Type.value_counts().values,color="r")

plt.title("Cover Type Distribution")
plt.xlabel("Cover Type")
plt.ylabel("Number of Observation")

plt.tight_layout()
plt.grid(False)

plt.show()

## Feature Engineering

### Aspect

Aspect is the compass direction that a terrain faces. Here it is expressed in Sexagesimal system where the angle lies in the range (0, 359) degrees. In this feature, however, there are some values which are less than 0 and some values which are greater than 359. It will be better If we fix those values so that It lies in the given range. This is fairly easy to do in this case because upon a closer inspection you will find that all the values in this column lies in the range (-360, 720). So, adding 360 to angles smaller than 0 and subtracting 360 from angles greater than 359 will do the work. This is how it should be:

[Credit](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373)

In [None]:
train["Aspect"][train["Aspect"] < 0] += 360
train["Aspect"][train["Aspect"] > 359] -= 360

test["Aspect"][test["Aspect"] < 0] += 360
test["Aspect"][test["Aspect"] > 359] -= 360

### Hillshade

The next three features are the Hillshade features. Hillshade, basically, is a 3D representation of a surface. Hillshade is created by measuring luminosity of certain patches of a terrain that results when a source of light is casted at a particular angle. It's a shade of grey so all the values must lie in the range (0, 255) which is also what the data description in the original competition says. However, In both train and test datasets, there are certain rows with hillshade value more than 255 or less than 0. This may be the result of recording error. It seems that the negative values refer to the darkest shade, which has the value of 0, and the values greater then 255 refer to brightest shade, which has the value of 255 and, hence, It would be better to replace all the negative values with 0 and values greater than 255 with 255. Here is how it should be:

[Credit](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373)

In [None]:
train.loc[train["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
test.loc[test["Hillshade_9am"] < 0, "Hillshade_9am"] = 0

train.loc[train["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0
test.loc[test["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0

train.loc[train["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0
test.loc[test["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0

train.loc[train["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
test.loc[test["Hillshade_9am"] > 255, "Hillshade_9am"] = 255

train.loc[train["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255
test.loc[test["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255

train.loc[train["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255
test.loc[test["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255

In [None]:
features_Hillshade = ['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']
soil_features = [x for x in train.columns if x.startswith("Soil_Type")]
wilderness_features = [x for x in train.columns if x.startswith("Wilderness_Area")]

In [None]:
addFeature(train)
addFeature(test)

[Credit](https://www.kaggle.com/chryzal/features-engineering-for-you)

### Soil

In [None]:
train.drop(["Soil_Type7", "Id", "Soil_Type15"], axis=1, inplace=True)
test.drop(["Soil_Type7", "Id", "Soil_Type15"], axis=1, inplace=True)

In [None]:
train = train[train.Cover_Type != 5]

### Creating distance based features

In [None]:
# Manhhattan distance to Hydrology
train["mnhttn_dist_hydrlgy"] = np.abs(train["Horizontal_Distance_To_Hydrology"]) + np.abs(train["Vertical_Distance_To_Hydrology"])
test["mnhttn_dist_hydrlgy"] = np.abs(test["Horizontal_Distance_To_Hydrology"]) + np.abs(test["Vertical_Distance_To_Hydrology"])

# Euclidean distance to Hydrology
train["ecldn_dist_hydrlgy"] = (train["Horizontal_Distance_To_Hydrology"]**2 + train["Vertical_Distance_To_Hydrology"]**2)**0.5
test["ecldn_dist_hydrlgy"] = (test["Horizontal_Distance_To_Hydrology"]**2 + test["Vertical_Distance_To_Hydrology"]**2)**0.5

## Scaling the Data

In [None]:
X_train = train.drop("Cover_Type",axis=1).values
y_train = train.Cover_Type.values

In [None]:
X_train

In [None]:
sc = RobustScaler()
train_scaled = sc.fit_transform(X_train)

In [None]:
test_scaled = sc.transform(test)

## Logistic Regression

In [None]:
logistic_regression = LogisticRegression(random_state=0,solver="liblinear").fit(train_scaled,y_train)

In [None]:
logistic_regression

In [None]:
#saving the model
#dump(logistic_regression,"logistic_Regression_model.joblib")

In [None]:
logistic_regression.intercept_

In [None]:
logistic_regression.coef_[:1]

In [None]:
y_pred = logistic_regression.predict(test_scaled)

In [None]:
pd.DataFrame(y_pred).value_counts()

In [None]:
sample_submission.Cover_Type = y_pred

## Neural Networks

In [None]:
clf = MLPClassifier(solver='adam', alpha=0.001,hidden_layer_sizes=(10, 3),
                    max_iter=150,activation="tanh", random_state=1)

In [None]:
start_time = time.time()

clf.fit(train_scaled,y_train)

elapsed_time = time.time() - start_time

print(f"Elapsed time for Neural Networks: "
      f"{elapsed_time:.3f} seconds")

In [None]:
y_pred = clf.predict(test_scaled)

In [None]:
pd.DataFrame(y_pred).value_counts()

In [None]:
sample_submission.Cover_Type = y_pred

## Random Forests

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)

In [None]:
start_time = time.time()

clf.fit(train_scaled,y_train)

elapsed_time = time.time() - start_time

print(f"Elapsed time for Random Forests: "
      f"{elapsed_time:.3f} seconds")

In [None]:
y_pred = clf.predict(test_scaled)

In [None]:
pd.DataFrame(y_pred).value_counts()

In [None]:
sample_submission.Cover_Type = y_pred

## Xgboost - The Best Score

In [None]:
train.head()

In [None]:
xgb_model = XGBClassifier()

In [None]:
start_time = time.time()

xgb_model.fit(train_scaled,y_train)

elapsed_time = time.time() - start_time

print(f"Elapsed time for XGBoost: "
      f"{elapsed_time:.3f} seconds")

In [None]:
y_pred = xgb_model.predict(test_scaled)

In [None]:
pd.DataFrame(y_pred).value_counts()

In [None]:
sample_submission.Cover_Type = y_pred

In [None]:
sample_submission.to_csv("xgboost_submission.csv",index=False)