# Introduction
In this study, how the number of amphibian species changes with human effect was examined.

<font color = 'blue'>
Content:

1. [Load and Check Data](#1)
1. [Variable Description](#2)
    * [Variable Analysis](#3)
1. [Random Forest With One-Hot Encoding](#4) 
1. [Random Forest With Label Encoding](#5)


In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.python.data import Dataset
import keras
from keras.utils import to_categorical
from keras import models
from keras import layers
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

<a id = "1" ></a><br>
# Load and Check Data

In [None]:
df=pd.read_csv('../input/vericsv/veri.csv',delimiter=',')
df2 = df.copy()
df2.head()

<a id = "2" ></a><br>
# Variable Description
1. ID: Integer value, not use in calculations
1. Motorway: Categorical value, not use in calculations
1. SR: Surface of water reservoir numeric [m2]
1. NR: Number of water reservoirs in habitat
1. TR: Type of water reservoirs:
   * a. reservoirs with natural features that are natural or anthropogenic water reservoirs
   * b. recently formed reservoirs
   * c. settling ponds
   * d. water reservoirs located near houses
   * e. technological water reservoirs
   * f. water reservoirs in allotment gardens
   * g. trenches
   * h. wet meadows, flood plains, marshes
   * i. river valleys
   * j. streams and very small watercourses
   
1. VR:Presence of vegetation within the reservoirs:
   * a. no vegetation
   * b. narrow patches at the edges
   * c. areas heavily overgrown
   * d. lush vegetation within the reservoir with some part devoid of vegetation
   * e. reservoirs completely overgrown with a disappearing water table
   
1. SUR1: Dominant types of land cover surrounding the water reservoir
1. SUR2: Second most dominant types of land cover surrounding the water reservoir
1. SUR3: Third most dominant types of land cover surrounding the water reservoir
   * a. forest areas (with meadows) and densely wooded areas
   * b. areas of wasteland and meadows
   * c. allotment gardens
   * d. parks and green areas
   * e. dense building development, industrial areas
   * f. dispersed habitation, orchards, gardens
   * g. river valleys
   * h. roads, streets
   * i. agricultural land
   
1. UR: Use of water reservoirs:
   * a. unused by man (very attractive for amphibians)
   * b. recreational and scenic (care work is performed)
   * c. used economically (often fish farming)
   * d. technological
   
1. FR: The presence of fishing:
   * a. lack of or occasional fishing
   * b. intense fishing
   * c. breeding reservoirs
   
1. OR: Percentage access from the edges of the reservoir to undeveloped areas
   * a. %25 lack of access or poor access
   * b. %25-%50 low access
   * c. %50-%75 medium access,
   * d. %75-%100 large access to terrestrial habitats 
   
1. RR: Minimum distance from the water reservoir to roads:
   * a. <50 m
   * b. 50-100 m
   * c. 100-200 m
   * d. 200-500 m
   * e. 500-1000 m
   * f. >1000 m
   
1. BR: Minimum distance to buildings:
   * a. <50 m
   * b. 50-100 m
   * c. 100-200 m
   * d. 200-500 m
   * e. 500-1000 m
   * f. >1000 m
   
1. MR: Maintenance status of the reservoir:
   * a. Clean
   * b. slightly littered
   * c. reservoirs heavily or very heavily littered
   
1. CR: Type of shore
   * a. Natural
   * b. Concrete
   
1. Green frogs: the presence of Green frogs
1. Brown frogs: the presence of Brown frogs
1. Common toad: the presence of Common toad
1. Fire-bellied toad: the presence of Fire-bellied toad
1. Tree frog: the presence of Tree frog
1. Common newt: the presence of Common newt
1. Great crested newt: the presence of Great crested newt

<a id="3" ></a><br>
# Variable Analysis
* Categorical Variable: Motorway, TR, VR, SUR, UR, FR, OR, RR, BR, MR, CR, Green frogs, Brown frogs, Common toad, Fire-bellied toad, Tree frog, Common newt, Great crested newt

* Numerical Variable: ID, SR, NR

In this section we can see the frequency of all categorical values.

In [None]:
def bar_plot(variable):
    # get feature
    var = df2[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (6,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category1 = ["TR","VR", "UR", "FR", "OR", "RR", "BR", "MR", "CR","SUR1", "SUR2", "SUR3"]
for c in category1:
    bar_plot(c)

 SUR1: {1:'forest areas', 2:'meadows', 4:'gardens',
                   6:'industrial areas',10:'river valleys', 7:'orchards', 9:'roads',
                   14:'agricultural'}
                   
 SUR2: {1:'forest areas', 2:'meadows',6:'industrial areas',
                     10:'river valleys', 7:'orchards', 9:'roads',
                     11:'agricultural'}
                     
 SUR3: {1:'forest areas', 2:'meadows', 5:'parks',
                   6:'industrial areas',10:'river valleys', 7:'orchards', 9:'roads',
                   11:'agricultural'}

In [None]:
category2 = ["SUR1", "SUR2", "SUR3"]
for c in category2:
    print("{} \n".format(df2[c].value_counts()))

To estimate the number of species we must take the total number of frog species

In [None]:
df2['Species count'] = df2['Green frogs']+ df2['Brown frogs'] + df2['Common toad']+df2['Fire-bellied toad']+df2['Tree frog'] + df2['Common newt'] + df2['Great crested newt']

In [None]:
df2=df2.drop(['Green frogs', 'Brown frogs','Common toad','Fire-bellied toad','Tree frog','Common toad','Fire-bellied toad','Tree frog','Common newt','Great crested newt'], axis=1)

In [None]:
df2.head()

The relationship between the number of species and all other parameters.

In [None]:
df2[["SR","Species count"]].groupby(["SR"], as_index = False).mean().sort_values(by="Species count",ascending = False)

In [None]:
df2[["NR","Species count"]].groupby(["NR"], as_index = False).mean().sort_values(by="Species count",ascending = False)

7: 'Garden reservoirs' equals the average of the species = 7, but we only have one garden reservoir so we can't say the garden is the best reservoir (NR)

In [None]:
df2[["TR","Species count"]].groupby(["TR"], as_index = False).mean().sort_values(by="Species count",ascending = False)

In [None]:
df2[["RR","Species count"]].groupby(["RR"], as_index = False).mean().sort_values(by="Species count",ascending = False)

In [None]:
df2[["BR","Species count"]].groupby(["BR"], as_index = False).mean().sort_values(by="Species count",ascending = False)

In [None]:
def detect_outliers(df2,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df2[c],25)
        # 3rd quartile
        Q3 = np.percentile(df2[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df2[(df2[c] < Q1 - outlier_step) | (df2[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

We dont have a outliers and missing value

In [None]:
df2.loc[detect_outliers(df2,["TR","VR","UR","FR","OR","RR","BR","CR"])]

In [None]:
df2.isnull().sum()

Here I copied df2 because I'm gonna use df4 for "label encoding"

In [None]:
df4=df2.copy()

Next, if you take a look at the values contained within df2['TR'], you'll notice that it contains numerically-encoded categorical data. If we head over to the column descriptions on the Amphibian Dataset page, it says that:

1 = "natural reservoirs", 2 = "recently formed", 5 = "technological", 7 = "garden", 11 = "trenches", 12 = "wet meadows", 14 = "river valleys", 15 = "small watercourses".

We'll relabel our data so that the values in df2['TR'] are more descriptive of what's really contained within it. And for every categorical data we're gonna make same thing.

In [None]:
df2["TR"].value_counts()

In [None]:
df2['TR'].replace({1:'natural reservoirs', 2:'recently formed', 5:'technological',
                   7:'garden',11:'trenches', 12:'wet meadows', 14:'river valleys',
                   15:'small watercourses'}, inplace=True)

In [None]:
df2['VR'].replace({0:'no vegetation', 1:'patches at the edges', 2:'heavily overgrown',
                   3:'some part devoid of vegetation',4:'reservoirs completely overgrown'}, inplace=True)

In [None]:
df2["SUR3"].value_counts()

In [None]:
df2['SUR1'].replace({1:'forest areas', 2:'meadows', 4:'gardens',
                   6:'industrial areas',10:'river valleys', 7:'orchards', 9:'roads',
                   14:'agricultural'}, inplace=True)

In [None]:
df2['SUR2'].replace({1:'forest areas', 2:'meadows',6:'industrial areas',
                     10:'river valleys', 7:'orchards', 9:'roads',
                     11:'agricultural'}, inplace=True)

In [None]:
df2['SUR3'].replace({1:'forest areas', 2:'meadows', 5:'parks',
                   6:'industrial areas',10:'river valleys', 7:'orchards', 9:'roads',
                   11:'agricultural'}, inplace=True)

In [None]:
df2["UR"].value_counts()

In [None]:
df2['UR'].replace({0:'unused', 1:'scenic', 3:'technological'}, inplace=True)

In [None]:
df2["FR"].value_counts()

In [None]:
df2['FR'].replace({0:'lack', 1:'intense fishing', 2:'breeding reservoirs',3:'remove',4:'remove'}, inplace=True)

In [None]:
df2["OR"].value_counts()

In [None]:
df2['OR'].replace({25:'poor access', 50:'low access', 75:'medium access',100:'large access',99:'remove',80:'remove'}, inplace=True)

In [None]:
df2["RR"].value_counts()

In [None]:
df2['RR'].replace({0:'<50 m', 1:'50-100 m', 2:'100-200 m',5:'200-500 m',9:'500-1000 m',10:'>10000'}, inplace=True)

In [None]:
df2["BR"].value_counts()

In [None]:
df2['BR'].replace({0:'<50 m', 1:'50-100 m', 2:'100-200 m',5:'200-500 m',9:'500-1000 m',10:'>10000'}, inplace=True)

In [None]:
df2["MR"].value_counts()

In [None]:
df2['MR'].replace({0:'Clean', 1:'slightly littered', 2:'heavily littered'}, inplace=True)

In [None]:
df2["CR"].value_counts()

In [None]:
df2['CR'].replace({1:'Natural', 2:'Concrete'}, inplace=True)

"get dummies" converts categorical data into columns containing values 0 and 1. (One-Hot Encoding)

In [None]:
df2 = pd.get_dummies(df2)

In [None]:
df2.head()

Standard scaler is a good way to make sure that all of the numerical variables are on roughly the same scale that the categorical variables are on. We'll split our dataframe into two other dataframes: numerical and categorical.

In [None]:
numerical = df2[["SR","NR","Species count"]]

In [None]:
categorical=df2.drop(["SR","ID","NR","Species count"],axis=1)

In [None]:
scaler = StandardScaler()
numerical = pd.DataFrame(scaler.fit_transform(numerical))

In [None]:
numerical.columns = ["SR","NR","Species count"]

In [None]:
df3 = pd.concat([numerical, categorical], axis=1, join='inner')

In [None]:
df3.head()

In [None]:
binary_data=df3.loc[:,'TR_garden':'CR_Natural']

In [None]:
%%time
for i, col in enumerate(binary_data.columns):
    plt.figure(i,figsize=(6,4))
    sns.countplot(x=col, hue=df3['Species count'] ,data=df3, palette="rainbow")
    plt.show()

In [None]:
data=df4.drop(["ID"],axis=1)

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(data.corr(),cmap='magma',linecolor='white',linewidths=1,annot=False)

In [None]:
df3=df3.drop(["TR_garden","Motorway_A1","Motorway_S52","FR_remove","OR_remove"],axis=1)

In [None]:
y=df3['Species count']
x=df3.drop(["Species count"],axis=1)
X=pd.DataFrame(x)

<a id = "4" ></a><br>
# Random Forest With One-Hot Encoding

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)

In [None]:
rf=RandomForestRegressor(n_estimators=100,max_depth=3,random_state=42)
rf.fit(X_train,y_train)

In [None]:
mae = mean_absolute_error(rf.predict(X_test),y_test)
mse = mean_squared_error(rf.predict(X_test),y_test)
rmse = np.sqrt(mse)

print("mean absolute error: %.2f" % mae)
print("mean squared error: %.2f" % mse)
print("root mean squared error: %.2f" % rmse)

pip install pydotplus

In [None]:
pip install pydotplus

In [None]:
import pydotplus

In [None]:
from ipywidgets import Image
from io import StringIO
import graphviz
from sklearn.tree import export_graphviz

In [None]:
d_tree99 = rf.estimators_[99]
dot_data = StringIO()
export_graphviz(d_tree99, feature_names = X.columns,
               out_file = dot_data, filled = True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(value = graph.create_png())

Model tuning
* This section finds the parameters that gives the highest accuracy and performs the fit process again.

In [None]:
rf_params = {"max_depth":[2,5,8,10],
             "max_features":[2,5,8],
             "n_estimators":[10,500,1000],
             "min_samples_split":[2,5,10]}

In [None]:
rf_model = RandomForestRegressor()
rf_cv_model = GridSearchCV(rf_model,
                        rf_params,
                        cv=10,
                        n_jobs=-1,
                        verbose=2)

In [None]:
rf_cv_model.fit(X_train,y_train)
print("Best parameters: " + str(rf_cv_model.best_params_))

Fit process is repeated with the best parameters

In [None]:
rf_tuned = RandomForestRegressor(max_depth=8,
                                  max_features=5,
                                  min_samples_split=5,
                                  n_estimators=10)
rf_tuned.fit(X_train,y_train)

In [None]:
mae = mean_absolute_error(rf.predict(X_test),y_test)
mse = mean_squared_error(rf.predict(X_test),y_test)
rmse = np.sqrt(mse)

print("mean absolute error: %.2f" % mae)
print("mean squared error: %.2f" % mse)
print("root mean squared error: %.2f" % rmse)

It is ranked according to the highest effect on the number of species, that is, the high degree of importance.

In [None]:
Importance=pd.DataFrame({"Importance":rf_tuned.feature_importances_*100},
                       index=X_train.columns)

In [None]:
Importance.sort_values(by="Importance",
                      axis=0,
                      ascending=True).plot(kind="barh",color="green")
plt.xlabel("Importance level of values")

In [None]:
## from sklearn.metrics import r2_score
print(r2_score(y_test,rf_tuned.predict(X_test)))

In [None]:
import statsmodels.api as sm

In [None]:
model=sm.OLS(rf_tuned.predict(X_test),X_test)
model.fit().summary()

<a id = "5" ></a><br>
# Random Forest Regressor With Label Encoding

In [None]:
df4.head()

In [None]:
k=df4["Species count"]

In [None]:
m=df4.drop(["Species count","ID","Motorway"],axis=1)

In [None]:
m.head()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(k)
plt.show()

In [None]:
m_train, m_test, k_train, k_test=train_test_split(m,k,test_size=0.3,random_state=42)

In [None]:
rf2=RandomForestRegressor(n_estimators=100,max_depth=3,random_state=42)
rf2.fit(m_train,k_train)

In [None]:
mae = mean_absolute_error(rf2.predict(m_test),k_test)
mse = mean_squared_error(rf2.predict(m_test),k_test)
rmse = np.sqrt(mse)

print("mean absolute error: %.2f" % mae)
print("mean squared error: %.2f" % mse)
print("root mean squared error: %.2f" % rmse)

In [None]:
d_tree99 = rf2.estimators_[99]
dot_data = StringIO()
export_graphviz(d_tree99, feature_names = m.columns,
               out_file = dot_data, filled = True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(value = graph.create_png())

In [None]:
rf2_params = {"max_depth":[2,5,8,10],
             "max_features":[2,5,8],
             "n_estimators":[10,500,1000],
             "min_samples_split":[2,5,10]}

In [None]:
rf2_model = RandomForestRegressor()
rf2_cv_model = GridSearchCV(rf2_model,
                        rf2_params,
                        cv=10,
                        n_jobs=-1,
                        verbose=2)

In [None]:
rf2_cv_model.fit(m_train,k_train)
print("Best parameters: " + str(rf2_cv_model.best_params_))

In [None]:
rf2_tuned = RandomForestRegressor(max_depth=10,
                                  max_features=8,
                                  min_samples_split=10,
                                  n_estimators=1000)
rf2_tuned.fit(m_train,k_train)

In [None]:
mae=mean_absolute_error(rf2_tuned.predict(m_test),k_test)
mse=mean_squared_error(rf2_tuned.predict(m_test),k_test)
rmse=np.sqrt(mse)

print("mean_absolute_error: %.2f"%mae)
print("mean squared error: %.2f" %mse)
print("root mean squared error: %.2f" %rmse)

In [None]:
m_test.head()

In [None]:
from sklearn.metrics import r2_score
print(r2_score(k_test,rf2_tuned.predict(m_test)))

In [None]:
Importance=pd.DataFrame({"Importance":rf2_tuned.feature_importances_*100},
                       index=m_train.columns)

In [None]:
Importance.sort_values(by="Importance",
                      axis=0,
                      ascending=True).plot(kind="barh",color="green")
plt.xlabel("Importance level of values")

In [None]:
model2=sm.OLS(rf2_tuned.predict(m_test),m_test)
model2.fit().summary()

## Gradient Boosting With One Hot Encoding

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbm_model = GradientBoostingRegressor()
gbm_model.fit(X_train, y_train)

In [None]:
y_pred = gbm_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
gbm_params = {
    "learning_rate": [0.01, 0.1, 0.2],
    'max_depth': [3,5,8,50,100],
    'n_estimators': [200, 500, 1000, 2000],
    'subsample': [1, 0.5, 0.75],
}

In [None]:
gbm = GradientBoostingRegressor()
gbm_cv_model = GridSearchCV(gbm, gbm_params, cv=10, n_jobs = -1, verbose =2)
gbm_cv_model.fit(X_train, y_train)

In [None]:
gbm_cv_model.best_params_

In [None]:
gbm_tuned = GradientBoostingRegressor(learning_rate = 0.01,
                                      max_depth = 5,
                                      n_estimators = 200,
                                      subsample = 0.5)
gbm_tuned = gbm_tuned.fit(X_train, y_train)

In [None]:
y_pred = gbm_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
Importance=pd.DataFrame({"Importance":gbm_tuned.feature_importances_*100},
                       index= X_train.columns)

In [None]:
Importance.sort_values(by="Importance",
                      axis=0,
                      ascending=True).plot(kind="barh",color="green")
plt.xlabel("Importance level of values")
plt.show()
df5=pd.DataFrame(Importance)
df5.head(64)

In [None]:
df5.iloc[40:64]

In [None]:
model=sm.OLS(gbm_tuned.predict(X_test),X_test)
model.fit().summary()

## Gradient Boosting With Label Encoding

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbm_model2 = GradientBoostingRegressor()
gbm_model2.fit(m_train, k_train)

In [None]:
k_pred2 = gbm_model2.predict(m_test)
np.sqrt(mean_squared_error(k_test, k_pred2))

In [None]:
gbm_params = {
    "learning_rate": [0.01, 0.1, 0.2],
    'max_depth': [3,5,8,50,100],
    'n_estimators': [200, 500, 1000, 2000],
    'subsample': [1, 0.5, 0.75],
}

In [None]:
gbm2 = GradientBoostingRegressor()
gbm_cv_model2 = GridSearchCV(gbm2, gbm_params, cv=10, n_jobs = -1, verbose =2)
gbm_cv_model2.fit(m_train, k_train)

In [None]:
gbm_cv_model2.best_params_

In [None]:
gbm_tuned2 = GradientBoostingRegressor(learning_rate = 0.01,
                                      max_depth = 3,
                                      n_estimators = 200,
                                      subsample = 0.5)
gbm_tuned2 = gbm_tuned2.fit(m_train, k_train)

In [None]:
k_pred2 = gbm_tuned2.predict(m_test)
np.sqrt(mean_squared_error(k_test, k_pred2))

In [None]:
Importance=pd.DataFrame({"Importance":gbm_tuned2.feature_importances_*100},
                       index=m_train.columns)

In [None]:
Importance.sort_values(by="Importance",
                      axis=0,
                      ascending=True).plot(kind="barh",color="green")
plt.xlabel("Importance level of values")
plt.show()

In [None]:
model2=sm.OLS(gbm_tuned2.predict(m_test), m_test)
model2.fit().summary()

## XGBoost With Label Encoding

In [None]:
!pip install xgboost

In [None]:
import xgboost as xgb

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb = XGBRegressor().fit(m_train, k_train)

In [None]:
k_pred = xgb.predict(m_test)
np.sqrt(mean_squared_error(k_test, k_pred))

In [None]:
xgb

In [None]:
xgb_grid = {
    'colsample_bytree': [0.4, 0.5, 0.6, 0.9 ,1],
    'n_estimators': [100, 200, 500, 1000],
    'max_depth': [2, 3, 4, 5, 6],
    'learning_rate': [0.1, 0.01, 0.5]
}

In [None]:
xgb = XGBRegressor()
xgb_cv = GridSearchCV(xgb, 
                     param_grid = xgb_grid,
                     cv=10,
                     n_jobs = -1,
                     verbose = 2)
xgb_cv.fit(m_train, k_train)

In [None]:
xgb_cv.best_params_

In [None]:
xgb_tuned = XGBRegressor(colsample_bytree = 0.4,
                         learning_rate = 0.01,
                         max_depth = 2,
                         n_estimators = 500)
xgb_tuned = xgb_tuned.fit(m_train, k_train)

In [None]:
k_pred = xgb_tuned.predict(m_test)
np.sqrt(mean_squared_error(k_test, k_pred))

In [None]:
Importance=pd.DataFrame({"Importance":xgb_tuned.feature_importances_*100},
                       index = m_train.columns)

In [None]:
Importance.sort_values(by="Importance",
                      axis=0,
                      ascending=True).plot(kind="barh",color="green")
plt.xlabel("Importance level of values")

In [None]:
model=sm.OLS(xgb_tuned.predict(m_test), m_test)
model.fit().summary()