In [1]:
import pandas as pd

# forest cover

**Goal:** to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data).

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/forest_cover.csv'
forest_cover = pd.read_csv(url)
forest_cover.tail()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
15115,15116,2607,243,23,258,7,660,170,251,214,...,0,0,0,0,0,0,0,0,0,3
15116,15117,2603,121,19,633,195,618,249,221,91,...,0,0,0,0,0,0,0,0,0,3
15117,15118,2492,134,25,365,117,335,250,220,83,...,0,0,0,0,0,0,0,0,0,3
15118,15119,2487,167,28,218,101,242,229,237,119,...,0,0,0,0,0,0,0,0,0,3
15119,15120,2475,197,34,319,78,270,189,244,164,...,0,0,0,0,0,0,0,0,0,3


This data includes four wilderness areas located in the Roosevelt National Forest of northern Colorado.

**Data Description**

| Feature | Description |
| :- | -: |
| Elevation | Elevation in meters
| Aspect | Aspect in degrees azimuth
| Slope | Slope in degrees
| Horizontal_Distance_To_Hydrology | Horz Dist to nearest surface water features (in meters)
| Vertical_Distance_To_Hydrology | Vert Dist to nearest surface water features (in meters)
| Horizontal_Distance_To_Roadways | Horz Dist to nearest roadway (in meters)
| Hillshade_9am | Hillshade index at 9am, summer solstice
| Hillshade_Noon | Hillshade index at noon, summer soltice
| Hillshade_3pm | Hillshade index at 3pm, summer solstice
| Horizontal_Distance_To_Fire_Points | Horz Dist to nearest wildfire ignition points (in meters)
| Wilderness_Area (4 binary columns) | 0 (absence) or 1 (presence) / Wilderness area designation
| Soil_Type (40 binary columns) | 0 (absence) or 1 (presence) / Soil Type designation
| Cover_Type (target vector) | Forest Cover Type designation

The seven **cover types** are:

 - Spruce/Fir
 - Lodgepole Pine
 - Ponderosa Pine
 - Cottonwood/Willow
 - Aspen
 - Douglas-fir
 - Krummholz

The **wilderness areas** are:

 - Rawah Wilderness Area
 - Neota Wilderness Area
 - Comanche Peak Wilderness Area
 - Cache la Poudre Wilderness Area

The **soil types** are:

- Cathedral family - Rock outcrop complex, extremely stony.
- Vanet - Ratake families complex, very stony.
- Haploborolis - Rock outcrop complex, rubbly.
- Ratake family - Rock outcrop complex, rubbly.
- Vanet family - Rock outcrop complex complex, rubbly.
- Vanet - Wetmore families - Rock outcrop complex, stony.
- Gothic family.
- Supervisor - Limber families complex.
- Troutville family, very stony.
- Bullwark - Catamount families - Rock outcrop complex, rubbly.
- Bullwark - Catamount families - Rock land complex, rubbly.
- Legault family - Rock land complex, stony.
- Catamount family - Rock land - Bullwark family complex, rubbly.
- Pachic Argiborolis - Aquolis complex.
- unspecified in the USFS Soil and ELU Survey.
- Cryaquolis - Cryoborolis complex.
- Gateview family - Cryaquolis complex.
- Rogert family, very stony.
- Typic Cryaquolis - Borohemists complex.
- Typic Cryaquepts - Typic Cryaquolls complex.
- Typic Cryaquolls - Leighcan family, till substratum complex.
- Leighcan family, till substratum, extremely bouldery.
- Leighcan family, till substratum - Typic Cryaquolls complex.
- Leighcan family, extremely stony.
- Leighcan family, warm, extremely stony.
- Granile - Catamount families complex, very stony.
- Leighcan family, warm - Rock outcrop complex, extremely stony.
- Leighcan family - Rock outcrop complex, extremely stony.
- Como - Legault families complex, extremely stony.
- Como family - Rock land - Legault family complex, extremely stony.
- Leighcan - Catamount families complex, extremely stony.
- Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
- Leighcan - Catamount families - Rock outcrop complex, extremely stony.
- Cryorthents - Rock land complex, extremely stony.
- Cryumbrepts - Rock outcrop - Cryaquepts complex.
- Bross family - Rock land - Cryumbrepts complex, extremely stony.
- Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
- Leighcan - Moran families - Cryaquolls complex, extremely stony.
- Moran family - Cryorthents - Leighcan family complex, extremely stony.
- Moran family - Cryorthents - Rock land complex, extremely stony.

**goal** is to train a Random Forest classifier that predicts the target column (`Cover_Type`), tune the Random Forest hyperparameters, and test the performance of the classification model (using `accuracy` to evaluate the performance of the model.)

In [19]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [20]:
x=forest_cover.drop('Cover_Type',axis=1)
y= forest_cover.Cover_Type
x_train,x_test,y_train,y_test=train_test_split(x,y)

In [25]:
randFor=RandomForestClassifier()
from sklearn.metrics import accuracy_score,recall_score,confusion_matrix

In [26]:
param_dic ={'n_estimators':[5,10,25,50,100,200],
            'max_depth':[2,5,10,20],
            'min_samples_split':[2,4,8,16,32],
            'min_samples_leaf':[2,4,8,16,32]}

In [27]:
grid=GridSearchCV(randFor,
                  param_dic,
                  cv=10,
                  scoring='accuracy',
                  n_jobs=-1,verbose=1)

In [28]:
grid.fit(x_train,y_train)

Fitting 10 folds for each of 600 candidates, totalling 6000 fits


GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [2, 5, 10, 20],
                         'min_samples_leaf': [2, 4, 8, 16, 32],
                         'min_samples_split': [2, 4, 8, 16, 32],
                         'n_estimators': [5, 10, 25, 50, 100, 200]},
             scoring='accuracy', verbose=1)

In [29]:
grid.best_params_

{'max_depth': 20,
 'min_samples_leaf': 2,
 'min_samples_split': 4,
 'n_estimators': 200}

In [30]:
best_clf=grid.best_estimator_

In [31]:
best_clf.fit(x_train,y_train)

RandomForestClassifier(max_depth=20, min_samples_leaf=2, min_samples_split=4,
                       n_estimators=200)

In [33]:
y_test_pred= best_clf.predict(x_test)
accuracy_score(y_test,y_test_pred)

0.8513227513227514