##### In the [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data), we see the following description on **Soil Types**:

In [None]:
Soil_Types = [
'1 Cathedral family - Rock outcrop complex, extremely stony.',
'2 Vanet - Ratake families complex, very stony.',
'3 Haploborolis - Rock outcrop complex, rubbly.',
'4 Ratake family - Rock outcrop complex, rubbly.',
'5 Vanet family - Rock outcrop complex complex, rubbly.',
'6 Vanet - Wetmore families - Rock outcrop complex, stony.',
'7 Gothic family.',
'8 Supervisor - Limber families complex.',
'9 Troutville family, very stony.',
'10 Bullwark - Catamount families - Rock outcrop complex, rubbly.',
'11 Bullwark - Catamount families - Rock land complex, rubbly.',
'12 Legault family - Rock land complex, stony.',
'13 Catamount family - Rock land - Bullwark family complex, rubbly.',
'14 Pachic Argiborolis - Aquolis complex.',
'15 unspecified in the USFS Soil and ELU Survey.',
'16 Cryaquolis - Cryoborolis complex.',
'17 Gateview family - Cryaquolis complex.',
'18 Rogert family, very stony.',
'19 Typic Cryaquolis - Borohemists complex.',
'20 Typic Cryaquepts - Typic Cryaquolls complex.',
'21 Typic Cryaquolls - Leighcan family, till substratum complex.',
'22 Leighcan family, till substratum, extremely bouldery.',
'23 Leighcan family, till substratum - Typic Cryaquolls complex.',
'24 Leighcan family, extremely stony.',
'25 Leighcan family, warm, extremely stony.',
'26 Granile - Catamount families complex, very stony.',
'27 Leighcan family, warm - Rock outcrop complex, extremely stony.',
'28 Leighcan family - Rock outcrop complex, extremely stony.',
'29 Como - Legault families complex, extremely stony.',
'30 Como family - Rock land - Legault family complex, extremely stony.',
'31 Leighcan - Catamount families complex, extremely stony.',
'32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.',
'33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.',
'34 Cryorthents - Rock land complex, extremely stony.',
'35 Cryumbrepts - Rock outcrop - Cryaquepts complex.',
'36 Bross family - Rock land - Cryumbrepts complex, extremely stony.',
'37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.',
'38 Leighcan - Moran families - Cryaquolls complex, extremely stony.',
'39 Moran family - Cryorthents - Leighcan family complex, extremely stony.',
'40 Moran family - Cryorthents - Rock land complex, extremely stony.'
]

We can see that the description of Soil Types contains information such as **'extremely stony', 'Leighcan, 'Rock'**,
that is contained in multiple columns.

In the following, **binary features** representing
the presence of at least one Soil Type that includes a
**given word/phrase** in its description are created.

In [None]:
import pandas as pd

data = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
data = data.drop(3403875, axis = 0) #I dropped the row with the Target Value = 5

X = data.iloc[:,1:-2] #data without the id and target columns

Soil = [col for col in X.columns if 'Soil' in col] #Soil Type Columns
Non_Soil = [col for col in X.columns if ('Soil' in col) == False] #Other Columns

Y = data['Cover_Type'] #Target Column

Words_Phrases = ['Cry', 'Crya', 'Cryaquo','extremely stony', 'rubbly', 'very stony','Rock outcrop complex',
          'Typic Cryaquolls', 'Leighcan family' ,'Leighcan', 'Cryaquolls','warm', 'Rock',
          'Cryaquepts', 'stony', 'Moran', 'Como','Catamount', 'Vanet' ,'Typic', 'land',
          'outcrop', 'complex', 'Granile', 'bouldery', 'Bullwark' , 'Legault', 'family']
#Words/Phrases that will be searched in the Soil Type description

for Information in Words_Phrases:
    data[Information] = 0 #This will be the binary feature for each Word/Phrase
    
    Type_Number = [] #This list will include all the Soil Type Numbers that contain the Word/Phrase in its description.

    
    for i in range(0,40):
        if Information in Soil_Types[i]: #If the information is found in each of the Soil Type's description,
            Type_Number.append(i+1) #The Soil Type Number will be added to the list (+1 because of index values)
    
    for i in Type_Number:
        data[Information] = data[Information] + data['Soil_Type' + str(i)] #This adds up all the Soil Type Presences for the Word/Phrase.
    
    data[Information] = 1 * (data[Information] > 0)
    #I think the Soil Types are supposed to be mutually exclusive, but it didn't happen in this generated dataset? This code basically makes all the overlapping cases to '1'.

Now, we can check if each of the Information column increases the validation score.

In this notebook, I'll use XGBClassifier with learning rate = 0.5, n_estimators = 20, and CV-folds = 2.

In [None]:
from xgboost import XGBClassifier

Classifier = XGBClassifier(learning_rate = 0.5, n_estimators = 20,  #XGB Classifier
            tree_method = 'gpu_hist', predictor =  'gpu_predictor',
              objective = 'multi:softmax', eval_metric = 'mlogloss')

In the following, Validation scores for each of the following will be computed:

**(1) Just the Non_Soil Columns**

**(2) Non_Soil and Soil Columns (basically the given data)**

**(3) Non_Soil Columns and each of the 10 of the information columns**

'Leighcan', 'Rock', 'stony', 'Moran', 'Catamount',
'Bullwark', 'Cry', 'rubbly', 'family', 'complex'

You can try out other columns as well. In this notebook, I will use the above 10.

**(4) Non_Soil Columns and 10 of the information columns**

In [None]:
from sklearn.model_selection import cross_val_score

scores = []
columns = []

#(1) Just the Non-Soil Columns
X_1 = data[Non_Soil]
score_1 = cross_val_score(Classifier, X_1, Y, cv = 2)
score_1 = round(sum(score_1)/2,5)
scores.append(score_1)
columns.append('Just the Non-Soil')

In [None]:
#(2) Non_Soil and Soil Columns (basically the given data)
X_2 = data[Non_Soil + Soil]
score_2 = cross_val_score(Classifier, X_2, Y, cv = 2)
score_2 = round(sum(score_2)/2,5)
scores.append(score_2)
columns.append('Non_Soil and Soil')

In [None]:
#(3) Non_Soil and Each of the 10 of the Information Columns
Words_Phrases_10 = ['Leighcan','Rock','stony','Moran','Catamount','Bullwark','Cry','rubbly','family','complex']

for Information in Words_Phrases_10:
    X_3 = data[Non_Soil + [Information]]
    score_3 = cross_val_score(Classifier, X_3, Y, cv = 2)
    score_3 = round(sum(score_3)/2,5)
    scores.append(score_3)
    columns.append('Non_Soil + ' + Information)

In [None]:
#(4) Non_Soil Columns and 10 of the information columns
X_4 = data[Non_Soil + Words_Phrases]
score_4 = cross_val_score(Classifier, X_4, Y, cv = 2)
score_4 = round(sum(score_4)/2,5)
scores.append(score_4)
columns.append('Non_Soil and 10')

In [None]:
pd.DataFrame({'Validation Scores':scores}, index = columns)

I didn't run it in this notebook, but **Non_Soil + Soil + 10 information columns** gave me higher score than (1), (2), (3), or (4).