# Beginner's Random Forests example

This is a very simple Random Forests example meant for beginners. This is not meant to achieve a high score, merely a starting point on which to start, without complicated techniques.

It's recommended that you finish the Kaggle Learn courses (introduction and intermediate machine learning).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/test_identity.csv
/kaggle/input/sample_submission.csv
/kaggle/input/train_identity.csv
/kaggle/input/train_transaction.csv
/kaggle/input/test_transaction.csv


## Files

As can be seen the training data contains two files, `train_transaction.csv` and `train_identity.csv`. These two tables are related to each other via the column `TransactionID`. 

In [2]:
train_transaction_df = pd.read_csv('/kaggle/input/train_transaction.csv')
train_identity_df = pd.read_csv('/kaggle/input/train_identity.csv')

In [3]:
train_transaction_df.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
train_identity_df.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In order to use these files for training, we'll need to do what's sometimes called denormalising the data. We can do this by doing a left join on both tables using the DataFrame's `merge()` method.

In [5]:
df_train = train_transaction_df.merge(train_identity_df, on='TransactionID', how='left')

Let's do a sanity check whenever we do something like this. Make sure the shape contains the same number of rows and the combined columns:

In [6]:
df_train.shape

(590540, 434)

In [7]:
df_train.head().transpose()

Unnamed: 0,0,1,2,3,4
TransactionID,2987000,2987001,2987002,2987003,2987004
isFraud,0,0,0,0,0
TransactionDT,86400,86401,86469,86499,86506
TransactionAmt,68.5,29,59,50,50
ProductCD,W,W,W,W,H
card1,13926,2755,4663,18132,4497
card2,,404,490,567,514
card3,150,150,150,150,150
card4,discover,mastercard,visa,mastercard,mastercard
card5,142,102,166,117,102


Looks like that worked! But we're now using a lot of RAM (look at the sidebar of your kernel). My kernel currently says I'm at 5 gigabytes, and we haven't even read the test set yet!

While cleaning the data, it's possible we may need to make copies of (some sections) of the data, so this is obviously not ideal.

In [8]:
df_train.memory_usage().sum()

2055079200

## Memory reduction

As discussed in [my other kernel](https://www.kaggle.com/yoongkang/beginner-memory-reduction-techniques), parsing the training dataset with default settings could take up to 2GBs of memory unnecessarily. With a few techniques (also discussed in the linked kernel) we can cut this down by about a gigabyte. 

In this kernel, we'll use similar techniques, if you want to see my reasoning for this please refer to the other kernel.

First we need to determine which numeric columns we have so that we can downcast them (cast from float64 to another type that requires less memory).

In [9]:
# Get the categorical and numeric columns
cat_cols = [
    'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
    'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'DeviceType', 'DeviceInfo',
] + [f'M{n}' for n in range(1, 10)] + [f'id_{n}' for n in range(12, 39)]
num_cols = list(set(df_train.columns) - set(cat_cols))

As described in the other kernel, some of these columns are `float64` by default due to the presence of some `NaN` values.

The fact that they are `NaN` might be meaningful in this context, so completely replacing them (this is called imputation), doesn't sound like a great idea. However, we can add a boolean column to mark that the column has been replaced, and hopefully the training algorithm is smart enough to take care of it. There's no guarantee that it will, though! So you should always challenge your assumptions (e.g. maybe just dropping the columns could give you similar results, and train faster).

Also bear in mind we'll need to use the same values we're using to impute on the training set on the test set. 

In [10]:
a = df_train[num_cols].isnull().any()
train_null_num_cols = a[a].index

In [11]:
nas = {}
for n in train_null_num_cols:
    df_train[f'{n}_isna'] = df_train[n].isnull()
    median = df_train[n].median()
    df_train[n].fillna(median, inplace=True)
    nas[n] = median

Now that we've removed all the `NaN` values, we can downcast the columns to the lowest precision.

First we'll need to know which columns are integers, though! The following snippet does just that.

In [12]:
integer_cols = []
for c in num_cols:
    try:
        if df_train[c].fillna(-1.0).apply(float.is_integer).all():
            integer_cols += [c]
    except Exception as e:
        print("error: ", c, e)

error:  TransactionID descriptor 'is_integer' requires a 'float' object but received a 'int'
error:  TransactionDT descriptor 'is_integer' requires a 'float' object but received a 'int'
error:  isFraud descriptor 'is_integer' requires a 'float' object but received a 'int'


There will be some errors printed, but that's normal because some numeric columns are already integers. I'm too lazy to fix that right now.

Let's look at some stats.

In [13]:
stats = df_train[integer_cols].describe().transpose()
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V117,590540.0,1.000391,0.035229,0.0,1.0,1.0,1.0,3.0
C7,590540.0,2.848478,61.727304,0.0,0.0,0.0,0.0,2255.0
V54,590540.0,0.669594,0.514694,0.0,0.0,1.0,1.0,6.0
V34,590540.0,0.121228,0.336966,0.0,0.0,0.0,0.0,13.0
V44,590540.0,1.059888,0.541348,0.0,1.0,1.0,1.0,48.0
V225,590540.0,0.042353,0.611830,0.0,0.0,0.0,0.0,51.0
V111,590540.0,1.002562,0.070812,0.0,1.0,1.0,1.0,9.0
V27,590540.0,0.000676,0.026692,0.0,0.0,0.0,0.0,4.0
V256,590540.0,0.955686,0.481175,0.0,1.0,1.0,1.0,87.0
V244,590540.0,1.026186,0.332075,0.0,1.0,1.0,1.0,22.0


So we can see here that there are some very small ranges there -- not all of them will need to be `float64`. Let's downcast them.

In [14]:
int8columns = stats[stats['max'] < 256].index
print(int8columns.shape)
print(int8columns)
int16columns = stats[(stats['max'] >= 256) & (stats['max'] <= 32767)].index
print(int16columns.shape)
print(int16columns)

(239,)
Index(['V117', 'V54', 'V34', 'V44', 'V225', 'V111', 'V27', 'V256', 'V244',
       'V5',
       ...
       'V87', 'V243', 'V109', 'V289', 'V157', 'V183', 'V193', 'V64', 'V152',
       'id_05'],
      dtype='object', length=239)
(62,)
Index(['C7', 'V167', 'V259', 'V322', 'D5', 'C6', 'D12', 'C8', 'D13', 'V323',
       'C13', 'V145', 'C10', 'D2', 'D1', 'V178', 'D7', 'D15', 'D3', 'V95',
       'V245', 'V103', 'V324', 'D10', 'V218', 'V279', 'D11', 'V293', 'V96',
       'dist2', 'V295', 'V292', 'V221', 'V219', 'V177', 'C11', 'V143', 'C5',
       'D4', 'C1', 'dist1', 'C2', 'V294', 'V280', 'D14', 'V102', 'V179',
       'V233', 'V227', 'V232', 'V217', 'V168', 'C4', 'D6', 'V97', 'C12',
       'V291', 'V222', 'V150', 'C14', 'V101', 'V231'],
      dtype='object')


In [15]:
for c in int8columns:
    df_train[c] = df_train[c].astype('int8')
    
for c in int16columns:
    df_train[c] = df_train[c].astype('int16')

In [16]:
df_train.memory_usage().sum()

1064153080

Looks like we shaved a whole gig.

We're not done yet, we need to make sure we do the same thing on the test set. Let's read it in now, merge the tables, and impute it the same way.

In [17]:
test_transaction_df = pd.read_csv('/kaggle/input/test_transaction.csv')
test_identity_df = pd.read_csv('/kaggle/input/test_identity.csv')
df_test = test_transaction_df.merge(test_identity_df, on='TransactionID', how='left')

We added some columns in our training set and replaced missing values with the medians. We need to add those same columns, and also add the median from the training set for those missing values (the same ones).

In [18]:
for k, v in nas.items():
    df_test[f'{k}_isna'] = df_test[k].isnull()
    df_test[k].fillna(v, inplace=True)

Unfortunately, we're not done yet. We might have some missing values on other numeric columns we didn't anticipate. 

In practice, we don't always have a "test set". The test set might be new observations that come in the future, could be a list of observations or a single one, so we can't really use statistics from the test set to impute the missing values. So we need to use values from the training set.

We'll use the median as well.

In [19]:
test_num_cols = list(set(num_cols) - set(['isFraud']))
a = df_test[test_num_cols].isnull().any()
test_null_num_cols = a[a].index

In [20]:
for n in test_null_num_cols:
    df_test[n].fillna(df_train[n].median(), inplace=True)  # use the training set's median!

Now, we can downcast numeric columns in the same way

In [21]:
# copied from above cells

integer_cols = []
for c in test_num_cols:
    try:
        if df_test[c].fillna(-1.0).apply(float.is_integer).all():
            integer_cols += [c]
    except Exception as e:
        print("error: ", c, e)
stats = df_test[integer_cols].describe().transpose()
int8columns = stats[stats['max'] < 256].index
int16columns = stats[(stats['max'] >= 256) & (stats['max'] <= 32767)].index
for c in int8columns:
    df_test[c] = df_test[c].astype('int8')
    
for c in int16columns:
    df_test[c] = df_test[c].astype('int16')

error:  TransactionID descriptor 'is_integer' requires a 'float' object but received a 'int'
error:  TransactionDT descriptor 'is_integer' requires a 'float' object but received a 'int'


## Categorical values

Okay, we've reduced memory, but we still need to deal with categorical values. As machine learning algorithms don't understand things like strings, we need to convert them into numbers. This is called encoding.

We could use either label encoding, which replaces each category into a numerical representation, or use one hot encoding which creates a separate column for each category. In general, one hot encoding performs better. However, in this case we have columns with very high cardinality -- and since we have a large dataset, it's probably more practical to use label encoding which we'll do.

Two things we need to deal with for label encoding are missing values and unknown values. Missing values means the data is simply not there, whereas unknown values are values in the test set that we don't have in the training set.

For missing values, we'll just replace them with a label, e.g. the string `"missing"`. That's pretty straightforward.

For unknown values, that requires a bit more thought. The main question is whether or not we have all the categories a priori. If we know all the possible categories beforehand (i.e. fixed categories like gender, state, postcodes) then we can go ahead and devise a mapping beforehand for all possible values. However, sometimes categories only come in the future, like mobile phone models. In the latter case, we have no way of knowing all the possible future values, and thus we can't map them -- so we'll need another strategy, i.e. replace them with a different label like the string `"unknown"`. We'll be doing that.

First, we'll replace missing values with the string `"missing"` (we actually don't need to do this since pandas does it automatically, but I like to give it an explicit label, makes it easier to see).

In [22]:
for c in cat_cols:
    df_train[c] = df_train[c].fillna("missing")
    
for c in cat_cols:   
    df_test[c] = df_test[c].fillna("missing")

Next we'll convert the columns in the training set to categorical.

In [23]:
cats = {}
for c in cat_cols:
    df_train[c] = df_train[c].astype("category")
    df_train[c].cat.add_categories('unknown', inplace=True)
    cats[c] = df_train[c].cat.categories

Then we'll convert the test set.

In [24]:
for k, v in cats.items():
    df_test[k][~df_test[k].isin(v)] = 'unknown'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
from pandas.api.types import CategoricalDtype

for k, v in cats.items():
    new_dtype = CategoricalDtype(categories=v, ordered=True)
    df_test[k] = df_test[k].astype(new_dtype)

In [26]:
for c in cat_cols:
    df_train[c] = df_train[c].cat.codes
    df_test[c] = df_test[c].cat.codes
    

Now we're more or less done with the minimum preprocessing required. Let's save our progress to a feather file, so that we don't have to go through it again!

In [27]:
df_train.to_feather('df_train')

In [28]:
df_test.to_feather('df_test')

## Validation set

Now we can start training our model. But how do we know if a model is good or not? We commonly use something called a validation set, that is separate to the test set. The reason we have a holdout set is that we use the validation set to choose our model (even if we don't use it for training), otherwise our model will overfit. If you're unfamiliar with this, I suggest reading on overfitting and underfitting.

The data description seems to indicate that the data is time ordered, so we don't really want a random split. So let's hold out a portion of the bottom rows to use as our validation set, and the rest as our training set.

In [29]:
idx = int(len(df_train) * 0.8)
training_set, validation_set = df_train[:idx], df_train[idx:]

In [30]:
y_train = training_set['isFraud']
X_train = training_set.drop('isFraud', axis=1)
y_valid = validation_set['isFraud']
X_valid = validation_set.drop('isFraud', axis=1)

In [31]:
print(X_train.shape, y_train.shape)

(472432, 800) (472432,)


In [32]:
print(X_valid.shape, y_valid.shape)

(118108, 800) (118108,)


## Training the model

Now we can finally train a model. You can iterate on this part.

In [33]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score

Using the whole training set is too time-consuming for quick iteration, so we can use a sample. 

We could use a random sample, but since this is time-ordered, I'm guessing the more recent rows would give us better predictive value. So let's just grab the bottom rows.

In [34]:
training_sample = training_set[-100000:]
y_train_sample = training_sample['isFraud']
X_train_sample = training_sample.drop('isFraud', axis=1)

In [35]:
model = RandomForestRegressor(
    n_estimators=400, max_features=0.3,
    min_samples_leaf=20, n_jobs=-1, verbose=1)

In [36]:
model.fit(X_train_sample, y_train_sample)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed: 12.3min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.3, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=20, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
                      oob_score=False, random_state=None, verbose=1,
                      warm_start=False)

In [37]:
preds_valid = model.predict(X_valid)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    2.8s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:    5.6s finished


In [38]:
roc_auc_score(y_valid, preds_valid)

0.8799528483981317

## Submission

Now that we have a decent model, we can actually train on the whole dataset, including the validation set.

In [39]:
model = RandomForestRegressor(
    n_estimators=400, max_features=0.3,
    min_samples_leaf=20, n_jobs=-1, verbose=1)

In [40]:
y = df_train['isFraud']
X = df_train.drop('isFraud', axis=1)

In [41]:
model.fit(X, y)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 16.5min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 73.2min
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed: 151.8min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.3, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=20, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
                      oob_score=False, random_state=None, verbose=1,
                      warm_start=False)

In [42]:
y_preds = model.predict(df_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.9s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   18.1s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   37.4s finished


In [43]:
submission = pd.read_csv('/kaggle/input/sample_submission.csv')
submission['isFraud'] = y_preds
submission.to_csv('submission.csv', index=False)