<a href="https://colab.research.google.com/github/watsonselah/bubba-watson/blob/master/pandas_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

An implementation for porting to other platforms and discussion (this is not to do exploratory analysis but rather to consider the APIs and technologies involved - it is not intended to be a good or reference solution to this problem).

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np

Obtain the data from Google Cloud Storage buckets

In [2]:
! wget https://storage.googleapis.com/bdt-spark-store/external_sources.csv -O gcs_external_sources.csv

--2025-10-02 18:31:28--  https://storage.googleapis.com/bdt-spark-store/external_sources.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.216.207, 108.177.11.207, 192.178.219.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.216.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15503836 (15M) [text/csv]
Saving to: ‘gcs_external_sources.csv’


2025-10-02 18:31:30 (13.5 MB/s) - ‘gcs_external_sources.csv’ saved [15503836/15503836]



In [3]:
! wget https://storage.googleapis.com/bdt-spark-store/internal_data.csv -O gcs_internal_data.csv

--2025-10-02 18:31:35--  https://storage.googleapis.com/bdt-spark-store/internal_data.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.216.207, 108.177.11.207, 192.178.219.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.216.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152978396 (146M) [text/csv]
Saving to: ‘gcs_internal_data.csv’


2025-10-02 18:31:40 (31.3 MB/s) - ‘gcs_internal_data.csv’ saved [152978396/152978396]



Read in data sources

In [4]:
df_data = pd.read_csv('gcs_internal_data.csv')
df_ext = pd.read_csv('gcs_external_sources.csv')

Join them on their common identifier key

In [5]:
df_full = df_data.merge(df_ext, on='SK_ID_CURR', how='inner')
df_full.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0.083037,0.262949,0.139376
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.311267,0.622246,
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,,0.555912,0.729567
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,,,,,,,,0.650442,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0.0,0.0,0.0,0.0,0.0,0.0,,0.322738,


We will filter a few features out for the sake of this example

In [6]:
columns_extract = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
                  'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_EDUCATION_TYPE',
                  'DAYS_ID_PUBLISH', 'CODE_GENDER', 'AMT_ANNUITY',
                  'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT',
                  'ORGANIZATION_TYPE', 'DAYS_LAST_PHONE_CHANGE',
                  'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE', 'TARGET']
df = df_full[columns_extract]

In [7]:
df.head(3)

Unnamed: 0,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_BIRTH,DAYS_EMPLOYED,NAME_EDUCATION_TYPE,DAYS_ID_PUBLISH,CODE_GENDER,AMT_ANNUITY,DAYS_REGISTRATION,AMT_GOODS_PRICE,AMT_CREDIT,ORGANIZATION_TYPE,DAYS_LAST_PHONE_CHANGE,NAME_INCOME_TYPE,AMT_INCOME_TOTAL,OWN_CAR_AGE,TARGET
0,0.083037,0.262949,0.139376,-9461,-637,Secondary / secondary special,-2120,M,24700.5,-3648.0,351000.0,406597.5,Business Entity Type 3,-1134.0,Working,202500.0,,1
1,0.311267,0.622246,,-16765,-1188,Higher education,-291,F,35698.5,-1186.0,1129500.0,1293502.5,School,-828.0,State servant,270000.0,,0
2,,0.555912,0.729567,-19046,-225,Secondary / secondary special,-2531,M,6750.0,-4260.0,135000.0,135000.0,Government,-815.0,Working,67500.0,26.0,0


Let's obtain a train and test split

In [8]:
# set the seed for reproducibility
np.random.RandomState(101)

RandomState(MT19937) at 0x79FB67C77B40

In [9]:
train, test = np.split(df.sample(frac=1), [int(.8*len(df))])

  return bound(*args, **kwds)


In [10]:
print(train.TARGET.value_counts()/len(train.index))
print(test.TARGET.value_counts()/len(test.index))

TARGET
0    0.919299
1    0.080701
Name: count, dtype: float64
TARGET
0    0.919158
1    0.080842
Name: count, dtype: float64


Handle the categorical variables

In [11]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

Training Features shape:  (246008, 88)
Testing Features shape:  (61503, 88)


Align the training and test data (as the test data may not have the same columns in the encoding)

In [12]:
# Align the training and testing data, keep only columns present in both dataframes
train, test = train.align(test, join = 'inner', axis = 1)

print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

Training Features shape:  (246008, 88)
Testing Features shape:  (61503, 88)


Get labels from data

In [13]:
train_labels = train['TARGET']
test_labels = test['TARGET']

Fill in missing data and scale

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer as Imputer

# Drop the target from the training data
if 'TARGET' in train:
    train = train.drop(columns = ['TARGET'])
    test = test.drop(columns = ['TARGET'])
else:
    train = train.copy()
    test = test.copy()

# Feature names
features = list(train.columns)

# Median imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = StandardScaler()

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(test)

scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

Training data shape:  (246008, 87)
Testing data shape:  (61503, 87)


Fit random forest

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100,
                                       random_state = 50,
                                       verbose = 1, n_jobs = -1)
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict(test)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   51.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.7min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    1.5s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    2.9s finished


Evaluate on test

In [16]:
from sklearn.metrics import accuracy_score, roc_auc_score

print(accuracy_score(test_labels, predictions))

0.9191909337755881


In [17]:
feature_importances.sort_values('importance', ascending=False).head(10)

Unnamed: 0,feature,importance
1,EXT_SOURCE_2,0.098683
2,EXT_SOURCE_3,0.08692
3,DAYS_BIRTH,0.076777
5,DAYS_ID_PUBLISH,0.076086
7,DAYS_REGISTRATION,0.075531
6,AMT_ANNUITY,0.071283
10,DAYS_LAST_PHONE_CHANGE,0.068226
4,DAYS_EMPLOYED,0.065257
9,AMT_CREDIT,0.064658
11,AMT_INCOME_TOTAL,0.056638
