# Lab | Random Forest

<br>

<details><summary>▶ Instructions:</summary>
<p>

* Apply the Random Forests algorithm but this time only by upscaling the data.
* Discuss the output and its impact in the bussiness scenario. Is the cost of a false positive equals to the cost of the false negative? How would you change your algorithm or data in order to maximize the return of the bussiness?

</p>
</details>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
categorical = pd.read_csv("files_for_lab/categorical.csv")
numerical = pd.read_csv('files_for_lab/numerical.csv')
target = pd.read_csv('files_for_lab/target.csv')
target.head(1)   # b=boolean/response previous mailing and d=dollar/amount_donated as response

Unnamed: 0,TARGET_B,TARGET_D
0,0,0.0


In [2]:
print('cat:', categorical.shape)
print('num:', numerical.shape)
print('target_B:\n' + str(target['TARGET_B'].value_counts()))

cat: (95412, 22)
num: (95412, 315)
target_B:
0    90569
1     4843
Name: TARGET_B, dtype: int64


In [3]:
target_b = target['TARGET_B']   # seperate target datasets, possibly needed like that in labs
target_d = target['TARGET_D']
data = pd.concat([categorical, numerical, target_d, target_b], axis=1) 

In [4]:
nans = pd.DataFrame(data.isna().sum()*100/len(data), columns=['percentage'])
nans.sort_values('percentage', ascending = False).head()
data[data.isna().any(axis=1)]   # no NaN's at all

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B,...,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_D,TARGET_B


In [5]:
# 0    90569
# 1     4843
# upsampling from 4843 --> 90569
from sklearn.utils import resample

category_0 = data[data['TARGET_B'] == 0]
category_1 = data[data['TARGET_B'] == 1]

category_1_oversampled = resample(category_1, replace=True, n_samples = len(category_0))

print(category_0.shape)
print(category_1_oversampled.shape)

(90569, 339)
(90569, 339)


In [6]:
# re-joining
data_upsampled = pd.concat([category_0, category_1_oversampled], axis=0).reset_index(drop=True)
data_upsampled.shape

(181138, 339)

In [7]:
# X/y split and OneHotEncoding
y = data_upsampled['TARGET_B']
X = data_upsampled.drop(['TARGET_B'], axis = 1)

X_num = X.select_dtypes(np.number)
X_cat = X.select_dtypes(np.object)

encoder = OneHotEncoder(drop='first').fit(X_cat)
encoded_categorical = encoder.transform(X_cat).toarray()
encoded_categorical = pd.DataFrame(encoded_categorical)
X = pd.concat([X_num, encoded_categorical], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [8]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

y_train_regression = X_train['TARGET_D']
y_test_regression = X_test['TARGET_D']

# Now we can remove the column target d from the set of features
X_train = X_train.drop(['TARGET_D'], axis = 1)
X_test = X_test.drop(['TARGET_D'], axis = 1)

In [9]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=5,
                             min_samples_split=20,
                             min_samples_leaf =20)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.6168242357325237
0.6146902947996025


In [10]:
# For cross validation
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(max_depth=5,
                             min_samples_split=20,
                             min_samples_leaf =20)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=10)
print(np.mean(cross_val_scores))

0.6126561313918983


In [12]:
cross_val_scores

array([0.61741771, 0.61037886, 0.61120696, 0.61707267, 0.60775654,
       0.61562349, 0.61224208, 0.60761852, 0.61679663, 0.61044786])

In [13]:
# Is the cost of a false positive equal to the cost of the false negative? 
# A false positive means sending a mailpack to someone that it not likely to donate.  
# Marketing costs and costs for sending will be unnecessarily high.
# A false negative is about a potential donor who is not included in the campaign and 
# will probably not spontaneously come up with a donation. Losing a prospect is never
# good which means losing a donation

In [None]:
# How would you change your algorithm or data in order to maximize the return of the business?
# I wouldn't know by now. Up- versus downsampling didn't change much in the last lab
# Let's see if this insight will be created in next lab