Instructions

Apply the Random Forests algorithm but this time only by upscaling the data.

Discuss the output and its impact in the business scenario. 
Is the cost of a false positive equal to the cost of the false negative? 
How would you change your algorithm or data in order to maximize the return of the business?

In [4]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [2]:
categorical = pd.read_csv('categorical.csv')
numerical = pd.read_csv('numerical.csv')
target = pd.read_csv('target.csv')

In [12]:
target

Unnamed: 0,TARGET_B,TARGET_D
0,0,0.0
1,0,0.0
2,0,0.0
3,0,0.0
4,0,0.0
...,...,...
95407,0,0.0
95408,0,0.0
95409,0,0.0
95410,1,18.0


In [None]:
# TARGET_B Target Variable: Binary Indicator for Response to 97NK Mailing
# TARGET_D Target Variable: Donation Amount (in \$) associated with the Response to 97NK Mailing

In [13]:
print(target['TARGET_B'].value_counts())
print(target['TARGET_D'].value_counts())

0    90569
1     4843
Name: TARGET_B, dtype: int64
0.00     90569
10.00      941
15.00      591
20.00      577
5.00       503
         ...  
4.50         1
55.00        1
18.25        1
16.87        1
48.00        1
Name: TARGET_D, Length: 71, dtype: int64


In [3]:
combined = pd.concat([categorical, numerical, target], axis=1)

In [6]:
combined

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B,...,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B,TARGET_D
0,IL,36,H,F,3,L,E,C,T,2,...,12.0,10.0,4,7.741935,95515,0,4,39,0,0.0
1,CA,14,H,M,3,L,G,A,S,1,...,25.0,25.0,18,15.666667,148535,0,2,1,0,0.0
2,NC,43,U,M,3,L,E,C,R,2,...,16.0,5.0,12,7.481481,15078,1,4,60,0,0.0
3,CA,44,U,F,3,L,E,C,R,2,...,11.0,10.0,9,6.812500,172556,1,4,41,0,0.0
4,FL,16,H,F,3,L,F,A,S,2,...,15.0,15.0,14,6.864865,7112,1,2,26,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,other,27,H,M,3,L,G,C,C,2,...,25.0,25.0,9,25.000000,184568,0,1,12,0,0.0
95408,TX,24,H,M,3,L,F,A,C,1,...,20.0,20.0,9,20.000000,122706,1,1,2,0,0.0
95409,MI,30,H,M,3,L,E,B,C,3,...,10.0,10.0,3,8.285714,189641,1,3,34,0,0.0
95410,CA,24,H,F,2,L,F,A,C,1,...,21.0,18.0,4,12.146341,4693,1,4,11,1,18.0


In [15]:
category_0 = combined[combined['TARGET_B'] == 0]
category_1 = combined[combined['TARGET_B'] == 1]

In [None]:
# Upsampling

In [16]:
cat1_oversampled = resample(category_1, 
                                  replace=True, 
                                  n_samples = len(category_0))

In [18]:
print(cat1_oversampled.shape)
print(category_0.shape)

(90569, 339)
(90569, 339)


In [28]:
combined_upsampled = pd.concat([cat1_oversampled, category_0], axis=0)

In [29]:
combined_upsampled.shape

(181138, 339)

In [37]:
y = combined_upsampled['TARGET_B']
X = combined_upsampled.drop(['TARGET_B'], axis = 1)

numericalX = X.select_dtypes(np.number).reset_index().drop(['index'],axis=1)
categoricalX = X.select_dtypes(object)

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first').fit(categoricalX)
encoded_categorical = encoder.transform(categoricalX).toarray()
encoded_categorical = pd.DataFrame(encoded_categorical)
X = pd.concat([numericalX, encoded_categorical], axis = 1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [38]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

y_train_regression = X_train['TARGET_D']
y_test_regression = X_test['TARGET_D']

# Now we can remove the column target d from the set of features
X_train = X_train.drop(['TARGET_D'], axis = 1)
X_test = X_test.drop(['TARGET_D'], axis = 1)

In [39]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=5,
                             min_samples_split=20,
                             min_samples_leaf =20)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.6172727900075909
0.6077067461631886


In [40]:
# For cross validation
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(max_depth=5,
                             min_samples_split=20,
                             min_samples_leaf =20)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=10)
print(np.mean(cross_val_scores))

0.6131253881719688


In [None]:
# Is the cost of a false positive equal to the cost of the false negative? 

A false positive would result in targeting someone that it not likely to donate.  This would result in unnecessary marketing 
costs, e.g. postage fees, etc., for targeting this person.

A false negative would result in a potential donor not being included in a marketing campaign and potentially losing this 
person's donation.  The cost of this would depend on the size of the donation that was lost.

In [None]:
# How would you change your algorithm or data in order to maximize the return of the business?

I would probably use downsampling instead of upsampling.  It appears as if with upsampling, we are using all the data of 
persons not likely to donate (majority class), and creating copies of the data of persons that will donate. 
I would rather focus on the real data of the persons that will donate and "downsample" the data from the majority class.

To maximize the return of the business the focus should probably be on how much a potential donor would donate, which would
be a regression problem, rather than a classification problem.