# Task 2 (Unsupervised Learning) - Characterizing Donors and Donation Type

In this task you should **use unsupervised learning algorithms and try to characterize donors (people who really did a donation) and their donation type**. You can use:
* **Association rule mining** to find **associations between the features and the target Donation/DonationTYPE**.
* **Clustering algorithms to find similar groups of donors**. Is it possible to find groups of donors with the same/similar DonationTYPE?
* **Be creative and define your own unsupervised analysis!** What would it be interesting to find out ?

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.display import HTML

## Preprocessing Data for Association Rule Mining

In [2]:
df_clean = pd.read_csv('donors_dataset_clean.csv') 
df_clean.head()

Unnamed: 0,TARGET_B,TARGET_D,MONTHS_SINCE_ORIGIN,DONOR_AGE,IN_HOUSE,SES,CLUSTER_CODE,INCOME_GROUP,MOR_HIT_RATE,MEDIAN_HOME_VALUE,...,LIFETIME_GIFT_RANGE_BIN,LAST_GIFT_AMT_BIN,CARD_PROM_12_BIN,NUMBER_PROM_12_BIN,MONTHS_SINCE_LAST_GIFT_BIN,MONTHS_SINCE_FIRST_GIFT_BIN,FILE_CARD_GIFT_BIN,INCOME_GROUP_BIN,RECENCY_STATUS_96NK_e,DONATION_TYPE
0,1.0,10.0,137.0,79.0,0.0,2.0,45.0,7.0,0.0,334.0,...,"(15.0, 30.0]","(15.0, 30.0]","(3.0, 9.0]","(20.0, 30.0]","(-0.1, 8.0]","(120.0, 260.0]","(10.0, 20.0]","(6.4, 8.337]",4.0,D
1,0.0,0.0,113.0,75.0,0.0,1.0,11.0,5.0,0.0,2388.0,...,"(15.0, 30.0]","(15.0, 30.0]","(9.0, 14.0]","(30.0, 64.0]","(-0.1, 8.0]","(80.0, 120.0]","(10.0, 20.0]","(4.4, 5.4]",4.0,
2,0.0,0.0,92.0,62.695162,0.0,2.0,4.0,6.0,0.0,1688.0,...,"(-0.1, 15.0]","(-0.1, 15.0]","(9.0, 14.0]","(30.0, 64.0]","(-0.1, 8.0]","(80.0, 120.0]","(10.0, 20.0]","(5.4, 6.4]",0.0,
3,0.0,0.0,101.0,74.0,0.0,2.0,49.0,2.0,8.0,514.0,...,"(15.0, 30.0]","(15.0, 30.0]","(3.0, 9.0]","(10.0, 20.0]","(16.0, 24.0]","(80.0, 120.0]","(-0.1, 10.0]","(1.4, 2.4]",0.0,
4,0.0,0.0,101.0,63.0,0.0,3.0,8.0,3.0,0.0,452.0,...,"(-0.1, 15.0]","(-0.1, 15.0]","(3.0, 9.0]","(10.0, 20.0]","(16.0, 24.0]","(80.0, 120.0]","(-0.1, 10.0]","(2.4, 3.4]",0.0,


In this task, we only need to find rules for the existance of donation. Thus we will use a subset of the dataset without considering the columns: DONATION_TYPE, TARGET_D.

In [3]:
df_clean.drop(columns = ['DONATION_TYPE','TARGET_D'], inplace=True)

In [4]:
# Transform these columns into categorical data
df_targetb = pd.get_dummies(df_clean, columns = ['TARGET_B','IN_HOUSE','SES','CLUSTER_CODE','RECENT_STAR_STATUS','PEP_STAR','FREQUENCY_STATUS_97NK','URBANICITY_e', 'DONOR_GENDER_e', 'OVERLAY_SOURCE_e',
       'RECENCY_STATUS_96NK_e','DONOR_AGE_BIN', 'MONTHS_SINCE_ORIGIN_BIN', 'MOR_HIT_RATE_BIN', 'MEDIAN_HOME_VALUE_BIN', 'MEDIAN_HOME_VALUE_BIN', 'MEDIAN_HOUSEHOLD_INCOME_BIN', 
             'PCT_OWNER_OCCUPIED_BIN', 'PER_CAPITA_INCOME_BIN','PCT_ATTRIBUTE1_BIN','PCT_ATTRIBUTE2_BIN', 'PCT_ATTRIBUTE3_BIN', 'PCT_ATTRIBUTE4_BIN',
                  'RECENT_RESPONSE_PROP_BIN', 'RECENT_AVG_GIFT_AMT_BIN', 'RECENT_CARD_RESPONSE_PROP_BIN', 'RECENT_AVG_CARD_GIFT_AMT_BIN', 'RECENT_RESPONSE_COUNT_BIN','RECENT_CARD_RESPONSE_COUNT_BIN', 'MONTHS_SINCE_LAST_PROM_RESP_BIN', 'LIFETIME_CARD_PROM_BIN',
                        'LIFETIME_PROM_BIN', 'INCOME_GROUP_BIN' ,'LIFETIME_GIFT_AMOUNT_BIN', 'LIFETIME_GIFT_COUNT_BIN', 'LIFETIME_AVG_GIFT_AMT_BIN', 'LIFETIME_GIFT_RANGE_BIN', 'LAST_GIFT_AMT_BIN', 'CARD_PROM_12_BIN', 'NUMBER_PROM_12_BIN', 'MONTHS_SINCE_LAST_GIFT_BIN', 'MONTHS_SINCE_FIRST_GIFT_BIN', 'FILE_CARD_GIFT_BIN' ])

In [5]:
# Final dataset without non-categorical data
df_enc = df_targetb.drop(columns = ['MONTHS_SINCE_ORIGIN', 'DONOR_AGE', 'MOR_HIT_RATE', 'MEDIAN_HOME_VALUE',
       'MEDIAN_HOUSEHOLD_INCOME', 'PCT_OWNER_OCCUPIED', 'PER_CAPITA_INCOME','INCOME_GROUP',
       'PCT_ATTRIBUTE1', 'PCT_ATTRIBUTE2', 'PCT_ATTRIBUTE3', 'PCT_ATTRIBUTE4',
       'RECENT_RESPONSE_PROP', 'RECENT_AVG_GIFT_AMT',
       'RECENT_CARD_RESPONSE_PROP', 'RECENT_AVG_CARD_GIFT_AMT',
       'RECENT_RESPONSE_COUNT', 'RECENT_CARD_RESPONSE_COUNT',
       'MONTHS_SINCE_LAST_PROM_RESP', 'LIFETIME_CARD_PROM', 'LIFETIME_PROM',
       'LIFETIME_GIFT_AMOUNT', 'LIFETIME_GIFT_COUNT', 'LIFETIME_AVG_GIFT_AMT',
       'LIFETIME_GIFT_RANGE', 'LAST_GIFT_AMT', 'CARD_PROM_12',
       'NUMBER_PROM_12', 'MONTHS_SINCE_LAST_GIFT', 'MONTHS_SINCE_FIRST_GIFT',
       'FILE_CARD_GIFT'])

In [6]:
df_enc.head()

Unnamed: 0,TARGET_B_0.0,TARGET_B_1.0,IN_HOUSE_0.0,IN_HOUSE_1.0,SES_1.0,SES_2.0,SES_3.0,SES_4.0,CLUSTER_CODE_1.0,CLUSTER_CODE_2.0,...,"MONTHS_SINCE_LAST_GIFT_BIN_(24.0, 27.0]","MONTHS_SINCE_LAST_GIFT_BIN_(8.0, 16.0]","MONTHS_SINCE_FIRST_GIFT_BIN_(-0.1, 40.0]","MONTHS_SINCE_FIRST_GIFT_BIN_(120.0, 260.0]","MONTHS_SINCE_FIRST_GIFT_BIN_(40.0, 80.0]","MONTHS_SINCE_FIRST_GIFT_BIN_(80.0, 120.0]","FILE_CARD_GIFT_BIN_(-0.1, 10.0]","FILE_CARD_GIFT_BIN_(10.0, 20.0]","FILE_CARD_GIFT_BIN_(20.0, 30.0]","FILE_CARD_GIFT_BIN_(30.0, 32.0]"
0,0,1,1,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
2,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
3,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,1,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,1,1,0,0,0


To find frequent items for each donation class (donors vs. non-donors) we can create one subset for each. 

In [7]:
# Donors
donors = df_enc.loc[df_enc['TARGET_B_1.0'] == 1]
#NoDonors
no_donors = df_enc.loc[df_enc['TARGET_B_0.0'] == 1]

### Finding Frequent Items

The Apriori algorithm calculates rules that express probabilistic relationships between items in frequent itemsets.

#### Donors 

In [8]:
# Finding frequent items for donors
frequent_itemsets_d = apriori(donors, min_support=0.6, use_colnames=True)
## We can filter the data by adding a length
frequent_itemsets_d['length'] = frequent_itemsets_d['itemsets'].apply(lambda x: len(x))
frequent_itemsets_d = frequent_itemsets_d[frequent_itemsets_d['length'] >= 3]
frequent_itemsets_d

Unnamed: 0,support,itemsets,length
118,0.904708,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_1.0, ...",3
119,0.652035,"(TARGET_B_1.0, MEDIAN_HOUSEHOLD_INCOME_BIN_(20...",3
120,0.909711,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0], TARGET_B_1.0...",3
121,0.642711,"(TARGET_B_1.0, PCT_ATTRIBUTE2_BIN_(25.0, 50.0]...",3
122,0.644076,"(RECENT_RESPONSE_PROP_BIN_(-0.1, 0.25], TARGET...",3
...,...,...,...
766,0.656811,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_1.0, ...",7
767,0.638617,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_1.0, ...",7
768,0.629065,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_1.0, ...",7
769,0.704344,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_1.0, ...",7


#### Not Donors 

In [9]:
# Finding frequent items for non donors
frequent_itemsets_nd = apriori(no_donors, min_support=0.6, use_colnames=True)
## We can filter the data by adding a length
frequent_itemsets_nd['length'] = frequent_itemsets_nd['itemsets'].apply(lambda x: len(x))
frequent_itemsets_nd = frequent_itemsets_nd[frequent_itemsets_nd['length'] >= 3]
frequent_itemsets_nd

Unnamed: 0,support,itemsets,length
139,0.672943,"(RECENT_STAR_STATUS_0.0, TARGET_B_0.0, IN_HOUS...",3
140,0.927556,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_0.0, ...",3
141,0.647308,"(MEDIAN_HOME_VALUE_BIN_(-0.1, 1000.0], TARGET_...",3
142,0.647308,"(MEDIAN_HOME_VALUE_BIN_(-0.1, 1000.0], TARGET_...",3
143,0.675287,"(TARGET_B_0.0, MEDIAN_HOUSEHOLD_INCOME_BIN_(20...",3
...,...,...,...
1520,0.620614,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], FILE_CARD_GIFT...",8
1521,0.623336,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0], TARGET_B_0.0...",8
1522,0.649123,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",8
1523,0.616379,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",8


### Finding Associations

An association rule states that an item (or group of items) implies the presence of another item with some probability. Unlike decision tree rules, which predict a target, association rules simply express correlation.

In [10]:
# Frequent items for the complete subset (donor and non-donors)
frequent_itemsets = apriori(df_enc, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets[frequent_itemsets['length'] >= 3]
frequent_itemsets

Unnamed: 0,support,itemsets,length
123,0.696101,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_0.0, ...",3
124,0.698996,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0], TARGET_B_0.0...",3
125,0.675444,"(LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0], TARGET_B...",3
126,0.659951,"(TARGET_B_0.0, CARD_PROM_12_BIN_(3.0, 9.0], IN...",3
127,0.625958,"(FILE_CARD_GIFT_BIN_(-0.1, 10.0], TARGET_B_0.0...",3
...,...,...,...
709,0.613700,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",7
710,0.631803,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",7
711,0.613416,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",7
712,0.603995,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",7


In [11]:
# Generate association rules with confidence >= 80% for the dataset
frequent_itemsets = apriori(df_enc, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules['length']= rules['antecedents'].apply(lambda x: len(x))
rules['Consequents']= rules['consequents'].apply(lambda x: ','.join(list(x))).astype("unicode")
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length,Consequents
0,(TARGET_B_0.0),(IN_HOUSE_0.0),0.750468,0.93207,0.703819,0.93784,1.006191,0.004331,1.092835,1,IN_HOUSE_0.0
1,(IN_HOUSE_0.0),(TARGET_B_0.0),0.93207,0.750468,0.703819,0.755114,1.006191,0.004331,1.018973,1,TARGET_B_0.0
2,"(MOR_HIT_RATE_BIN_(-0.1, 30.0])",(TARGET_B_0.0),0.987118,0.750468,0.740764,0.750431,0.999951,-3.7e-05,0.999852,1,TARGET_B_0.0
3,(TARGET_B_0.0),"(MOR_HIT_RATE_BIN_(-0.1, 30.0])",0.750468,0.987118,0.740764,0.987069,0.999951,-3.7e-05,0.996236,1,"MOR_HIT_RATE_BIN_(-0.1, 30.0]"
4,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0])",(TARGET_B_0.0),0.993814,0.750468,0.745588,0.750228,0.99968,-0.000238,0.99904,1,TARGET_B_0.0


In [12]:
rules[(rules['length']>=2) & (rules['Consequents'].str.contains('TARGET_B', regex=False))]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length,Consequents
164,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], IN_HOUSE_0.0)",(TARGET_B_0.0),0.921855,0.750468,0.696101,0.755110,1.006185,0.004279,1.018953,2,TARGET_B_0.0
170,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0], IN_HOUSE_0.0)",(TARGET_B_0.0),0.925997,0.750468,0.698996,0.754857,1.005848,0.004064,1.017903,2,TARGET_B_0.0
176,"(LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0], IN_HOUSE...",(TARGET_B_0.0),0.889450,0.750468,0.675444,0.759395,1.011895,0.007940,1.037102,2,TARGET_B_0.0
183,"(CARD_PROM_12_BIN_(3.0, 9.0], IN_HOUSE_0.0)",(TARGET_B_0.0),0.870836,0.750468,0.659951,0.757836,1.009818,0.006417,1.030427,2,TARGET_B_0.0
188,"(FILE_CARD_GIFT_BIN_(-0.1, 10.0], IN_HOUSE_0.0)",(TARGET_B_0.0),0.818512,0.750468,0.625958,0.764751,1.019032,0.011690,1.060712,2,TARGET_B_0.0
...,...,...,...,...,...,...,...,...,...,...,...
6685,"(LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0], FILE_CAR...","(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",0.808240,0.735883,0.605641,0.749333,1.018277,0.010871,1.053655,3,"MOR_HIT_RATE_BIN_(-0.1, 30.0],PCT_ATTRIBUTE1_B..."
6687,"(MOR_HIT_RATE_BIN_(-0.1, 30.0], FILE_CARD_GIFT...","(LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0], PCT_ATTR...",0.858010,0.664491,0.605641,0.705867,1.062267,0.035501,1.140669,2,"LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0],PCT_ATTRIB..."
6689,"(PCT_ATTRIBUTE1_BIN_(-0.1, 25.0], FILE_CARD_GI...","(MOR_HIT_RATE_BIN_(-0.1, 30.0], TARGET_B_0.0, ...",0.862664,0.660292,0.605641,0.702059,1.063256,0.036031,1.140187,2,"MOR_HIT_RATE_BIN_(-0.1, 30.0],TARGET_B_0.0,CAR..."
6693,"(LIFETIME_GIFT_COUNT_BIN_(0.1, 25.0], FILE_CAR...","(MOR_HIT_RATE_BIN_(-0.1, 30.0], PCT_ATTRIBUTE1...",0.864366,0.689915,0.605641,0.700676,1.015597,0.009301,1.035950,2,"MOR_HIT_RATE_BIN_(-0.1, 30.0],PCT_ATTRIBUTE1_B..."


### Association Rules - Results and Discussion 

The IF component of an association rule is known as the Antecedent. The THEN component is known as the Consequent. The minimum support indicates how frequently the itemset appears in the dataset. From the results we can see not donating could is a consequence, 75% of times, of not donating to a In House Program and having a low response to a mailed solicitation from a group other than the charitable organization. Not donating is a consequence, 75% of times, of having never donated to the charitable organization's In House program and living in a neighbourhood with the lowest amount of militaries currently serving.
These rules suggest not donating is related to the level of involvement of the individual in the society. This is the first time, so far, we were able to withdraw cultural conclusions from the variables we have been using in models. Due to the small sample size for actual donations, the model was not able to establish any rules related TARGET_B_1.0. It would have been more interesting to achieve rules for TARGET_B_1.0, but unfortunatly, even with low parameters, we did not achieve that. Again, we belive this is due to the reduced amount of examples for TARGET_B_1.0 when compared to TARGET_B_0.0.