General description: these are the primary donation data analyzed in the paper, "Examining charitable giving in real-world online donations" 
published in Nature Communications in 2019 by Sisco and Weber. Each row represents one donation. 

-------Variables:

amount_donated - the amount donated in US dollars.


campaign_ID - an encrypted identifier for the campaign_ID. You can use this to tell which donations were made to the same campaign but not to trace the raw data back to the unencrypted campaign_ID. The encryption is to protect the privacy of the campaign creators and donors.


category - the category of each campaign.


anonymous - whether or not the donation was made anonymously to the public.


gender - The gender of the current donor. "F" represents female and "M" represents male. These genders were estimated based on the public names of the donors. The algorithm used is provided in the supplementary materials for the paper.

same_last_name - "1" means that the donor and the recipient had the same last name. "0" means that they did not (from what we could infer based on publicly displayed names). 

empathy - "1" means that an expression of empathy was detected in the message left by the current donor. "0" means that no expression of empathy was detected. The algorithm used is provided in the supplementary materials for the paper.


In [1]:
# Import the libraries
import pandas as pd

In [41]:
# read csv
df = pd.read_csv('donor_data.csv')
print(f'Dimension of the dataset: number of rows: {df.shape[0]}, number of columns: {df.shape[1]}')
df.head(3)

Dimension of the dataset: number of rows: 11999, number of columns: 7


Unnamed: 0,amount_donated,campaign_ID,category,anonymous,gender,same_last_name,empathy
0,50,f3F1j8SbZZZIR/7Y9r8XUS2JwH4JNnU9zxa967TSlCAz,Medical,0,F,0.0,0
1,100,f3F1j8SbZZZIR/7Y9r8XUS2JwH4JNnU9zxa967TSlCAz,Medical,0,F,0.0,1
2,10,f3F1j8SbZZZIR/7Y9r8XUS2JwH4JNnU9zxa967TSlCAz,Medical,0,M,0.0,0


In [42]:
df.tail(3)

Unnamed: 0,amount_donated,campaign_ID,category,anonymous,gender,same_last_name,empathy
11996,25,O6D6pB+cgIR2I4snh43BNi6WyGYPNHU=,Community,0,F,0.0,0
11997,50,O6D6pB+cgIR2I4snh43BNi6WyGYPNHU=,Community,1,,0.0,0
11998,25,O6D6pB+cgIR2I4snh43BNi6WyGYPNHU=,Community,0,F,0.0,0


In [38]:
# Number of unique values in each column
df.nunique()

amount_donated    169
campaign_ID       299
category           20
anonymous           2
gender              2
same_last_name      2
empathy             2
dtype: int64

In [10]:
# NUmber of missing values in each column
df.isnull().sum()

amount_donated       0
campaign_ID          0
category           105
anonymous            0
gender            4866
same_last_name     174
empathy              0
dtype: int64

In [37]:
# check duplication in the dataset and show duplicated rows
print(f'Number of duplicated rows: {df.duplicated().sum()}')
# show duplicated rows
df[df.duplicated()]

Number of duplicated rows: 0


Unnamed: 0,amount_donated,campaign_ID,category,anonymous,gender,same_last_name,empathy


In [39]:
# drop duplicated rows
df.drop_duplicates(inplace=True)
# check the number of rows after dropping duplicated rows
print(f'Number of rows after dropping duplicated rows: {df.shape[0]}')

Number of rows after dropping duplicated rows: 5063


In [24]:
# check speeling for similarity between rows in categorical columns
print(f'Number of unique values in the category column: {df.category.nunique()}')
print(f'Frequency of each category in descending order:\n{df.category.value_counts()}')

Number of unique values in the category column: 20
Frequency of each category in descending order:
Sandy                6273
Medical              2520
Alberta Fires         906
Charity               430
Sports                341
Family                258
Business              229
Events                213
Creative              136
Education             127
Nepal                 121
Volunteer              68
Community              62
Philippine Relief      54
Newlyweds              52
US Storms              37
Travel                 27
Faith                  19
Ecuador                15
Emergencies             6
Name: category, dtype: int64


In [30]:
# the correlation between the empathy column and the other columns
print(f'Correlation between the empathy column and the other columns:\n{df.corr()["empathy"].sort_values(ascending=False)}')

Correlation between the empathy column and the other columns:
empathy           1.000000
amount_donated    0.004352
same_last_name    0.003049
anonymous        -0.099264
Name: empathy, dtype: float64
