<a href="https://colab.research.google.com/github/sarkarau/datascience/blob/main/Donors_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Segmentation of Donors
For a non-profit organization I had a csv with payments, their details and emails of donors. I wanted to perform a simple segmentation on these data. So I Googled RFM segmentation.
With help of ChatGPT I was able to perform this within 3 hours by myself.

## 0. Get data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Increase a figure size
sns.set(rc={'figure.figsize':(12,8)})

%matplotlib inline

In [23]:
data = pd.read_csv('/content/payments.csv')

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132096 entries, 0 to 132095
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   ID           132096 non-null  int64 
 1   Campaign ID  132096 non-null  object
 2   Status       132096 non-null  object
 3   Banka        132096 non-null  object
 4   Test         132096 non-null  int64 
 5   Is donation  132096 non-null  int64 
 6   Transparent  132096 non-null  object
 7   Příspěvek    132096 non-null  object
 8   Poplatek     132096 non-null  object
 9   E-mail       132096 non-null  object
 10  Comgate ID   132055 non-null  object
 11  VS           132047 non-null  object
 12  Vytvořeno    132096 non-null  object
 13  Bank Date    132096 non-null  object
 14  Bank Month   132096 non-null  object
dtypes: int64(3), object(12)
memory usage: 15.1+ MB


## 1. Feature and data preparation
Let´s drop the columns and rows with chosen payment statuses.

In [25]:
processed = data.drop(['ID', 'Campaign ID', 'VS', 'Test', 'Poplatek', 'Banka', 'Comgate ID', 'Vytvořeno'], axis=1)

In [26]:
data.to_pickle("processed.pkl")

In [27]:
processed.drop(processed[processed['Status'] == 'CANCELLED'].index, inplace = True)

In [28]:
processed.drop(processed[processed['Status'] == 'PENDING'].index, inplace = True)

In [29]:
processed.drop(processed[processed['Status'] == 'REFUNDED'].index, inplace = True)

Deleting 10377 anonymous donors (email = \N)

In [30]:
processed['E-mail'].describe()

count     105238
unique     43708
top           \N
freq       10377
Name: E-mail, dtype: object

In [31]:
processed.drop(processed[processed['E-mail'] == '\\N'].index, inplace=True)

In [32]:
processed.drop(processed[processed['Bank Date'] == '\\N'].index, inplace=True)

In [33]:
data.to_pickle("processed1.pkl")

## 2. RFM analysis
First we need to convert certain data types. Then, we can perform the segmentation.



In [34]:
# Convert 'Bank Date' column to datetime
processed['Bank Date'] = pd.to_datetime(processed['Bank Date'], dayfirst=True)

In [35]:
processed['Příspěvek'] = processed['Příspěvek'].astype(int)

In [36]:
processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 94701 entries, 1 to 132045
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Status       94701 non-null  object        
 1   Is donation  94701 non-null  int64         
 2   Transparent  94701 non-null  object        
 3   Příspěvek    94701 non-null  int64         
 4   E-mail       94701 non-null  object        
 5   Bank Date    94701 non-null  datetime64[ns]
 6   Bank Month   94701 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 5.8+ MB


In [37]:
# Define a function to calculate recency
def calculate_recency(x):
    current_date = pd.to_datetime('2024-04-14')  # Your current date
    max_date = pd.to_datetime(x.max(), errors='coerce')  # Convert to datetime, handle errors
    return (current_date - max_date).days if not pd.isnull(max_date) else pd.NaT

In [38]:
# Group by 'E-mail' and calculate Recency, Frequency, and Monetary values
rfm = processed.groupby('E-mail').agg({
    'Bank Date': calculate_recency,  # Recency
    'E-mail': 'size',  # Frequency
    'Příspěvek': 'sum'  # Monetary
})

# Rename columns
rfm.columns = ['Recency', 'Frequency', 'Monetary']

# Print RFM data
print(rfm)

                       Recency  Frequency  Monetary
E-mail                                             
001938@GMAIL.COM           243         37     31574
007adela@seznam.cz        1412          1       100
007haky@gmail.com          285          1       500
007jaba@gmail.com          461          2       800
007tommy@centrum.cz       1560          1       500
...                        ...        ...       ...
zzurek@seznam.cz           205          2      1000
zzvv@volny.cz              381          1       200
zzz@centrum.cz            1251          3      1500
zzzuzannnka@seznam.cz      383          1      1800
zzzuzzzkaaa@seznam.cz     1277          1       200

[43652 rows x 3 columns]


In [39]:
rfm.to_csv('rfm_scores.csv')

In [40]:
# RFM segmentation
# Define quartiles for each RFM metric
quartiles = rfm.quantile(q=[0.25, 0.5, 0.75])

# Function to create RFM segments
def rfm_segment(row, metrics, quartiles):
    rfm_segment = ''
    for i, metric in enumerate(metrics):
        if metric == 'Recency':
            if row[metric] <= quartiles[metric][0.25]:
                rfm_segment += '1'
            elif row[metric] <= quartiles[metric][0.5]:
                rfm_segment += '2'
            elif row[metric] <= quartiles[metric][0.75]:
                rfm_segment += '3'
            else:
                rfm_segment += '4'
        else:  # For Frequency and Monetary
            if row[metric] <= quartiles[metric][0.25]:
                rfm_segment += '4'
            elif row[metric] <= quartiles[metric][0.5]:
                rfm_segment += '3'
            elif row[metric] <= quartiles[metric][0.75]:
                rfm_segment += '2'
            else:
                rfm_segment += '1'
    return rfm_segment

In [41]:
# Apply segmentation function to each row
rfm['RFM_Segment'] = rfm.apply(lambda x: rfm_segment(x, ['Recency', 'Frequency', 'Monetary'], quartiles), axis=1)

# Calculate RFM Score
rfm['RFM_Score'] = rfm[['Recency', 'Frequency', 'Monetary']].sum(axis=1)

# Output RFM data
print(rfm)

                       Recency  Frequency  Monetary RFM_Segment  RFM_Score
E-mail                                                                    
001938@GMAIL.COM           243         37     31574         111      31854
007adela@seznam.cz        1412          1       100         344       1513
007haky@gmail.com          285          1       500         143        786
007jaba@gmail.com          461          2       800         112       1263
007tommy@centrum.cz       1560          1       500         343       2061
...                        ...        ...       ...         ...        ...
zzurek@seznam.cz           205          2      1000         112       1207
zzvv@volny.cz              381          1       200         144        582
zzz@centrum.cz            1251          3      1500         311       2754
zzzuzannnka@seznam.cz      383          1      1800         141       2184
zzzuzzzkaaa@seznam.cz     1277          1       200         344       1478

[43652 rows x 5 columns]

## 3. Download the results

In [45]:
rfm.to_csv('rfm.csv')