## Stripe: Data Science Intern - Written Project
#### Frank Yue Ying | 2022-03-07

### Raw Data
merchant transaction activity, for merchants that start over a 2 year period (2033-2034). The data spans from 1/1/33 through 12/31/34. Each observation is a transaction amount in cents.
#### Columns
1. ID: [1,1513719] natural order of rows in raw data
2. merchant: unique merchant ID with length of 10 digits & letters
3. time: format as YYYY-MM-DD HH:mm:ss, assuming in the same timezone
4. amount_usd_in_cents: integer number in cents

In [14]:
import pandas as pd
import numpy as np

In [6]:
#Load the data
raw_data = pd.read_csv("takehome_ds_written.csv",header = 0,index_col =0)
raw_data.head()

In [17]:
#Transform the data: break time into year, month, day
data = raw_data.copy()
def convert_time(row,x):
    if x == 'year':
        return row.split(" ")[0].split("-")[0]
    elif x == 'month':
        return row.split(" ")[0].split("-")[1]
    elif x == 'day':
        return row.split(" ")[0].split("-")[2]

data['year'] = np.vectorize(convert_time)(data['time'], 'year')
data['month'] = np.vectorize(convert_time)(data['time'], 'month')
data['day'] = np.vectorize(convert_time)(data['time'], 'day')

In [44]:
data.dtypes

merchant               object
time                   object
amount_usd_in_cents     int64
year                   object
month                  object
day                    object
dtype: object

In [23]:
#Group by year & month
data_ByYear = data.groupby(["year","month"])["merchant"].nunique()
data_ByYear

year  month
2033  01        390
      02        656
      03        964
      04       1164
      05       1402
      06       1639
      07       1864
      08       2149
      09       2324
      10       2562
      11       2785
      12       3107
2034  01       3227
      02       3490
      03       3931
      04       4079
      05       4437
      06       4630
      07       4789
      08       5175
      09       5273
      10       5740
      11       5955
      12       6126
Name: merchant, dtype: int64

In [24]:
# set up unique merchant-based tracking table 
merchants = data['merchant'].unique().tolist()

In [68]:
len(merchants)
merchant_data = pd.DataFrame(columns = ["id"])
merchant_data.set_index("id")
merchant_data_columns_dict = {}
for year in ['count-2033','count-2034','volume-2033','volume-2034']:
    for month in range(1,13):
        merchant_data[year+"-"+str(month)] = 0
        merchant_data_columns_dict[year+"-"+str(month)] = 0

def query_merchant(dt,merchant,merchant_data_columns_dict):
    merchant_data_values = merchant_data_columns_dict.copy()
    merchant_data = dt.loc[dt['merchant'] == merchant]
    merchant_data = merchant_data.groupby(["year","month"]).agg({"amount_usd_in_cents": [np.count_nonzero, np.sum]}).reset_index()
    for index, row in merchant_data.iterrows():
        merchant_data_columns_dict['count-'+str(row['year'].values[0])+"-"+str(int(row['month'].values[0]))] = int(row[2])
        merchant_data_columns_dict['volume-'+str(row['year'].values[0])+"-"+str(int(row['month'].values[0]))] = int(row[3])
    return pd.DataFrame([merchant_data_columns_dict], columns=merchant_data_columns_dict.keys())

for merchant in merchants:
    merchant_dt = query_merchant(data,merchant,merchant_data_columns_dict)
    merchant_dt['id'] = merchant
    merchant_data = pd.concat([merchant_data, merchant_dt], axis=0)

merchant_data.to_csv("merchant_data.csv", index = False)

    year month amount_usd_in_cents         
                     count_nonzero      sum
0   2033    08                  11   218133
1   2033    09                  38   454400
2   2033    10                  20   309162
3   2033    11                  11   170170
4   2033    12                  45   766358
5   2034    01                  28   483214
6   2034    02                  34   667433
7   2034    03                  36   539391
8   2034    04                  29   618490
9   2034    05                  19   401317
10  2034    06                  25   422276
11  2034    07                  28   639421
12  2034    08                  38   488288
13  2034    09                  34   707890
14  2034    10                  30   557971
15  2034    11                  83  1071722
16  2034    12                  35   585480
   count-2033-08  volume-2033-08  count-2033-09  volume-2033-09  \
0             11          218133             38          454400   

   count-2033-10  volume-2033

### References
1. https://chaotic-flow.com/saas-metrics-faqs-what-is-churn/#:~:text=SaaS%20churn%20is%20the%20percentage,important%20parameter%20in%20revenue%20forecasting.
2. https://dataconomy.com/2017/07/churn-predictive-analytics/