# Venmo Transactional Data

Recently I came across an interesting dataset that was collected using the Venmo API and scraping public transactions on Venmo from July-October 2018 and then again in January-February 2019. A more complete description of the dataset can be found [here, at Sa7mon's github](https://github.com/sa7mon/venmo-data). Some of it has been redacted for privacy but having recently looked at some customer segmentation literature, I wanted to see how feature engineering and clustering could be leveraged for some hypothetical use cases.

### Reading in the data

The data is stored in binary json, or bson. First, I will read in only a subset of the entire dataset to get started. I'd like to store it in a pandas dataframe and ultimately, export some other aggregations as csv files.

In [68]:
import pandas as pd
import bson #dont use pip install bson, use pip install pymongo instead
import sys

### Creating a one-time approach

First, I start by writing line by line and testing along the way. Once I have a functional bit of code, I will turn that into a function

In [69]:
# Simple data processing

#set the source of the data
venmo_transactions = bson.decode_file_iter(
    open('F:/Datasets/venmo/venmo.bson', 'rb'))

#create empty dict to store items of interest
conversion_dict = dict()

#loop through transactions and store info of interest

stop_at = 50000  #set number of iterations, and therefore records, to process
for c, d in enumerate(venmo_transactions):
    if d['payment'] != None:
        if d['payment']['target'] == None:
            target_username = d['payment']['target']['user']['username']
            target_user_id = d['payment']['target']['user']['id']
        else:
            target_username = None
            target_user_id = None
        target_type = d['payment']['target']['type']
        actor_username = d['payment']['actor']['username']
        actor_user_id = d['payment']['actor']['id']
        note = d['payment']['note']
        transaction_id = d['payment']['id']
        date_created = d['date_created']
        overall_type = d['type']

    else:
        target_type = None
        actor_username = None
        actor_user_id = None
        note = None
        transaction_id = None

    record = {
        'transaction_id': transaction_id,
        'actor_user_id': actor_user_id,
        'actor_username': actor_username,
        'target_user_id': target_user_id,
        'target_username': target_username,
        'target_type': target_type,
        'overall_type': overall_type,
        'transaction_note': note,
        'date_created': date_created
    }
    conversion_dict[c] = record

    if c == stop_at:  #exit on stop_at iteration
        break

#create a dataframe from the dictionary
generated_df = pd.DataFrame.from_dict(conversion_dict, orient='index')

#export dataframe as csv
generated_df.to_csv(
    'C:/Users/Stuart/Documents/GitHub/venmo/data/output/smallerdf.csv')

### Defining the read/export as a function

Hooray! The single use approach worked but I dont want to have to change constants throughout the code if I want an export of a different size. With that, I'll define a function to do the same thing. It wont be highly generalized as navigating the json via the python dictionary is pretty specific. I'm not sure how one would get around that in a flexible way.

In [70]:
def read_export_venmo_bson(filepath='',
                           exportpath='',
                           filename='venmo_export',
                           records=1000):
    """ reads bson venmo data from local file at filepath, 
        captures transaction details and stores as exported csv at exportpath with the filename and '.csv'"""

    venmo_transactions = bson.decode_file_iter(open(filepath, 'rb'))

    #create empty dict to store items of interest
    conversion_dict = dict()

    #loop through transactions and store info of interest
    for c, d in enumerate(venmo_transactions):
        if c == records:  #exit on records iteration

            #generate dataframe from dictionary storing select info from above
            generated_df = pd.DataFrame.from_dict(conversion_dict,
                                                  orient='index')

            #export to exportpath as csv
            generated_df.to_csv(str(exportpath) + str(filename) + '.csv')
            print('Function ran successfully.', str(records),
                  'records exported into table at:',
                  exportpath + filename + '.csv')

            break
        else:
            if d['payment'] != None:
                if d['payment']['target'] == None:
                    target_username = d['payment']['target']['user'][
                        'username']
                    target_user_id = d['payment']['target']['user']['id']
                else:
                    target_username = None
                    target_user_id = None
                target_type = d['payment']['target']['type']
                actor_username = d['payment']['actor']['username']
                actor_user_id = d['payment']['actor']['id']
                note = d['payment']['note']
                transaction_id = d['payment']['id']
                date_created = d['date_created']
                overall_type = d['type']

            else:
                target_type = None
                actor_username = None
                actor_user_id = None
                note = None
                transaction_id = None

            record = {
                'transaction_id': transaction_id,
                'actor_user_id': actor_user_id,
                'actor_username': actor_username,
                'target_user_id': target_user_id,
                'target_username': target_username,
                'target_type': target_type,
                'overall_type': overall_type,
                'transaction_note': note,
                'date_created': date_created
            }
            conversion_dict[c] = record

### Running the function

Time to see how it does!

In [71]:
bson_filepath = 'F:/Datasets/venmo/venmo.bson'
export_filepath = 'C:/Users/Stuart/Documents/GitHub/venmo/data/output/'
filename = 'transactions'

#run function given presets above
read_export_venmo_bson(bson_filepath, export_filepath, filename, 50000)

Function ran successfully. 50000 records exported into table at: C:/Users/Stuart/Documents/GitHub/venmo/data/output/transactions.csv


# Exploratory Data Analysis

In [72]:
transactions = pd.read_csv(export_filepath + filename + '.csv')
transactions.dtypes

Unnamed: 0            int64
transaction_id        int64
actor_user_id         int64
actor_username       object
target_user_id      float64
target_username     float64
target_type          object
overall_type         object
transaction_note     object
date_created         object
dtype: object

### Checking for null values

In [73]:
print('Field and Proportion of null values:\n\n',
      (transactions.isnull().sum() /
       transactions.isnull().count()).sort_values(ascending=False))

Field and Proportion of null values:

 target_username     1.00000
target_user_id      1.00000
transaction_note    0.00004
date_created        0.00000
overall_type        0.00000
target_type         0.00000
actor_username      0.00000
actor_user_id       0.00000
transaction_id      0.00000
Unnamed: 0          0.00000
dtype: float64


### Removing columns with no or 'low-value' data

In [74]:
transactions = transactions.drop(['Unnamed: 0','target_username', 'target_user_id'], axis=1)

### Fixing datetime fields and re-indexing

In [75]:
transactions['date_created'] = pd.to_datetime(transactions['date_created'])
transactions.set_index(transactions['date_created'])

Unnamed: 0_level_0,transaction_id,actor_user_id,actor_username,target_type,overall_type,transaction_note,date_created
date_created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-08-07 02:11:16,2540405007077868184,2482900494712832556,Vitna-Kim,user,payment,fuk ya,2018-08-07 02:11:16
2018-08-07 02:11:16,2540405006884930468,2457721903251456771,mekanik915,user,payment,🚗,2018-08-07 02:11:16
2018-08-07 02:11:16,2540405007379857710,2363395470786560486,Brian-Joel-1,user,payment,:venmo_dollar:,2018-08-07 02:11:16
2018-08-07 02:11:15,2540404998227886310,1988829997170688939,mikeinglese,user,payment,Gatorade,2018-08-07 02:11:15
2018-08-07 02:11:15,2540404998613762676,2278060275531776951,Savannah-Landry-4,user,payment,🎉,2018-08-07 02:11:15
...,...,...,...,...,...,...,...
2018-07-27 07:25:41,2532590732674334806,2153475571974144200,Trickkster,user,payment,Lyft and fuuuuun,2018-07-27 07:25:41
2018-07-27 07:25:41,2532590727095911178,1883160900009984660,Stephen-Castellana,user,payment,August rent,2018-07-27 07:25:41
2018-07-27 07:25:40,2532590719109955646,1843985320509440293,janetzzzyy,user,payment,4hunnid,2018-07-27 07:25:40
2018-07-27 07:25:39,2532590710964617980,2120882776440832561,Juan-Dominguez-24,user,payment,Hard,2018-07-27 07:25:39


### Appending differing date views

In [89]:
#append month_year, month, year, and day to dataframe
transactions['month_year'] = pd.to_datetime(transactions['date_created']).dt.to_period('M')
transactions['month'] = pd.DatetimeIndex(transactions['date_created']).month
transactions['year'] = pd.DatetimeIndex(transactions['date_created']).year
transactions['day'] = pd.DatetimeIndex(transactions['date_created']).day

transactions.head()

Unnamed: 0,transaction_id,actor_user_id,actor_username,target_type,overall_type,transaction_note,date_created,month_year,month,year,day
0,2540405007077868184,2482900494712832556,Vitna-Kim,user,payment,fuk ya,2018-08-07 02:11:16,2018-08,8,2018,7
1,2540405006884930468,2457721903251456771,mekanik915,user,payment,🚗,2018-08-07 02:11:16,2018-08,8,2018,7
2,2540405007379857710,2363395470786560486,Brian-Joel-1,user,payment,:venmo_dollar:,2018-08-07 02:11:16,2018-08,8,2018,7
3,2540404998227886310,1988829997170688939,mikeinglese,user,payment,Gatorade,2018-08-07 02:11:15,2018-08,8,2018,7
4,2540404998613762676,2278060275531776951,Savannah-Landry-4,user,payment,🎉,2018-08-07 02:11:15,2018-08,8,2018,7


### Grouping data to more helpful views

First, I check how many transactions the top 10 users generate in this dataset.

In [115]:
top_10_frequencies = sorted(transactions.groupby('actor_user_id')['transaction_id'].count(),
       reverse=True)[:10]

print(top_10_frequencies)
print('The most freuqently appearing user in the dataset is associated with', top_10_frequencies[0], 'transactions.')

[9, 7, 5, 5, 5, 4, 4, 4, 4, 4]
The most freuqently appearing user in the dataset is associated with 9 transactions.


In [103]:
pd.pivot_table(transactions, index = 'actor_user_id', columns = 'year', aggfunc='count').sort_values(by='actor_user_id', ascending=False)

Unnamed: 0_level_0,actor_username,date_created,day,month,month_year,overall_type,target_type,transaction_id,transaction_note
year,2018,2018,2018,2018,2018,2018,2018,2018,2018
actor_user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2532588468043776121,1,1,1,1,1,1,1,1,1
2532583762034688573,1,1,1,1,1,1,1,1,1
2532581044125696550,1,1,1,1,1,1,1,1,1
2532580180099072703,1,1,1,1,1,1,1,1,1
2532579626450944659,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...
731288847777792823,1,1,1,1,1,1,1,1,1
726979233972224206,1,1,1,1,1,1,1,1,1
690722277687296206,1,1,1,1,1,1,1,1,1
577079506632704432,1,1,1,1,1,1,1,1,1
