# Fraudulent User Detection Using Amazon Dataset
### Penghao Xu, Yuan Chen, Jiawei Wu, Haojing Lu

## Part 1. Dataset preprocessing

This script is used to clean the Amazon review dataset (http://jmcauley.ucsd.edu/data/amazon/links.html) and generate data for baseline and the new proposed model.


In [1]:
import json
import pandas as pd
import gzip
import os
from collections import Counter

Download data if needed

In [2]:
# Uncomment to download data
# !wget http://snap.stanford.edu/data/amazon/productGraph/kcore_5.json.gz

In [3]:
## 5-core data is used in this study
# DO NOT extract the dataset. gzip format is required
filename = 'kcore_5.json.gz'
assert filename.endswith('gz'), 'Gzipped dataset is required!'

# set output folder
folder = 'dataset'
if not os.path.isdir(folder):
    os.mkdir(folder)

## 1. Generate rating-only dataset
The rating-only dataset has 4 columns: User, item, and rating. This dataset is used for baseline model REV2

In [4]:
# Process data and generate helpfulness score
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def get_df(path, benign=0.8, fraudulent=0.2):
    i = 0
    df = {}
    for d in parse(path):
        i += 1
        # report every 5m
        if not i % 5000000:
            print(f'{i} reviews processed!')
        # skip if no helpful information
        if not d['helpful'][1]:
            continue
        # extract useful features
        df[i] = {}
        for k in ['reviewerID', 'asin']:
            df[i][k] = d[k]
        df[i]['rating'] = (d['overall'] - 3) / 2
        df[i]['helpfulness'] = d['helpful'][0]/d['helpful'][1]
    df = pd.DataFrame.from_dict(df, orient='index')
    return df
df = get_df(filename)
df

5000000 reviews processed!
10000000 reviews processed!
15000000 reviews processed!
20000000 reviews processed!
25000000 reviews processed!
30000000 reviews processed!
35000000 reviews processed!
40000000 reviews processed!


Unnamed: 0,reviewerID,asin,rating,helpfulness
2,A2SUAM1J3GNN3B,0000013714,1.0,0.666667
6,A14A5Q8VJK5NLR,0000029831,0.5,1.000000
7,A3W2PX96K1BA3M,0000029831,1.0,1.000000
8,A2GKR2Q7MD8DG4,0000029831,1.0,1.000000
9,A1MC4E00RO5E9T,0000029831,1.0,1.000000
...,...,...,...,...
41135696,A3PLMYQCFRHU24,BT00DDVMVQ,0.5,0.333333
41135697,A2TNQ87GWTKOON,BT00DDVMVQ,-1.0,0.166667
41135698,A6D3XGIXKU5HQ,BT00DDVMVQ,1.0,1.000000
41135699,A1CC2HQ8XCA28P,BT00DDVMVQ,1.0,1.000000


Check the benign and fraudulent user counts in the original dataset.

In [5]:
users = df.groupby('reviewerID').helpfulness.mean()
benign = users[users > 0.8]
fraudulent = users[users < 0.2]
print(f'Benign users: {len(benign)}')
print(f'Fraudulent users: {len(fraudulent)}')

Benign users: 1139382
Fraudulent users: 151789


Select only benign and fraudulent users. Discard the other users without label.

In [6]:
df_benign = df[df.reviewerID.isin(set(benign.index))].copy()
df_benign['label'] = 'Benign'
df_fra = df[df.reviewerID.isin(set(fraudulent.index))].copy()
df_fra['label'] = 'Fraudulent'
df = pd.concat([df_benign, df_fra])
df

Unnamed: 0,reviewerID,asin,rating,helpfulness,label
6,A14A5Q8VJK5NLR,0000029831,0.5,1.000000,Benign
7,A3W2PX96K1BA3M,0000029831,1.0,1.000000,Benign
9,A1MC4E00RO5E9T,0000029831,1.0,1.000000,Benign
49,A2RAGC7VLO78QG,0000031887,0.5,1.000000,Benign
52,A12OFS8WQP86O5,0000031887,1.0,0.869565,Benign
...,...,...,...,...,...
41132360,A32PHKD604WRG7,B00LTFG8EC,0.0,0.000000,Fraudulent
41132943,A1KBQ2GO5TN1VH,B00LUXND82,-1.0,0.000000,Fraudulent
41133093,A1PQH8Z7XBRGD8,B00LVEZYOQ,-1.0,0.000000,Fraudulent
41133297,A08001923S5BQH48HJ5FF,B00LWRN8SQ,1.0,0.000000,Fraudulent


Check the number of reviews from benign and fraudulent users.

In [7]:
counts = Counter(df.label)
print(f'Reviews from benign users: {counts["Benign"]}')
print(f'Reviews from fraudulent users: {counts["Fraudulent"]}')

Reviews from benign users: 6828872
Reviews from fraudulent users: 278376


Generate k-core dataset

In [8]:
k = 3
output = f'{folder}/processed_{k}-core_80_20.csv'
# Repeatly remove the users with less than k reviews, 
# then remove the items with less than k reviews until
# no one is removed.
diff = 1
while diff:
    cache = len(df)
    counts = df.groupby('reviewerID').asin.count()
    counts = counts[counts >= k]
    df = df[df.reviewerID.isin(set(counts.index))]
    counts = df.groupby('asin').reviewerID.count()
    counts = counts[counts >= k]
    df = df[df.asin.isin(set(counts.index))]
    diff = cache - len(df)
df_out = df[['reviewerID', 'asin', 'rating']]
df_out.to_csv(f'processed_{k}-core_80_20.csv', index=False)
df_out

Unnamed: 0,reviewerID,asin,rating
52,A12OFS8WQP86O5,0000031887,1.0
76,A2Y0ZD9CYGAS1S,0000031887,0.5
80,A3EERSWHAI6SO,0000031887,1.0
121,A2NXWW4QCL1EU1,0000031887,1.0
170,A5Y15SAOMX6XA,0000589012,-0.5
...,...,...,...
41123081,AD9JN9OJUQOJU,B00LFRBSVM,1.0
41123379,AZ0WVYQHWSMVU,B00LG0DKBO,0.0
41123923,A2AEMXSQ3WKIPF,B00LGABA7U,-0.5
41126046,AF8K7HXE8Y1TO,B00LJKGSZQ,1.0


Check the number of reviews from benign and fraudulent users again

In [9]:
benign = set(df[df.label == 'Benign'].reviewerID.unique())
fraudulent = set(df[df.label == 'Fraudulent'].reviewerID.unique())
print(f'Benign users: {len(benign)}')
print(f'Fraudulent users: {len(fraudulent)}')
counts = Counter(df.label)
print(f'Reviews from benign users: {counts["Benign"]}')
print(f'Reviews from fraudulent users: {counts["Fraudulent"]}')

Benign users: 601030
Fraudulent users: 22752
Reviews from benign users: 5071608
Reviews from fraudulent users: 92922


Output user labels

In [10]:
userfile = f'{folder}/user_label.csv'
with open(userfile, 'w') as fw:
    fw.write('reviewerID,label\n')
    for u in benign:
        fw.write(f'{u},Benign\n')
    for u in fraudulent:
        fw.write(f'{u},Fraudulent\n')

## 2. Generate toy datasets for coding
Here, toy datasets are generated to speed up model design and debugging.

In [34]:
# select users
n_benign = 50000
n_fraudulent = 5000
toy_users = set(sorted(list(benign))[:n_benign] + sorted(list(fraudulent))[:n_fraudulent])
df_toy = df[df.reviewerID.isin(toy_users)].copy()

# k-core dataset
k = 3
toy_out = f'{folder}/toy_{k}-core_80_20.csv'
diff = 1
while diff:
    cache = len(df_toy)
    counts = df_toy.groupby('reviewerID').asin.count()
    counts = counts[counts >= k]
    df_toy = df_toy[df_toy.reviewerID.isin(set(counts.index))]
    counts = df_toy.groupby('asin').reviewerID.count()
    counts = counts[counts >= k]
    df_toy = df_toy[df_toy.asin.isin(set(counts.index))]
    diff = cache - len(df_toy)
df_toy_out = df_toy[['reviewerID', 'asin', 'rating']]
df_toy_out.to_csv(toy_out, index=False)
df_toy

                     reviewerID        asin  rating  helpfulness   label
2424667   A06120993HKK17SIDKLPM  0373778287     0.5          1.0  Benign
4179841   A06120993HKK17SIDKLPM  0615731317     0.5          1.0  Benign
4785773   A06120993HKK17SIDKLPM  0751552615     1.0          0.0  Benign
5200091   A06120993HKK17SIDKLPM  0778316149     1.0          1.0  Benign
6649649   A06120993HKK17SIDKLPM  0984887873     1.0          1.0  Benign
...                         ...         ...     ...          ...     ...
41092900  A06120993HKK17SIDKLPM  B00KVNGPIM     1.0          1.0  Benign
41117851  A06120993HKK17SIDKLPM  B00LBITZPQ     1.0          1.0  Benign
41117893  A06120993HKK17SIDKLPM  B00LBJUA5E     0.5          1.0  Benign
41131652  A06120993HKK17SIDKLPM  B00LS5PYG6     0.5          1.0  Benign
41133564  A06120993HKK17SIDKLPM  B00LYH0JD6     1.0          1.0  Benign

[82 rows x 5 columns]
                     reviewerID        asin  rating  helpfulness   label
2424667   A06120993HKK17SIDK

Unnamed: 0,reviewerID,asin,rating,helpfulness,label
525,A14CC5FIPR5YVF,0001055178,0.5,1.0,Benign
528,A19D816DMGI44L,0001055178,1.0,0.0,Benign
530,A12RI2CLUGCZ26,0001055178,0.5,1.0,Benign
781,A14FER0RLJLB5Z,0002007770,1.0,1.0,Benign
808,A15OS6GWJ26FB1,0002007770,1.0,1.0,Benign
...,...,...,...,...,...
40801549,A1M0LXJIZOAE0Y,B00IK5AGAQ,0.5,0.0,Fraudulent
40851230,A1SFGNFF4HMCKW,B00IT1WJZQ,0.5,0.0,Fraudulent
40924525,A1LMX0QT4PVPHZ,B00JAQJMJ0,0.0,0.3,Fraudulent
40981061,A006458827ALF2J23JJTO,B00JQR8D6G,1.0,0.0,Fraudulent


Statistics for toy dataset.

In [30]:
toy_benign = set(df_toy[df_toy.label == 'Benign'].reviewerID.unique())
toy_fraudulent = set(df_toy[df_toy.label == 'Fraudulent'].reviewerID.unique())
print(f'Benign users: {len(toy_benign)}')
print(f'Fraudulent users: {len(toy_fraudulent)}')
counts = Counter(df_toy.label)
print(f'Reviews from benign users: {counts["Benign"]}')
print(f'Reviews from fraudulent users: {counts["Fraudulent"]}')

Benign users: 11908
Fraudulent users: 826
Reviews from benign users: 79687
Reviews from fraudulent users: 3146


Output user labels

In [31]:
toy_userfile = f'{folder}/toy_user_label.csv'
with open(toy_userfile, 'w') as fw:
    fw.write('reviewerID,label\n')
    for u in toy_benign:
        fw.write(f'{u},Benign\n')
    for u in toy_fraudulent:
        fw.write(f'{u},Fraudulent\n')

## 3. Generate dataset with text reviews

Here, we generate the dataset with text reviews of same review, which can help us to incorporate text embeddings.

In [38]:
# output name
output = f'{folder}/processed_{k}-core_80_20_with_text.csv'
output_toy = f'{folder}/toy_{k}-core_80_20_with_text.csv'
# Only output the reviews from benign or fraudulent users
products = set(df.asin.unique())
users = set(df.reviewerID.unique())
toy_products = set(df_toy.asin.unique())
toy_users = set(df_toy.reviewerID.unique())
with open(output, 'w') as fw, open(output_toy, 'w') as ftoy:
    # header
    fw.write('reviewerID,asin,rating,reviewText\n')
    ftoy.write('reviewerID,asin,rating,reviewText\n')
    i=0
    for d in parse(filename):
        i += 1
        # report every 5m
        if not i % 5000000:
            print(f'{i} reviews processed!')
        if not d['helpful'][1]:
            continue
        # skip unselected users and items
        if d['reviewerID'] in users and d['asin'] in products:
            fw.write(','.join([d['reviewerID'], d['asin'], str((d['overall']-3)/2), \
                           d['reviewText'].replace('\n',' ').replace(',', ' ')]) + '\n')
        if d['reviewerID'] in toy_users and d['asin'] in toy_products:
            ftoy.write(','.join([d['reviewerID'], d['asin'], str((d['overall']-3)/2), \
                             d['reviewText'].replace('\n', ' ').replace(', ', ' ')]) + '\n')

5000000 reviews processed!
10000000 reviews processed!
15000000 reviews processed!
20000000 reviews processed!
25000000 reviews processed!
30000000 reviews processed!
35000000 reviews processed!
40000000 reviews processed!


In [15]:
len(toy_products)

18863