# Suptech Framework for Illegal Payment Service Provider App Detection in Indonesia Listed on Google PlayStore

[By: Rafi Salman](https://www.linkedin.com/in/rafisalman/)

### Scrape payment related app from Google Play Store using google_play_scraper

First we need to obtain list of apps that are related with payment in google play store that available for download in Indonesia

In [1]:
from google_play_scraper import search
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

In [2]:
# Below is only example of a keyword search ("wallet") to find payment related app specificaly in Indonesia
result = search('wallet',
                lang="id",  # Language in Indonesia (id)'
                country="id",  # Available for download in Indonesia (id)'
                n_hits=30 # limited to 30 (= Google's maximum)
)

In [3]:
# Store the dataset in form of dataframe
playstore = pd.DataFrame(result)
# Display the crawled result
playstore[:30]

Unnamed: 0,appId,icon,screenshots,title,score,genre,price,free,currency,video,videoImage,description,descriptionHTML,developer,installs
0,com.droid4you.application.wallet,https://play-lh.googleusercontent.com/DqAKT8mJ...,[https://play-lh.googleusercontent.com/nC8X24D...,Wallet - Pelacak Anggaran,4.894325,Keuangan,0,True,IDR,https://www.youtube.com/embed/kAI9yu0U7Gw?ps=p...,https://i.ytimg.com/vi/kAI9yu0U7Gw/hqdefault.jpg,<b>Wallet membantu Anda merencanakan anggaran ...,<b>Wallet membantu Anda merencanakan anggaran ...,BudgetBakers.com,5.000.000+
1,com.wallet.crypto.trustapp,https://play-lh.googleusercontent.com/-3uTwEsZ...,[https://play-lh.googleusercontent.com/HOAibhB...,Trust - Dompet Kripto,4.710865,Keuangan,0,True,IDR,,,Trust Wallet adalah dompet kripto resmi Binanc...,Trust Wallet adalah dompet kripto resmi Binanc...,"DApps Platform, Inc.",10.000.000+
2,io.metamask,https://play-lh.googleusercontent.com/8rzHJpfk...,[https://play-lh.googleusercontent.com/_bBbQOO...,MetaMask - Blockchain Wallet,4.471292,Keuangan,0,True,IDR,https://www.youtube.com/embed/YVgfHZMFFFQ?ps=p...,https://i.ytimg.com/vi/YVgfHZMFFFQ/hqdefault.jpg,Whether you are an experienced user or brand n...,Whether you are an experienced user or brand n...,MetaMask Web3 Wallet,10.000.000+
3,com.airtm.android,https://play-lh.googleusercontent.com/sTc5uAZ8...,[https://play-lh.googleusercontent.com/e1YARiO...,Airtm,3.8,Keuangan,0,True,IDR,https://www.youtube.com/embed/8u8WPL0acz0?ps=p...,https://i.ytimg.com/vi/8u8WPL0acz0/hqdefault.jpg,"Mueve y usa tu dinero como y cuando quieras, e...","Mueve y usa tu dinero como y cuando quieras, e...","Airtm, Inc.",1.000.000+
4,org.toshi,https://play-lh.googleusercontent.com/wrgUujbq...,[https://play-lh.googleusercontent.com/DxQ03kq...,Coinbase Wallet: Simpan Kripto,3.235714,Keuangan,0,True,IDR,,,Coinbase Wallet adalah kunci Anda menuju hal b...,Coinbase Wallet adalah kunci Anda menuju hal b...,Coinbase Wallet,5.000.000+
5,com.bitkeep.wallet,https://play-lh.googleusercontent.com/9ejeZqAJ...,[https://play-lh.googleusercontent.com/m-c5wLt...,BitKeep: Dompet Crypto DeFi,4.67052,Keuangan,0,True,IDR,https://www.youtube.com/embed/FskRmYjXqbY?ps=p...,https://i.ytimg.com/vi/FskRmYjXqbY/hqdefault.jpg,"Didirikan di Singapura pada Mei 2018, BitKeep ...","Didirikan di Singapura pada Mei 2018, BitKeep ...",BitKeep Global Inc,100.000+
6,com.wemadetree.wemixwallet,https://play-lh.googleusercontent.com/e4Vq3dIh...,[https://play-lh.googleusercontent.com/UezCteC...,WEMIX Wallet,3.932367,Keuangan,0,True,IDR,,,"<font color=""#813ccc""><b>Integration with Vari...","<font color=""#813ccc""><b>Integration with Vari...",WEMIX PTE. LTD.,1.000.000+
7,com.dokuwallet.android,https://play-lh.googleusercontent.com/yHAWkzRY...,[https://play-lh.googleusercontent.com/qzAkuJP...,DOKU,3.27084,Keuangan,0,True,IDR,,,"DOKU, layanan dompet digital yang membantu sip...","DOKU, layanan dompet digital yang membantu sip...",PT. Nusa Satu Inti Artha (DOKU),1.000.000+
8,com.algorand.android,https://play-lh.googleusercontent.com/InLGSL6V...,[https://play-lh.googleusercontent.com/UIMLGzf...,Pera Algo Wallet,4.635514,Keuangan,0,True,IDR,,,"Pera Algo Wallet is the simple, fast way to se...","Pera Algo Wallet is the simple, fast way to se...",Pera Wallet,100.000+
9,io.walletcards.android,https://play-lh.googleusercontent.com/IKrNtdJI...,[https://play-lh.googleusercontent.com/FwPwGkA...,Wallet Cards | Digital Wallet,0.0,Perjalanan & Lokal,0,True,IDR,,,"Across all wallet apps on Play Store, \r\nWall...","Across all wallet apps on Play Store, <br>Wall...",Wallet Cards Alliance,1.000.000+


### Upload List of Legal Payment Service Provider Dataset
Next, we upload the List of Legal Payment Service Provider as dataframe

In [4]:
legal_psp = pd.read_csv('legal_PSP.csv')
# Example of legal PSP on the dataset, the dataframe has gone through preprocessing before uploaded
legal_psp[:5]

Unnamed: 0,organization,product,address,telephone,category,decision_number,decision_date,permit_place,operational_date,description,status,barcode,email,website
0,"PT Pan Indonesia Bank, Tbk (PT Bank Panin)",,"Gedung Panin Centre Lt. 1-2, Jl. Jend. Sudirma...",,QRIS,24/303/DKSP/Srt/B,8/8/2022,Departemen Kebijakan Sistem Pembayaran,,QRIS Operators,Berizin (Belum Beroperasinal),,,
1,Lembaga Pelatihan Kerja Alfamart Learning Center,,"Alfa Tower Lantai 19, Alamat Sutera, Jl. Jalur...",,Job Training Institute,24/159/DPSP-GESK/Srt/B,7/27/2022,Jakarta,,Job Training Institute,Berizin (Telah Operasional),,,
2,PT Citra Abdi Valasindo,,"Wisma Bumiputera Lt. 2/M, Jl. Jendral Sudirman...",,Non Bank Money Changer Operators,24/37/KEP.PBI/Jkt/2022,7/8/2022,KPwBI Provinsi DKI Jakarta,7/11/2022,Non Bank Money Changer Operators,Berizin (Telah Operasional),1424.48-000/Jkt,,
3,PT Dewata Inter Valasindo,,"Jl. Bulungan I No. 64, Gedung Ayam Bulungan Lt...",,Non Bank Money Changer Operators,24/39/KEP.PBI/Jkt/2022,7/8/2022,KPwBI Provinsi DKI Jakarta,7/18/2022,Non Bank Money Changer Operators,Berizin (Telah Operasional),1423.47-000/Jkt,,
4,PT Luxury Valuta Perkasa,,"Jalan Panglima Polim Raya No. 105, Kel. Melawa...",,Non Bank Money Changer Operators,24/40/KEP.PBI/Jkt/2022,7/8/2022,KPwBI Provinsi DKI Jakarta,7/12/2022,Non Bank Money Changer Operators,Berizin (Telah Operasional),1422.46-001/Jkt,,


### Using NLP Fuzzy String Matching between playstore["developer"] and legal_psp['organization']

We will use Fuzzywuzzy library for this part. Fuzzywuzzy library is a Python library that uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Using this data set, we are going to test how Fuzzywuzzy thinks. In another words, we are using Fuzzywuzzy to match records between two data sources.

Since the problem is that the publisher company or developer listed on playstore['developer'] might different than the one listed in the legal_psp['organization'] list

There are several ways to compare two strings in Fuzzywuzzy, let’s try them one by one.

`ratio`, compares the entire string similarity, in order.

In [5]:
# fuzz.ratio(Playstore[title],list_berizin[nama_penyelenggara])
fuzz.ratio('PT. Sprint Asia Technology', 'PT Sprint Asia Technology')

98

In [6]:
fuzz.ratio('PT. Nusa Satu Inti Artha (DOKU)', 'PT Nusa Satu Inti Artha')

85

In [7]:
fuzz.ratio('PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK', 'PT Bank Jabar dan Banten')

24

'PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK' and 'PT Bank Jabar dan Banten' even though we know that both are the same entity is only 24% the same.  It turns out, the naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

Next we use `partial_ratio`, as it compares partial string similarity using the same data pairs

In [8]:
# fuzz.partial_ratio(Playstore[title],list_berizin[nama_penyelenggara])
fuzz.partial_ratio('PT. Sprint Asia Technology', 'PT Sprint Asia Technology')

96

In [9]:
fuzz.partial_ratio('PT. Nusa Satu Inti Artha (DOKU)', 'PT Nusa Satu Inti Artha')

96

In [10]:
fuzz.partial_ratio('PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK', 'PT Bank Jabar dan Banten')

21

For this data set, comparing partial string while it does bring better score for one result, but it's not the same case with the others.

The next method is `token_sort_ratio`, token_sort_ratio will ignores word order.

In [11]:
# fuzz.token_sort_ratio(Playstore[title],list_berizin[nama_penyelenggara])
fuzz.token_sort_ratio('PT. Sprint Asia Technology', 'PT Sprint Asia Technology')

100

In [12]:
fuzz.token_sort_ratio('PT. Nusa Satu Inti Artha (DOKU)', 'PT Nusa Satu Inti Artha')

90

In [13]:
fuzz.token_sort_ratio('PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK', 'PT Bank Jabar dan Banten')

61

This method has the best result so far, we take look at the final method

`token_set_ratio`, it will ignores duplicated words. It is similar with token sort ratio, but a little bit more flexible.`

In [14]:
# fuzz.token_set_ratio(Playstore[title],list_berizin[nama_penyelenggara])
fuzz.token_set_ratio('PT. Sprint Asia Technology', 'PT Sprint Asia Technology')

100

In [15]:
fuzz.token_set_ratio('PT. Nusa Satu Inti Artha (DOKU)', 'PT Nusa Satu Inti Artha')

100

In [16]:
fuzz.token_set_ratio('PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK', 'PT Bank Jabar dan Banten')

74

Looks like `token_set_ratio` is the best fit for our dataset. According to this discovery, we decided to apply token_set_ratio for the developer/organization matching.

In [17]:
dev_test = ['PT. Sprint Asia Technology', 'PT. Nusa Satu Inti Artha (DOKU)', 'PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK']
org_test = ['PT Sprint Asia Technology', 'PT Nusa Satu Inti Artha', 'PT Bank Jabar dan Banten']

for dev in dev_test:
    print(dev, process.extract(dev, org_test, scorer = fuzz.token_set_ratio, limit =1 ))

PT. Sprint Asia Technology [('PT Sprint Asia Technology', 100)]
PT. Nusa Satu Inti Artha (DOKU) [('PT Nusa Satu Inti Artha', 100)]
PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK [('PT Bank Jabar dan Banten', 74)]


### Checking if the developer company is listed on the legal_PSP list

In [18]:
# Create new dataframe for the crawled playstore dataset to keep the original safe
result_df = playstore[['title', 'developer', 'description']].copy()

In [19]:
dev_test = ['PT. Sprint Asia Technology', 'PT. Nusa Satu Inti Artha (DOKU)', 'PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK']
org_test = ['PT Sprint Asia Technology', 'PT Nusa Satu Inti Artha', 'PT Bank Jabar dan Banten']

for dev in dev_test:
    print(dev, process.extract(dev, org_test, scorer = fuzz.token_set_ratio, limit =1 ))

PT. Sprint Asia Technology [('PT Sprint Asia Technology', 100)]
PT. Nusa Satu Inti Artha (DOKU) [('PT Nusa Satu Inti Artha', 100)]
PT.BANK PEMBANGUNAN DAERAH JAWA BARAT & BANTEN,TBK [('PT Bank Jabar dan Banten', 74)]


In [20]:
listed_developer = []
unlisted_developer = []

for dev in result_df['developer']:
    result = process.extractOne(dev, legal_psp['organization'], scorer = fuzz.token_set_ratio)
    ratio = int(result[1])
    if ratio < 70:
        unlisted_developer.append(dev)
    else:
        listed_developer.append(dev)

In [21]:
# Sample of developer that are listed in the legal_psp['organization'] from the crawled dataset
listed_developer[:5]

['PT. Nusa Satu Inti Artha (DOKU)']

In [22]:
# Sample of developer that are NOT LISTED in the legal_psp['organization'] from the crawled dataset
unlisted_developer[:5]

['BudgetBakers.com',
 'DApps Platform, Inc.',
 'MetaMask Web3 Wallet',
 'Airtm, Inc.',
 'Coinbase Wallet']

In [23]:
# Create new column named 'dev_status' where developer listed in listed_psp['organization'] set as True
result_df['dev_status'] = result_df['developer'].isin(listed_developer)

In [24]:
# Current state of our result_df dataframe where if dev_status == false means the developer for each of the app is not 
# a listed organization in the legal_psp dataset
result_df

Unnamed: 0,title,developer,description,dev_status
0,Wallet - Pelacak Anggaran,BudgetBakers.com,<b>Wallet membantu Anda merencanakan anggaran ...,False
1,Trust - Dompet Kripto,"DApps Platform, Inc.",Trust Wallet adalah dompet kripto resmi Binanc...,False
2,MetaMask - Blockchain Wallet,MetaMask Web3 Wallet,Whether you are an experienced user or brand n...,False
3,Airtm,"Airtm, Inc.","Mueve y usa tu dinero como y cuando quieras, e...",False
4,Coinbase Wallet: Simpan Kripto,Coinbase Wallet,Coinbase Wallet adalah kunci Anda menuju hal b...,False
5,BitKeep: Dompet Crypto DeFi,BitKeep Global Inc,"Didirikan di Singapura pada Mei 2018, BitKeep ...",False
6,WEMIX Wallet,WEMIX PTE. LTD.,"<font color=""#813ccc""><b>Integration with Vari...",False
7,DOKU,PT. Nusa Satu Inti Artha (DOKU),"DOKU, layanan dompet digital yang membantu sip...",True
8,Pera Algo Wallet,Pera Wallet,"Pera Algo Wallet is the simple, fast way to se...",False
9,Wallet Cards | Digital Wallet,Wallet Cards Alliance,"Across all wallet apps on Play Store, \r\nWall...",False


### Checking if the app is listed on the legal_PSP dataset

After we check whether the developer is already listed on the legal_psp list or not, we need to further investigate if the product/app itself is already legally listed

We need to determine which method of Fuzzy String Matching best suited for the app name/title in our dataset.

Based on the previous finding, we narrow it down to two method which are `token_sort_ratio` and `token_set_ratio`

In [25]:
listed_app = []
unlisted_app = []

# We settled using token_sort_ratio for a lot of app name are similar with each other
for app in result_df['title']:
    result = process.extractOne(app , legal_psp['product'], scorer = fuzz.token_sort_ratio)
    ratio = int(result[1])
    if ratio < 70:
         unlisted_app.append(app)
    else:
        listed_app.append(app)

In [26]:
# Sample of app that are listed in the legal_psp['product'] from the crawled dataset
listed_app[:5]

['DOKU']

In [27]:
# Sample of developer that are NOT LISTED in the legal_psp['product'] from the crawled dataset
unlisted_developer[:5]

['BudgetBakers.com',
 'DApps Platform, Inc.',
 'MetaMask Web3 Wallet',
 'Airtm, Inc.',
 'Coinbase Wallet']

In [28]:
# Create new column named 'app_status' where developer listed in listed_psp['product'] set as True
result_df['app_status'] = result_df['title'].isin(listed_app)

In [29]:
result_df

Unnamed: 0,title,developer,description,dev_status,app_status
0,Wallet - Pelacak Anggaran,BudgetBakers.com,<b>Wallet membantu Anda merencanakan anggaran ...,False,False
1,Trust - Dompet Kripto,"DApps Platform, Inc.",Trust Wallet adalah dompet kripto resmi Binanc...,False,False
2,MetaMask - Blockchain Wallet,MetaMask Web3 Wallet,Whether you are an experienced user or brand n...,False,False
3,Airtm,"Airtm, Inc.","Mueve y usa tu dinero como y cuando quieras, e...",False,False
4,Coinbase Wallet: Simpan Kripto,Coinbase Wallet,Coinbase Wallet adalah kunci Anda menuju hal b...,False,False
5,BitKeep: Dompet Crypto DeFi,BitKeep Global Inc,"Didirikan di Singapura pada Mei 2018, BitKeep ...",False,False
6,WEMIX Wallet,WEMIX PTE. LTD.,"<font color=""#813ccc""><b>Integration with Vari...",False,False
7,DOKU,PT. Nusa Satu Inti Artha (DOKU),"DOKU, layanan dompet digital yang membantu sip...",True,True
8,Pera Algo Wallet,Pera Wallet,"Pera Algo Wallet is the simple, fast way to se...",False,False
9,Wallet Cards | Digital Wallet,Wallet Cards Alliance,"Across all wallet apps on Play Store, \r\nWall...",False,False


### Final Result

On the final step, we want to see the list of app that has potential to be furtherly investigated.

Based on our method, we can categorize the list of app based on three different categories:
   * Immediate Investigation = if the developer and app are set to false in the psp_legal dataset
   * Further Checking = if one of the criteria set to false (developer OR app) in the psp_legal dataset
   * Legally Listed = if both the developer and app are already listed in the psp_legal dataset
   
*Do note that some of the application might listed on list of other regulatory organization (eg. OJK, Bappebti)*

#### Immediate Investigation:

In [34]:
print('List of app that needs immediate investigation are:' )
print('\n')
immediate_investigation = result_df.loc[(result_df['dev_status'] == False) & (result_df['app_status'] == False)]
print(immediate_investigation)

List of app that needs immediate investigation are:


                                          title                     developer  \
0                     Wallet - Pelacak Anggaran              BudgetBakers.com   
1                         Trust - Dompet Kripto          DApps Platform, Inc.   
2                  MetaMask - Blockchain Wallet          MetaMask Web3 Wallet   
3                                         Airtm                   Airtm, Inc.   
4                Coinbase Wallet: Simpan Kripto               Coinbase Wallet   
5                   BitKeep: Dompet Crypto DeFi            BitKeep Global Inc   
6                                  WEMIX Wallet               WEMIX PTE. LTD.   
8                              Pera Algo Wallet                   Pera Wallet   
9                 Wallet Cards | Digital Wallet         Wallet Cards Alliance   
10               Blockchain.com Wallet: Buy BTC    Blockchain Luxembourg S.A.   
11                Exodus: Crypto Bitcoin Wallet        

#### Further Checking:

In [31]:
print('List of app that needs Further Checking are:' )
print('\n')
further_checking = result_df.loc[((result_df['dev_status'] == False) & (result_df['app_status'] == True) |
                                  (result_df['dev_status'] == True) & (result_df['app_status'] == False))]
print(further_checking)

List of app that needs Further Checking are:


Empty DataFrame
Columns: [title, developer, description, dev_status, app_status]
Index: []


#### Legally Listed:

In [32]:
print('List of app that are legally listed are:' )
print('\n')
legally_listed = result_df.loc[(result_df['dev_status'] == True) & (result_df['app_status'] == True)]
print(legally_listed)

List of app that are legally listed are:


  title                        developer  \
7  DOKU  PT. Nusa Satu Inti Artha (DOKU)   

                                         description  dev_status  app_status  
7  DOKU, layanan dompet digital yang membantu sip...        True        True  


### Further Development

There are several aspect that can be improved for further development which are:
   * Data scraper needs to get more data of app listed in the playstore, passing the current limit of 30 apps data
   * List of keywords used that are used to find payment related app
   * Crosscheck with other list of legal app and organization that are listed on other regulator such as OJK and Bappebti