# Experimental simple performance testing notebook for Turi Create
- testing and comparing simple dataframe / sql operations of commong data (pre-)processing tasks 
- various available single-machine Python solutions are to be tested: Pandas, PySpark, Turi Create and Dask.
- execution times, CPU load and maximal memory use should be tracked

## Kiva dataset 
- [Kiva](https://www.kaggle.com/gaborfodor/additional-kiva-snapshot): crowdfunding data with lenders and loans, with additional geographic data
- download the related CSV files and move them to a folder where the kernel can read them

## imports, setup

In [1]:
import turicreate
from turicreate import SFrame
import turicreate.aggregate as agg

import timeit

  from ._conv import register_converters as _register_converters


## read files to dataframes: loans and lenders

In [2]:
full_start = timeit.default_timer()

lenders_sf = SFrame(data='../../kiva/lenders.csv')  # 130 MB file, 797.279 lines
loans_sf = SFrame(data='../../kiva/loans.csv')      # 2.1 GB file, 1.419.607 lines

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,int,str,str,float,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,float,float,str,str,str,str,str,str,str,str,float,str,float,str,str,str,str,float,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


## read, transform and count loan_lenders 
string enumeration to rows: split tuple strings to array, then explode the array to rows

In [3]:
start = timeit.default_timer()

llsf = SFrame.read_csv('../../kiva/loans_lenders.csv', header=True) #, nrows=20) 

# transform string to list: - have to remove whitespaces too
llsf['lenders_list'] = llsf.apply( lambda row: row['lenders'].replace(' ', '').split(',') )
llsf = llsf.remove_column('lenders')

# stacking list elements to rows: 
llsf = llsf.stack('lenders_list', new_column_name='lender').select_columns(['loan_id', 'lender']).unique() 

loans_lenders_sf = llsf 

print('ellapsed time: ', timeit.default_timer() - start)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


ellapsed time:  63.05687742300506


## join, filter and sort loan and lender data
get distinct joined lines with renamed columns, then write to an output file (for fully materialized results)
- filtering on lenders.country_code: 
  - 'US': 25% of lenders
  - 'CA': 3% of lenders --> 3.5 GB joined file, 1.971.548 rows

In [4]:
start = timeit.default_timer()

# filter unique lenders: CA: 67.970
lenders_sf_filtered = lenders_sf.filter_by(['CA'], 'country_code').unique()

# join: 
joined_sf = loans_lenders_sf.join(lenders_sf_filtered, on={'lender':'permanent_name'}, how='inner') \
    .join(loans_sf, on={'loan_id':'loan_id'}, how='inner')

print('ellapsed time: ', timeit.default_timer() - start)

ellapsed time:  26.369552541000303


## grouping and sorting
* group by on the exploded loans_lenders table (6 GB): count distinct loan_id by lender


In [5]:
start = timeit.default_timer()

lender_loan_count_sf = joined_sf.groupby(key_column_names='lender', operations={'loan_id_ct': agg.COUNT_DISTINCT('loan_id')} )
#                               .sort('loan_count', ascending = False)

lender_loan_count_sf.export_csv('../../kiva/turi-result-lender_loan_count_sf.csv', header=True)

print('ellapsed time: ', timeit.default_timer() - start)

print('full ellapsed time: ', timeit.default_timer() - full_start)

ellapsed time:  2.106998919000034
full ellapsed time:  114.90185756399296
