# Experimental simple performance testing notebook for Turi Create
- testing and comparing simple dataframe / sql operations of commong data (pre-)processing tasks 
- various available single-machine Python solutions are to be tested: Pandas, PySpark, Turi Create and Dask.
- execution times, CPU load and maximal memory use should be tracked

## Kiva dataset 
- [Kiva](https://www.kaggle.com/gaborfodor/additional-kiva-snapshot): crowdfunding data with lenders and loans, with additional geographic data
- download the related CSV files and move them to a folder where the kernel can read them

## imports, setup

In [None]:
import turicreate
from turicreate import SFrame
import timeit

## read files to dataframes: loans and lenders

In [None]:
lenders_sf = SFrame(data='../../kiva/lenders.csv')  # 130 MB file, 797.279 lines
loans_sf = SFrame(data='../../kiva/loans.csv')      # 2.1 GB file, 1.419.607 lines

In [None]:
lenders_sf.show()

In [None]:
loans_sf.show()

In [None]:
lenders_sf.num_rows()

## read, transform and count loan_lenders 
string enumeration to rows: split tuple strings to array, then explode the array to rows

In [None]:
start = timeit.default_timer()

llsf = SFrame.read_csv('../../kiva/loans_lenders.csv', header=True) #, nrows=20) 
# 339 MB file, 1.387.433 lines -> 27.459.086 distinct lines normalized, 8.4 - 6.4GB mem
# 200.000 heading lines --> 3.994.263 distinct output lines
print('read lines: ', llsf.num_rows() )

# transform string to list: - have to remove whitespaces too
llsf['lenders_list'] = llsf.apply( lambda row: row['lenders'].replace(' ', '').split(',') )
llsf = llsf.remove_column('lenders')

# stacking list elements to rows: 
llsf = llsf.stack('lenders_list', new_column_name='lender').select_columns(['loan_id', 'lender']).unique() 

loans_lenders_sf = llsf 

print('ellapsed time: ', timeit.default_timer() - start)
print('loans_lenders_sf line count: ', loans_lenders_sf.num_rows() )

#loans_lenders_sf.export_csv('../../kiva/turi-loans_lenders_sf-20.csv', header=True)

In [None]:
loans_lenders_sf.head(5)

In [None]:
loans_lenders_sf.show() 

## join, filter and sort loan and lender data
get distinct joined lines with renamed columns, then write to an output file (for fully materialized results)
- filtering on lenders.country_code: 
  - 'US': 25% of lenders
  - 'CA': 3% of lenders --> 3.5 GB joined file, 1.971.548 rows

In [None]:
start = timeit.default_timer()

# filter unique lenders: CA: 67.970
lenders_sf_filtered = lenders_sf.filter_by(['CA'], 'country_code').unique()

# join: 
joined_sf = loans_lenders_sf.join(lenders_sf_filtered, on={'lender':'permanent_name'}, how='inner') \
    .join(loans_sf, on={'loan_id':'loan_id'}, how='inner')

joined_sf.export_csv('kiva/turi-result-joined.csv', header=True)

print('ellapsed time: ', timeit.default_timer() - start)
print('line count: ', joined_sf.num_rows() )

In [None]:
joined_sf.head(5)

In [None]:
lenders_sf_filtered.show()

In [None]:
loans_sf.head(4)

## grouping and sorting
* group by lender on loans_lenders, count distinct loan_ids


In [None]:
joined_agg_sf = joined_sf.groupby(key_column_names='joined_sf', operations={'loan_count': agg.COUNT_DISTINCT('loan_id')} )
#                               .sort('loan_count', ascending = False)

* group by lenders.country_code on joined table, get count(distinct loans.country) - how many different countries the lender donated to
* group by loan.contry_name on joined table, count(distinct loan_id): how many donation went to different countries from CA lenders