# Project Solution - Primary

You have to work on the [Kiva](https://drive.google.com/file/d/1-tJtnIbo1Rt-F1XfoWGVkmBXiI-ciuRx/view) dataset. Some information on the datasets are on the [Kaggle](https://www.kaggle.com/gaborfodor/additional-kiva-snapshot) web page.

## Basic tasks

**1.** Normalize the `loan_lenders` table. In the normalized table, each row must have one `loan_id` and one `lender`.

Firstly, csv file is read and data frame with unnormalized data is created:

In [1]:
import pandas as pd
import numpy as np
import time as tm
FINAL = True # This is final calculation, on complete data
start_time = tm.time()
loans_lenders_un = pd.read_csv('../../Datasets/kiva/loans_lenders.csv')
if not FINAL:
    loans_lenders_un = loans_lenders_un.head(10000) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_lenders_un

--- 4.1969993114471436 seconds ---


Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."
...,...,...
1387427,678999,"michael43411218, carol5987, gooddogg1, chris41..."
1387428,1207353,"rjhoward1986, jeffrey6870, trolltech4460, elys..."
1387429,1206220,"vicky7746, gooddogg1, fairspirit, craig9729960..."
1387430,1206425,"rich6705, sergiiy9766, angela7509, barbara5610..."


Data frame with structured data will be created with help of intermediate object - list of pairs (loan_id, lender) packed in the dictionary:

In [2]:
import time as tm
start_time = tm.time()
lis = []
for index, row in loans_lenders_un.iterrows(): 
    ls = row['lenders'].split(',')
    for l in ls:
        lis.append({ 'loan_id' : row['loan_id'], 'lender': l.strip() })
loans_lender = pd.DataFrame(lis) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_lender 

--- 507.3094873428345 seconds ---


Unnamed: 0,loan_id,lender
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499
...,...,...
28293926,1206425,trogdorfamily7622
28293927,1206425,danny6470
28293928,1206425,don6118
28293929,1206486,alan5175


**2.** For each loan, add a column `duration` corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.

Firstly, data frame should be loaded and structure of the data frame `loans` should be determined:

In [3]:
import pandas as pd
import numpy as np
import time as tm
start_time = tm.time()
loans = pd.read_csv('../../Datasets/kiva/loans.csv')
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans.columns

--- 34.18983602523804 seconds ---


Index(['loan_id', 'loan_name', 'original_language', 'description',
       'description_translated', 'funded_amount', 'loan_amount', 'status',
       'activity_name', 'sector_name', 'loan_use', 'country_code',
       'country_name', 'town_name', 'currency_policy',
       'currency_exchange_coverage_rate', 'currency', 'partner_id',
       'posted_time', 'planned_expiration_time', 'disburse_time',
       'raised_time', 'lender_term', 'num_lenders_total',
       'num_journal_entries', 'num_bulk_entries', 'tags', 'borrower_genders',
       'borrower_pictured', 'repayment_interval', 'distribution_model'],
      dtype='object')

After thar, values from the data frame should be displayed:

In [4]:
if not FINAL:
    loans = loans.head(20000)
loans

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,2014-01-15 04:48:22.000 +0000,7.0,3,2,1,,female,true,irregular,field_partner
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,2014-02-25 06:42:06.000 +0000,8.0,11,2,1,,female,true,monthly,field_partner
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,2014-01-24 23:06:18.000 +0000,14.0,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,true,monthly,field_partner
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,2014-01-22 05:29:28.000 +0000,14.0,21,2,1,user_favorite,female,true,monthly,field_partner
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,2014-01-14 17:29:27.000 +0000,7.0,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,true,bullet,field_partner
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1419602,988180,,,,,400.0,400.0,funded,Tailoring,Services,...,2015-12-28 15:44:18.000 +0000,14.0,16,4,2,"#Parent, #Repeat Borrower, #Woman Owned Biz",,,monthly,field_partner
1419603,988213,Perlita,English,"Perlita is 52 years old, married and has three...","Perlita is 52 years old, married and has three...",300.0,300.0,funded,Pigs,Agriculture,...,2015-12-22 10:37:06.000 +0000,14.0,12,1,1,"#Animals, #Elderly, #Repeat Borrower, #Woman O...",female,true,irregular,field_partner
1419604,989109,Okyeso Nyame Group,English,Okyeso Nyame group will begin its third cycle ...,Okyeso Nyame group will begin its third cycle ...,2425.0,2425.0,funded,Bakery,Food,...,2015-12-26 20:24:47.000 +0000,8.0,76,2,1,"user_favorite, #Parent, #Vegan, #Woman Owned B...","female, female, female, male, male, female","true, true, true, true, true, true",irregular,field_partner
1419605,989143,Exequila,English,"Exequila is from San Miguel, Bohol. She is in...","Exequila is from San Miguel, Bohol. She is in...",100.0,100.0,funded,Farming,Agriculture,...,2015-12-06 21:03:57.000 +0000,12.0,3,1,1,,female,true,irregular,field_partner


We are now mainly interested in columns 'planned_expiration_time', 'disburse_time':

In [5]:
loans_sel = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time'] ]
loans_sel

Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000
...,...,...,...,...
1419602,988180,,2016-01-02 01:00:03.000 +0000,2015-11-23 08:00:00.000 +0000
1419603,988213,Perlita,2016-01-02 16:40:07.000 +0000,2015-11-24 08:00:00.000 +0000
1419604,989109,Okyeso Nyame Group,2016-01-03 22:20:04.000 +0000,2015-11-13 08:00:00.000 +0000
1419605,989143,Exequila,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000


In [6]:
from datetime import datetime as dt
import time as tm
start_time = tm.time()
loans['duration'] = None
for index, row in loans.iterrows(): 
    s2 = row['planned_expiration_time']
    s1 = row['disburse_time']
    try:
        d2 = dt.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    try:
        d1 = dt.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    loans.loc[index,'duration'] = (d2 - d1).days 
loans_sel = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time', 'duration'] ]
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_sel

--- 24575.554668426514 seconds ---


Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time,duration
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,53
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,96
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,37
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000,34
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000,57
...,...,...,...,...,...
1419602,988180,,2016-01-02 01:00:03.000 +0000,2015-11-23 08:00:00.000 +0000,39
1419603,988213,Perlita,2016-01-02 16:40:07.000 +0000,2015-11-24 08:00:00.000 +0000,39
1419604,989109,Okyeso Nyame Group,2016-01-03 22:20:04.000 +0000,2015-11-13 08:00:00.000 +0000,51
1419605,989143,Exequila,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000,63


**3.** Find the lenders that have funded at least twice.

- Those are lenders that are duplicated in structured loan_leneders dataframe.


In [7]:
import time as tm
start_time = tm.time()
loans_lender_2_more = loans_lender[loans_lender.duplicated(['lender'])]
lenders_funded_2_more = list(set(loans_lender_2_more['lender']))
print("--- %s seconds ---" % (tm.time()-start_time)) 
lenders_funded_2_more

--- 28.642706394195557 seconds ---


['aly7284',
 'nalini6364',
 'pieterwillem6242',
 'maya6627',
 'thomas3765',
 'gregorio9483',
 'brian4528',
 'charles5981',
 'johnandbonnie2312',
 'mauricio3668',
 'jayshrimepani',
 'tricia7795',
 'susie5136',
 'rue6928',
 'christy74573235',
 'gail2460',
 'alexandros5108',
 'chapulin',
 'jana6757',
 'erik8582',
 'nuno5385',
 'vaughn9760',
 'jeannette8546',
 'carolyn1002',
 'dimitris8183',
 'naoise2573',
 'deborah6858',
 'karsten7782',
 'samuele5855',
 'maureen8542',
 'basava4552',
 'jannette3875',
 'kathleen3209',
 'cita4809',
 'carol6049',
 'tohko8881',
 'jensolevigen4106',
 'chris32866274',
 'dean9185',
 'anna2526',
 'joshua5745',
 'ryan2941',
 'garvan5141',
 'tracey8145',
 'cascade5565',
 'margaret2040',
 'venita3714',
 'vijay8327',
 'marilyn6185',
 'jean5413',
 'kashay3779',
 'andrea9736',
 'henrietta8760',
 'jill5863',
 'lara2195',
 'jason4251',
 'karen56485081',
 'nancy8601',
 'mariette9011',
 'george3958',
 'heather1978',
 'sandy1452',
 'edith4108',
 'virginia2454',
 'george1497'

**4.** For each country, compute how many loans have involved that country as borrowers.

- This solution is based on panda grouping:

In [8]:
import time as tm
start_time = tm.time()
loans_count = loans.groupby(['country_code','country_name'], as_index=False)[['loan_id']].count()
loans_count.columns = ['country_code','country_name','loan_id_count']
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_count

--- 0.5570371150970459 seconds ---


Unnamed: 0,country_code,country_name,loan_id_count
0,AF,Afghanistan,2337
1,AL,Albania,3075
2,AM,Armenia,13952
3,AZ,Azerbaijan,10172
4,BA,Bosnia and Herzegovina,608
...,...,...,...
90,XK,Kosovo,2178
91,YE,Yemen,4206
92,ZA,South Africa,633
93,ZM,Zambia,1277


**5.** For each country, compute the overall amount of money borrowed.


- This solution is based on panda grouping:

In [9]:
import time as tm
start_time = tm.time()
loans_sum = loans.groupby(['country_code','country_name'], as_index = False)[['loan_amount']].sum()
loans_sum.columns = ['country_code','country_name','loan_amount_sum']
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_sum

--- 0.35497236251831055 seconds ---


Unnamed: 0,country_code,country_name,loan_amount_sum
0,AF,Afghanistan,1967950.0
1,AL,Albania,4307350.0
2,AM,Armenia,22950475.0
3,AZ,Azerbaijan,14784625.0
4,BA,Bosnia and Herzegovina,477250.0
...,...,...,...
90,XK,Kosovo,3083025.0
91,YE,Yemen,3444000.0
92,ZA,South Africa,1006525.0
93,ZM,Zambia,1978975.0


**6.** Like the previous point, but expressed as a percentage of the overall amount lent.


- This solution is based on panda functions:

In [10]:
import time as tm
start_time = tm.time()
loans_total = loans['loan_amount'].sum()
loans_percent = loans.groupby(['country_code','country_name'], as_index = False)[['loan_amount']].sum()
loans_percent.columns = ['country_code','country_name','loan_amount_sum']
loans_percent['percent'] = loans_percent['loan_amount_sum']/loans_total*100
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_percent

--- 0.6792972087860107 seconds ---


Unnamed: 0,country_code,country_name,loan_amount_sum,percent
0,AF,Afghanistan,1967950.0,0.166573
1,AL,Albania,4307350.0,0.364586
2,AM,Armenia,22950475.0,1.942589
3,AZ,Azerbaijan,14784625.0,1.251410
4,BA,Bosnia and Herzegovina,477250.0,0.040396
...,...,...,...,...
90,XK,Kosovo,3083025.0,0.260955
91,YE,Yemen,3444000.0,0.291509
92,ZA,South Africa,1006525.0,0.085195
93,ZM,Zambia,1978975.0,0.167506


**7.** Like the three previous points, but split for each year (with respect to `disburse time`).


- This solution is based on panda functions:

In [11]:
import time as tm
start_time = tm.time()
from datetime import datetime as dt
loans_total = loans['loan_amount'].sum()
loans['disburse_time_year'] = None
for index, row in loans.iterrows():
    try:
        y = dt.strptime(row['disburse_time'], "%Y-%m-%d %H:%M:%S.%f %z").year
    except TypeError:
        y = None
    except ValueError:
        y = None
    loans.loc[index,'disburse_time_year'] = y 
loans_agg = loans.groupby(['country_code','country_name','disburse_time_year'\
                          ], as_index = False).agg({'loan_id':['count'],\
                                                    'loan_amount':['sum', lambda x: x.sum()/loans_total*100]})
loans_agg.columns = ['country_code','country_name','disburse_time_year','loan_id_count', \
                     'loan_amount_sum','loan_amount_percent']
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_agg

--- 48767.20305228233 seconds ---


Unnamed: 0,country_code,country_name,disburse_time_year,loan_id_count,loan_amount_sum,loan_amount_percent
0,AF,Afghanistan,2007,408,194975.0,0.016503
1,AF,Afghanistan,2008,370,365375.0,0.030926
2,AF,Afghanistan,2009,678,585125.0,0.049527
3,AF,Afghanistan,2010,632,563350.0,0.047683
4,AF,Afghanistan,2011,247,245125.0,0.020748
...,...,...,...,...,...,...
740,ZW,Zimbabwe,2013,426,678525.0,0.057432
741,ZW,Zimbabwe,2014,2078,1311575.0,0.111015
742,ZW,Zimbabwe,2015,600,723625.0,0.061250
743,ZW,Zimbabwe,2016,808,788600.0,0.066749


**8.** For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must assume that all lenders contributed the same amount.


- This solution is based on panda functions:

In [12]:
import time as tm
start_time = tm.time()
merged = pd.merge(left=loans_lender,right=loans, left_on='loan_id', right_on='loan_id')
merged = merged[['loan_id','lender', 'loan_amount']]
number_of_lenders_by_loan_id = merged.groupby(['loan_id'], as_index = False)[['lender']].count()
number_of_lenders_by_loan_id
merged2 = pd.merge(left=merged,right=number_of_lenders_by_loan_id, left_on='loan_id', right_on='loan_id')
merged2['precise_amount'] = merged2.loan_amount / merged2.lender_y
merged2
lent_by_lender = merged2.groupby(['lender_x'], as_index = False)[['precise_amount']].sum()
print("--- %s seconds ---" % (tm.time()-start_time)) 
lent_by_lender
#result.loc[result['lender_x'] == 'muc888'] # result for one lender

--- 4879.449104070663 seconds ---


Unnamed: 0,lender_x,precise_amount
0,000,1764.285078
1,00000,1380.693644
2,0002,2472.563566
3,00mike00,52.631579
4,0101craign0101,2623.565117
...,...,...
1383794,zzmcfate,66113.226325
1383795,zzpaghetti9994,51.020408
1383796,zzrvmf8538,576.978086
1383797,zzzsai,267.667370


**9.** For each country, compute the difference between the overall amount of money lent and the overall amount of money borrowed. Since the country of the lender is often unknown, you can assume that the true distribution among the countries is the same as the one computed from the rows where the country is known.


In [13]:
import pandas as pd
import numpy as np
import time as tm
start_time = tm.time()
lenders = pd.read_csv('../../Datasets/kiva/lenders.csv')
if not FINAL:
    lenders = lenders.head(20000) 
lender_states = lenders[['permanent_name','country_code']]
lender_states

Unnamed: 0,permanent_name,country_code
0,qian3013,
1,reena6733,
2,mai5982,
3,andrew86079135,
4,nguyen6962,
...,...,...
2349169,janet7309,
2349170,pj4198,
2349171,maria2141,US
2349172,simone9846,


In [14]:
lenders_in_states = lender_states.loc[lender_states['country_code'].notnull()]
lenders_in_states

Unnamed: 0,permanent_name,country_code
16,naresh2074,US
31,christina27976796,US
37,vikas1098,IN
39,qian1385,US
42,xigg8769,US
...,...,...
2349158,rakhi,US
2349159,vicki5374,US
2349161,jennifer5879,CA
2349171,maria2141,US


In [15]:
lenders_without_states = lender_states.loc[lender_states['country_code'].isnull()]
lenders_without_states

Unnamed: 0,permanent_name,country_code
0,qian3013,
1,reena6733,
2,mai5982,
3,andrew86079135,
4,nguyen6962,
...,...,...
2349167,todd5695,
2349168,kate40761039,
2349169,janet7309,
2349170,pj4198,


In [16]:
lent_and_state = pd.merge(left=lent_by_lender, right=lenders_in_states, left_on='lender_x', right_on='permanent_name')
lent_total = lent_and_state['precise_amount'].sum()
lent_by_state = lent_and_state.groupby(['country_code'], as_index = False\
                                      ).agg({'precise_amount':['count', 'sum', \
                                                               lambda x: x.sum()/lent_total]})
lent_by_state.columns = ['country_code', 'lent_count', 'lent_sum', 'lent_factor']
lent_by_state

Unnamed: 0,country_code,lent_count,lent_sum,lent_factor
0,AD,9,4.912436e+03,5.295105e-06
1,AE,667,1.769248e+06,1.907069e-03
2,AF,124,1.074489e+05,1.158190e-04
3,AG,3,6.182085e+02,6.663656e-07
4,AI,1,3.350203e+02,3.611176e-07
...,...,...,...,...
220,YE,51,1.673080e+04,1.803409e-05
221,YT,1,5.970588e+01,6.435684e-08
222,ZA,659,5.338213e+05,5.754048e-04
223,ZM,17,3.317287e+04,3.575697e-05


In [17]:
lent_no_state = pd.merge(left=lent_by_lender, right=lenders_without_states, \
                         left_on='lender_x', right_on='permanent_name')
lent_no_state_total = lent_no_state['precise_amount'].sum()
lent_no_state_total

235702701.9289114

In [18]:
lent_by_state['additional_lent'] = lent_by_state.lent_factor * lent_no_state_total
lent_by_state

Unnamed: 0,country_code,lent_count,lent_sum,lent_factor,additional_lent
0,AD,9,4.912436e+03,5.295105e-06,1248.070449
1,AE,667,1.769248e+06,1.907069e-03,449501.211979
2,AF,124,1.074489e+05,1.158190e-04,27298.840916
3,AG,3,6.182085e+02,6.663656e-07,157.064178
4,AI,1,3.350203e+02,3.611176e-07,85.116403
...,...,...,...,...,...
220,YE,51,1.673080e+04,1.803409e-05,4250.683852
221,YT,1,5.970588e+01,6.435684e-08,15.169082
222,ZA,659,5.338213e+05,5.754048e-04,135624.477154
223,ZM,17,3.317287e+04,3.575697e-05,8428.014749


In [19]:
loan_and_state = loans[['loan_id','country_code','loan_amount']]
loan_and_state

Unnamed: 0,loan_id,country_code,loan_amount
0,657307,PH,125.0
1,657259,HN,400.0
2,658010,PK,400.0
3,659347,KG,625.0
4,656933,PH,425.0
...,...,...,...
1419602,988180,KE,400.0
1419603,988213,PH,300.0
1419604,989109,GH,2425.0
1419605,989143,PH,100.0


In [20]:
loan_by_state = loan_and_state.groupby(['country_code'], as_index = False).agg({'loan_amount':['sum']})
loan_by_state.columns = ['country_code','loan_amount_sum']
loan_by_state

Unnamed: 0,country_code,loan_amount_sum
0,AF,1967950.0
1,AL,4307350.0
2,AM,22950475.0
3,AZ,14784625.0
4,BA,477250.0
...,...,...
90,XK,3083025.0
91,YE,3444000.0
92,ZA,1006525.0
93,ZM,1978975.0


In [21]:
loan_and_lent = pd.merge(loan_by_state, lent_by_state, how='left', on=['country_code'])
loan_and_lent['difference'] = loan_and_lent.loan_amount_sum - loan_and_lent.lent_sum - loan_and_lent.additional_lent 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loan_and_lent[loan_and_lent.difference == loan_and_lent.difference.max()]

--- 21.760595083236694 seconds ---


Unnamed: 0,country_code,loan_amount_sum,lent_count,lent_sum,lent_factor,additional_lent,difference
61,PH,97984600.0,910.0,765759.181625,0.000825,194551.413275,97024290.0


**10.** Which country has the highest ratio between the difference computed at the previous point and the population?


In [31]:
import pandas as pd
import numpy as np
import time as tm
start_time = tm.time()
country_stat = pd.read_csv('../../Datasets/kiva/country_stats.csv')
country_stat = country_stat[['country_code', 'country_name', 'population', 'population_below_poverty_line']]
country_stat = country_stat.dropna(subset=['population_below_poverty_line'])
country_stat

Unnamed: 0,country_code,country_name,population,population_below_poverty_line
0,IN,India,1339180127,21.9
1,NG,Nigeria,190886311,70.0
2,MX,Mexico,129163276,46.2
3,PK,Pakistan,197015955,29.5
4,BD,Bangladesh,164669751,31.5
...,...,...,...,...
147,MT,Malta,430835,16.3
148,MV,Maldives,436330,16.0
149,ME,Montenegro,628960,8.6
150,TM,Turkmenistan,5758075,0.2


In [33]:
country_stat_loan_and_lent = pd.merge(left=country_stat, right=loan_and_lent, \
                                      left_on='country_code', right_on='country_code')
country_stat_loan_and_lent = country_stat_loan_and_lent[['country_code', 'country_name', 'difference', \
                                                         'population', 'population_below_poverty_line']]
country_stat_loan_and_lent['ratio_1'] = country_stat_loan_and_lent.difference/country_stat_loan_and_lent.population
print("--- %s seconds ---" % (tm.time()-start_time)) 
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_1 == country_stat_loan_and_lent.ratio_1.max()]

--- 59.55899930000305 seconds ---


Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_1
63,PY,Paraguay,53897900.0,6811297,22.2,7.913015


As we can see, Paraguay has  highest ratio between the difference computed at the previous point and the population.

Country with the lowest ratio between the difference computed at the previous point and the population can be calculated on the following way:

In [34]:
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_1 == country_stat_loan_and_lent.ratio_1.min()]

Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_1
48,CA,Canada,-98096740.0,36624199,9.4,-2.678468


**11.** Which country has the highest ratio between the difference computed at point 9 and the population that is not below the poverty line?


In [35]:
import time as tm
start_time = tm.time()
country_stat_loan_and_lent = pd.merge(left=country_stat, right=loan_and_lent, \
                                      left_on='country_code', right_on='country_code')
country_stat_loan_and_lent = country_stat_loan_and_lent[['country_code', 'country_name', 'difference', \
                                                         'population', 'population_below_poverty_line']]
country_stat_loan_and_lent['ratio_2'] = country_stat_loan_and_lent.difference\
/(country_stat_loan_and_lent.population * ((100-country_stat_loan_and_lent.population_below_poverty_line)/100) )
print("--- %s seconds ---" % (tm.time()-start_time)) 
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_2 == country_stat_loan_and_lent.ratio_2.max()]

--- 0.009996175765991211 seconds ---


Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_2
71,AM,Armenia,22920730.0,2930450,32.0,11.502316


As we can see, Armenia has  highest ratio between the difference computed at the point 9. and the population that is not below the poverty line.

Country with the lowest ratio between the difference computed at the point 9. and the population that is not below the poverty line can be calculated on the following way:

In [36]:
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_2 == country_stat_loan_and_lent.ratio_2.min()]

Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_2
48,CA,Canada,-98096740.0,36624199,9.4,-2.956366


**12.** For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year. For example, a loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 / (31+365+30) = 352.11 for 2018.

In [29]:
import time as tm
start_time = tm.time()

from datetime import datetime as dt, timezone as tz

now = dt.now(tz.utc)

def days_in_year(year):
    d1 = dt.strptime( str(year)+'-01-01', "%Y-%m-%d") 
    d1 = d1.replace(tzinfo=None)
    d2 = dt.strptime( str(year)+'-12-31', "%Y-%m-%d")
    d2 = d2.replace(tzinfo=None)
    return (d2-d1).days+1

def total_amount_of_loans(disb, plan_exp, amount):
    if( disb.year == plan_exp.year ):
        return [(disb.year,amount)]
    disb = disb.replace(tzinfo=None)
    plan_exp = plan_exp.replace(tzinfo=None) 
    total_days = (plan_exp-disb).days 
    total_days += 1
    total_amount = []
    curr = disb
    curr = curr.replace(tzinfo=None)
    for y in range(disb.year, plan_exp.year):
        dy = (dt.strptime(str(y)+'-12-31',"%Y-%m-%d").replace(tzinfo=None)-curr).days + 1
        if(total_days >0):
            total_amount.append( (y, dy * amount / total_days) )
        else:
            total_amount.append( (y, 0) )
        curr = dt.strptime(str(y+1)+'-01-01',"%Y-%m-%d")
        curr = curr.replace(tzinfo=None)
    dy = (plan_exp-curr).days + 1
    if(total_days >0):
        total_amount.append( (plan_exp.year, dy * amount / total_days) )
    else:
        total_amount.append( (plan_exp.year,0) )
    return total_amount

d_t = dt.strptime( '2016-12-01', "%Y-%m-%d")
p_e_t = dt.strptime( '2018-01-30', "%Y-%m-%d")
total_amount_of_loans(d_t, p_e_t, 5000)

[(2016, 363.84976525821594),
 (2017, 4284.037558685446),
 (2018, 352.11267605633805)]

In [26]:
import pandas as pd
import numpy as np
loans
loans_base = loans[['loan_id','loan_name','loan_amount','planned_expiration_time','disburse_time']]
loans_base

Unnamed: 0,loan_id,loan_name,loan_amount,planned_expiration_time,disburse_time
0,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000
1,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000
2,658010,Aasia,400.0,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000
3,659347,Gulmira,625.0,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000
4,656933,Ricky\t,425.0,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000
...,...,...,...,...,...
1419602,988180,,400.0,2016-01-02 01:00:03.000 +0000,2015-11-23 08:00:00.000 +0000
1419603,988213,Perlita,300.0,2016-01-02 16:40:07.000 +0000,2015-11-24 08:00:00.000 +0000
1419604,989109,Okyeso Nyame Group,2425.0,2016-01-03 22:20:04.000 +0000,2015-11-13 08:00:00.000 +0000
1419605,989143,Exequila,100.0,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000


In [30]:
lis = []
for index, row in loans_base.iterrows():
    s2 = row['planned_expiration_time']
    try:
        d2 = dt.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    s1 = row['disburse_time']
    try:
        d1 = dt.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    try:
        amt = float(row['loan_amount'])
    except ValueError:
        continue
    except TypeError:
        continue
    ls = total_amount_of_loans(d1, d2, amt)
    for l in ls:
        lis.append({ 'loan_id' : row['loan_id'], 'loan_name': row['loan_name'], 'loan_amount': row['loan_amount'], \
                    'planned_expiration_time': row['planned_expiration_time'], \
                    'disburse_time': row['disburse_time'],'year': l[0], 'ammount':l[1] })
loans_final = pd.DataFrame(lis) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_final

--- 394.8900270462036 seconds ---


Unnamed: 0,loan_id,loan_name,loan_amount,planned_expiration_time,disburse_time,year,ammount
0,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,2013,20.833333
1,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,2014,104.166667
2,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,2013,45.360825
3,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,2014,350.515464
4,658010,Aasia,400.0,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,2014,400.000000
...,...,...,...,...,...,...,...
1196549,989109,Okyeso Nyame Group,2425.0,2016-01-03 22:20:04.000 +0000,2015-11-13 08:00:00.000 +0000,2016,139.903846
1196550,989143,Exequila,100.0,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000,2015,90.625000
1196551,989143,Exequila,100.0,2016-01-05 08:50:02.000 +0000,2015-11-03 08:00:00.000 +0000,2016,7.812500
1196552,989240,Lydia,175.0,2016-01-03 20:50:06.000 +0000,2015-11-03 08:00:00.000 +0000,2015,163.709677
