# Project Solution

You have to work on the [Kiva](https://drive.google.com/file/d/1-tJtnIbo1Rt-F1XfoWGVkmBXiI-ciuRx/view) dataset. Some information on the datasets are on the [Kaggle](https://www.kaggle.com/gaborfodor/additional-kiva-snapshot) web page.

## Basic tasks

All groups and individual must do the following:

`1.` Normalize the `loan_lenders` table. In the normalized table, each row must have one `loan_id` and one `lender`.

Firstly, csv file is read and data frame with unnormalized data is created:

In [1]:
import pandas as pd
import numpy as np
loans_lenders_un = pd.read_csv('../../Datasets/kiva/loans_lenders.csv')
loans_lenders_un = loans_lenders_un.head(5000) # Coment this for final calculation, uncomment during development
loans_lenders_un

Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."
...,...,...
4995,108036,"bobharris, isabel2008, isabel2008, al4832, dan..."
4996,108913,"bobharris, johnf9634, fred8631, amelia3963, la..."
4997,120715,"bobharris, ruth7441, jonathan3668, r3922, thom..."
4998,126756,"bobharris, thomas7482, steve2025, peter8038, v..."


Data frame with structured data will be created with help of intermediate object - list of pairs (loan_id, lender) packed in the dictionary:

In [2]:
lis = []
for index, row in loans_lenders_un.iterrows(): 
    ls = row['lenders'].split(',')
    for l in ls:
        lis.append({ 'loan_id' : row['loan_id'], 'lender': l.strip() })
loans_lender = pd.DataFrame(lis) 
loans_lender 

Unnamed: 0,loan_id,lender
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499
...,...,...
135937,128485,xxAnonyMousexx
135938,128485,klassenkasse9334
135939,128485,merle2926
135940,128485,guillermo5352


`2.` For each loan, add a column `duration` corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.

Firstly, data frame should be loaded and structure of the data frame `loans` should be determined:

In [3]:
import pandas as pd
import numpy as np
loans = pd.read_csv('../../Datasets/kiva/loans.csv')
loans.columns

Index(['loan_id', 'loan_name', 'original_language', 'description',
       'description_translated', 'funded_amount', 'loan_amount', 'status',
       'activity_name', 'sector_name', 'loan_use', 'country_code',
       'country_name', 'town_name', 'currency_policy',
       'currency_exchange_coverage_rate', 'currency', 'partner_id',
       'posted_time', 'planned_expiration_time', 'disburse_time',
       'raised_time', 'lender_term', 'num_lenders_total',
       'num_journal_entries', 'num_bulk_entries', 'tags', 'borrower_genders',
       'borrower_pictured', 'repayment_interval', 'distribution_model'],
      dtype='object')

After thar, values from the data frame should be displayed:

In [4]:
loans_head = loans.head()
loans = loans.head(5000) # Coment this for final calculation, uncomment during development
loans

Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,2014-01-15 04:48:22.000 +0000,7.0,3,2,1,,female,true,irregular,field_partner
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,2014-02-25 06:42:06.000 +0000,8.0,11,2,1,,female,true,monthly,field_partner
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,2014-01-24 23:06:18.000 +0000,14.0,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,true,monthly,field_partner
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,2014-01-22 05:29:28.000 +0000,14.0,21,2,1,user_favorite,female,true,monthly,field_partner
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,2014-01-14 17:29:27.000 +0000,7.0,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,true,bullet,field_partner
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,695734,Luisa Gualupe,Spanish,Luisa se dedica a la venta de mates y jugos de...,Luisa sells herbal teas and juice on foot from...,550.0,550.0,funded,Beverages,Food,...,2014-04-20 03:36:16.000 +0000,6.0,19,1,1,,female,true,irregular,field_partner
4996,696387,Teresa De Jesús,Spanish,"Teresa de 45 años de edad, estudio segundo gra...",Teresa is 45 years old. She went to school up ...,500.0,500.0,funded,Agriculture,Agriculture,...,2014-04-24 12:52:51.000 +0000,14.0,15,4,2,,female,true,bullet,field_partner
4997,687099,Merlinda,English,Merlinda is married to a carpenter. She has b...,Merlinda is married to a carpenter. She has b...,125.0,125.0,funded,Manufacturing,Manufacturing,...,2014-03-24 00:37:09.000 +0000,8.0,4,1,1,,female,true,irregular,field_partner
4998,687242,Ndèye Fatou,French,"Âgée de 29 ans, cette emprunteuse appelée Ndèy...","This borrower is Ndèye Fatou, age 29, a resell...",125.0,125.0,funded,Food Stall,Food,...,2014-03-25 16:22:12.000 +0000,6.0,3,1,1,,female,true,irregular,field_partner


We are now mainly interested in columns 'planned_expiration_time', 'disburse_time':

In [5]:
loans_sel = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time'] ]
loans_sel

Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000
...,...,...,...,...
4995,695734,Luisa Gualupe,2014-05-19 21:40:01.000 +0000,2014-03-17 07:00:00.000 +0000
4996,696387,Teresa De Jesús,2014-05-21 10:40:02.000 +0000,2014-04-07 07:00:00.000 +0000
4997,687099,Merlinda,2014-04-22 23:00:04.000 +0000,2014-03-18 07:00:00.000 +0000
4998,687242,Ndèye Fatou,2014-04-24 15:30:02.000 +0000,2014-03-18 07:00:00.000 +0000


In [6]:
from datetime import datetime
loans['duration'] = None
for index, row in loans.iterrows(): 
    s2 = row['planned_expiration_time']
    s1 = row['disburse_time']
    if( pd.notna(s1) and pd.notna(s2) and s1 != '' and s2 != ''):
        d2 = datetime.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
        d1 = datetime.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
        loans.loc[index,'duration'] = (d2 - d1).days 
loans_sel = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time', 'duration'] ]
loans_sel

Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time,duration
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,53
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,96
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,37
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000,34
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000,57
...,...,...,...,...,...
4995,695734,Luisa Gualupe,2014-05-19 21:40:01.000 +0000,2014-03-17 07:00:00.000 +0000,63
4996,696387,Teresa De Jesús,2014-05-21 10:40:02.000 +0000,2014-04-07 07:00:00.000 +0000,44
4997,687099,Merlinda,2014-04-22 23:00:04.000 +0000,2014-03-18 07:00:00.000 +0000,35
4998,687242,Ndèye Fatou,2014-04-24 15:30:02.000 +0000,2014-03-18 07:00:00.000 +0000,37


`3.` Find the lenders that have funded at least twice.

Those are lenders that are duplicated in structured loan_leneders dataframe.


- One possible solution is based on calculation/counting using dicitionary object:

In [7]:
stat={}
for index, row in loans_lender.iterrows(): 
    lender = row['lender']
    stat[lender] = 1 + stat.get(lender, 0)
lenders_funded_2_more = []
for k in stat.keys():
    if( stat[k] >= 2):
        lenders_funded_2_more.append((k,stat[k]))
lenders_funded_2_more

[('muc888', 11),
 ('sam4326', 5),
 ('rebecca3499', 89),
 ('karlheinz4543', 8),
 ('paula8951', 3),
 ('gmct', 677),
 ('r3922', 54),
 ('brian9451', 3),
 ('shree8053', 57),
 ('alan5513', 31),
 ('oisin3389', 11),
 ('bo3186', 4),
 ('ric8947', 2),
 ('daniel98469874', 82),
 ('deborah12671549', 6),
 ('matthew9831', 2),
 ('john6330', 10),
 ('john9479', 4),
 ('mattiaslaven', 35),
 ('jason3883', 11),
 ('highgrovechurch', 153),
 ('dino5102', 8),
 ('jonathan7946', 3),
 ('ann8187', 2),
 ('bryan2669', 2),
 ('eddyphil', 14),
 ('don9212', 90),
 ('carolineandcolin9686', 2),
 ('raph8817', 4),
 ('danielle2350', 3),
 ('barbara5610', 202),
 ('danhostetler', 3),
 ('daniel1104', 5),
 ('amirali5409', 190),
 ('oceanwest', 7),
 ('trolltech4460', 758),
 ('thedragonflykeeper', 6),
 ('patrick8466', 2),
 ('terrystl', 42),
 ('sherri4341', 2),
 ('gooddogg1', 716),
 ('danny6470', 25),
 ('jacqueline4838', 6),
 ('diederik8163', 6),
 ('ryan54597608', 4),
 ('james5068', 2),
 ('rudi5955', 2),
 ('daniel9859', 9),
 ('robert228

In [8]:
loans_lender_2_more = loans_lender[loans_lender.duplicated(['lender'])]
lenders_funded_2_more = list(set(loans_lender_2_more['lender']))
lenders_funded_2_more

['judi4141',
 'themockturtle',
 'colin20286727',
 'khadija5547',
 'thomas6252',
 'simon9510',
 'starrstarr',
 'david29512429',
 'morten1983',
 'gill1764',
 'desiree37',
 'jason47824779',
 'andreas2382',
 'derk9257',
 'nfolkert',
 'nigel4330',
 'manon9265',
 'karen4442',
 '11familymembers',
 'nicole3428',
 'kat4303',
 'pimenhannie1777',
 'diane5687',
 'mohammad5457',
 'valerie5339',
 'kennethandevelyn',
 'kent9427',
 'stianedvin3097',
 'angela9181',
 'james8617',
 'theresa1032',
 'ketty2969',
 'paul4667',
 'konnsyquence',
 'alan888',
 'yen3285',
 'bobharris',
 'eric5133',
 'clark9828',
 'samjones',
 'chrisandnatasha5215',
 'mary6278',
 'jeff8664',
 'gianna5909',
 'FightExpiry',
 'ann8187',
 'david55808081',
 'andyonkiva',
 'rich6247',
 'chrisandsarah5331',
 'sue3816',
 'scottandcherie5721',
 'robert62035531',
 'hartmut7176',
 'eva2329',
 'ming',
 'pat5984',
 'martin4951',
 'colum4732',
 'bill9744',
 'hussain2895',
 'bas2217',
 'robertsanek',
 'mary15102341',
 'franziska35021491',
 'chri

`4.` For each country, compute how many loans have involved that country as borrowers.

- One solution is based on calculation/counting using dicitionary object:

In [9]:
stat={}
for index, row in loans.iterrows(): 
    cn = row['country_name']
    cc = row['country_code']
    stat[(cc,cn)] = 1 + stat.get((cc,cn), 0)
stat

{('PH', 'Philippines'): 1394,
 ('HN', 'Honduras'): 52,
 ('PK', 'Pakistan'): 110,
 ('KG', 'Kyrgyzstan'): 51,
 ('SV', 'El Salvador'): 323,
 ('BI', 'Burundi'): 12,
 ('ML', 'Mali'): 33,
 ('MN', 'Mongolia'): 28,
 ('PE', 'Peru'): 217,
 ('GE', 'Georgia'): 29,
 ('AM', 'Armenia'): 41,
 ('GH', 'Ghana'): 42,
 ('TZ', 'Tanzania'): 21,
 ('KH', 'Cambodia'): 220,
 ('TG', 'Togo'): 26,
 ('GT', 'Guatemala'): 31,
 ('LB', 'Lebanon'): 43,
 ('PY', 'Paraguay'): 102,
 ('NI', 'Nicaragua'): 159,
 ('KE', 'Kenya'): 419,
 ('UG', 'Uganda'): 132,
 ('RW', 'Rwanda'): 51,
 ('MZ', 'Mozambique'): 15,
 ('AZ', 'Azerbaijan'): 24,
 ('IL', 'Israel'): 6,
 ('TJ', 'Tajikistan'): 153,
 ('BO', 'Bolivia'): 75,
 ('MX', 'Mexico'): 46,
 ('ID', 'Indonesia'): 22,
 ('NG', 'Nigeria'): 13,
 ('EC', 'Ecuador'): 107,
 ('MG', 'Madagascar'): 14,
 ('VN', 'Vietnam'): 60,
 ('CO', 'Colombia'): 98,
 ('JO', 'Jordan'): 97,
 ('YE', 'Yemen'): 20,
 ('IN', 'India'): 217,
 ('CL', 'Chile'): 6,
 ('AL', 'Albania'): 13,
 ('PS', 'Palestine'): 47,
 ('NP', 'Nepal'

- Another solution is based on panda grouping:

In [10]:
loans_count = loans.groupby(['country_code','country_name'])['loan_id'].count()
loans_count

country_code  country_name
AF            Afghanistan      1
AL            Albania         13
AM            Armenia         41
AZ            Azerbaijan      24
BF            Burkina Faso     1
                              ..
XK            Kosovo          17
YE            Yemen           20
ZA            South Africa     1
ZM            Zambia           5
ZW            Zimbabwe        23
Name: loan_id, Length: 71, dtype: int64

`5.` For each country, compute the overall amount of money borrowed.


- One solution is based on calculation/sum using dicitionary object:

In [11]:
stat={}
for index, row in loans.iterrows(): 
    cn = row['country_name']
    cc = row['country_code']
    stat[(cc,cn)] = stat.get((cc,cn), 0) + row['loan_amount']
stat

{('PH', 'Philippines'): 441625.0,
 ('HN', 'Honduras'): 45650.0,
 ('PK', 'Pakistan'): 60975.0,
 ('KG', 'Kyrgyzstan'): 70850.0,
 ('SV', 'El Salvador'): 204875.0,
 ('BI', 'Burundi'): 38350.0,
 ('ML', 'Mali'): 35875.0,
 ('MN', 'Mongolia'): 46475.0,
 ('PE', 'Peru'): 186200.0,
 ('GE', 'Georgia'): 40925.0,
 ('AM', 'Armenia'): 78775.0,
 ('GH', 'Ghana'): 29400.0,
 ('TZ', 'Tanzania'): 40250.0,
 ('KH', 'Cambodia'): 180200.0,
 ('TG', 'Togo'): 17300.0,
 ('GT', 'Guatemala'): 56025.0,
 ('LB', 'Lebanon'): 66175.0,
 ('PY', 'Paraguay'): 240725.0,
 ('NI', 'Nicaragua'): 134150.0,
 ('KE', 'Kenya'): 257875.0,
 ('UG', 'Uganda'): 106325.0,
 ('RW', 'Rwanda'): 55950.0,
 ('MZ', 'Mozambique'): 8975.0,
 ('AZ', 'Azerbaijan'): 47600.0,
 ('IL', 'Israel'): 21500.0,
 ('TJ', 'Tajikistan'): 150300.0,
 ('BO', 'Bolivia'): 137300.0,
 ('MX', 'Mexico'): 97775.0,
 ('ID', 'Indonesia'): 18700.0,
 ('NG', 'Nigeria'): 2525.0,
 ('EC', 'Ecuador'): 118675.0,
 ('MG', 'Madagascar'): 4475.0,
 ('VN', 'Vietnam'): 77375.0,
 ('CO', 'Colombia

- Another solution is based on panda grouping:

In [12]:
loans_sum = loans.groupby(['country_code','country_name'])['loan_amount'].sum()
loans_sum

country_code  country_name
AF            Afghanistan      1175.0
AL            Albania         19350.0
AM            Armenia         78775.0
AZ            Azerbaijan      47600.0
BF            Burkina Faso      700.0
                               ...   
XK            Kosovo          22000.0
YE            Yemen           33375.0
ZA            South Africa     1700.0
ZM            Zambia          13150.0
ZW            Zimbabwe        20000.0
Name: loan_amount, Length: 71, dtype: float64

`6.` Like the previous point, but expressed as a percentage of the overall amount lent.


- One solution is based on calculation using dicitionary object:

In [13]:
stat={}
total = 0
for index, row in loans.iterrows(): 
    cn = row['country_name']
    cc = row['country_code']
    stat[(cc,cn)] = stat.get((cc,cn), 0) + row['loan_amount']
    total += row['loan_amount']
for k in stat.keys():
    stat[k] = stat[k]/total*100
stat

{('PH', 'Philippines'): 10.789631204114292,
 ('HN', 'Honduras'): 1.11530521249435,
 ('PK', 'Pakistan'): 1.4897203796679737,
 ('KG', 'Kyrgyzstan'): 1.7309830077814832,
 ('SV', 'El Salvador'): 5.005436044025849,
 ('BI', 'Burundi'): 0.9369541051294268,
 ('ML', 'Mali'): 0.8764857502351546,
 ('MN', 'Mongolia'): 1.1354613307924408,
 ('PE', 'Peru'): 4.549174820732706,
 ('GE', 'Georgia'): 0.9998656258780128,
 ('AM', 'Armenia'): 1.9246039017358694,
 ('GH', 'Ghana'): 0.7182907611683219,
 ('TZ', 'Tanzania'): 0.9833742563613931,
 ('KH', 'Cambodia'): 4.4025848694738645,
 ('TG', 'Togo'): 0.42266769279632543,
 ('GT', 'Guatemala'): 1.3687836698794298,
 ('LB', 'Lebanon'): 1.6167650040923027,
 ('PY', 'Paraguay'): 5.881311002797425,
 ('NI', 'Nicaragua'): 3.2775069935622576,
 ('KE', 'Kenya'): 6.300313946812279,
 ('UG', 'Uganda'): 2.5976960945993817,
 ('RW', 'Rwanda'): 1.3669512954886942,
 ('MZ', 'Mozambique'): 0.2192741354246833,
 ('AZ', 'Azerbaijan'): 1.1629469466534736,
 ('IL', 'Israel'): 0.525280658677

- Another solution is based on panda functions:

In [14]:
loans_total = loans['loan_amount'].sum()
loans_percent = loans.groupby(['country_code','country_name'])['loan_amount'].sum()/loans_total*100
loans_percent

country_code  country_name
AF            Afghanistan     0.028707
AL            Albania         0.472753
AM            Armenia         1.924604
AZ            Azerbaijan      1.162947
BF            Burkina Faso    0.017102
                                ...   
XK            Kosovo          0.537496
YE            Yemen           0.815407
ZA            South Africa    0.041534
ZM            Zambia          0.321276
ZW            Zimbabwe        0.488633
Name: loan_amount, Length: 71, dtype: float64

`7.` Like the three previous points, but split for each year (with respect to `disburse time`).


- One solution is based on calculation using dicitionary object:

In [15]:
from datetime import datetime
stat={}
total = 0
for index, row in loans.iterrows(): 
    cn = row['country_name']
    cc = row['country_code']
    y = datetime.strptime(row['disburse_time'], "%Y-%m-%d %H:%M:%S.%f %z").year
    triplet = stat.get((cc,cn,y), (0,0,0))
    stat[(cc,cn,y)] =  (triplet[0] + 1, triplet[1] + row['loan_amount'], None)
    total += row['loan_amount']
for k in stat.keys():
    triplet = stat[k]
    stat[k]= (triplet[0], triplet[1], triplet[1]/total*100)
stat

{('PH', 'Philippines', 2013): (47, 21175.0, 0.5173403696509937),
 ('HN', 'Honduras', 2013): (3, 3000.0, 0.07329497562942061),
 ('PK', 'Pakistan', 2014): (57, 25625.0, 0.6260612501679677),
 ('KG', 'Kyrgyzstan', 2014): (15, 18400.0, 0.44954251719377974),
 ('PH', 'Philippines', 2014): (340, 104625.0, 2.5561622750760438),
 ('SV', 'El Salvador', 2014): (105, 68375.0, 1.6705146528872112),
 ('BI', 'Burundi', 2014): (5, 12525.0, 0.306006523252831),
 ('ML', 'Mali', 2014): (19, 19375.0, 0.47336338427334135),
 ('MN', 'Mongolia', 2013): (3, 3200.0, 0.07818130733804864),
 ('PE', 'Peru', 2013): (25, 14975.0, 0.3658640866835245),
 ('HN', 'Honduras', 2014): (29, 27475.0, 0.671259818472777),
 ('PH', 'Philippines', 2015): (88, 26600.0, 0.6498821172475294),
 ('PK', 'Pakistan', 2015): (6, 2500.0, 0.06107914635785051),
 ('GE', 'Georgia', 2015): (2, 1950.0, 0.04764173415912339),
 ('SV', 'El Salvador', 2015): (17, 7575.0, 0.18506981346428703),
 ('AM', 'Armenia', 2015): (6, 8350.0, 0.20400434883522067),
 ('KG

- Another solution is based on panda functions:

In [16]:
from datetime import datetime
loans_total = loans['loan_amount'].sum()
loans['disburse_time_year'] = None
for index, row in loans.iterrows(): 
    y = datetime.strptime(row['disburse_time'], "%Y-%m-%d %H:%M:%S.%f %z").year
    loans.loc[index,'disburse_time_year'] = y 
loans_agg = loans.groupby(['country_code','country_name','disburse_time_year']).agg({'loan_amount':['count', 'sum', lambda x: x.sum()/total*100]})
loans_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,loan_amount,loan_amount,loan_amount
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,sum,<lambda_0>
country_code,country_name,disburse_time_year,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AF,Afghanistan,2010,1,1175.0,0.028707
AL,Albania,2012,2,1750.0,0.042755
AL,Albania,2013,1,950.0,0.023210
AL,Albania,2014,6,11300.0,0.276078
AL,Albania,2015,1,825.0,0.020156
...,...,...,...,...,...
ZW,Zimbabwe,2012,1,1200.0,0.029318
ZW,Zimbabwe,2013,1,4000.0,0.097727
ZW,Zimbabwe,2014,16,8000.0,0.195453
ZW,Zimbabwe,2015,4,5550.0,0.135596


`8.` For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must assume that all lenders contributed the same amount.


- One solution is based on calculation using dicitionary objects:

In [17]:
loan_ids_by_lender = {}
number_of_lenders_by_loan_id = {}
for index, row in loans_lender.iterrows(): 
    loanId = row['loan_id']
    lender = row['lender']
    s = loan_ids_by_lender.get(lender,set())
    s.add(loanId)
    loan_ids_by_lender[lender] = s 
    number_of_lenders_by_loan_id[loanId] = number_of_lenders_by_loan_id.get(loanId, 0) + 1
amount_by_loan_id = {}
for index, row in loans.iterrows(): 
    loanId = row['loan_id']
    amount = row['loan_amount']
    amount_by_loan_id[loanId] = amount
#(loan_ids_by_lender, number_of_lenders_by_loan_id, amount_by_loan_id)  
money_lent = {}
for lender in loan_ids_by_lender:
    s = loan_ids_by_lender[lender]
    amm = 0
    for loanId in s:
        amm += amount_by_loan_id.get(loanId,0) / number_of_lenders_by_loan_id.get(loanId,1)
    money_lent[lender] = amm
money_lent


{'muc888': 0.0,
 'sam4326': 0.0,
 'camaran3922': 0.0,
 'lachheb1865': 0.0,
 'rebecca3499': 56.11635220125786,
 'karlheinz4543': 0.0,
 'jerrydb': 0.0,
 'paula8951': 0.0,
 'gmct': 299.15509624221056,
 'amra9383': 0.0,
 'r3922': 39.94082840236686,
 'brian9451': 0.0,
 'shree8053': 0.0,
 'alan5513': 0.0,
 'oisin3389': 0.0,
 'helle8622': 0.0,
 'bo3186': 53.55750487329435,
 'ric8947': 0.0,
 'daniel98469874': 119.14033526183059,
 'nick9464': 0.0,
 'deborah12671549': 0.0,
 'matthew9831': 0.0,
 'john6330': 0.0,
 'john9479': 0.0,
 'mattiaslaven': 66.77886576907385,
 'jonathan2867': 0.0,
 'jason3883': 0.0,
 'highgrovechurch': 571.3255214043226,
 'maria3124': 0.0,
 'dino5102': 67.85714285714286,
 'jonathan7946': 28.571428571428573,
 'ann8187': 0.0,
 'bryan2669': 0.0,
 'john88459657': 0.0,
 'eddyphil': 0.0,
 'don9212': 136.39866574649184,
 'carolineandcolin9686': 0.0,
 'bent8782': 0.0,
 'raph8817': 0.0,
 'danielle2350': 0.0,
 'nora3555': 0.0,
 'williammanashi': 0.0,
 'barbara5610': 258.0485131775364

- Another solution is based on panda functions:

In [22]:
merged = pd.merge(left=loans_lender,right=loans, left_on='loan_id', right_on='loan_id')
merged = merged[['loan_id','lender', 'loan_amount']]
merged
grouped1 = loans.groupby(['loan_id']).agg({'lender':['count']})
loans_agg

KeyError: "Column 'lender' does not exist!"

`9.` For each country, compute the difference between the overall amount of money lent and the overall amount of money borrowed. Since the country of the lender is often unknown, you can assume that the true distribution among the countries is the same as the one computed from the rows where the country is known.


`10.` Which country has the highest ratio between the difference computed at the previous point and the population?


`11.` Which country has the highest ratio between the difference computed at point 9 and the population that is not below the poverty line?


`12.` For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year. For example, a loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 / (31+365+30) = 352.11 for 2018.