# FoCS Project - Solution

*You have to work on the [Kiva](https://drive.google.com/file/d/1-tJtnIbo1Rt-F1XfoWGVkmBXiI-ciuRx/view) dataset. Some information on the datasets are on the [Kaggle](https://www.kaggle.com/) web page.*

## Task 1

*Normalize the `loan_lenders` table. In the normalized table, each row must have one `loan_id` and one lender.*


The first step is to read the `loan_lenders.csv` file and create a dataframe with unnormalized data. Since the file is very large, I introduced a bollean variable `FINAL` - if it is `false` then only a part of data is processed in order to spead up the work. Libraries `panda` and `numpy` will be used for data processing while the efficiency will be measured using the `time` library.

In [1]:
import pandas as pd
import numpy as np
import time as tm
FINAL = False
start_time = tm.time()
loans_lenders_un = pd.read_csv('./data/loans_lenders.csv')
if not FINAL:
    loans_lenders_un = loans_lenders_un.head(10000) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_lenders_un

--- 8.029874801635742 seconds ---


Unnamed: 0,loan_id,lenders
0,483693,"muc888, sam4326, camaran3922, lachheb1865, reb..."
1,483738,"muc888, nora3555, williammanashi, barbara5610,..."
2,485000,"muc888, terrystl, richardandsusan8352, sherri4..."
3,486087,"muc888, james5068, rudi5955, daniel9859, don92..."
4,534428,"muc888, niki3008, teresa9174, mike4896, david7..."
...,...,...
9995,45940,"helga4707, james6963, jimjams, andreas2382, si..."
9996,247491,"priyaram, christian9832, john9242, sandra1434,..."
9997,345274,"priyaram, nicola1093, bobby9744, simon7848, di..."
9998,125945,"joseph1859, matt5349, reese3555, stanley3312, ..."


In [2]:
start_time = tm.time()
pairs_loan_lander = []
for index, row in loans_lenders_un.iterrows(): 
    lenders = row['lenders'].split(',')
    for l in lenders:
        pairs_loan_lander.append({ 'loan_id' : row['loan_id'], 'lender': l.strip() })
loans_lenders = pd.DataFrame(pairs_loan_lander) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_lenders 

--- 3.3072316646575928 seconds ---


Unnamed: 0,loan_id,lender
0,483693,muc888
1,483693,sam4326
2,483693,camaran3922
3,483693,lachheb1865
4,483693,rebecca3499
...,...,...
245000,225434,wongacom3393
245001,225434,marleneanddel8151
245002,225434,joanne4956
245003,225434,juddie7070


## Task 2

*For each loan, add a column `duration` corresponding to the number of days between the disburse time and the planned expiration time. If any of those two dates is missing, also the duration must be missing.*

Firstly, I'll read the `loans.csv` file.


In [3]:
start_time = tm.time()
loans = pd.read_csv('./data/loans.csv')
if not FINAL:
    loans = loans.head(20000)
time = tm.time() -start_time
print("--- %s seconds ---" % time) 
loans

--- 232.84460973739624 seconds ---


Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,raised_time,lender_term,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,2014-01-15 04:48:22.000 +0000,7.0,3,2,1,,female,true,irregular,field_partner
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,2014-02-25 06:42:06.000 +0000,8.0,11,2,1,,female,true,monthly,field_partner
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,2014-01-24 23:06:18.000 +0000,14.0,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,true,monthly,field_partner
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,2014-01-22 05:29:28.000 +0000,14.0,21,2,1,user_favorite,female,true,monthly,field_partner
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,2014-01-14 17:29:27.000 +0000,7.0,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,true,bullet,field_partner
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,341198,Guillermo Antonio,Spanish,"El Señor Guillermo tiene 42 años, la ubicación...",Guillermo is 42 years old. His business is lo...,375.0,375.0,funded,Shoe Sales,Retail,...,2011-10-13 00:00:41.000 +0000,14.0,10,1,1,,male,true,monthly,field_partner
19996,341771,ERLINDA,English,Erlinda is from the village of Sillawit. She i...,,175.0,175.0,funded,General Store,Retail,...,2011-10-14 06:02:23.000 +0000,8.0,3,2,1,,female,true,irregular,field_partner
19997,344237,Ma Guadalupe,Spanish,La señora Ma Guadalupe es madre de 2 hijos de ...,\r\nMaría Guadalupe is the mother of two child...,625.0,625.0,funded,Property,Housing,...,2011-10-27 19:23:44.000 +0000,14.0,23,1,1,,female,true,monthly,field_partner
19998,344735,Blanca Lidia,Spanish,"Blanca, vive con su esposo, tiene tres hijos d...",Blanca lives with her husband and has three ch...,350.0,350.0,funded,Food Production/Sales,Food,...,2011-10-12 02:10:27.000 +0000,8.0,10,4,2,,female,true,monthly,field_partner


In order to see the exact titles of the columns and their values I'll do as followes:

In [4]:
loans.columns

Index(['loan_id', 'loan_name', 'original_language', 'description',
       'description_translated', 'funded_amount', 'loan_amount', 'status',
       'activity_name', 'sector_name', 'loan_use', 'country_code',
       'country_name', 'town_name', 'currency_policy',
       'currency_exchange_coverage_rate', 'currency', 'partner_id',
       'posted_time', 'planned_expiration_time', 'disburse_time',
       'raised_time', 'lender_term', 'num_lenders_total',
       'num_journal_entries', 'num_bulk_entries', 'tags', 'borrower_genders',
       'borrower_pictured', 'repayment_interval', 'distribution_model'],
      dtype='object')

In [5]:
loans['planned_expiration_time']


0        2014-02-14 03:30:06.000 +0000
1        2014-03-26 22:25:07.000 +0000
2        2014-02-15 21:10:05.000 +0000
3        2014-02-21 03:10:02.000 +0000
4        2014-02-13 06:10:02.000 +0000
                     ...              
19995                              NaN
19996                              NaN
19997                              NaN
19998                              NaN
19999                              NaN
Name: planned_expiration_time, Length: 20000, dtype: object

In [6]:
loans['disburse_time']

0        2013-12-22 08:00:00.000 +0000
1        2013-12-20 08:00:00.000 +0000
2        2014-01-09 08:00:00.000 +0000
3        2014-01-17 08:00:00.000 +0000
4        2013-12-17 08:00:00.000 +0000
                     ...              
19995    2011-09-13 07:00:00.000 +0000
19996    2011-09-16 07:00:00.000 +0000
19997    2011-10-07 07:00:00.000 +0000
19998    2011-09-16 07:00:00.000 +0000
19999    2011-09-23 07:00:00.000 +0000
Name: disburse_time, Length: 20000, dtype: object

A `duration` coulmn is created and filled with `None`. 
I'm going now through every row of the dataframe `loans`, and pick the values of the `planned_expiration_time` and `disburse_time`. 
In order to check whether there are missing values I'm using the `pd.notna` method. 
Since in both columns the time is recorded as `str`, a `strptime` method from the `datetime` library is used to transform strings into date-time format. Subsequently, the value of the `disburse_time` is subtracted from the value of the `planned_expiration_time`, and `days` metod gives the value of this difference converted in days. 
Finally, a restricted view of the resulting dataframe is displayed. 

In [7]:
from datetime import datetime as dt
start_time = tm.time()
loans['duration'] = None
for index, row in loans.iterrows(): 
    s2 = row['planned_expiration_time']
    s1 = row['disburse_time']
    if( pd.notna(s1) and pd.notna(s2) and s1 != '' and s2 != ''):
        d2 = dt.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
        d1 = dt.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
        loans.loc[index,'duration'] = (d2 - d1).days 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_restricted = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time', 'duration'] ]
loans_restricted

--- 18.299906015396118 seconds ---


Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time,duration
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,53
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,96
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,37
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000,34
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000,57
...,...,...,...,...,...
19995,341198,Guillermo Antonio,,2011-09-13 07:00:00.000 +0000,
19996,341771,ERLINDA,,2011-09-16 07:00:00.000 +0000,
19997,344237,Ma Guadalupe,,2011-10-07 07:00:00.000 +0000,
19998,344735,Blanca Lidia,,2011-09-16 07:00:00.000 +0000,


Hovewer, much efficient way of solving this task is to create an auxilliary function `daysbetween` which transforms input parameters from strings into date-time, and calulate the difference in days. Then the function is applied to each row of the `loans` dataframe. 

In [8]:
from datetime import datetime as dt

def daysbetween(s1,s2):
    if( pd.notna(s1) and pd.notna(s2) and s1 != '' and s2 != ''):
        d2 = dt.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
        d1 = dt.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
        return (d2 - d1).days 
    else:
        return None

start_time = tm.time()
loans['duration'] = loans.apply(lambda x: daysbetween(x['disburse_time'],x['planned_expiration_time']), axis=1)
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_restricted = loans[ ['loan_id', 'loan_name','planned_expiration_time', 'disburse_time', 'duration'] ]
loans_restricted

--- 4.037983179092407 seconds ---


Unnamed: 0,loan_id,loan_name,planned_expiration_time,disburse_time,duration
0,657307,Aivy,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,53.0
1,657259,Idalia Marizza,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,96.0
2,658010,Aasia,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,37.0
3,659347,Gulmira,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000,34.0
4,656933,Ricky\t,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000,57.0
...,...,...,...,...,...
19995,341198,Guillermo Antonio,,2011-09-13 07:00:00.000 +0000,
19996,341771,ERLINDA,,2011-09-16 07:00:00.000 +0000,
19997,344237,Ma Guadalupe,,2011-10-07 07:00:00.000 +0000,
19998,344735,Blanca Lidia,,2011-09-16 07:00:00.000 +0000,


## Task 3

*Find the lenders that have funded at least twice.*

For this analysis, the dataframe `loans-lenders`, resulting from the first task. Lenders that have corresponding two or more `loan_id` have funded at least twice. 

In [9]:
start_time = tm.time()
groups = loans_lenders.groupby("lender").lender.count()
groups
lenders_2more = []
for x in groups.iteritems(): 
    if x[1]  >= 2:
        lenders_2more.append(x)
print("--- %s seconds ---" % (tm.time()-start_time)) 
lenders_2more

--- 3.07442569732666 seconds ---


[('0li', 2),
 ('100ofhumanity1199', 38),
 ('106xin', 2),
 ('10yearitch', 4),
 ('11220', 2),
 ('11familymembers', 3),
 ('123321', 75),
 ('1396', 4),
 ('1880company', 3),
 ('1961', 4),
 ('1968', 2),
 ('1987RDD', 2),
 ('1ghostorchid', 2),
 ('1nottus1', 3),
 ('2000world', 2),
 ('2007', 5),
 ('2008', 2),
 ('20226206', 3),
 ('25894', 3),
 ('25perday', 5),
 ('2checkout', 2),
 ('2composers', 2),
 ('2ndcenterolson7814', 2),
 ('2ndskiesforex', 10),
 ('369', 2),
 ('3Katz', 4),
 ('3beditions', 62),
 ('3ric', 3),
 ('404', 2),
 ('40NewHappy', 3),
 ('45cakes6304', 3),
 ('4729', 3),
 ('4feet2mouths', 6),
 ('4k6089', 2),
 ('70722017', 4),
 ('746416431', 2),
 ('7469', 3),
 ('7thgradeclass5175', 2),
 ('8024693', 2),
 ('82abhilash', 4),
 ('9112', 2),
 ('94704', 70),
 ('9678', 2),
 ('9999', 2),
 ('ACHAIAH', 3),
 ('AMfamilyfund', 4),
 ('AND', 3),
 ('ASVyas512', 2),
 ('AaronRoth', 5),
 ('AdamCohn', 2),
 ('Adjwilley', 8),
 ('Adrion', 3),
 ('AgniPakshi', 2),
 ('AlanMimms', 3),
 ('AmnestyInternationalMHS', 3),


## Task 4

*For each country, compute how many loans have involved that country as borrowers.*

From the dataframe `loans` I'm grouping by `country_name` and counting the `loan_id`.

In [10]:
start_time = tm.time()
countries = loans.groupby("country_name").country_name.count()
print("--- %s seconds ---" % (tm.time()-start_time)) 
countries

--- 0.0050182342529296875 seconds ---


country_name
Afghanistan       12
Albania           50
Armenia          212
Azerbaijan       109
Belize             4
                ... 
United States     55
Vietnam          309
Yemen             88
Zambia            18
Zimbabwe          91
Name: country_name, Length: 78, dtype: int64

## Task 5

*For each country, compute the overall amount of money borrowed.*

Again, the dataframe `loans` is processed. After grouping by `country_name`, the `loan_amount` is summed.

In [11]:
start_time = tm.time()
amount_per_country = loans.groupby("country_name").loan_amount.sum()
print("--- %s seconds ---" % (tm.time()-start_time)) 
amount_per_country 

--- 1.3081724643707275 seconds ---


country_name
Afghanistan       11625.0
Albania           69800.0
Armenia          371075.0
Azerbaijan       205675.0
Belize             4850.0
                   ...   
United States    294625.0
Vietnam          377700.0
Yemen             92100.0
Zambia            32650.0
Zimbabwe          92700.0
Name: loan_amount, Length: 78, dtype: float64

## Task 6

*Like the previous point, but expressed as a percentage of the overall amount lent* 

First the total amount is calculated. A helper function `calculate_percent` is created in order to calculate the percentage `loan_amount/total_amount*100`, and it is then applied to each row of the series `amount_per_country`.

In [12]:
start_time = tm.time()
total_amount = loans.loan_amount.sum()

def calculate_percent(v):
    return (v / total_amount) * 100

percentage_per_country= amount_per_country.apply(lambda x: calculate_percent(x))
print("--- %s seconds ---" % (tm.time()-start_time)) 
percentage_per_country

--- 1.143559217453003 seconds ---


country_name
Afghanistan      0.069063
Albania          0.414677
Armenia          2.204531
Azerbaijan       1.221901
Belize           0.028814
                   ...   
United States    1.750347
Vietnam          2.243890
Yemen            0.547160
Zambia           0.193971
Zimbabwe         0.550724
Name: loan_amount, Length: 78, dtype: float64

## Task 7

*Like the three previous points, but split for each year (with respect to disburse time).*

Firstly, an auxiliary function `year2` is made in order to enable extracting year from the date.  Subsequently this function is applied on each row of the `loans` dataframe enabling creation of new coulmn `disburse_year`.

In [13]:
start_time = tm.time()
from datetime import datetime as dt
def year2 (s):
    if( pd.notna(s) and s != ''):
        d = dt.strptime(s, "%Y-%m-%d %H:%M:%S.%f %z")
        return d.year
    return None
    
loans["disburse_year"] = loans.disburse_time.map(lambda x: year2(x))
loans



Unnamed: 0,loan_id,loan_name,original_language,description,description_translated,funded_amount,loan_amount,status,activity_name,sector_name,...,num_lenders_total,num_journal_entries,num_bulk_entries,tags,borrower_genders,borrower_pictured,repayment_interval,distribution_model,duration,disburse_year
0,657307,Aivy,English,"Aivy, 21 years of age, is single and lives in ...",,125.0,125.0,funded,General Store,Retail,...,3,2,1,,female,true,irregular,field_partner,53.0,2013
1,657259,Idalia Marizza,Spanish,"Doña Idalia, esta casada, tiene 57 años de eda...","Idalia, 57, is married and lives with her husb...",400.0,400.0,funded,Used Clothing,Clothing,...,11,2,1,,female,true,monthly,field_partner,96.0,2013
2,658010,Aasia,English,Aasia is a 45-year-old married lady and she ha...,,400.0,400.0,funded,General Store,Retail,...,16,2,1,"#Woman Owned Biz, #Supporting Family, user_fav...",female,true,monthly,field_partner,37.0,2014
3,659347,Gulmira,Russian,"Гулмире 36 лет, замужем, вместе с супругом вос...",Gulmira is 36 years old and married. She and ...,625.0,625.0,funded,Farming,Agriculture,...,21,2,1,user_favorite,female,true,monthly,field_partner,34.0,2014
4,656933,Ricky\t,English,Ricky is a farmer who currently cultivates his...,,425.0,425.0,funded,Farming,Agriculture,...,15,2,1,"#Animals, #Eco-friendly, #Sustainable Ag",male,true,bullet,field_partner,57.0,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,341198,Guillermo Antonio,Spanish,"El Señor Guillermo tiene 42 años, la ubicación...",Guillermo is 42 years old. His business is lo...,375.0,375.0,funded,Shoe Sales,Retail,...,10,1,1,,male,true,monthly,field_partner,,2011
19996,341771,ERLINDA,English,Erlinda is from the village of Sillawit. She i...,,175.0,175.0,funded,General Store,Retail,...,3,2,1,,female,true,irregular,field_partner,,2011
19997,344237,Ma Guadalupe,Spanish,La señora Ma Guadalupe es madre de 2 hijos de ...,\r\nMaría Guadalupe is the mother of two child...,625.0,625.0,funded,Property,Housing,...,23,1,1,,female,true,monthly,field_partner,,2011
19998,344735,Blanca Lidia,Spanish,"Blanca, vive con su esposo, tiene tres hijos d...",Blanca lives with her husband and has three ch...,350.0,350.0,funded,Food Production/Sales,Food,...,10,4,2,,female,true,monthly,field_partner,,2011


### Task 7.1

*For each country, compute how many loans have involved that country as borrowers per year.*

In [14]:
start_time = tm.time()
countries_per_year = loans.groupby(["country_name", "disburse_year"]).country_name.count()
countries_per_year
print("--- %s seconds ---" % (tm.time()-start_time)) 
countries_per_year

--- 1.4596352577209473 seconds ---


country_name  disburse_year
Afghanistan   2009             11
              2010              1
Albania       2012              5
              2013             15
              2014             16
                               ..
Zimbabwe      2013              9
              2014             47
              2015              9
              2016              7
              2017             12
Name: country_name, Length: 519, dtype: int64

### Task 7.2

*For each country, compute the overall amount of money borrowed per year.*

In [15]:
amount_per_country_per_year = loans.groupby(["country_name","disburse_year"]).loan_amount.sum()
amount_per_country_per_year                                    


country_name  disburse_year
Afghanistan   2009             10450.0
              2010              1175.0
Albania       2012              4250.0
              2013             19150.0
              2014             27225.0
                                ...   
Zimbabwe      2013             16450.0
              2014             31200.0
              2015              9150.0
              2016              8950.0
              2017             13300.0
Name: loan_amount, Length: 519, dtype: float64

### Task 7.3

*Like the previous point, but expressed as a percentage of the overall amount lent per year.* 

Firstly, `total_amount_per_year` is calculated.

In [16]:
total_amount_per_year = loans.groupby("disburse_year").loan_amount.sum()
total_amount_per_year

disburse_year
2007        200.0
2009     914050.0
2010     522125.0
2011     816025.0
2012    2115475.0
2013    4052325.0
2014    3427300.0
2015    1430500.0
2016    1318650.0
2017    2224975.0
2018      10750.0
Name: loan_amount, dtype: float64

An auxiliary function `find_total_per_year` is created in order to enable accessing the total amounts for each year, needed later to calculate the percentage. 
We iterate through `amount_per_country_per_year`, extract the loan amounts grouped by country and year, anc calculate the percentage. As a result, there is a list of dictionaries then converted in dataframe. 

In [17]:
def find_total_per_year(y):
    for x in total_amount_per_year.items():
        if x[0] == y:
            return x[1]
    return 0

result = []
start_time = tm.time()
for x in amount_per_country_per_year.items():
    v = x[1]
    y = x[0][1]
    c = x[0][0]
    tot = find_total_per_year(y)
    perc = (v / tot) * 100
    result.append({"country_name":c, "disburse_year":y, "percent":perc })
result = pd.DataFrame(result)
print("--- %s seconds ---" % (tm.time()-start_time)) 
result

--- 0.5856130123138428 seconds ---


Unnamed: 0,country_name,disburse_year,percent
0,Afghanistan,2009,1.143263
1,Afghanistan,2010,0.225042
2,Albania,2012,0.200901
3,Albania,2013,0.472568
4,Albania,2014,0.794357
...,...,...,...
514,Zimbabwe,2013,0.405940
515,Zimbabwe,2014,0.910338
516,Zimbabwe,2015,0.639636
517,Zimbabwe,2016,0.678724


## Task 8

*For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must assume that all lenders contributed the same amount*

Two dataframes, `loans` and `loans_lenders` are merged.
The resulting dataframe `number_of_lenders_by_loan_id` has the number of lenders per each loan.
The amount per lender is calculated assuming ttah each lender contributed the same to the loan amount. 
Finally, in order to obtain the total sum per lender, the dataframe is gro

In [18]:
start_time = tm.time()
merged = pd.merge(left=loans_lenders,right=loans, left_on="loan_id", right_on="loan_id")
merged = merged[["loan_id","lender", "loan_amount"]]
number_of_lenders_by_loan_id = merged.groupby(["loan_id"], as_index = False)[["lender"]].count()
number_of_lenders_by_loan_id
merged2 = pd.merge(left=merged,right=number_of_lenders_by_loan_id, left_on="loan_id", right_on="loan_id")
merged2["amount_per_lender"] = merged2.loan_amount / merged2.lender_y
merged2
lent_by_lender = merged2.groupby(["lender_x"], as_index = False)[["amount_per_lender"]].sum()
print("--- %s seconds ---" % (tm.time()-start_time)) 
lent_by_lender

--- 20.798378229141235 seconds ---


Unnamed: 0,lender_x,amount_per_lender
0,100ofhumanity1199,64.583333
1,10yearitch,33.876812
2,123321,153.990193
3,1ghostorchid,31.097561
4,1nottus1,43.750000
...,...,...
19663,zubair1627,41.233766
19664,zubin,104.088235
19665,zuzu,25.714286
19666,zxdr,32.500000


## Task 9

*For each country, compute the difference between the overall amount of money lent and the overall amount of money borrowed. Since the country of the lender is often unknown, you can assume that the true distribution among the countries is the same as the one computed from the rows where the country is known.*

The file `lenders.csv` contains data related to lenders. `lender_states` is a newely created dataframe containing `country_code` and `permanent_name`.

In [19]:
start_time = tm.time()
lenders = pd.read_csv("./data/lenders.csv")
if not FINAL:
    lenders = lenders.head(20000) 
lender_states = lenders[["permanent_name","country_code"]]
print("--- %s seconds ---" % (tm.time()-start_time)) 
lender_states

--- 46.39687728881836 seconds ---


Unnamed: 0,permanent_name,country_code
0,qian3013,
1,reena6733,
2,mai5982,
3,andrew86079135,
4,nguyen6962,
...,...,...
19995,mohamed4338,
19996,vicrennaisance8567,
19997,mariz6787,
19998,julian2868,


The dataframe `lenders_in_states` contains data related to lenders whose country is known. 

In [20]:
lenders_in_states = lender_states.loc[lender_states['country_code'].notnull()]
lenders_in_states

Unnamed: 0,permanent_name,country_code
16,naresh2074,US
31,christina27976796,US
37,vikas1098,IN
39,qian1385,US
42,xigg8769,US
...,...,...
19979,manrike5051,LA
19981,janell3482,US
19984,sheryl2462,US
19989,amy5291,US


The dataframe `lenders_without_states` contains data related to lenders whose country is unknown or the data is missing. 

In [21]:
lenders_without_states = lender_states.loc[lender_states["country_code"].isnull()]
lenders_without_states

Unnamed: 0,permanent_name,country_code
0,qian3013,
1,reena6733,
2,mai5982,
3,andrew86079135,
4,nguyen6962,
...,...,...
19995,mohamed4338,
19996,vicrennaisance8567,
19997,mariz6787,
19998,julian2868,


In order to calculate the participation of each country in total amount of lent loans, the following calculation is done:

In [22]:
start_time = tm.time()
lent_and_state = pd.merge(left=lent_by_lender, right=lenders_in_states, left_on='lender_x', right_on='permanent_name')
lent_total = lent_and_state['amount_per_lender'].sum()
lent_by_state = lent_and_state.groupby(['country_code'], as_index = False\
                                      ).agg({'amount_per_lender':['count', 'sum', \
                                                               lambda x: x.sum()/lent_total]})
lent_by_state.columns = ['country_code', 'lent_count', 'lent_sum', 'lent_factor']
print("--- %s seconds ---" % (tm.time()-start_time)) 
lent_by_state

--- 1.1097071170806885 seconds ---


Unnamed: 0,country_code,lent_count,lent_sum,lent_factor
0,AU,1,33.333333,0.431034
1,US,1,44.0,0.568966


Total amount of lent money for the lenders with unknown/unavailable country is being calculated

In [23]:
lent_no_state = pd.merge(left=lent_by_lender, right=lenders_without_states, \
                         left_on='lender_x', right_on='permanent_name')
lent_no_state_total = lent_no_state['amount_per_lender'].sum()
lent_no_state_total

305.0

This amount is factorized as previously calculated, and added to `lent_by_state` datafrime as requested in the task.

In [24]:
lent_by_state['additional_lent'] = lent_by_state.lent_factor * lent_no_state_total
lent_by_state

Unnamed: 0,country_code,lent_count,lent_sum,lent_factor,additional_lent
0,AU,1,33.333333,0.431034,131.465517
1,US,1,44.0,0.568966,173.534483


`loan_and_state` is the dataframe obtained by reducing the `loans` dataframe to the following three columns:

In [25]:
loan_and_state = loans[['loan_id','country_code','loan_amount']]
loan_and_state

Unnamed: 0,loan_id,country_code,loan_amount
0,657307,PH,125.0
1,657259,HN,400.0
2,658010,PK,400.0
3,659347,KG,625.0
4,656933,PH,425.0
...,...,...,...
19995,341198,NI,375.0
19996,341771,PH,175.0
19997,344237,MX,625.0
19998,344735,SV,350.0


The loan amount per contry is calculated.

In [26]:
start_time = tm.time()
loan_by_state = loan_and_state.groupby(['country_code'], as_index = False).loan_amount.sum()
loan_by_state.columns = ['country_code','loan_amount_sum']
print("--- %s seconds ---" % (tm.time()-start_time)) 
loan_by_state

--- 0.003984212875366211 seconds ---


Unnamed: 0,country_code,loan_amount_sum
0,AF,11625.0
1,AL,69800.0
2,AM,371075.0
3,AZ,205675.0
4,BA,575.0
...,...,...
73,XK,74950.0
74,YE,92100.0
75,ZA,29000.0
76,ZM,32650.0


Two data frames relating to loaned and lent amounts per country are joined (left join since there are much more loans), the difference is calculated and finally, the entry with maximum difference is displayed.

In [27]:
start_time = tm.time()
loan_and_lent = pd.merge(loan_by_state, lent_by_state, how='left', on=['country_code'])
loan_and_lent['difference'] = loan_and_lent.loan_amount_sum - loan_and_lent.lent_sum - loan_and_lent.additional_lent 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loan_and_lent[loan_and_lent.difference == loan_and_lent.difference.max()]


--- 0.16536235809326172 seconds ---


Unnamed: 0,country_code,loan_amount_sum,lent_count,lent_sum,lent_factor,additional_lent,difference
70,US,294625.0,1.0,44.0,0.568966,173.534483,294407.465517


## Task 10

*Which country has the highest ratio between the difference computed at the previous point and the population?*

The data related to population are contained in the file `country_stat.csv`

In [28]:
start_time = tm.time()
country_stat = pd.read_csv('./data/country_stats.csv')
country_stat = country_stat[['country_code', 'country_name', 'population', 'population_below_poverty_line']]
country_stat = country_stat.dropna(subset=['population_below_poverty_line'])
print("--- %s seconds ---" % (tm.time()-start_time)) 
country_stat

--- 0.5335147380828857 seconds ---


Unnamed: 0,country_code,country_name,population,population_below_poverty_line
0,IN,India,1339180127,21.9
1,NG,Nigeria,190886311,70.0
2,MX,Mexico,129163276,46.2
3,PK,Pakistan,197015955,29.5
4,BD,Bangladesh,164669751,31.5
...,...,...,...,...
147,MT,Malta,430835,16.3
148,MV,Maldives,436330,16.0
149,ME,Montenegro,628960,8.6
150,TM,Turkmenistan,5758075,0.2


A new dataframe `country_stat_loan_and_lent` is created by joining the `country_stat` and `loan_and_lent` dataframes on `country_code` column. `ratio_1` is calculated by dividing previously obtained difference and the population. Subsequentlt the country with the maximum ratio is selected. 

In [29]:
country_stat_loan_and_lent = pd.merge(left=country_stat, right=loan_and_lent, \
                                      left_on='country_code', right_on='country_code')
country_stat_loan_and_lent = country_stat_loan_and_lent[['country_code', 'country_name', 'difference', \
                                                         'population', 'population_below_poverty_line']]
country_stat_loan_and_lent['ratio_1'] = country_stat_loan_and_lent.difference/country_stat_loan_and_lent.population
print("--- %s seconds ---" % (tm.time()-start_time)) 
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_1 == country_stat_loan_and_lent.ratio_1.max()]

--- 0.6650853157043457 seconds ---


Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_1
5,US,United States,294407.465517,324459463,15.1,0.000907


## Task 11

*Which country has the highest ratio between the difference computed at point 9 and the population that is not below the poverty line?*

Firstly, the population that is not below the poverty line is calculated using the percentage of population below poverty line and the population. 
`ratio_2` is calculated by dividing the difference by population that is not below the poverty line and the country with the hihgest ratio is selected.

In [31]:
start_time = tm.time()
#country_stat_loan_and_lent = pd.merge(left=country_stat, right=loan_and_lent, \
                                      #left_on='country_code', right_on='country_code')
#country_stat_loan_and_lent = country_stat_loan_and_lent[['country_code', 'country_name', 'difference', \
                                                         #'population', 'population_below_poverty_line']]
country_stat_loan_and_lent['ratio_2'] = country_stat_loan_and_lent.difference\
/(country_stat_loan_and_lent.population * ((100-country_stat_loan_and_lent.population_below_poverty_line)/100) )
print("--- %s seconds ---" % (tm.time()-start_time)) 
country_stat_loan_and_lent[country_stat_loan_and_lent.ratio_2 == country_stat_loan_and_lent.ratio_2.max()]

--- 0.0009984970092773438 seconds ---


Unnamed: 0,country_code,country_name,difference,population,population_below_poverty_line,ratio_2
5,US,United States,294407.465517,324459463,15.1,0.001069


## Task 12

*For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in different years must have its amount distributed proportionally to the number of days in each year. For example, a loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 / (31+365+30) = 352.11 for 2018.*

Two auxiliary functions are made: 
`days_in_year` calculate number of days in each year froma cartain point in time. 

`total_amount_of_loans`calculate the payment plan for **one loan** over the years as requested in the task.

In [32]:
start_time = tm.time()

from datetime import datetime as dt, timezone as tz

now = dt.now(tz.utc)

def days_in_year(year):
    d1 = dt.strptime( str(year)+'-01-01', "%Y-%m-%d") 
    d1 = d1.replace(tzinfo=None)
    d2 = dt.strptime( str(year)+'-12-31', "%Y-%m-%d")
    d2 = d2.replace(tzinfo=None)
    return (d2-d1).days+1

def total_amount_of_loans(disb, plan_exp, amount):
    if( disb.year == plan_exp.year ):
        return [(disb.year,amount)]
    disb = disb.replace(tzinfo=None)
    plan_exp = plan_exp.replace(tzinfo=None) 
    total_days = (plan_exp-disb).days 
    total_days += 1
    total_amount = []
    curr = disb
    curr = curr.replace(tzinfo=None)
    for y in range(disb.year, plan_exp.year):
        dy = (dt.strptime(str(y)+'-12-31',"%Y-%m-%d").replace(tzinfo=None)-curr).days + 1
        if(total_days >0):
            total_amount.append( (y, dy * amount / total_days) )
        else:
            total_amount.append( (y, 0) )
        curr = dt.strptime(str(y+1)+'-01-01',"%Y-%m-%d")
        curr = curr.replace(tzinfo=None)
    dy = (plan_exp-curr).days + 1
    if(total_days >0):
        total_amount.append( (plan_exp.year, dy * amount / total_days) )
    else:
        total_amount.append( (plan_exp.year,0) )
    return total_amount

d_t = dt.strptime( '2016-12-01', "%Y-%m-%d")
p_e_t = dt.strptime( '2018-01-30', "%Y-%m-%d")
total_amount_of_loans(d_t, p_e_t, 5000)

[(2016, 363.84976525821594),
 (2017, 4284.037558685446),
 (2018, 352.11267605633805)]

A restricted loan dataframe is created containing anly the necessary data. 

In [33]:
loans_base = loans[['loan_id','loan_name','loan_amount','planned_expiration_time','disburse_time']]
loans_base

Unnamed: 0,loan_id,loan_name,loan_amount,planned_expiration_time,disburse_time
0,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000
1,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000
2,658010,Aasia,400.0,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000
3,659347,Gulmira,625.0,2014-02-21 03:10:02.000 +0000,2014-01-17 08:00:00.000 +0000
4,656933,Ricky\t,425.0,2014-02-13 06:10:02.000 +0000,2013-12-17 08:00:00.000 +0000
...,...,...,...,...,...
19995,341198,Guillermo Antonio,375.0,,2011-09-13 07:00:00.000 +0000
19996,341771,ERLINDA,175.0,,2011-09-16 07:00:00.000 +0000
19997,344237,Ma Guadalupe,625.0,,2011-10-07 07:00:00.000 +0000
19998,344735,Blanca Lidia,350.0,,2011-09-16 07:00:00.000 +0000


The `loans_base` dataframe is iterated, and the two previously created functions are applied in order to btain the payment plan over the years. A new dataframe containing the payment plan per year (column `amount`) is created. 

In [34]:
lis = []
for index, row in loans_base.iterrows():
    s2 = row['planned_expiration_time']
    try:
        d2 = dt.strptime(s2, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    s1 = row['disburse_time']
    try:
        d1 = dt.strptime(s1, "%Y-%m-%d %H:%M:%S.%f %z")
    except ValueError:
        continue
    except TypeError:
        continue
    try:
        amt = float(row['loan_amount'])
    except ValueError:
        continue
    except TypeError:
        continue
    ls = total_amount_of_loans(d1, d2, amt)
    for l in ls:
        lis.append({ 'loan_id' : row['loan_id'], 'loan_name': row['loan_name'], 'loan_amount': row['loan_amount'], \
                    'planned_expiration_time': row['planned_expiration_time'], \
                    'disburse_time': row['disburse_time'],'year': l[0], 'ammount':l[1] })
loans_final = pd.DataFrame(lis) 
print("--- %s seconds ---" % (tm.time()-start_time)) 
loans_final

--- 638.949399471283 seconds ---


Unnamed: 0,loan_id,loan_name,loan_amount,planned_expiration_time,disburse_time,year,ammount
0,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,2013,20.833333
1,657307,Aivy,125.0,2014-02-14 03:30:06.000 +0000,2013-12-22 08:00:00.000 +0000,2014,104.166667
2,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,2013,45.360825
3,657259,Idalia Marizza,400.0,2014-03-26 22:25:07.000 +0000,2013-12-20 08:00:00.000 +0000,2014,350.515464
4,658010,Aasia,400.0,2014-02-15 21:10:05.000 +0000,2014-01-09 08:00:00.000 +0000,2014,400.000000
...,...,...,...,...,...,...,...
19962,971438,CHHON,1300.0,2015-12-01 02:00:03.000 +0000,2015-10-02 07:00:00.000 +0000,2015,1300.000000
19963,1267448,Khen's Group,100.0,2017-05-01 18:50:04.000 +0000,2017-03-21 07:00:00.000 +0000,2017,100.000000
19964,1268098,Hieu,900.0,2017-05-17 19:50:05.000 +0000,2017-03-22 07:00:00.000 +0000,2017,900.000000
19965,1268162,Hussain,2000.0,2017-05-02 22:20:02.000 +0000,2017-03-21 07:00:00.000 +0000,2017,2000.000000
