# Retrieving data from Kiva API

We will be using [KIVA](https://www.kiva.org/) data for the entirety of this course. Kiva makes it's data publically available through it's [API](http://build.kiva.org/). 

Not sure what an API is? It stands for Application Program Interface. Code Academy has a fantastic short course (link [here](https://www.codecademy.com/en/tracks/placekitten) that introduces you to APIs and allows you to pull images of kittens from a website by the end of the session.

Below, we import the packages we need in order to retrieve data from the API.

** Do we need to explain what packages are? **

In [119]:
from urllib.request import urlopen, Request
import json
from pandas.io.json import json_normalize
import pandas as pd
import json
import requests as r
import os
import logging

In the cell below we set the maximum number of columns to 100 so we can see our entire data set. If we did not set this option, we would not be able to see some columns in our dataset.

In [100]:
pd.set_option('display.max_columns', 100) 

Using % in the section below indicates a magic command. You can find out more about magic commands [here](https://ipython.org/ipython-doc/3/interactive/magics.html). Using the % in the block of code below allows us to run commands on the terminal from within jupyter. We make a folder in our user directory where we will store the data we retrieve from the api.

In [None]:
% mkdir ~/_intro_machine_learning_course
% cd ~/_intro_machine_learning_course
% pwd
% ls

In the line below, we create a data cache. A cache is a great way to store data, when it is costly to retrive it from scratch every time.

In [None]:
store = pd.HDFStore('kiva_cache.h5')

There is great documentation on Kiva's API [here](http://build.kiva.org/api). The documentation explains what parameters (conditions) we need to pass in our request to Kiva's database in order to get the data we want.

We are trying to retrive all Kiva data from Kenya. So we will be using two main parameters where we set country_code=KE (KE is the two letter [ISO](https://en.wikipedia.org/wiki/ISO_3166-2) code for Kenya), and we increase the results per page to 500 (this is the maximum KIVAs API appears to allow). You can see the HTML results of the api call by pasting the url below into your browser, HTML is a format that is really easy to read.

http://api.kivaws.org/v1/loans/search/?country_code=KE&per_page=500

1) Go ahead and play with the url in order to retrieve different data. For example, how would you retrieve data from South Africa (ZA)?

2) How would you only retrive 200 results?

Answers:

1) http://api.kivaws.org/v1/loans/search/?country_code=ZA&per_page=500

2)  http://api.kivaws.org/v1/loans/search/?country_code=ZA&per_page=200



We want to request this data from the api and store it in a format that is more intuitive to us - a dataframe. Let's get started. The code below retrieves the first 500 results and converts it into a pandas dataframe. You will get to know a lot more about dataframes over the next few classes.

In [52]:
d = r.get('http://api.kivaws.org/v1/loans/search.json?country_code=KE&per_page=500')

Notice that in the request above we specify json as the type of text we want returned. This is easier to handle and change into a python dataframe. You can past the link into your browser to understand the difference between [JSON](https://en.wikipedia.org/wiki/JSON) and [HTML](https://en.wikipedia.org/wiki/HTML).

By running d.headers below we can see all the data associated with our request. It shows the time of our request, the fact that we are requesting json text 'Content-Type': 'application/json; charset=UTF-8', in addition to other details.

In [53]:
d.headers

{'Date': 'Wed, 03 May 2017 23:56:20 GMT', 'Server': 'Apache/2.4.7 (Ubuntu)', 'Access-Control-Allow-Origin': '*', 'Expires': 'Tue, 03 Jul 2001 06:00:00 GMT', 'Last-Modified': 'Wed, 03 May 2017 23:56:20 GMT', 'Cache-Control': 'private, no-store, no-cache, must-revalidate, max-age=0, post-check=0, pre-check=0, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'X-RateLimit-Overall-Limit': '60', 'X-RateLimit-Overall-Remaining': '60', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '26111', 'Content-Type': 'application/json; charset=UTF-8'}

By running the command d.json() below we can get an idea of what our data looks like before we change it into a pandas data frame.

In [59]:
d.json()

{'loans': [{'activity': 'General Store',
   'bonus_credit_eligibility': False,
   'borrower_count': 1,
   'description': {'languages': ['en']},
   'funded_amount': 300,
   'id': 1281723,
   'image': {'id': 2497094, 'template_id': 1},
   'lender_count': 11,
   'loan_amount': 300,
   'location': {'country': 'Kenya',
    'country_code': 'KE',
    'geo': {'level': 'town', 'pairs': '1 38', 'type': 'point'},
    'town': 'Kapsabet'},
   'name': 'Patrick',
   'partner_id': 133,
   'planned_expiration_date': '2017-05-28T04:50:02Z',
   'posted_date': '2017-04-28T04:50:02Z',
   'sector': 'Retail',
   'status': 'funded',
   'tags': [{'name': '#Widowed'}, {'name': '#Elderly'}, {'name': '#Parent'}],
   'use': 'to purchase maize seeds, fertilizers, sugar and rice for his shop.'},
  {'activity': 'Farming',
   'bonus_credit_eligibility': False,
   'borrower_count': 1,
   'description': {'languages': ['en']},
   'funded_amount': 100,
   'id': 1281570,
   'image': {'id': 2502890, 'template_id': 1},
   'l

In [60]:
data = json.loads(d.text)

In [61]:
loans=json_normalize(data['loans'])

In [62]:
loans.head(3)

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrower_count,description.languages,funded_amount,id,image.id,image.template_id,lender_count,...,location.town,name,partner_id,planned_expiration_date,posted_date,sector,status,tags,themes,use
0,General Store,,False,1,[en],300,1281723,2497094,1,11,...,Kapsabet,Patrick,133,2017-05-28T04:50:02Z,2017-04-28T04:50:02Z,Retail,funded,"[{'name': '#Widowed'}, {'name': '#Elderly'}, {...",,"to purchase maize seeds, fertilizers, sugar an..."
1,Farming,,False,1,[en],100,1281570,2502890,1,4,...,Kitale,Rabecca,156,2017-05-28T02:20:02Z,2017-04-28T02:20:02Z,Agriculture,funded,[{'name': '#Parent'}],[Rural Exclusion],"to buy fertilizers, pesticides and herbicides."
2,Farming,0.0,False,1,[en],0,1281588,2502914,1,0,...,Kitale,Grace,156,2017-05-27T23:40:05Z,2017-04-27T23:40:05Z,Agriculture,fundraising,[{'name': '#Woman Owned Biz'}],[Rural Exclusion],"to buy fertilizers, pesticides and herbicides."


We have now extracted the first 500 rows of loans from the API. We can confirm how many rows we have in a dataset using the len() function below. Now, we need to extract this data more systematically for all Kenyan loan results. KIVA has a parameter called page but does not allow for range of pages, so we will have to create a python loop to go through each page of results and add to our dataset.

In [63]:
len(loans.index)

500

In [None]:
    # Should you put an explaination of the function.
    # ie: Create a dataframe to fill in along the way
    # - go through each page
    # - load data as a text
    # - etc etc

In [164]:
def extract_loans(pages,country_iso_code):
    loans_full=pd.DataFrame()
    for n in range(1, pages+1):
        s=str(n)
        print(s)
        d = r.get('http://api.kivaws.org/v1/loans/search.json?country_code='+country_iso_code+'&per_page=500&page='+s)
        data = json.loads(d.text)
        loans=json_normalize(data['loans'])
        loans_full=loans_full.append(loans,ignore_index=True)
        print(len(loans_full.index)) 
        
    return loans_full

In [149]:
loans_full = extract_data(10,'KE')

1
500
2
1000
3
1500
4
2000
5
2500
6
3000
7
3500
8
4000
9
4500
10
5000


In [150]:
len(loans_full.index)

5000

In [157]:
# more comments of what is happening here - being unfamiliar with APIs, Im not sure what is happening
w_lender=loans_full.loc[loans_full['lender_count'] >= 1]
w_lender_list= w_lender['id'].tolist()
w_lender.head(2)
w_lender_list[1]

We want to cache loans_full locally so we do not have to pull this data again every lesson.

In [151]:
store['loans']=loans_full
loans_full=store.get('loans')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['activity', 'description.languages', 'location.country', 'location.country_code', 'location.geo.level', 'location.geo.pairs', 'location.geo.type', 'location.town', 'name', 'planned_expiration_date', 'posted_date', 'sector', 'status', 'tags', 'themes', 'use']]

  exec(code_obj, self.user_global_ns, self.user_ns)


In [152]:
len(loans_full.index)

5000