#Scikic API v0.2
 
##Overview

The scikic api is an inference tool which takes a set of question/answer items and then queries a series of local and remote databases to generate conditional probability distributions over various features. The api is highly modular, and some modules don't use this probabilistic framework, for example the music module simply contacts api.bandsintown.com to provide useful suggestions about local bands to go and see.

The conditional probabilities are combined using a Bayesian network, using the pyMC module. Each module can provide pyMC 'features' which create functions to output the relevant probability distributions.

##Question/Answer dictionaries

The questions and answers are organised to be in 4 value tuples, containing:

- dataset: lets the system know which class to instantiate etc, examples: postcode, census, movielens, ...etc
- dataitem: used by classes to know which aspect of the dataset. For example in the movielens dataset, one could be interested in whether the user's seen a film or what rating they've given the film.
- detail: often unused by the classes, could be, for example, the id of the film we want to know about.
- answer: the user's answer.

##Available actions

Here are some examples of the API in action. A POST request is used for the query, in case the data we're sending is too large to fit in a GET request. Note it always uses POST (so not using the range of HTTP queries).

###1. Get a suggestion for a question to answer *[action: question]*

####Parameters
One passes to this call in data, a dictionary containing:
 - 'questions_asked'
 - 'unprocessed_questions'
 - 'facts'
 - 'target'

The 'questions_asked' include all the questions we've asked, so we don't ask the same question again.
The 'facts' dictionary contains information that we've found from earlier questions, etc. It allows caching of the calculations from earlier calls to the API.
The 'unprocessed_questions' are a list of question/answers that we've asked before, that haven't had their results added to the 'facts' dictionary.
The 'target' item is currently unused, but in the future will allow the choice of question to be selected to maximise the information about a particular feature.

####Returns

This call returns a dictionary containing two things:

 - a 'facts' dictionary - this you can pass back in future so that the method doesn't have to recalculate or generate earlier results.
 - a 'question' dictionary, containing the dictionary describing the question, e.g. {dataset,dataitem,detail}.

'data' contains a list of previous asked (and answered) questions, to allow an optimum question to be asked.

####Usage example

1. One might initially call this method with all these fields being empty. The method will return an empty 'facts' dictionary and a question dictionary for the first question you want to ask. 
2. Once you have an answer from the user you would call the method again, this time with the question/answer tuple as both 'questions_asked' and 'unprocessed_questions'.
3. The method will return a facts dictionary now, potentially with some results from the processing of the last answer, and another question for you to ask. 
4. When you call the method a third time (with the user's second answer), you'll pass all the question/answer tuples that you've asked so far in 'questions_asked' and the last question/answer tuple in 'unprocessed_questions'. You'll also pass the new facts dictionary, that now has some content in it.
5. This process continues, with the facts dictionary growing each time, the 'questions_asked' growing too, and each time you just have one item in 'unprocessed_questions'.

To summarise:

Generating a question requires a dictionary of 'questions_asked', 'facts' and 'target'. The 'questions_asked' is a list of dictionaries of previous questions, that you want to avoid asking again.
The 'unprocessed_questions' are questions that you've asked already and that haven't been incorporated into the 'facts' dictionary.

Below is an example:

In [1]:
#apiurl = 'http://scikic.org/api/api.cgi';
#apiurl = 'http://127.0.0.1/~lionfish/scikic/api.cgi';
#apiurl = 'http://52.18.184.63/~ubuntu/scikic/api.cgi';
apiurl = 'http://production-backend-lb-no-ssl-1389362950.eu-west-1.elb.amazonaws.com/~ubuntu/scikic/api.cgi';

In [3]:
import requests
#We provide data about previous questions etc:
#data consists of a dictionary of (all of these are optional):

#'questions_asked': An array of previous questions and answers we've asked, consists of a list of dictionaries.
questions_asked = [{'dataset':'postal','dataitem':'postcode','detail':''},{"detail": "", "dataitem": "favourite_artist", "dataset": "music"}]

#none of the questions that have been asked before have been processed.
unprocessed_questions = questions_asked

#'facts': If you've run the inference query and stored a copy of the facts dictionary you can pass it back.
#this makes it quicker. In this case we've not yet had any questions processed.
facts = {}

#'target': What feature we want to know more about (example: 'age', 'gender', 'location') NOT YET IMPLEMENTED
#not used. All these items are optional, so we just don't include it.

#Build the dictionary:
data = {'unprocessed_questions':unprocessed_questions,'questions_asked':questions_asked,'facts':facts}

#put it into the payload of the request. This also includes the version, the api key we're using and the action we want (in this case generate a question)
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'question'}
r = requests.post(apiurl,json=payload)
r.content

'\n{"facts": {}, "question": {"detail": "", "dataitem": "travel", "dataset": "lifestyle"}}\n'

Here it has output:

 - the facts dictionary: Empty
 - the question it suggests we ask, which in this case is from the 'lifestyle' dataset, and is the question on 'travel'.
 
Currently we don't really know what this question means, we need to get a test string of it.

###2. Get a text string of the question *[action: questionstring]*

Once you have a tuple, like the one generated above, you may want a human readable string of the question. This method takes the tuple (in data) and returns a dictionary, of:

- 'text' - the actual string of the question (e.g. "Who's your favourite band or artist?")
- 'type' - the type of question (it might just want a text reply, so this would equal 'text' or it might be a choice, and so would say 'select'
- 'options' - optional, and is included if the type is 'select'.

In [4]:
import requests

data = {'dataset':'postal','dataitem':'postcode','detail':''}
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'questionstring'}
r = requests.post(apiurl,json=payload)
r.content

'\n{"type": "text", "question": "What\'s your postcode?"}\n'

In this example the dictionary `{'dataset':'postal','dataitem':'postcode','detail':''}` gets converted to the question "`What's your postcode?`" with type "`text`" (i.e. the user can type anything).

In [5]:
import json
data = {'dataset':'geoloc','dataitem':'nearcity','detail':json.dumps({'city':'Sheffield','country':'UK'})}
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'questionstring'}
r = requests.post(apiurl,json=payload)
r.content

'\n{"type": "select", "question": "Is your home in or near Sheffield, UK?", "options": ["yes", "no", "don\'t know"]}\n'

In this example the dictionary "`{'dataset':'geoloc','dataitem':'nearcity','detail':json.dumps({'city':'Sheffield','country':'UK'})}`" gets converted to "`Is your home in or near Sheffield, UK?`", which is a select type of question with the options yes, no or don't know.

Note that the detail contained a dictionary, this varies by the dataset module involved.

###Why did the question string generation get separated from question selection?
The current frontend stores the question that needs asking next, so:

  1. We know which question any answer given is for.
  2. We can ask the same question when the user returns.

###3. Inference *[action: inference]*

####Parameters

The data dictionary should contain three things, similar to the "action: question" above,

- questions_asked - list of question tuples that we've asked (with their answers)
- unprocessed_questions - list of question tuples that we've asked (with their answers), that have not yet been added to the facts dictionary.
- facts - the current 'facts' dictionary (possibly provided by earlier calls using action:question)

Returns a dictionary of:

#####features
This is a dictionary of things that have probabilities associated, for example one of its items is 'household' with the following fields:
 
     {"distribution": [0.029, 0.058, 0.23, 0.034, 0.070, 0.026, 0.055, 0.036, 0.14, 0.024, 0.023, 0.035, 0.24], "quartiles": {"upper": 11, "lower": 2, "mean": 6.46}}
 
 where the distribution is how likely the person is to be in each of the categories of a household (these categories can be found in the metadata from the module, or elsewhere). The quartiles don't mean much here as this is properly categorical data. This makes more sense in data such as age.
 
#####facts
As mentioned previously is a set of truths about the user generated from processing their answers.

#####Insights
This is a list of strings, generated by each module, here is an example:
 
     ["I can\'t tell which country you\'re in, just looking at your facebook likes, as I can\'t see your facebook likes!", "You are aged between 20 and 33.", "You don\'t have children living at home", " I think you are Christian or of no religion."]}
 
Note regarding the distribution above: If some probabilities are zero towards the end of a list then the list will be truncated. For example if inference is certain the user is a male, then the output list will be {"factor_gender":[1.0]}. If they are definitely female it will be {"factor_gender":[0.0, 1.0]}

####Example call using action:inference

Below we set up the data dictionary with one question asked (and answered) regarding postcode. It also contains the same question/answer dictionary as an unprocessed question. The facts dictionary is empty.

The output

In [12]:
import requests
import json
questions_asked = [{'dataset':'postal','dataitem':'postcode','detail':'','answer':'s63af'}]
unprocessed_questions = [{'dataset':'postal','dataitem':'postcode','detail':'','answer':'s63af'}]
facts = {}
        
data = {'questions_asked':questions_asked,'unprocessed_questions':unprocessed_questions,'facts':facts}
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'inference'}
r = requests.post('http://scikic.org/api/api.cgi',json=payload)
result = json.loads(r.content)
print "==Facts=="
print result['facts']
print ""
print "==Features=="
for feat in result['features']:
    print feat
    print result['features'][feat]
print ""
print "==Text insights=="
print result['insights']

==Facts==
{u'guess_loc': {}, u'where': {u'ukcensus': [{u'item': u'E00172420', u'probability': 1.0, u'level': u'oa'}], u'country': [{u'item': u'gb', u'probability': 1.0}], u'city': [{u'item': [u'Sheffield', u'uk'], u'probability': 1.0}]}, u'where_history': {u'error': u'no_fb_likes'}}

==Features==
religion
{u'distribution': [0.276, 0.03422222222222222, 0.027555555555555555, 0.034666666666666665, 0.057777777777777775, 0.03244444444444444, 0.049777777777777775, 0.4875555555555556], u'quartiles': {u'upper': 7, u'lower': 0, u'mean': 4.298222222222222}}
household
{u'distribution': [0.022222222222222223, 0.05688888888888889, 0.23466666666666666, 0.03244444444444444, 0.08266666666666667, 0.028, 0.059111111111111114, 0.03955555555555555, 0.14222222222222222, 0.024, 0.024444444444444446, 0.028444444444444446, 0.22533333333333333], u'quartiles': {u'upper': 11, u'lower': 2, u'mean': 6.340888888888888}}
oa
{u'distribution': [1.0], u'quartiles': {u'upper': 0, u'lower': 0, u'mean': 0.0}}
factor_gende

For the example above one can see that after just processing the answer about postcode, quite a bit of new info has been generated.

First the facts dictionary. This is often dataset specific stuff, although I've tried to make things compatable between datasets.

 - 'guess_loc': {} - info on whether it guessed the location of the user from IP address (I think)
 - 'where': - the dictionary describing the location of the user's home. This is quite tricky, as different sources of data have different resolutions, etc about this.
     - {'ukcensus': [{'item': 'E00172420', 'probability': 1.0, 'level': 'oa'}] - In terms of the UK census, with have a list of output areas for this person's home. This list only has one item in it. Each item in the list has a probability associated, in this case the probability is 1.0: We are certain the person is in that output area.
     - 'country': [{'item': 'gb', 'probability': 1.0}] - Which country they're in. A list of countries with associated probabilities.
     - 'city': [{'item': ['Sheffield', 'uk'], 'probability': 1.0}]} - which city their in (with probabilites).
     - 'where_history': {u'error': u'no_fb_likes'}} - if we have access to facebook likes, it tries to generate a history of where the person's lived. But the error item means that it's not managed to get hold of the likes to do this.

Next the features dictionary. Different datasets provide different conditional probability distributions. Each distribution has at least two features associated. If they are not in the list already the module adds them, thus one has a list of features at the end.

The value of these features is then estimated using MCMC with the pyMC module.

The example above has five features: religion, household, (census) output area, gender and age. They're all catagorical (for now) although that's purely due to the type of data that we have about them.

Looking just at the household feature: [0.022222222222222223, 0.05688888888888889, 0.23466666666666666, 0.03244444444444444, 0.08266666666666667, 0.028, 0.059111111111111114, 0.03955555555555555, 0.14222222222222222, 0.024, 0.024444444444444446, 0.028444444444444446, 0.22533333333333333]

Each value refers to the probability of being of a given type of household (the labels associated are available via the metadata of the module).

Finally there are insights, these are simple strings of facts about the person:
 - "I can't tell which country you're in, just looking at your facebook likes, as I can't see your facebook likes!" - this first one is more of an error message warning us that we couldn't get to their facebook like data.
 - "You are aged between 19 and 31"
 - "I think you are Christian or of no religion"
 
Here's another example, with an american zip code:

In [13]:
import requests
questions_asked = [{'dataset':'postal','dataitem':'zipcode','detail':'','answer':'86021'}]
unprocessed_questions = [{'dataset':'postal','dataitem':'zipcode','detail':'','answer':'86021'}]
facts = {}
        
data = {'questions_asked':questions_asked,'unprocessed_questions':unprocessed_questions,'facts':facts}

payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'inference'}
r = requests.post('http://scikic.org/api/api.cgi',json=payload)
r.content

'\n{"facts": {"guess_loc": {}, "where": {"uscensus": [{"item": ["04", "015", "950100", "1"], "probability": 0.514, "level": "blockgroup"}, {"item": ["04", "015", "950100", "3"], "probability": 0.486, "level": "blockgroup"}], "city": [{"item": ["Colorado City, AZ", "us"], "probability": 1.0}], "country": [{"item": "us", "probability": 1.0}]}, "where_history": {"error": "no_fb_likes"}}, "features": {"bg": {"distribution": [0.33644444444444443, 0.6635555555555556], "quartiles": {"upper": 1, "lower": 0, "mean": 0.6635555555555556}}, "factor_gender": {"distribution": [0.3328888888888889, 0.6671111111111111], "quartiles": {"upper": 1, "lower": 0, "mean": 0.6671111111111111}}, "factor_age": {"distribution": [0.015111111111111112, 0.032, 0.04666666666666667, 0.015555555555555555, 0.036, 0.042222222222222223, 0.041777777777777775, 0.030666666666666665, 0.028888888888888888, 0.030666666666666665, 0.02, 0.032, 0.02711111111111111, 0.036, 0.05644444444444444, 0.025333333333333333, 0.01466666666666

You can see this is similar, except the 'where' item in the dictionary has a 'uscensus' item within it. This contains two items:

    {"item": ["04", "015", "950100", "1"], "probability": 0.514, "level": "blockgroup"}
    {"item": ["04", "015", "950100", "3"], "probability": 0.486, "level": "blockgroup"}

Because zipcodes cover quite large areas, it doesn't know which blockgroup the person's home is in, as the zip code spans more than one blockgroup. It therefore gives the probability of being in the two.

The US census module doesn't know about religion, so doesn't have a conditional probability distribution about it, so no religion feature is created. The features that are created are:

 - bg (block group)
 - gender
 - age

###5. Metadata *[action: metadata]*

Some of the classes provide metadata about the results. Use the 'metadata' action to retrieve these. Pass a dictionary in 'data' with the name of the dataset, or leave empty to get all the metadata of all the classes.

In this example we display the citation information for the 'babynames' dataset.

In [9]:
import requests
import json
data = {'dataset':'babynames'}
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'metadata'}
r = requests.post('http://scikic.org/api/api.cgi',json=payload)
for item in json.loads(r.content):
    if 'citation' in item:
        print(item['citation'])

The ONS provide statistics on the distribution of the names of baby's in the UK: <a href="http://www.ons.gov.uk/ons/about-ons/business-transparency/freedom-of-information/what-can-i-request/published-ad-hoc-data/pop/august-2014/baby-names-1996-2013.xls">1996-2013</a> and <a href="http://www.ons.gov.uk/ons/rel/vsob1/baby-names--england-and-wales/1904-1994/top-100-baby-names-historical-data.xls">1904-1994</a>.


In this example we get all citations:

In [None]:
import requests
import json
data = {}#no dataset specified (makes it output all metadata)
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'metadata'}
r = requests.post('http://scikic.org/api/api.cgi',json=payload)
for item in json.loads(r.content):
    if 'citation' in item:
        print(item['citation'])

The <a href="facebook.com">facebook</a> graph API
The <a href="http://www.census.gov/developers/">US census bureau</a>
The <a href="http://files.grouplens.org/datasets/movielens">movielens</a> database
The ONS provide statistics on the distribution of the names of baby's in the UK: <a href="http://www.ons.gov.uk/ons/about-ons/business-transparency/freedom-of-information/what-can-i-request/published-ad-hoc-data/pop/august-2014/baby-names-1996-2013.xls">1996-2013</a> and <a href="http://www.ons.gov.uk/ons/rel/vsob1/baby-names--england-and-wales/1904-1994/top-100-baby-names-historical-data.xls">1904-1994</a>.
The <a href="https://geoportal.statistics.gov.uk">UK office of national statistics</a> (see <a href="http://www.ons.gov.uk/ons/guide-method/geography/products/census/lookup/other/index.html">details</a> and <a href="https://geoportal.statistics.gov.uk/geoportal/catalog/search/resource/details.page?uuid={A33B0569-97E2-4F44-836C-B656A6D082B6} ">information</a>) and the US zipcode data 

###Typical API usage

The scikic front end may use the API in the following way.

In [19]:
import requests

#We start with no questions asked, none unprocessed, and nothing in the facts dictionary.
questions_asked = []
unprocessed_questions = []
facts = {}

for loop in range(3): #we'll ask three questions
    
    #1. get the question (populate the data dictionary & send it off)
    data = {'unprocessed_questions':unprocessed_questions,'questions_asked':questions_asked,'facts':facts}
    payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'question'}
    r = requests.post(apiurl,json=payload) #>>>
    question_query_result = json.loads(r.content)
    
    #if processing was done then more items will be available for 'facts':
    facts = question_query_result['facts']
    
    #2. We want to get the question string itself (put the question tuple in data and send it off to the server)
    data = question_query_result['question']
    payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'questionstring'}
    r = requests.post(apiurl,json=payload) #>>>

    #we now have the question string
    question_string_result = json.loads(r.content)
    question = question_query_result['question']

    #ask the user this question    
    userinput = raw_input(question_string_result['question'])
    question['answer'] = userinput #add their answer

    #add this to the list of questions we've asked, and unprocessed questions
    questions_asked.append(question)
    unprocessed_questions.append(question)
    
#3. Once enough questions are asked we can do inference.
#   Populate the data dictionary with questions asked,
#   unprocessed questions and facts (as for the question query in step 1)
data = {'questions_asked':questions_asked,'unprocessed_questions':unprocessed_questions,'facts':facts}
payload = {"version":1, 'data': data, 'apikey': 'YOUR_API_KEY_HERE', 'action':'inference'}
r = requests.post('http://scikic.org/api/api.cgi',json=payload) #>>>
inference_results = json.loads(r.content)

#It generates insights from these questions, which are displayed below.
print "\nInsights\n"
for insight in inference_results['insights']:
    print insight

New
{}
What's your favourite band or artist? (be honest!)Eels
Old
{}
New
{u'guess_loc': {}, u'where': {}}
Which country are you in?uk
Old
{u'guess_loc': {}, u'where': {}}
New
{u'guess_loc': {}, u'where': {u'country': [{u'item': u'gb', u'probability': 1.0}], u'city': []}}
Have you seen Crocodile Dundee (1986)? (yes or no)no
Old
{u'guess_loc': {}, u'where': {u'country': [{u'item': u'gb', u'probability': 1.0}], u'city': []}}

Insights

Unfortunately there are no local events I think you'd like but why not give The Dandy Warhols a listen ?
I can't tell which country you're in, just looking at your facebook likes, as I can't see your facebook likes!
You are aged between 26 and 75.
 I think you are Christian or of no religion.
