###Introduction

The Online Psychic is a website, integrated as a facebook app, which asks a series of questions and then, using various sources of data, infers facts about the user (currently their age, gender and religion).

We need a general framework for:

 - Selecting questions
 - Creating question text
 - Recording answers
 - Incorporating new datasets
 - Calculating conditional probabilities, i.e. p(age|gender, postcode, religion)
 - Performing inference on these conditional probabilities

To this end, we define a base class Answer. This has the following methods:

 - question_to_text - produce a string with text of question (in future might want to make it more interactive, like TypeForm as suggested by Enoch).
 - get_pymc_function - generate the categorical conditional probability distribution used by pyMC
 - append_facts - add facts to the 'facts' dictionary. These are true things (e.g. the person's name), that we don't do probabilistic inference on directly, but instead are used by other modules.
 - append_features - these are pymc nodes in the bayesian network
 - pick_question - this returns the question that the module would like to ask (in the future maybe pass what we would like to have an answer to, and also a way of returning how much information that question will provide, so we can select the most appropriate question from among different classes).
 - process_answer - occasionally used to reformat or adjust the answer given by the user before putting it in the database. Useful if it requires computation/API queries to avoid this being required repeatedly when the DB is read.
 
This class is inherited by each dataset (census, movielens, etc).

When an answer is returned, it is added to the database.

Regarding selection of questions; currently a random class is selected. The long term plan: When a question needs asking, each class is asked, given a feature we want to know, for a question and an estimate of how much information that question will provide towards that feature. 

####Hierarchy

Supposing, gender is not known, but it would be very useful for movielens' guess. We can get an improved estimate of gender from the census. How do we automatically ask the census class for an estimate of gender? As the design gets more complicated these dependencies become more intricate. We know which of 10 films the user has seen; combining the data from these will also probably be more effective than assuming independence.

Possible ways to deal with this are to either use an undirected graph (the census info provides data to the movie lens and vice versa) or ideally, ensure directed graph is acyclic, by careful graph design.

Rather than write a new Bayesian inference engine ourselves, there are python libraries that will do this for us: BayesPy, pyMC, libpgm, scikit-learn (only does naive bayes?). I chose pyMC as the most usable/practical/well-documented.

Books and other things to look at:

- http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Prologue/Prologue.ipynb
- Building Probabilistic Graphical Models with Python -- Kiran R Karkera
- See also: Bayesian Interchange Format (bif) and the asia network?
- What about population weighting within OA to help with P(LM|OA)

###Database

The table of greatest importance is the 'qa' (question-answer) table:

    CREATE TABLE qa (userid integer, dataset varchar(255), dataitem varchar(255), detail varchar(255), answered integer, asked_last integer, answer varchar(255), PRIMARY KEY (userid, dataset, dataitem, detail));

By column:

- dataset - the name of the dataset (e.g. movielens, census, where, ...) these names are held in the dataset class variable in each class.
- dataitem - the particular type of data from the dataset (e.g. whether the user has /seen/ a film)
- detail - e.g. the name of the film.
- asked_last - are we still waiting for an answer for this?
- answered - might be unused (TO CONFIRM)
- answer - the answer given



A simple example of the Categorical type and pyMC.

In [1]:
import numpy as np
import pymc as pm

#religious or not religious?
religious = pm.Categorical('religious',np.array([0.1,0.9]))

#Young or Old given Religious/Not Religious
CPTLines = np.array([[0.9,0.1],[0.1,0.9]])
#clues from http://stackoverflow.com/questions/22808110/how-to-make-conditional-probability-tables-cpts-for-bayesian-networks-with-pym
@pm.deterministic
def selectedCPTLine(CPTLines=CPTLines,religious=religious):
    return CPTLines[religious]

age=pm.Categorical('age', selectedCPTLine)

model = pm.Model([religious,age])

mcmc = pm.MCMC(model)
mcmc.sample(4000,1000,10)

a = mcmc.trace('age')[:]
r = mcmc.trace('religious')[:]

np.mean(a)
print "\n"
print "Prob of religious: %0.2f" % np.mean(r)
print "Prob of Old: %0.2f" % np.mean(a)

 [-----------------100%-----------------] 4000 of 4000 complete in 0.4 sec

Prob of religious: 0.89
Prob of Old: 0.80


###Designing the Bayesian Network

So far things seem to have worked ok, however: A key problem with this model becomes apparent when we try to add more images.

We initially imagined a network such as:
![alt text](files/images/bayes_1.png "Bayesian Network")

However, what if we wanted to add an additionally source of data, say, the number of cats on the person's facebook page.

![alt text](files/images/bayes2.png "Network with an additional data source")

We now have two, very difficult probability distributions to create. Of the form:

$p(age|postcode, facebook_{cats})$

To handle this we need to be more careful about the order and design of the network.

A principled approach would have causes: Age, Gender, Religion, Height, Political-position, etc...

and would have effects or observations: Facebook cat count, Movie Preferences, Music preferences, postcode, etc...

We then ensure that the network is of the form:

![alt text](files/images/bayes3.png "Restricted network")

Basically a form of factor analysis?

Finally, we can allow some connections between observations, but need to be careful to avoid loops:

![alt text](files/images/bayes4.png "Less Restricted network")

One could have nodes on the bottom layer that are only connected to other bottom-layer nodes, causing a hierarchical model, with the dangers and problems mentioned (1. inability of the current system to handle conditionals $P(factor|list\;of\; observations)$, 2. loops).

To enforce this general structure, there are some choices:

 - Have two feature vectors, one for factors and one for outcomes.
 - Have a flag associated with a feature to mark it as a factor.
 - Just being careful in defining the conditional probability distributions.
 
I quite like the simplicity of a single list.

####Facts

To handle 'facts', even if these are 'facts' about probabilities, a new dictionary, called 'facts' has been created. This allows information, e.g. from the 'where' class about the names and likelihoods of each OA to be shifted to the census class, without having to go via the probabilistic framework.

The latest version of the network.

![alt text](files/images/bayes5.png "Includes probabilistic output areas and religion.")

###Latest Version notes

####Changes

In this version I make it closer to what the webserver needs

 - glue on the question generation
 - move data to databases for quick access

####Name->Age

We can use their name to infer age. See the answer_babynames.py code for more detail.

 - http://www.ssa.gov/oact/babynames/limits.html
 - http://data.gov.uk/dataset/baby_names_england_and_wales
 - Note: We don't take into account migration
 - Currently only uses most popular 100 names for each decade, so less popular names aren't used (even for ascertaining gender).
 
####Other data sets or questions
 
- Asbos/1000 people
- MMSE or similar cognitive tests?
- 23&me API.
- openGov.
- journey to work data? http://cider.census.ac.uk/cider/wicid/query.php
- Output area boundaries http://census.edina.ac.uk/easy_download_data.html?data=infuse_oa_lyr_2011
- OSM key: http://wiki.openstreetmap.org/wiki/Key:amenity
- Swedish Human Genome http://go.theregister.com/feed/www.theregister.co.uk/2015/04/28/sweden_releases_human_genome_under_cc/
- GDELT http://gdeltproject.org/#downloading
- google street view API: http://maps.googleapis.com/maps/api/streetview?size=400x400&location=%2051.507222,%20-0.1275&sensor=false
- http://aws.amazon.com/public-data-sets/
- http://data.gov.uk/ OBVIOUSLY!

- http://blog.echonest.com/post/11992136676/taste-profiles-get-added-to-the-million-song <<<

Other thoughts:

- nepal and open data: http://www.bond.org.uk/blog/44/nepal-earthquake-open-data-and-the-immediate-response?utm_source=Bond&utm_campaign=7d4354f72f-Your_Network_March_2015_Wk4&utm_medium=email&utm_term=0_9e0673822f-7d4354f72f-247672305

####Regarding differential privacy:

We sort of WANT to 'know too much' during inference. Maybe the inference happens locally so that's ok. The associated API calls (eg to the Census) do reveal location though... can we hack a simple way to avoid this? E.g. several queries, and discard all but our one?

####Facebook

What to ask for? We need to ask permission to access these.

- friends_about_me
- friends_activities
- friends_birthday
- friends_education_history
- friends_groups
- friends_hometown
- friends_likes
- friends_location
- friends_photos
- friends_relationships

- user_about_me
- user_birthday <
- user_education_history
- user_events
- user_friends
- user_groups 
- user_hometown
- user_likes <
- user_interests
- user_location
- user_online_presence
- user_photos
- user_posts
- user_relationships
- user_status

email

####From citizen Me list

- Facebook  	Profile	
- Facebook 	Likes
- Facebook 	Location
- Facebook  	Contacts
- Facebook 	Posts 	
- Twitter 	Profile 
- Twitter 	Tweets - text
- Twitter  	Tweets - location stamp	
- Twitter 	Tweets - time stamp
- Google+ 	Profile 
- Google+ 	Location
- LinkedIn 	Profile 

###Other notes

- Maybe need to use "DiscreteUniform" for age.

##Next Stage

So far all the questions have been chosen to help gain insight into the person's age and gender. However we should expand the questions, to these three types:

- Questions to help us gain insight (i.e. used immediately by the inference engine).
- Questions to collect data for later (i.e. they might be useful for future insights).
- Questions as answers "I think you're like watching East Enders, am I right?"

###Todo list

- JS - get next Question Queued - Started work, but skipped for now: this turns out to be quite complicated, as some questions can be generated without knowing what's gone before, while others can't. Also how to keep track of which have already been asked, and what is a duplicate or near-duplicate.

- Add keystroke collection - DONE
    - Timing of typing DONE 
- OS/browser DONE
- Add Facebook likes - needs to go through REVIEW
- Get inference running constantly to generate better questions
- Personality test questions
    - Psych tests (make them more concrete: "Are you going out tonight?"
    - MMSE?
- IMDB (suggest films - like/dislike)
- Employment/occupation
- Scaling
    - Move Geonames to a locally hosted API
    - Combine API code into a single wrapper function
    - Combine SQL code into a single function (check sqlite's lock behaviour).
    - including ask Morten for feedback about 'AWS' and scaling
    - Transparent webcache
- Account control:
    - logging in (via openID & facebook)
    - delete
    - download
    - upload
    - Decide on licese/terms and conditions: MULTIPLE CHOICE
- Turn into loaded questions: "Am I right you like this movie?"
- Web interface for creating new datasets on backend
- ONS (Marital status, living arrangements, residence type, children, ethnicity, national identity, religion, household size, qualifications, hours worked, occupation, socio-economic class, method of travel to work)
- Mobile support
- "Provide me with a reading now"
- Twitter, geo-tagging, handle
- Differential Privacy
- Ask questions about trending #tag, or news, or TV programmes on
- Generate a list of other sources of questions (e.g. movies, books, music, visited-places, holidays, hobbies, TV programmes, twitter)
- Check values are numbers when we expect them to be.
    - Allow script to unset an answer and ask for it again.

- New Name: SciKick

###Purpose

- How comfortable are people with data being processed like this?
- Increase user understanding.
- Increase our own understanding of what's possible.
- Models of user-centric data modelling
- Public perception of AI
- Provide methods for future models (mental health->case studies?, requires models of normality).


##Implementation Notes

###index.cgi

Provides via default output the interface to the user, fairly basic

###index.cgi?ajax=on

 - Receives ajax queries (from interface.js).
 - Instantiates relevant class for data of interest,
 - Each class has a method for turning user reply into database entry
 - Select next question 
     - ask each class how much information they will recover with a single question
     - call class genQuestion method
     - add to database
     - call the questionToText method and send back in ajax reply
     
###interface.js

Provides:

 - Facebook interface (sends messages to process.cgi in a similar way). TODO Could move this to a server-side interaction?
 - ajax: sends user replies to process.cgi and displays the computer's replies.