In [None]:
from problem import * # the code that largely constitutes these experiments.
#from bokeh.io import output_notebook
%matplotlib inline
#output_notebook()
import psycopg2
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

This statistical investigation takes a look at how people use wikipedia. Wikipedia makes available stats on how people come across wikipedia page, be it through a link from a google search or a link on another wikipedia page etc. This type of information is sometimes called a 'clickstream', and the next cell shows how this is manifested in a SQL database.  The exhuastive list of so-called 'referers' is:
- google, called 'other-google' in the table.
- bing, called 'other-bing' in the table
- yahoo, called 'other-yahoo' in the table
- wikipedia's own search feature, called 'other-wikipedia'
- other wikipedia pages. This information is not aggregated; each page which links to the 'current' page has its own count.
- other. 

The raw data for this notebook counts wikipedia traffic from February 2015. The sources of this data are hosted [on datahub](http://datahub.io/dataset/wikipedia-clickstream). The website there makes clear some of the finer points involved in the data collection, but I'll summarize the most relevant ones here:

In [None]:
conn = psycopg2.connect(dbname="wiki", user="shalom", host="localhost", password="")
pd.read_sql("SELECT * FROM wikithresh LIMIT 40",conn)

This information lends itself to many sorts of statistical investigations. The type of questions I'm interested in here are "which categories of wikipedia pages have characteristic access patterns?" A distinct usage pattern might mean:

- a noteworthy proportion of traffic from within wikipedia compared to traffic from search engines.
- a noteworthy distribution of which search engines people 
- a correlation between some statistic of a category and that of another category

Because I'm interested in proportions like this, wikipedia pages without many views would provide prohibitively much noise. For instance, there are many pages like ' ' which have only been visited via ' ', and that does not point to interesting results so much as the coincidence of a few people happening to have done ' '. The exact choice of cutoff is. This number was not decided through any interesting, rigorous way. In fact, my first inclination was to made that threshold a parameter to queries by making a tresholded SQL view. It turns out the queries here are massively sped up by storing an index on 'title', and such bookeeping requires tables which are not dynamically formed.

In order to have a feel for what is 'noteworthy', it is useful to look at some basic descriptive statistics on the data we're working with:

In [None]:
data_characteristics()

Let's first examine the question 'do people use the microsoft bing search engine notably often to look at microsoft-related wikipedia pages?' I would expect so, because 1) people who use microsoft products may be most likely to be reading about microsoft products and 2) microsoft makes bing the default search engine on microsoft stuff, and many people tend not to switch away from defaults.

In [None]:
import instances.ms
ms_problem = WikiProblem(instances.ms.inst)

Here I'll put a description of what a violin plot is.

In [None]:
ms_problem.discovery()

In [None]:
ms_problem.popularityOf()

It seems _. 

There is a word 'clickbait' that people use to talk about provocativly titled stuff that people click on. Here, we consider the categories 'sex', 'drugs', 'danger', and 'politics' as potentially clickbait. Let's look at what the numbers say.

In [None]:
import instances.clickbait
clickbait_problem = WikiProblem(instances.clickbait.inst)

In [None]:
clickbait_problem.discovery()

From looking at these plots, it seems sex is clickbait. Drugs are not clickbait. Danger is not clickbait.

I sometimes find myself reading about a author/artist I'm interested in, and while I'm on that page I click on their discography or filmography or whatever out of curiosity. However, I don't remember ever searching for a list of things done by a person. I wonder whether other people have a similar usage pattern. Let's see.

In [None]:
import instances.hierarchical
hierarchical_problem = WikiProblem(instances.hierarchical.inst)

In [None]:
hierarchical_problem.discovery()

I seldom see people post links to wikipedia pages on facebook or twitter. Looking at what people share via social media, or more precisely what people click on via social media, is then more of an unsupervised clustering matter than a hypothesis testing matter.

The source of this data is the work of two people, partly myself and mostly my brother. He has this to say about his methods:

This means that a potentially confounding variable to the idea of inference of these categories compared to random is that these pages are pages a human thought of whereas the random pages were by definition not. What exactly a human, namely mark, thinks of may be influenced by hugely many factors, but the easiest one to test is that he thought of pages that were more popularly viewed than the average page. Let's examine that hypothesis.

Another potential source of bias is in parts of speech. Because it is easy to make samples, here we go...

In [None]:
import instances.nountype
nountype_problem = WikiProblem(instances.nountype.inst)

Another source of bias is the possibility that the sampling of non-random categories is skewed towards the high-view pages, and it is possible that that bias weakens the suggestion that the previous findings are related to their hypothesis.

In [None]:
ms_problem.popularityOf()
nountype_problem.popularityOf()
clickbait_problem.popularityOf()
hierarchical_problem.popularityOf()

In [None]:
import instances.popularity
popularity_q_problem = WikiProblem(instances.popularity.quantile_inst)
popularity_even_problem =  WikiProblem(instances.popularity.even_inst)

Maybe the proportion of wikipedia traffic and the proportion of engines/links cluster together, because they are both 'within the wikipedia site'. Maybe that applies to the concept of stuff people research versus stuff people quickly search for and then leave.

In [None]:
# A Tool for finding the exact wikipedia names corresponding to phrases.
candidate_phrases = ['tupac','David Foster Wallace', 'paternalism']
for phrase in candidate_phrases:
    most_like(phrase)

In [3]:
sns.violinplot??

In [2]:
import seaborn as sns