# In-lab data vs. Online data

### We want to understand if there is any difference in data quality between in-lab and online data. Past research has shwon in-lab data and online data are essentially the same in terms of data quality (Buhrmester et al., 2011; Gould et al., 2015; Reimers et al., 2015; Crump et al., 2013). But this is the first time, to our knowledge, that experiments framed in free-operant setting, were conducted online. Free-operant setting is special in the sense that nothing prompts the subjects to act. As such, subjects may respond at a rate of their own choosing within a given period of time, in either continuous or discrete manner. Futhermore, we require continuous engagement of subjects. 

### We use the following metrics to measure the quality of our data:
* latency per tap
* environment betas per subject

### Specifically, we want to answer these two questions:
First, is there evidence of a difference between in-lab data and online data?
<br>
Second, if so, which variables are responsible for these differences?

In [1]:
import scipy.io as sio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import json
import seaborn as sns
from sklearn import linear_model
import sklearn
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import scipy.stats as ss
%matplotlib inline

In [2]:
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', None)

In [3]:
# import in-lab data and convert to panda dataframe
mat = sio.loadmat('untransformeddataforRN.mat', squeeze_me=True)
m = mat['untransformed_data']
in_lab = pd.DataFrame(m)
in_lab.columns = ['latency', 'price_displayed', 
             'id', 'price_bin', 'apathy', 'environment_binary']
in_lab = in_lab[['latency', 'price_displayed', 'id', 'environment_binary']]

In [4]:
# add a label for all in_lab data
in_lab['label'] = [1]*len(in_lab)
in_lab.head(1)

Unnamed: 0,latency,price_displayed,id,environment_binary,label
0,128.0,1.2,1.0,1.0,1


In [5]:
#import online data and add lables for all online data
online = pd.read_csv('fish_100.csv', sep = ",")
online['label'] = [0]*len(online)
online['environment_binary'] = np.where(online['environment']=='low', 0, online['environment_binary'])
online['environment_binary'] = np.where(online['environment']=='high', 1, online['environment_binary'])

In [6]:

online = online [['latency', 'price_displayed', 'id', 'environment_binary', 'label']]
                  

In particular, classifier quality measures like area under the ROC curve (AUC) can be used to assess the degree of difference between the original datasets: small AUC values suggest that the original datasets are similar, while large AUC values suggest substantial differences. 
If differences are detected, the random permutation strategy described in the companion vignette “Assessing Variable Importance for Predictive Models of Arbitrary Type” can be applied to determine which variables from the original datasets are most responsible for their differences.

In [7]:
data = pd.concat([in_lab, online])

In [8]:
data = data.dropna()

In [9]:
X = data[['latency', 'price_displayed', 'id', 'environment_binary']]
y = data['label']


In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [11]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

Accuracy of K-NN classifier on training set: 1.00
Accuracy of K-NN classifier on test set: 1.00


In [12]:
from sklearn.svm import SVC


In [13]:
#Support Vector Classifier
s_clf = SVC()
s_clf.fit(X_train, y_train)




SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [15]:
s_prediction = s_clf.predict(X_test)
print (s_prediction)

[0 0 0 ... 0 0 0]


In [16]:
from sklearn.metrics import accuracy_score

In [17]:
np.sum(s_prediction)

17211

In [18]:
np.sum(data['label'])

68794

In [20]:
s_acc = accuracy_score(s_prediction,y_test)