https://scholarworks.montana.edu/xmlui/handle/1/3507

# Read.Me

What is this?
-------------

This folder contains A/B testing data that corresponds with 
the following Open Access academic research paper:

Young, Scott W.H. (2014) Improving Library User Experience with A/B Testing: Principles and Process. Weave: Journal of Library User Experience. University of Michigan Library. http://dx.doi.org/10.3998/weave.12535642.0001.101


Who created it?
-------------

This data was prepared by Scott W. H. Young, 
Digital Initiatives Librarian at Montana State University.


How was it created?
--------------

The folder contains data exported from Google Analytics and Crazy Egg.


How do I use it?
----------------

The subfolder named "GoogleAnalytics" contains 4 files:
	- Data from Google Analytics Users Flow, in PDF
	- Data from Google Analytics All Pages view, in PDF and XSL
	- Data from Google Analytics Experiments

The subfolder named "CrazyEgg" contains 5 subfolders, 
one each that corresponds with an A/B test variation 
used during the experiment described in the above paper.
Each subfolder contains 3 files:
	- Data visualization of user click behavior, in PDF and JPEG
	- Data for user click behavior, in XSL

Together these files may be used to reconstruct results and
to guide the design of additional A/B tests.


Licensing
----------

This data is licensed CC BY-SA, http://creativecommons.org/licenses/by-sa/4.0/


Contact
--------

For feedback and inquiry, 
	- write swyoung@montana.edu
	- tweet @hei_scott
	- visit http://hellolibrarian.com

# Data analysis

### Objective
To perform a brief analysis of the 5 datasets to prepare for `Multivarate-testing`

### Steps
- Load in all the library tools and dataset
- Overview of the analysis using table visualization and structural descriptive method
- Summary of the finding

In [1]:
# load all the necessary library
library(dplyr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(scales)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union



Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The f

In [2]:
# Installing 5 different versions of page
version_1 <- read.csv('CrazyEgg/Homepage Version 1 - Interact, 5-29-2013/Element list Homepage Version 1 - Interact, 5-29-2013.csv')
version_2 <- read.csv('CrazyEgg/Homepage Version 2 - Connect, 5-29-2013/Element list Homepage Version 2 - Connect, 5-29-2013.csv')
version_3 <- read.csv('CrazyEgg/Homepage Version 3 - Learn, 5-29-2013/Element list Homepage Version 3 - Learn, 5-29-2013.csv')
version_4 <- read.csv('CrazyEgg/Homepage Version 4 - Help, 5-29-2013/Element list Homepage Version 4 - Help, 5-29-2013.csv')
version_5 <- read.csv('CrazyEgg/Homepage Version 5 - Services, 5-29-2013/Element list Homepage Version 5 - Services, 5-29-2013.csv')

#### Systematically visualize the 5 different versions of dataset

In [3]:
head(version_1,2)

Unnamed: 0_level_0,Element.ID,Tag.name,Name,No..clicks,Visible.,Snapshot.information
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>
1,128,area,Montana State University - Home,1291,False,Homepage Version 1 - Interact • http://www.lib.montana.edu/index.php
2,69,a,FIND,842,True,"created 5-29-2013 • 20 days 4 hours 21 mins • 10283 visits, 3714 clicks"


In [4]:
head(version_2,2)

Unnamed: 0_level_0,Element.ID,Tag.name,Name,No..clicks,Visible.,Snapshot.information
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>
1,74,a,FIND,502,True,Homepage Version 2 - Connect • http://www.lib.montana.edu/index2.php
2,66,input,s.q,357,True,"created 5-29-2013 • 20 days 7 hours 34 mins • 2742 visits, 1587 clicks"


In [5]:
head(version_3,2)

Unnamed: 0_level_0,Element.ID,Tag.name,Name,No..clicks,Visible.,Snapshot.information
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>
1,69,a,FIND,587,True,Homepage Version 3 - Learn • http://www.lib.montana.edu/index3.php
2,61,input,s.q,325,True,"created 5-29-2013 • 20 days 12 hours 21 mins • 2747 visits, 1652 clicks"


In [6]:
head(version_4,2)

Unnamed: 0_level_0,Element.ID,Tag.name,Name,No..clicks,Visible.,Snapshot.information
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>
1,74,a,FIND,631,True,Homepage Version 4 - Help • http://www.lib.montana.edu/index4.php
2,66,input,s.q,364,True,"created 5-29-2013 • 20 days 4 hours 59 mins • 3180 visits, 1717 clicks"


In [7]:
head(version_5,2)

Unnamed: 0_level_0,Element.ID,Tag.name,Name,No..clicks,Visible.,Snapshot.information
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>
1,69,a,FIND,397,True,Homepage Version 5 - Services • http://www.lib.montana.edu/index5.php
2,61,input,s.q,323,True,"created 5-29-2013 • 20 days 4 hours 59 mins • 2064 visits, 1348 clicks"


#### Using str perform quick descriptive analysis

In [8]:
print(str(version_1))
print(str(version_2))
print(str(version_3))
print(str(version_4))
print(str(version_5))

'data.frame':	69 obs. of  6 variables:
 $ Element.ID          : int  128 69 61 67 78 98 62 118 50 87 ...
 $ Tag.name            : chr  "area" "a" "input" "a" ...
 $ Name                : chr  "Montana State University - Home" "FIND" "s.q" "lib.montana.edu/find/" ...
 $ No..clicks          : int  1291 842 508 166 151 102 101 55 46 42 ...
 $ Visible.            : chr  "false" "true" "true" "true" ...
 $ Snapshot.information: chr  "Homepage Version 1 - Interact   •   http://www.lib.montana.edu/index.php" "created 5-29-2013   •   20 days 4 hours 21 mins   •   10283 visits, 3714 clicks" "" "" ...
NULL
'data.frame':	58 obs. of  6 variables:
 $ Element.ID          : int  74 66 72 133 103 83 92 67 81 101 ...
 $ Tag.name            : chr  "a" "input" "a" "area" ...
 $ Name                : chr  "FIND" "s.q" "lib.montana.edu/find/" "Montana State University Libraries - Home" ...
 $ No..clicks          : int  502 357 171 83 74 57 53 47 31 31 ...
 $ Visible.            : chr  "true" "true" "true" 

In [15]:
colnames(version_1)

## Data Analysis
version_1- 69 rows and 6 columns
version_2 58 rows and 6 columns
version_3 62 rows and 6 columns
version_4 57 rows and 6 columns
version_5 53 rows and 6 columns

All dataset contain same column name: `'Element.ID''Tag.name''Name''No..clicks''Visible.''Snapshot.information'`
#### Data cleaning
- extra columns 
- column name non-coherent
- missing data in all data sets

In [10]:
# function to perform full outer join on version_1
multimerge <- function (mylist) {
  ## mimics a recursive merge or full outer join
 
    # unlist and put all unique elements inside a variable
  unames <- unique(unlist(lapply(mylist, rownames)))
 
    # store size of unlist names in a variables
  n <- length(unames)
 
    # recursive function
  out <- lapply(mylist, function(df) {
 
    tmp <- matrix(nr = n, nc = ncol(df), dimnames = list(unames,colnames(df)))
    tmp[rownames(df), ] <- as.matrix(df)
    rm(df); gc()
 
    return(tmp)
  })
 
  stopifnot( all( sapply(out, function(x) identical(rownames(x), unames)) ) )
 
  bigout <- do.call(cbind, out)
  colnames(bigout) <- paste(rep(names(mylist), sapply(mylist, ncol)), unlist(sapply(mylist, colnames)), sep = "_")
  return(bigout)
}

# store all merged datasets into a variable
df <- multimerge( list (one=version_1, two=version_2, three=version_3, four=version_4, five=version_5))
# fill all n/a variable with 0
df[is.na(df)] <- 0
# Change the dataset back to dataframe structure
df <- data.frame(df)
str(df)

'data.frame':	69 obs. of  30 variables:
 $ one_Element.ID            : chr  "128" " 69" " 61" " 67" ...
 $ one_Tag.name              : chr  "area" "a" "input" "a" ...
 $ one_Name                  : chr  "Montana State University - Home" "FIND" "s.q" "lib.montana.edu/find/" ...
 $ one_No..clicks            : chr  "1291" " 842" " 508" " 166" ...
 $ one_Visible.              : chr  "false" "true" "true" "true" ...
 $ one_Snapshot.information  : chr  "Homepage Version 1 - Interact   •   http://www.lib.montana.edu/index.php" "created 5-29-2013   •   20 days 4 hours 21 mins   •   10283 visits, 3714 clicks" "" "" ...
 $ two_Element.ID            : chr  " 74" " 66" " 72" "133" ...
 $ two_Tag.name              : chr  "a" "input" "a" "area" ...
 $ two_Name                  : chr  "FIND" "s.q" "lib.montana.edu/find/" "Montana State University Libraries - Home" ...
 $ two_No..clicks            : chr  "502" "357" "171" " 83" ...
 $ two_Visible.              : chr  "true" "true" "true" "false" ...
 

In [11]:
# Checking columns names
colnames(df)

In [12]:
# changed column names
df <- df %>% 
{colnames(.)[4] = "no_clicks_version_1";.} %>% 
{colnames(.)[10] = "no_clicks_version_2";.} %>%
{colnames(.)[16] = "no_clicks_version_3";.} %>%
{colnames(.)[22] = "no_clicks_version_4";.} %>%
{colnames(.)[28] = "no_clicks_version_5";.} 
colnames(df)

In [13]:
# Using select to index out the working columns
df <- df %>%
select(one_Name,no_clicks_version_1,two_Name,no_clicks_version_2,three_Name,no_clicks_version_3,
       four_Name,no_clicks_version_4,five_Name,no_clicks_version_5)
head(df,10)

Unnamed: 0_level_0,one_Name,no_clicks_version_1,two_Name,no_clicks_version_2,three_Name,no_clicks_version_3,four_Name,no_clicks_version_4,five_Name,no_clicks_version_5
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Montana State University - Home,1291,FIND,502,FIND,587,FIND,631,FIND,397
2,FIND,842,s.q,357,s.q,325,s.q,364,s.q,323
3,s.q,508,lib.montana.edu/find/,171,lib.montana.edu/find/,142,lib.montana.edu/find/,139,lib.montana.edu/find/,106
4,lib.montana.edu/find/,166,Montana State University Libraries - Home,83,Montana State University - Home,83,Montana State University - Home,122,Search,85
5,REQUEST,151,Hours,74,Hours,76,REQUEST,72,Hours,81
6,Hours,102,REQUEST,57,REQUEST,63,Hours,68,REQUEST,57
7,Search,101,CONNECT,53,Search,50,Search,59,Montana State University - Home,49
8,MSU,55,Search,47,News,45,HELP,38,SERVICES,45
9,nav-item-dot,46,lib.montana.edu/request/,31,slideshow-right,35,News,26,News,24
10,INTERACT,42,News,31,Advanced Search,26,Contact Us,17,lib.montana.edu/request/,22


# A/B/C/D_testing

In [14]:
t.test(df$version_3, df$version_2, alternative='two.sided', var.equal=TRUE )

“argument is not numeric or logical: returning NA”


ERROR: Error in var(x): 'x' is NULL


x,y: numeric vectors  

alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.   

var.equal: a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch test is used.  

In [None]:
t.test(version_3 ~ version_2)

In [None]:
colnames(df)