# What is the user experience for organic searchers?

This code investigates the percentage of users who come to pages within wellcomecollection.org/works after clicking on results provided by Google search or similar. What is their average session length or bounce rate compared to all users? 

In [None]:
import os

from weco_datascience.reporting import get_recent_data

First, get the data

In [None]:
df = get_recent_data(config=os.environ, n=100000, index="metrics-conversion-prod")

In [None]:
keepers = [
    "@timestamp",
    "anonymousId",
    "session.id",
    "page.name",
    "page.query.id",
    "source",
    "properties.event",
    "type",
    "page.query.query",
]
df2 = df[keepers]

Sessions initiated by organic searches result in 2 kinds of profiles: 
1. users who begin their sessions with a /works page view (aka works_users) and
2. users who begin their sessions with 2 identical searches (aka searchers)

In [None]:
sorted = df2.sort_values(["session.id", "@timestamp"], ascending=[True, True])

deduped = sorted.drop_duplicates(
    subset="anonymousId", keep="first"
)  # attribution error: function object has no attribute 'head'
deduped.head()

In [None]:
works_users = deduped.loc[
    (deduped["page.name"] == "work")
    & (deduped["page.query.id"].notnull())
    & (deduped["source"] == "unknown")
    & (deduped["properties.event"].isnull())
    & (deduped["type"] == "pageview")
]
works_users.head()

How many users begin their sessions with a works page (profile 1.)?

In [None]:
works_users2 = works_users["anonymousId"].unique()
len(works_users2)

How many users were there in the sample in total?

In [None]:
len(df2["anonymousId"].unique())

Percentage of profile 1. users of total users

In [None]:
print((len(works_users2))/(len(df2["anonymousId"].unique())))

Searchers

In [None]:
searchers = sorted.loc[
    (sorted["page.name"] == "works") & (sorted["page.query.query"].notnull())
]

In [None]:
first_search = searchers.sort_values(
    ["session.id", "@timestamp"], ascending=[True, True]
)
first_search2 = first_search.groupby("session.id").head(2)

In [None]:
first_search2 = first_search2.sort_values(
    ["session.id", "@timestamp"], ascending=[True, False]
)
first_search2["rownum"] = first_search2.index
first_search2["consecutive"] = first_search2["rownum"].diff().eq(1)
first_search2["same_query"] = first_search2["page.query.query"] == first_search2[
    "page.query.query"
].shift(1)

In [None]:
first_search3 = first_search2
first_searchers=first_search3.loc[(first_search3["same_query"]==True) & (first_search3["consecutive"]==True)]
first_searchers.head()

How many searchers are there (profile 2.)?

In [None]:
first_searchers2 = first_searchers["anonymousId"].unique()
len(first_searchers2)

Percentage of profile 2. users of total users

In [None]:
print((len(first_searchers))/(len(df2["anonymousId"].unique())))

What is the average profile 1. session length? 

In [None]:
profile1=df2[df2["session.id"].isin(works_users["session.id"])]
sortedp1=profile1.sort_values(["session.id", "@timestamp"], ascending=[True, True])
firstp1=sortedp1.drop_duplicates(subset="session.id", keep="first")
lastp1=sortedp1.drop_duplicates(subset="session.id", keep="last")
keep=["session.id", "@timestamp"]
lastp1b=lastp1[keep]

In [None]:
import pandas as pd
duration=pd.merge(firstp1, lastp1b, how='left', on="session.id")
duration["from"]=pd.to_datetime(duration['@timestamp_x'], dayfirst=True)
duration["to"]=pd.to_datetime(duration['@timestamp_y'], dayfirst=True)
duration["session_length"]=(abs(duration['to']-duration['from']))

Remove duplicate sessions

In [None]:
duration_dupout = duration.drop_duplicates(
    subset="session.id", keep="first") 
duration_dupout.head(2)

In [None]:
print(duration_dupout["session_length"].mean())

What is the average session length for all users?

In [None]:
sorted_all=df2.sort_values(["session.id", "@timestamp"], ascending=[True, True])
firstall=sorted_all.drop_duplicates(subset="session.id", keep="first")
lastall=sorted_all.drop_duplicates(subset="session.id", keep="last")
lastall2=lastall[keep]

duration_all=pd.merge(firstall, lastall2, how='left', on="session.id")
duration_all["from"]=pd.to_datetime(duration_all['@timestamp_x'], dayfirst=True)
duration_all["to"]=pd.to_datetime(duration_all['@timestamp_y'], dayfirst=True)

duration_all["session_length"]=(abs(duration_all['to']-duration_all['from']))

In [None]:
Remove duplicate sessions

In [None]:
duration_all_dupout = duration_all.drop_duplicates(
    subset="session.id", keep="first") 

In [None]:
print(duration_all_dupout["session_length"].mean())

What is the average profile 2. session length? 

In [None]:
profile2=df2[df2["session.id"].isin(first_searchers["session.id"])]
sortedp2=profile2.sort_values(["session.id", "@timestamp"], ascending=[True, True])
firstp2=sortedp2.drop_duplicates(subset="session.id", keep="first")
lastp2=sortedp2.drop_duplicates(subset="session.id", keep="last")
lastp2b=lastp2[keep]

In [None]:
durationp2=pd.merge(firstp2, lastp2b, how='left', on="session.id")
durationp2["from"]=pd.to_datetime(durationp2['@timestamp_x'], dayfirst=True)
durationp2["to"]=pd.to_datetime(durationp2['@timestamp_y'], dayfirst=True)
durationp2["session_length"]=(abs(durationp2['to']-durationp2['from']))
#durationp2.head()

Remove duplicate sessions

In [None]:
durationp2_dupout = durationp2.drop_duplicates(
    subset="session.id", keep="first") 

In [None]:
print(durationp2_dupout["session_length"].mean())

How does the distribution of session length for all users compare with Profile 1 and 2 users?

In [None]:
#import numpy as np
#import pandas as pd

#%matplotlib inline


What is the distribution of session durations for all users?

In [None]:
axes = duration_all_dupout["session_length"].astype("timedelta64[s]").plot.hist(bins=100)
axes.set_xlim(0,10000)
axes.set_yscale('log')

What is the distribution of session durations for Profile 1 users?

In [None]:
axes = duration_dupout["session_length"].astype("timedelta64[s]").plot.hist(bins=100)
axes.set_xlim(0,10000)
axes.set_yscale('log')

What is the distribution of session durations for Profile 2 users?

In [None]:
axes = durationp2_dupout["session_length"].astype("timedelta64[s]").plot.hist(bins=100)
axes.set_xlim(0,10000)
axes.set_yscale('log')