# Data Analysis for the Paper: Persist: Persistent and Reusable Interactions in Computational Notebooks
*Kiran Gadhave, Zach Cutler, Alexander Lex*

Here we analyze the data for the user study conducted as part of the paper. The study was conducted as a lab study with participants who were knowledgable about Python, Jupyter and Pandas.

Participants completed two similar tasks either in the **Pandas** condition, where they had to write data wrangling code, or in the **Persist** condition, where they were asked to use our novel interactive system. The tasks were to analyze an avalanche and a video game dataset; the operations needed for each step were matched.

There were 11 participants; the order of conditions and the dataset used for the condition were randomized using a latin square.

After each condition, participants were ask about their subjective workload.

The study began with a short survey about their background and ended with a surey and an interview about their impressions about the tools used.

In this notebook we analyze the following aspects:

* The subjective workload
* Task completion times and correctness
* Survey Data


In [192]:
# # Uncomment the lines below, run it, and then restart runtime
!pip install altair==5.*
!pip install pyarrow==11





## 1. Analyzing Subjective Workload Across Conditions

We are using the [NASA TLX](https://humansystems.arc.nasa.gov/groups/tlx/) to assess subjective workload of participants in the two conditions *pandas* and *persist*. The test was administered via the TLX iPad Version.

The iPad Version uses a 100-point scale. The items and scales on the test are the following:

* **Mental Demand** - How mentally demaniding was the task? (*Very Low* to *Very High*)
* **Physical Demand** - How physically demanding was the task? (*Very Low* to *Very High*)
* **Temporal Demand** - How hurried or rushed was the pace of the task? (*Very Low* to *Very High*)
* **Perfomance** - How succesful were you in accomplishing what you were asked to do? (*Good* to *Poor*).
* **Effort** - How hard did you have to work to accomplish your level of performance? (*Very Low* to *Very High*)
* **Frustration** - How insecure, discouraged, irritated, stressed and annoyed were you? (*Very Low* to *Very High*)



In [193]:
import pandas as pd
import os
import altair as alt
from scipy import stats
import numpy as np
alt.__version__

'5.2.0'

In [194]:
data_path = "https://raw.githubusercontent.com/visdesignlab/persist_examples/main/study/analysis"

post_study_path = data_path + "/post-study-survey.csv"
pre_study_path = data_path + "/pre-study-survey.csv"
tasks_path = data_path + "/task-time-processed.csv"
tlx_path = data_path + "/tlx.csv"
reprod_path = data_path + "/reproducibility.csv"

In [195]:
# setting colors for conditions

PERSIST_COLOR = "#972D07"
PANDAS_COLOR = "#009FB7"

In [196]:
tlx_df = pd.read_csv(tlx_path)
tlx_df.head()
tlx_df.columns.to_list()

['PID',
 'Order',
 'Condition',
 'Dataset',
 'Mental Demand',
 'Physical Demand',
 'Temporal Demand',
 'Performance',
 'Effort',
 'Frustration']

**Note:** The performance score is inverted in the dataset because it's the only score that's reported from "good" to "bad". In the above table *Perfect* is 100, and *Failure* is 0.

We separate out the "Condition" column to have all of the relevant categories as values in a tidy dataset.

### Plotting Individual Values for TLX
Below we plot each participants result for both conditions. Note that the performance scale was inverted.

In [197]:
dims=['Mental Demand','Physical Demand','Temporal Demand','Performance','Effort','Frustration']

base = alt.Chart().encode(
    x=alt.X("Condition:N", title=None).sort(["Persist", "Pandas"]),
    y=alt.Y(alt.repeat("row"), type="quantitative").scale(domain=(0, 100)),
    color=alt.Color("Condition:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])),
)

dot_chart = base.mark_point(thickness=3, opacity=1, tooltip=True).encode(
    shape="Dataset:N"
).properties(
    height=70,
)
text_overlay = base.mark_text(dy=-10, opacity=1).encode(
    text=alt.Text(alt.repeat("row"), type="quantitative"),
)

(dot_chart + text_overlay).facet(
    column=alt.Facet("PID:N", title=None).sort(alt.Sort(field="Mental Demand")),
    data=tlx_df
  ).repeat(
    row=dims,
)

### Check Condition, Sequence and Dataset Effects

We compare the data faceted by three categorical variables:

* Condition; we expect to see a difference
* Order; we expect to see no or a minor order-effect (tasks completed second perform better)
* Dataset; we expect to see no difference

To see whether we have strong effects influencing task order and dataset, we compare these two below:

In [198]:
## Beeswarm helper
def split_beeswarm(data, split_by, shape=None):
  base = alt.Chart().encode(
      x=alt.X(f"{split_by}:N", title=None).sort(["Persist", "Pandas"]),
      y=alt.Y("mean(Score):Q", title="Score").axis(titleFontSize=15, labelFontSize=15),
      color=alt.Color(f"{split_by}:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])).legend(labelFontSize=15, titleFontSize=15),
  )

  points = base.mark_point(opacity=1).transform_calculate(
    jitter="sqrt(-16*log(random()))*cos(16*PI*random())"
  ).encode(
      xOffset="jitter:Q",
  )

  if shape:
    points = points.encode(
      shape=alt.Shape(f"{shape}:N").legend(labelFontSize=15, titleFontSize=15)
    )

  # b = alt.Chart().mark_rule(orient="horizontal", thickness=2, opacity=0.6).transform_fold(
  mean = base.mark_tick(orient="horizontal", thickness=3, opacity=0.6, color="red")


  return alt.layer(points, mean).transform_fold(
      dims,
      as_=["ScoreType", "Score"]
  ).properties(
      width=100,
      height=150
  ).facet(
        column=alt.Column("ScoreType:N", title=None, header=alt.Header(labelFontWeight="bold", labelFontSize=13)),
        data=data
      ).configure_tick(
      bandSize=30
  )

#### Condition: Persist vs. Pandas

In [199]:
split_beeswarm(tlx_df, "Condition")

We see a fairly strong trend for all dimensions. There is one exception: **Physical Demand**, which we expected to not be a relevant metric for this study. Participant statements related to this question "it's only moving a mouse and a keyboard". Hence, we will omit physical demand from all further analysis.  

#### Datasets: Avalanche vs. Video Games

In [200]:
split_beeswarm(tlx_df, "Dataset", shape="Condition")

There are no obvious effects of the dataset used, in the Frustration and Performance dimensions, means are separated by about 20 points hinting at the fact that the video games datasaet might have been slighly more difficult to work with.

## Order

In [201]:
split_beeswarm(tlx_df, "Order", shape="Condition")

We can see a few order effects: Frustration, and to a lesser degree effort, mental demand, and temporal demand are rated worse in the second task. Looking at the condition, it seems like participants who first completed the persist condition and then the pandas condition were more negative about pandas, than those who completed pandas first. We speculate, that this is because those participants have experienced a more efficient approach in the first condition.

# Workload: Plotting an Empirical CDF for the TLX Results (Paper Figure)

**Figure `workload`**

We are plotting an [Empirical Cumulative Distribution Function](https://en.wikipedia.org/wiki/Empirical_distribution_function) to show our results.


In [202]:
fontSize = 20

base = alt.Chart().mark_tick(orient="horizontal", thickness=3, opacity=1).encode(
    x=alt.X("Index:Q", title=None).sort('y').axis(titleFontSize=fontSize, labelFontSize=fontSize),
    y=alt.Y("Score:Q").axis(titleFontSize=fontSize, labelFontSize=fontSize),
    color=alt.Color("Condition:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])).legend(labelFontSize=fontSize, titleFontSize=fontSize, symbolStrokeWidth=3, symbolSize=200),
).properties(
    height=250,
    width=250
)

# base

line_chart = base.mark_line(
    interpolate="step-after",
  strokeWidth=3
)

mean_overlay = alt.Chart().mark_rule(strokeWidth=3, opacity=0.2).encode(
    y="mean(Score):Q",
    color=alt.Color("Condition:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])).legend(labelFontSize=fontSize, titleFontSize=fontSize, symbolStrokeWidth=3, symbolSize=200),
)

value_overlay = alt.Chart().mark_text(dx=20).encode(
    text=alt.Text("mean(Score):Q", format=".2f"),
    x=alt.datum(11),
    y="mean(Score):Q",
    color=alt.Color("Condition:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])).legend(labelFontSize=fontSize, titleFontSize=fontSize, symbolStrokeWidth=3, symbolSize=200),
)

(line_chart + value_overlay + mean_overlay).transform_fold(
    list(filter(lambda x: "Physical" not in x, dims)),
    as_=["ScoreType", "Score"]
).transform_window(
  sort=[alt.SortField('Score', order='ascending')],
  groupby=['Condition', 'ScoreType'],
  Index="row_number()"
).facet(
    column=alt.Column("ScoreType:N",  title=None, header=alt.Header(labelFontWeight="bold", labelFontSize=fontSize)),
    data=tlx_df
)

## Task Time and Correctness Analysis

The tasks that participants completed are the follwing:  

* "Col Delete (1a)" - deleting certain columns
* "Col Rename (1b)" - renaming a series of columns
* "Col dtype (1c)" - changing the data type of columns
* "Filter (2a)" - filtering outliers
* "Filter (2b)" - filtering ranges
* "Categorization (3a)" - categorizing data values
* "Analysis (3b)" - analyzing results

Task 3b was identical between the two condition, as it did not make use of either Persist or Pandas, but rather asked participants to interpret data. We included the task to give "closure" to participants, as they will have learned something form the study.


In [203]:
tasks_df = pd.read_csv(tasks_path)
tasks_df.fillna("-", inplace=True)

tasks_df['Duration'] = pd.to_timedelta(tasks_df['Duration'])
tasks_df['duration_seconds'] = tasks_df['Duration'].dt.total_seconds()
tasks_df['Duration'] = tasks_df['Duration'].astype(str) # for altair support

task_name_map = {
    "1a": "Delete Columns (1a)",
    "1b": "Rename Columns (1b)",
    "1c": "Change Column Type (1c)",
    "2a": "Filter Outliers (2a)",
    "2b": "Filter Ranges (2b)",
    "3a": "Assign Categories (3a)",
    "3b": "Analysis (3b)",
}

row_order = ["Delete Columns (1a)","Rename Columns (1b)","Change Column Type (1c)","Filter Outliers (2a)","Filter Ranges (2b)","Assign Categories (3a)", "Analysis (3b)",]

tasks_df["TaskID"] = tasks_df["TaskID"].map(task_name_map)
tasks_df.head()

Unnamed: 0,PID,Condition,Dataset,Order,Task,Sub Task,TaskID,Status,Duration,LLM,Search,Notes,duration_seconds
0,1,Persist,Avalanche,1,1,a,Delete Columns (1a),Correct,0 days 00:02:04,False,False,-,124.0
1,1,Persist,Avalanche,1,1,b,Rename Columns (1b),Correct,0 days 00:01:14,False,False,-,74.0
2,1,Persist,Avalanche,1,1,c,Change Column Type (1c),Correct,0 days 00:01:00,False,False,-,60.0
3,1,Persist,Avalanche,1,2,a,Filter Outliers (2a),Correct,0 days 00:02:30,False,False,-,150.0
4,1,Persist,Avalanche,1,2,b,Filter Ranges (2b),Correct,0 days 00:01:00,False,False,-,60.0


In [204]:
DURATION_CLAMP = 1200

base = alt.Chart().mark_bar(opacity=0.8).encode(
    x=alt.X("Condition:N").sort(["Persist", "Pandas"]),
    y=alt.Y("duration_seconds:Q").scale(domainMax=DURATION_CLAMP, clamp=True),
).properties(
    height=80,
    width=60
)

text = alt.Chart().mark_text(dy=-5, size=13).encode(
    text=alt.Text("duration_seconds:Q", format=""),
    x=alt.X("Condition:N", title=None).sort(["Persist", "Pandas"]),
    y=alt.Y("duration_seconds:Q", title=None).scale(domainMax=DURATION_CLAMP, clamp=True),
    color=alt.Color("Status:N").scale(alt.Scale(range=['green', 'orange', 'gray', 'red'])),
)

clamped = alt.Chart().mark_text(dy=-5, dx=-18, size=15, fontWeight="bold").transform_filter(
    alt.datum.duration_seconds > DURATION_CLAMP
).encode(
    text=alt.TextValue("*"),
    x=alt.X("Condition:N", title=None).sort(["Persist", "Pandas"]),
    y=alt.Y("duration_seconds:Q", title=None).scale(domainMax=DURATION_CLAMP, clamp=True),
    color=alt.Color("Status:N").scale(alt.Scale(range=['green', 'orange', 'gray', 'red'])),
)


(base + text + clamped).facet(
      column="PID:N",
      row=alt.Row("TaskID").sort(row_order),
      data=tasks_df
)

**TODO**

* [x] Make horizontal
* [x] compute statistical tests - mann-whitney(maybe wilcoxon) and cohen's d, but just for the condition example
* [x] include statistical info on chart
* [x] clamp to 1200sec.
* [ ] Try strip plot without jitter

In [205]:
def add_stats(df):
  df = df.copy()

  def calculate_ci(group):
      sem = group['duration_seconds'].sem()  # Standard Error of the Mean
      mean = group['mean'].iloc[0]  # Get the already calculated mean
      ci = stats.t.interval(0.95, len(group)-1, loc=mean, scale=sem)  # 95% CI
      group['ci0'] = ci[0]
      group['ci1'] = ci[1]
      return group

  df['mean'] = df.groupby(['TaskID', 'Condition',], group_keys=False)['duration_seconds'].transform('mean')
  df = df.groupby(['TaskID', 'Condition'], group_keys=False).apply(calculate_ci)
  df['mean_ci_formatted'] = df.apply(lambda row: f"{row['ci0']:.2f} — {row['mean']:.2f} — {row['ci1']:.2f}", axis=1)

  def cohens_d(group1, group2):
      mean1, mean2 = np.mean(group1), np.mean(group2)
      pooled_std = np.sqrt((np.std(group1, ddof=1) ** 2 + np.std(group2, ddof=1) ** 2) / 2)
      return (mean1 - mean2) / pooled_std

  grouped = df.groupby(['TaskID', 'Condition'], group_keys=False)

  cohens_d_values = {}
  for task, group in grouped:
      if task[0] not in cohens_d_values:
          cohens_d_values[task[0]] = cohens_d(grouped.get_group((task[0], 'Persist'))['duration_seconds'],
                                              grouped.get_group((task[0], 'Pandas'))['duration_seconds'])

  # Map Cohen's d values to the original DataFrame
  df['d'] = df['TaskID'].map(cohens_d_values)

  # Pivot DataFrame to get paired samples in separate columns
  pivot_df = df.pivot_table(index=['PID', 'TaskID'], columns='Condition', values='duration_seconds').reset_index()

  # Calculate Wilcoxon test for each task
  wilcoxon_results = pivot_df.groupby('TaskID').apply(lambda x: stats.wilcoxon(x['Persist'], x['Pandas'])).reset_index()

  # Extract W and p-values
  wilcoxon_results[['W', 'p']] = pd.DataFrame(wilcoxon_results[0].tolist(), index=wilcoxon_results.index)

  # Drop the old column containing the tuple of results
  wilcoxon_results = wilcoxon_results.drop(columns=[0])
  # Merge these results back to the original DataFrame
  df = df.merge(wilcoxon_results[['TaskID', 'W', 'p']], on='TaskID', how='left')
  df['n'] = df.groupby(['TaskID', 'Condition'])['PID'].transform('count')

  df['test_results'] = df.apply(lambda row: f"n={int(row['n'])}\nW={row['W']:.2f}\np={row['p']:.3f}\nd={row['d']:.3f}", axis=1)
  # df['test_results'] = df.apply(lambda row: f"n={int(row['n'])}, W={row['W']:.2f}, p={row['p']:.3f}, d={row['d']:.3f}", axis=1)

  return df

In [206]:
## Beeswarm helper (different from tlx - repeat vs facet column)
def split_beeswarm_facet(data, split_by, shape=None, exclude_tasks=[],clip=None, orient='vertical', stats=None, legend=True):
  data = data.copy()
  data = data[~data['TaskID'].isin(exclude_tasks)]

  base = alt.Chart().encode(
      y=alt.Y(f"{split_by}:N", title=None).sort(["Persist", "Pandas"]).axis(titleFontSize=15, labelFontSize=15),

  )
  if not legend:
    base = base.encode(
      color=alt.Color(f"{split_by}:N", legend=None).sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR]))
    )
  else:
      base = base.encode(
      color=alt.Color(f"{split_by}:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])).legend(labelFontSize=15, titleFontSize=15),
    )


  point = base.mark_point(opacity=0.5, tooltip=True).transform_calculate(
    jitter='sqrt(-2*log(random()))*cos(2*PI*random())'
  ).encode(
      x=alt.X("duration_seconds:Q", title=None).scale(domainMax=DURATION_CLAMP, clamp=True).axis(titleFontSize=15, labelFontSize=15),
      yOffset=alt.YOffset("jitter:Q"),
      shape=alt.condition(alt.datum.duration_seconds > DURATION_CLAMP, alt.ShapeValue('triangle'), alt.ShapeValue('circle'))
  )




  if shape:
    point = point.encode(
      shape=alt.Shape(f"{shape}:N").legend(labelFontSize=15, titleFontSize=15)
    )


  mean_point = base.mark_point(size=15).encode(
      x=alt.X("mean(duration_seconds):Q", title="Seconds").scale(domainMax=DURATION_CLAMP, clamp=True).axis(titleFontSize=15, labelFontSize=15)
  )

  ci_rule = base.mark_rule(thickness=2, opacity=1, color="red", strokeWidth=2).encode(
      x=alt.X("ci0(duration_seconds):Q", title=None).scale(domainMax=DURATION_CLAMP, clamp=True).axis(titleFontSize=15, labelFontSize=15),
      x2=alt.X2("ci1(duration_seconds):Q", title=None),
  )

  chart =  alt.layer(point, mean_point, ci_rule).properties(
        width=300,
        height=100,
    )
  if stats:
    data = add_stats(data)

    mean_ci_text = base.mark_text(dx=-80, dy=-13, size=13).encode(
        text=alt.Text("max(mean_ci_formatted):N"),
        x=alt.datum(1200)
    )

    stat_text = base.mark_text(dx=-95, dy=-82, size=13).transform_filter(
      alt.datum.Condition == "Pandas"
    ).encode(
        text=alt.Text("max(test_results):N"),
        color=alt.value('gray'),
        x=alt.datum(1200)
    )

    chart =  alt.layer(point, mean_point, ci_rule, mean_ci_text, stat_text).properties(
        width=300,
        height=100,
    )

  vchart = None

  for key, df in data.groupby('Task'):
    c = chart.transform_calculate(
        tid=alt.datum["Task"]+alt.datum["Sub Task"]
    ).facet(
        column=alt.Column("TaskID:N", title=None,
                          header=alt.Header(
                              labelFontWeight="bold",labelFontSize=13
                              )
                          ).sort(df["Sub Task"].tolist()
            ),
        data=df
      )

    if not vchart:
      vchart = c
    else:
      vchart &= c

  return vchart

split_beeswarm_facet(tasks_df, "Condition", stats=True, legend=False)

### Time: Task Completion Times by Condition (Paper Figure)

**Figure `time`**

TODOs:
* [x] the Jitter doesn't seem centered around the man plot. It seems to go down much further than up. **Fixed to add a gaussian jitter rather than just random**
* [x] Since we have three columns, this needs to get less tall. Reducing the jittered space by 50% could fix the problem. **Above + reduced height of each plot 100**
* [x] 3b needs to come after 3a. **Added sorting by subtask id instead of label**

In [207]:
split_beeswarm_facet(tasks_df, "Condition", stats=True, legend=False)

### Dataset

In [208]:
split_beeswarm_facet(tasks_df, "Dataset", "Condition", orient="horizontal")

### Order

In [209]:
split_beeswarm_facet(tasks_df, "Order", "Condition", orient="horizontal")

## Correctness (Paper Figure)

* two bar charts, one per condition, showing correct, wrong, skipped, and partially wrong.
* also include count of number of reproducibile notebooks

**Figure `correctness`**

TODO:
* remove Condition legend

In [210]:
bars = alt.Chart().mark_bar(tooltip=True).encode(
    x=alt.X("count()", title="# of tasks"),
    y=alt.Y("Status:N", title=None),
    color=alt.Color("Condition:N").sort(['Persist', 'Pandas']).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR]))
    ).properties(
    width=100,
    height=90)

text_overlay = alt.Chart().mark_text(dx=12, size=13).encode(
    y="Status:N",
    x="count():Q",
    text=alt.Text("count():Q"),
    color=alt.Color("Condition:N").sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR]))
)

(bars + text_overlay).facet(
    column=alt.Column("Condition:N",
                      title=None,
                      header=alt.Header(
                          labelFontWeight="bold",
                          labelFontSize=13)
                      ).sort(['Persist', 'Pandas']),
    data=tasks_df
)

In [211]:
# Add failed to reproduce

# Horziontal

In [212]:
reprod_df = pd.read_csv(reprod_path)
reprod_df.loc[:, "Reproducible"] = reprod_df["Reproducible"].apply(lambda x: "Yes" if x else "No")
non_reprod = reprod_df[reprod_df['Reproducible'] == "No"]
non_reprod

Unnamed: 0,PID,Order,Condition,Dataset,Reproducible,Skipped,Reasons
6,4,1,Pandas,Video Games,No,True,3a - cannot setitem on a Categorical with new ...
9,5,2,Pandas,Video Games,No,False,"does not break, but has incorrect output for 2..."
13,7,2,Pandas,Avalanche,No,True,"1c - Unable to parse string - skipped, 3a - sy..."
16,9,1,Pandas,Video Games,No,True,"1a - Key error - skipped, 3a - TypeError - ski..."
19,10,2,Pandas,Video Games,No,True,3a - ValueError length does not match. Skipped


In [213]:
alt.Chart(reprod_df).mark_bar().encode(
    x=alt.X("Reproducible:N", title=None),
    y=alt.Y("count()", title=None),
    color=alt.Color("Condition:N", legend=None).sort(["Persist", "Pandas"]).scale(alt.Scale(range=[PERSIST_COLOR, PANDAS_COLOR])),
    column=alt.Column("Condition:N",
                      title=None,
                      header=alt.Header(
                          labelFontWeight="bold",
                          labelFontSize=13)
                      ).sort(['Persist', 'Pandas']),
).properties(
    title="Reproducible Notebooks"

)

## Survey Analysis

* plot time vs experience with pandas in scatterplot

In [214]:
pre_study_df = pd.read_csv(pre_study_path)
pre_study_df = pre_study_df.fillna("-")
pre_study_df.head()

Unnamed: 0,PID,Python Reported Experience,Data Analysis Language,Pandas Reported Exp,Data Wrangling Reported Exp,Vis Libraries Used,Degree Enrolled,Non Coursework Data Exp,Domain Exp
0,1,3,Python,3,3,"matplotlib, seaborn",Undergrad,N,-
1,2,4,Python,3,3,"matplotlib, seaborn",MS,Y,Unknown
2,3,3,Python,3,3,"matplotlib, seaborn",MS,N,-
3,4,2,Python,1,1,matplotlib,Undergrad,N,-
4,5,4,Python,4,4,"matplotlib, pyplot",MS,N,-


### Overview of self reported scores & demograpics

In [215]:
pre_study_df.describe()

Unnamed: 0,PID,Python Reported Experience,Pandas Reported Exp,Data Wrangling Reported Exp
count,11.0,11.0,11.0,11.0
mean,6.0,3.636364,3.0,3.181818
std,3.316625,1.120065,1.264911,0.873863
min,1.0,2.0,1.0,1.0
25%,3.5,3.0,2.5,3.0
50%,6.0,4.0,3.0,3.0
75%,8.5,4.5,4.0,4.0
max,11.0,5.0,5.0,4.0


In [216]:
alt.Chart(pre_study_df).mark_bar(tooltip=True).encode(
    x="Degree Enrolled:N",
    y="count():Q",
) |  alt.Chart(pre_study_df).mark_bar(tooltip=True).encode(
    x="Non Coursework Data Exp:N",
    y="count():Q",
)

Here we will look at a plot of self reported experience with pandas and the time taken by the participant on a task with pandas condition

In [217]:
pid_duration_df = pd.merge(tasks_df, pre_study_df[['PID', 'Pandas Reported Exp']], on='PID', how='left')
pid_duration_df.head()

Unnamed: 0,PID,Condition,Dataset,Order,Task,Sub Task,TaskID,Status,Duration,LLM,Search,Notes,duration_seconds,Pandas Reported Exp
0,1,Persist,Avalanche,1,1,a,Delete Columns (1a),Correct,0 days 00:02:04,False,False,-,124.0,3
1,1,Persist,Avalanche,1,1,b,Rename Columns (1b),Correct,0 days 00:01:14,False,False,-,74.0,3
2,1,Persist,Avalanche,1,1,c,Change Column Type (1c),Correct,0 days 00:01:00,False,False,-,60.0,3
3,1,Persist,Avalanche,1,2,a,Filter Outliers (2a),Correct,0 days 00:02:30,False,False,-,150.0,3
4,1,Persist,Avalanche,1,2,b,Filter Ranges (2b),Correct,0 days 00:01:00,False,False,-,60.0,3


In [218]:
alt.Chart(pid_duration_df).mark_point(
    tooltip=alt.TooltipContent("data")
    ).transform_filter(
        alt.datum.Condition == "Pandas"
    ).encode(
    x="Pandas Reported Exp:O",
    y="duration_seconds:Q",
    color="PID:N",
).properties(
    width=300,
    height=300
)

In [219]:
post_study_df = pd.read_csv(post_study_path)

post_study_df.head()

Unnamed: 0,PID,Rename Columns,Delete Columns,Change Column Data Type,Interactive Selections,Filter Selections,Assign Categories,History
0,1,5,5,5,5,5,5,5
1,2,5,5,5,5,5,5,5
2,3,5,5,5,4,5,5,5
3,4,5,5,5,5,4,5,4
4,5,5,5,5,5,5,5,5


In [220]:
post_study_df.columns.tolist()

['PID',
 'Rename Columns',
 'Delete Columns',
 'Change Column Data Type',
 'Interactive Selections',
 'Filter Selections',
 'Assign Categories',
 'History']

## Helpfulness: Analyzing how Helpful Participant Found Persist (Paper Figure)


The questions asked were "How helpful did you find Persist for ____" on a five-point Likert scale ranging from "Not Helpful" to "Very Helpful".

**Figure `helpfulness`**

TODO:
* add color scale from Pandas color to Persist color, so that the right-most bar is fully red, and then they get shaded.
* use consistent terms for operations
* add link to survey here
* list survey questions here



In [221]:
dims = ['Rename Columns',
 'Delete Columns',
 'Change Column Data Type',
 'Interactive Selections',
 'Filter Selections',
 'Assign Categories',
 'History']

sel = alt.selection_point(fields=["PID"],bind="legend")

alt.Chart(post_study_df).transform_fold(
    dims,
    as_=['Score Type', 'Score']
).mark_bar(tooltip=alt.TooltipContent('data')).encode(
    alt.X("Score:O").scale(domain=(1,2,3,4,5)),
    alt.Y("count():Q"),
    column=alt.Column("Score Type:N").sort(dims)#,
    #color=alt.condition(sel, "PID:N", alt.value("gray")),
).properties(
    width=100,
    height=70
).add_params(
    sel
)

In [191]:
post_study_df.describe()

Unnamed: 0,PID,Rename Columns,Delete Columns,Change Column Data Type,Interactive Selections,Filter Selections,Assign Categories,History
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,6.0,4.909091,4.909091,4.909091,4.727273,4.545455,4.909091,4.909091
std,3.316625,0.301511,0.301511,0.301511,0.467099,0.934199,0.301511,0.301511
min,1.0,4.0,4.0,4.0,4.0,2.0,4.0,4.0
25%,3.5,5.0,5.0,5.0,4.5,4.5,5.0,5.0
50%,6.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
75%,8.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0
max,11.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
