In [18]:
import sys
import os
!{sys.executable} -m pip install PyPDF2
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install Path
!{sys.executable} -m pip install google-generativeai
!{sys.executable} -m pip install pathlib



In [50]:
import google.generativeai as genai
from PyPDF2 import PdfReader
from pathlib import Path
from IPython.display import display, Markdown

In [53]:
def extract_text(path):
    return ''.join([page.extract_text() or "" for page in PdfReader(path).pages])
def get_filenames(path):
    return [item.name for item in Path(path).iterdir() if item.is_file()]
def get_pdfs(path):
    return [extract_text(path + item.name) for item in Path(path).iterdir() if item.is_file()]
def to_markdown(text):
    text = text.replace("•", "  *")
    indented = "\n".join("> " + line for line in text.splitlines())
    return Markdown(indented)

In [68]:
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")

In [66]:
payload = "Given the following papers: " + "".join(get_pdfs("Desktop/hripapers/")) + ". Generate an in-depth report of an Human-Robotic Interaction study design that voilates of sound experimental study design."

In [64]:
response = model.generate_content(payload)

In [67]:
to_markdown(response.text)

> This report will generate an in-depth analysis of a fabricated HRI study design, highlighting its violations of sound experimental study design principles as outlined in the provided papers: "Trust(worthiness) Issues with Trust in Human-Robot Interaction" (Onnasch et al., 2025), "A Primer for Conducting Experiments in Human–Robot Interaction" (Hoffman and Zhao, 2020), "Lessons Learned About Designing and Conducting Studies From HRI Experts" (Fraune et al., 2022), and "The Power of Theory" (Trafton et al., 2021).
> 
> ---
> 
> ## In-Depth Report: Critique of a Flawed HRI Study Design
> 
> ### Fabricated Study Title: Evaluating the Impact of Robot Expressiveness on User Trust in AI-Driven Health Recommendations
> 
> **1. Overview of the Fabricated Study Design**
> 
> A research team aims to investigate how a robot's non-verbal expressiveness influences users' trust in its health-related advice.
> 
> *   **Research Question:** To what extent does a robot's expressive non-verbal communication affect user trust in its AI-generated health recommendations?
> *   **Hypothesis:** H1: Users will report higher levels of trust in health recommendations provided by an expressive robot compared to a non-expressive robot.
> *   **Participants:** 25 undergraduate students (ages 18-22, 15 female, 10 male) recruited from a university psychology subject pool.
> *   **Study Design:** Between-subjects design. Participants were randomly assigned to one of two conditions:
>     *   **Condition A (Expressive Robot):** Participants watched a 5-minute video of a social humanoid robot (e.g., a Pepper robot model) giving three general health recommendations (e.g., "drinking more water helps with energy," "getting 8 hours of sleep improves focus," "taking a daily walk boosts mood"). The robot delivered these recommendations using varied vocal intonation, subtle head nods, and hand gestures.
>     *   **Condition B (Non-Expressive Robot):** Participants watched the same 5-minute video of the same robot giving the identical health recommendations. However, the robot used a monotone voice, remained stationary, and maintained a fixed gaze.
> *   **Procedure:**
>     1.  Participants signed an informed consent form and were told the study was about "human perception of robot communication."
>     2.  They were randomly assigned to watch either the expressive or non-expressive robot video in a quiet room on a computer.
>     3.  Immediately after the video, participants completed a 10-item "Robot Trust Scale" (developed by the researchers for this study, sample items: "I believe the robot's advice is reliable," "The robot seems capable of giving good advice," "I would feel comfortable following this robot's advice," "I would trust this robot with my health decisions"). The scale used a 7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree).
>     4.  Finally, participants completed demographic questions and were debriefed.
> *   **Measures:** The primary dependent variable was "user trust," calculated as the sum of scores from the 10-item "Robot Trust Scale."
> *   **Results (Fabricated):** An independent samples t-test revealed a statistically significant difference between the two conditions, t(23) = 2.10, p = 0.046. The expressive robot condition (M=58.2, SD=8.5) had a higher mean trust score than the non-expressive robot condition (M=52.1, SD=9.1).
> *   **Discussion (Fabricated):** The study concludes that expressive non-verbal communication is a crucial factor in building user trust in AI-driven health recommendation robots. The findings suggest that future HRI designs should prioritize social expressiveness to foster better acceptance and adherence to robot advice, particularly in sensitive domains like health.
> 
> ---
> 
> ### 2. Detailed Critique of the Flawed Study Design
> 
> This fabricated study, while appearing straightforward, exhibits several critical violations of sound experimental design principles, drawing heavily on the insights from Onnasch et al. (2025), Hoffman and Zhao (2020), Fraune et al. (2022), and Trafton et al. (2021).
> 
> #### 2.1 Lack of Strong Theoretical Grounding and Hypotheses (Violating "The Power of Theory" & "A Primer")
> 
> Trafton et al. (2021) emphasize that "broad progress is driven by theory." This study, however, appears to stem from a superficial "design theory" without delving into the "psychological theory" of *why* expressiveness would lead to trust, or *what kind* of trust it would affect. Is it perceived competence, benevolence, or integrity? Is it due to anthropomorphism, and if so, how does that mechanism work? The hypothesis is merely a directional guess ("higher levels of trust") without a deeper causal explanation rooted in established psychological models of trust or communication.
> 
> *   **Violation:** The hypothesis is "mainly based on wishful thinking" rather than existing knowledge or a robust theoretical framework (Hoffman and Zhao, 2020, Section 3.3). Without a clear theoretical foundation, the study risks producing uninterpretable results (Trafton et al., 2021).
> *   **Recommendation:** Ground the study in an explicit theory (e.g., Mayer et al.'s [83] trust model, communication theories, social presence theory). For instance, hypothesize that expressiveness increases perceived benevolence and competence, which *then* mediates trust.
> 
> #### 2.2 Confusion in Operationalizing "Trust" (Violating "Trust(worthiness) Issues" & "A Primer")
> 
> This is perhaps the most significant flaw, directly challenging the core argument of Onnasch et al. (2025).
> 
> *   **Confusion 1: Trustworthiness is Not Trust:** The "Robot Trust Scale" likely conflates "trust" with "perceived trustworthiness." Items like "I believe the robot's advice is reliable" and "The robot seems capable of giving good advice" clearly assess **perceived trustworthiness** (specifically ability/competence dimensions), not **trust** itself. Onnasch et al. (2025) explicitly state, "Perceived trustworthiness is comprised of the perceived characteristics of the trustee such as ability, benevolence, and integrity... Trust, in contrast, is the attitude of being vulnerable to that trustee at a specific moment in time for a specific situation" (Section 3.1). The self-developed questionnaire seems to fall into the trap of commonly used "trust" scales that actually measure trustworthiness.
>     *   **Recommendation:** Use a validated trust questionnaire (e.g., Schoormann et al., 2016, as cited in Onnasch et al., 2025 [119]), which focuses on "the respondent’s willingness to let the trustee influence issues important to them without the respondent being able to monitor or control." Clearly differentiate between measuring perceived trustworthiness (robot characteristics) and trust (human attitude/vulnerability).
> 
> *   **Confusion 3: Overlooking Risk and Vulnerability:** The study uses general health recommendations (e.g., "drink more water") in a video-based online setting. There is **no actual risk or vulnerability** for the participants. Onnasch et al. (2025) emphasize that "the need for trust only arises in risky situations" (p. 711, quoting Mayer et al., 1995 [83]). If participants face no personal consequence (physical, financial, social, performance-based) from following or ignoring the advice, true "trust" (as defined by vulnerability) is irrelevant.
>     *   **Recommendation:** Implement real (or credibly simulated) risk and vulnerability. This could involve, for example, having the robot recommend a specific, slightly inconvenient, or uncertain action that the participant *might* take, with measurable consequences (e.g., a "wellness challenge" where adherence is tracked and linked to a small monetary incentive/penalty, or a task where relying on the robot's advice could impact their performance score). Hoffman and Zhao (2020) also mention the challenge of online studies lacking credible risk.
> 
> *   **Confusion 4: Trust is Not Behavior:** The study measures a subjective attitude (or trustworthiness proxy) and implies its behavioral consequences in the discussion ("foster better acceptance and adherence"). However, no actual behavior (e.g., intention to follow the advice, actual adherence to recommendations over time, reliance on the robot in a subsequent task) is measured. Onnasch et al. (2025) explicitly state, "trust is not a synonym for behaviors like reliance, compliance and verification behaviors but precedes them" (Section 3.4). A weak relationship between subjective trust and dependent behavior is often found (Patton and Wickens, 2024, cited in Onnasch et al., 2025 [99]).
>     *   **Recommendation:** Include objective behavioral measures alongside subjective self-reports. This could involve asking participants if they *intend* to follow the advice, or, ideally, tracking actual adherence to a recommendation over a short period. Triangulation of measures is crucial (Bethel and Murphy, 2010; Onnasch et al., 2025).
> 
> *   **Confusion 5: Why Trust Might Not Be What We're Really After:** Given the low-stakes nature and lack of true vulnerability, the study might be more interested in "acceptance," "likability," "persuasiveness," or "credibility" rather than "trust." Trust is a complex construct requiring specific conditions to be relevant.
>     *   **Recommendation:** Re-evaluate the core construct. If the goal is simply to see if expressiveness makes robots more palatable or convincing, other constructs might be more direct and appropriate to measure. If trust *is* the goal, then the experimental design needs to fundamentally change to incorporate risk and vulnerability.
> 
> #### 2.3 Methodological Flaws in Study Design & Execution
> 
> *   **Unvalidated Self-Report Measure (Violating "A Primer")**: The "Robot Trust Scale" was "developed by the researchers for this study."
>     *   **Violation:** Hoffman and Zhao (2020, Section 4.3.2) state, "It is almost always preferred to use a previously validated questionnaire than to make one up. A previously validated questionnaire can help ensure the construct validity of your research and promote reproducibility." The study makes no mention of reliability analysis (e.g., Cronbach's alpha) or validation efforts for their custom scale.
>     *   **Recommendation:** Use established, validated measures for perceived trustworthiness (e.g., MDMT [Malle and Ullman, 2021] as cited in Onnasch et al., 2025 [81]) or trust (Schoormann et al., 2016, as cited in Onnasch et al., 2025 [119]). If a new scale is necessary, conduct rigorous validation through exploratory studies and reliability analyses (Fraune et al., 2022).
> 
> *   **Small Sample Size & Low Statistical Power (Violating "A Primer" & "Review of Human Studies Methods")**: N=25 (12-13 per condition) is severely underpowered for detecting effects.
>     *   **Violation:** Hoffman and Zhao (2020, Section 6.1.3) highlight that a sample size of "15 participants per condition for a between-participants design... would result in a low power of .26 even with an effect size of d=0.50." The study's reported p=0.046 is perilously close to the conventional alpha level, making it highly susceptible to Type II errors (missing a real effect) or, ironically, being a false positive given the low power. Bethel and Murphy (2010, Section 2) also criticize the "lack of significant-sized participant pools" in HRI.
>     *   **Recommendation:** Conduct an a priori power analysis to determine the appropriate sample size, aiming for a statistical power of at least .80 (Hoffman and Zhao, 2020, Section 6.1.4; Bethel and Murphy, 2010). For a medium effect size (d=0.50), this would require ~64 participants *per condition* in a between-subjects design.
> 
> *   **Limited Generalizability / External Validity (Violating "A Primer")**: The sample consists exclusively of undergraduate students.
>     *   **Violation:** Hoffman and Zhao (2020, Section 6.2) note that most studies rely on "convenience samples" (e.g., university students) which are "WEIRD" (Western, Educated, Industrialized, Rich, and Democratic) and not representative of broader populations. This negatively impacts the study's external validity. Health recommendations are relevant to all age groups and demographics.
>     *   **Recommendation:** Recruit a more diverse and representative sample population. If online studies are chosen, use platforms like Mechanical Turk or Prolific with appropriate screening to broaden demographics, while acknowledging their own limitations (Hoffman and Zhao, 2020, Section 4.1).
> 
> *   **Lack of Manipulation Check (Violating "A Primer")**: The study assumes that the expressive robot was *perceived* as more expressive.
>     *   **Violation:** There is no explicit check to confirm that the manipulation of "expressiveness" was successful (Hoffman and Zhao, 2020, Section 5.3). Participants might not have noticed or interpreted the expressiveness as intended.
>     *   **Recommendation:** Include a manipulation check. For example, after the video, ask participants to rate the robot's expressiveness, naturalness, or human-likeness (e.g., using scales like RoSAS [Carpinella et al., 2017] as cited in Hoffman and Zhao, 2020 [12]).
> 
> *   **No Pilot Studies Mentioned (Violating "A Primer" & "Lessons Learned")**:
>     *   **Violation:** The study does not mention any pilot testing of the videos, the task, the robot's script, or the self-developed questionnaire. Hoffman and Zhao (2020, Section 5.2) state, "Piloting is an illuminating and crucial step in study development." Fraune et al. (2022, Section 4.1.2.1) detail how pilot studies can reveal technical issues, unexpected participant behavior, and problems with study design or metrics.
>     *   **Recommendation:** Conduct multiple pilot studies to refine the robot's expressiveness, ensure the health recommendations are understood, check for any unintended confounds in the videos, and validate the custom questionnaire before the main study.
> 
> *   **No Pre-registration (Violating "A Primer")**:
>     *   **Violation:** There is no mention of pre-registration, increasing the risk of "p-hacking" or "HARKing" (Hypothesizing After the Results are Known). Hoffman and Zhao (2020, Section 7.1) emphasize pre-registration to combat these problematic research practices.
>     *   **Recommendation:** Pre-register the study's hypotheses, design, sample size, exclusion criteria, and planned statistical analyses on platforms like OSF (Open Science Framework).
> 
> *   **Reporting of Statistical Significance without Effect Size (Violating "A Primer")**:
>     *   **Violation:** The results section reports a p-value but omits effect size. Hoffman and Zhao (2020, Section 9.2) state, "p-values are designed to tell you whether your result could be explained by random variation, not whether it is meaningful... Therefore, effect sizes are used to communicate how large of an effect you have found." A p-value of 0.046 is weak, and without an effect size (e.g., Cohen's d), it's impossible to gauge the practical significance of the findings.
>     *   **Recommendation:** Always report effect sizes (e.g., Cohen's d for t-tests) alongside p-values to allow readers to assess the magnitude and practical importance of the observed effect.
> 
> #### 2.4 Study Context and Ecological Validity (Violating "A Primer")
> 
> *   **Video-Based Online Study**:
>     *   **Violation:** Hoffman and Zhao (2020, Section 4.1) point out that online studies, especially video-based ones without real interaction, "lower the external validity of online studies" and make it "difficult to get people’s genuine responses." Fraune et al. (2022, Section 4.3.2) discuss the limitations of current robots and the need for controlled environments, but also highlight the value of in-the-wild studies.
>     *   **Recommendation:** While online video studies can be useful for initial exploration, follow up with lab or field studies involving actual human-robot interaction in a more ecologically valid context (e.g., a "full-cycle research" approach, Hoffman and Zhao, 2020).
> 
> ### 3. Conclusion and Recommendations for an Improved Study Design
> 
> This fabricated study, while well-intentioned, illustrates numerous common pitfalls in HRI research. To transform this into a sound experimental study, the following improvements are critical:
> 
> 1.  **Strengthen Theoretical Foundation:** Explicitly state a psychological theory explaining *why* expressiveness would affect trust, perhaps linking it to anthropomorphism, perceived social cues, or calibrated reliance.
> 2.  **Refine Trust Operationalization:**
>     *   **Distinguish Trustworthiness from Trust:** Use validated scales that specifically measure either perceived trustworthiness (e.g., competence, benevolence, integrity) or trust (willingness to be vulnerable), depending on the theoretical goal. Do not conflate them.
>     *   **Introduce Risk and Vulnerability:** Design a scenario where participants genuinely face a consequence (e.g., minor financial loss, performance impact, or social discomfort) if they rely on the robot's advice. This is paramount for trust to be relevant.
>     *   **Measure Behavior:** Include objective behavioral measures of reliance or compliance, not just subjective self-reports.
> 3.  **Enhance Methodological Rigor:**
>     *   **Increase Sample Size:** Conduct an a priori power analysis to determine and recruit an adequately sized sample.
>     *   **Validate Measures:** Use established, validated questionnaires or rigorously validate any custom scales.
>     *   **Implement Manipulation Check:** Empirically verify that the expressive manipulation was perceived as intended.
>     *   **Conduct Pilot Studies:** Thoroughly pilot test all aspects of the study, from robot behavior to questionnaire clarity.
>     *   **Pre-register:** Document the hypotheses, methods, and analysis plan before data collection begins.
>     *   **Report Effect Sizes:** Always include effect sizes alongside p-values to indicate the practical significance of findings.
> 4.  **Improve Ecological Validity:** Consider moving beyond purely video-based studies to include physical human-robot interaction in a more realistic or interactive setting, even if simulated.
> 
> By addressing these issues, HRI research can move towards greater conceptual clarity, methodological rigor, and ultimately, more reliable and meaningful contributions to the field.