## Let's Talk Digital: synergistic human-LLM deductive coding

_WIP - NOT FOR DISTRIBUTION_

_Proof-of-concept adaptation of the [CHALET](https://arxiv.org/abs/2405.05758) (**C**ollaborative **H**uman-LLM **A**na**L**ysis for **E**mpowering Conceptualization in Quali**T**ative Research) approach for Let's Talk Digital acceptability, feasibility, and usability interview data._

> ollama_scratchpad.ipynb<br> 
> Simone J. Skeen (01-02-2025)

### Prepare
***
Installs, imports, requisite packages; customizes outputs.

In [None]:
%%capture

!pip install python-docx
!pip install ollama

In [None]:
import docx
import json
import ollama
import os
import pandas as pd
import re
import requests

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

pd.options.mode.copy_on_write = True

pd.set_option(
              'display.max_columns',
              None,
              )

pd.set_option(
              'display.max_rows',
              None,
              )

warnings.simplefilter(
                      action = 'ignore',
                      category = FutureWarning,
                      )

#from langchain_community.llms import Ollama 

### Write
***
Defines parse_interview_responses and code_interview_responses functions.

**_parse_interview_responses_**

In [None]:
        ### SJS 12/31: _NOTE_ manually removing interview questions; zero-shot Llama in future

def parse_interview_responses(docx_file):
    """
    Reads a .docx interview transcription, extracts each distinct response (denoted 'R:'),
    returns pandas df with columns: ['text', '<tag>_1...n'].
    """

    # load .docx file
    
    doc = docx.Document(docx_file)

    # concatenate responses into single string
    
    full_text = []
    for p in doc.paragraphs:
        full_text.append(p.text)
    all_text = '\n'.join(full_text)

    # regex to capture distinct responses starting 'R:' until the next 'R:' or EOF
    
        ### SJS 1/1: GPT-o1-derived regex; verify
 
    #  - (R:.*?): Capture 'R:' followed by any characters, non-greedily (.*?)
    #  - (?=\n\s*R:|\Z): Look ahead for a newline+whitespace+R: OR the end of the string (\Z)
    pattern = r'(R:.*?)(?=\n\s*R:|\Z)'

    responses = re.findall(pattern, all_text, flags = re.DOTALL)

    # strip trailing whitespaces and linebreaks from each response
    
    responses = [resp.strip() for resp in responses]

    # 5) build df
    
    df = pd.DataFrame({
        'text': responses,
        'comm_sjs': None ### 'comm_sjs' = enhancing communications tag; add as needed
        # <tag n> etc.
    })

    return df

**_code_interview_responses_**

In [None]:
def code_interview_responses(df, text_column, endpoint_url, prompt_template, model_name):
    
        ### SJS 1/1: _NOTE_ enhancing communication tag used as proof of concept currently; translate with f-string TKTK
    
    """
    Classifies each row of 'text' column in provided df in accord with human-specified prompt,
    includes chain-of-thought reasoning, returning explanations for classification decision.
 
    Parameters:
    -----------
    df : pandas.DataFrame
        The DataFrame containing the text to classify.
    text_column : str
        The column name in df containing the text to be analyzed.
    endpoint_url : str
        The URL where locally hosted Llama model runs.
    prompt_template : str
        The prompt text with a placeholder (e.g., '{text}') where the row's text will be inserted.

    Returns:
    --------
    pandas.DataFrame
        The original DataFrame with two new columns: 'comm_llm' (either "0" or "1")
        and 'comm_expl' (the explanation).
    """

    # create empty tag ['*_llm'] and reasoning ['*_expl'] column
    
    df['comm_llm'] = None
    df['comm_expl'] = None

    for idx, row in df.iterrows():
        row_text = row[text_column]

        # replace '{text}' in prompt_template with df 'text' data
        
        prompt = prompt_template.format(text = row_text)

        # send request to local Llama endpoint.
        
        response = requests.post(
            endpoint_url,
            headers = {'Content-Type': 'application/json'},
            json = {
                'model': model_name,
                'prompt': prompt,
                'stream': False 
                },
        )

        # print statements for debugging
        
        print(response.status_code)
        print(response.text)      

        if response.status_code == 200:
            try:
                # Step 1: parse top-level JSON
                
                result_json = response.json()
                
                # Step 2: the 'response' field contains the JSON string with comm_llm & comm_expl
                
                raw_response_str = result_json.get('response', ' ')
                
                # Step 3: parse inner JSON
                
                parsed_output = json.loads(raw_response_str)
                
                # Step 4: extract tag and reasoning fields
                
                comm_llm_label = parsed_output.get('comm_llm')
                comm_llm_expl  = parsed_output.get('comm_expl')
                
            except (json.JSONDecodeError, KeyError, TypeError) as e:
                print("Parsing error:", e)
                comm_llm_label = None
                comm_llm_expl = None        
           
        #if response.status_code == 200:
        #    try:
        #        # Expecting the server's response to be valid JSON containing comm_llm and comm_expl
        #        result_json = response.json()
        #        comm_llm_label = result_json.get("comm_llm", None)
        #        comm_llm_expl = result_json.get("comm_expl", None)
        #    except ValueError:
        #        # If JSON parsing fails, set these fields to None or provide fallback values
        #        comm_llm_label = None
        #        comm_llm_expl = None
        
        # 'None' if lacking valid status code
        
        else:
            comm_llm_label = None
            comm_llm_expl = None

        # insert classification results into df
        
        df.at[idx, 'comm_llm'] = comm_llm_label
        df.at[idx, 'comm_expl'] = comm_llm_expl

    return df

### Transform
***
Imports raw interview data, parses via regex, transforms into structured pandas dataframe/Excel .xlsx file.

In [None]:
# set wd

#%pwd
os.chdir('C:/Users/sskee/OneDrive/Desktop/ltd_sham')

In [None]:
# import matrix - sham data

#d_sham = pd.read_excel(
#                       'ltd_qual_sham.xlsx',
                       #index_col = [0],
#                       )

# inspect

#print(d_sham.columns)
#d_sham.info()
#d_sham.head(3)

In [None]:
# import interview transcript - c10 as pilot

        ### SJS 1/1: for t in transcripts loop TKTK

c10 = parse_interview_responses("Caregiver_10.docx")
c10.head(5)

### Code
***
Enables human-LLM deductive coding: human-specified per-tag prompts, JSON-.xlsx structured outputs.

#### Enhancing communication skills (alias: `comm`): prompt formulation 

In [None]:
        ### SJS 1/2: _NOTE_ prelim sample

role = '''
You are tasked with applying pre-defined qualitative codes to segments of text excerpted from interviews with
graduates of a family-strengthening HIV prevention intervention. You will be provided a definition, instructions, 
and key exemplars of text to guide your coding decisions.
'''

definition = '''
Definition of "Enhancing communication skills": any description of the ability to speak and converse clearly, 
candidly, patiently, and receptively, without resorting to anger or assumptions, with family members.
'''

instruction = '''
You will be provided with a piece of text. For each piece of text:
- If it meets the definition of "Enhancing communication skills," output comm_llm as "1".
- Otherwise, output comm_llm as "0".
- Also provide a short explanation in exactly two sentences, stored in comm_expl.

Please respond in valid JSON with keys "comm_llm" and "comm_expl" only.

Text:
{text}
'''

examples = '''
Below are human-validated examples of "Enhancing communication skills"

- "<example_1>."
      
- "<example_2>."
      
- "<example_3>."
'''

In [None]:
# concatenate prompt as f-string

comm_prompt = f'{role}{definition}{instruction}{examples}'

# locally hosted Llama endpoint

llama_endpoint = 'http://localhost:11434/api/generate'

# classify texts and update df

df = code_interview_responses(
    c10,
    text_column = 'text',
    endpoint_url = llama_endpoint,
    prompt_template = comm_prompt,
    model_name = 'llama3',
)

#print(df)
df.head(10)

In [None]:
# export

df.to_excel('c10_prelim.xlsx')

> End of ollama_scratchpad.ipynb