## Linguistic markers of subtle discrimination among mental healthcare professionals

_WIP - NOT FOR DISTRIBUTION_

_Organizes iterative, independent co-annotation of audit correspondence field experiment responses receievd from mental health professionals. Samples parent study data by co-annotation cycle, computes Cohen's $\kappa$, flags discrepant tagging decisions for in-person deliberation. Extracts .html profile attribiutes and background vars._

> mhp_annotate_iaa_append.ipynb<br>
> Simone J. Skeen (10-22-2024)

1. [Prepare](#scrollTo=_nwtco4XT0CL)
2. [Write](#scrollTo=EavEs0OkbeHT)
3. [Sample](#scrollTo=c7mqGlB3hHCc)
4. [Triangulate](#scrollTo=z5y8kU5C-FZJ)
5. [Extract](#scrollTo=S2aMoYZlA-k3)

### Prepare
Installs, imports, and downloads requisite models and packages. Organizes RAP-consistent directory structure.
***

**Install**

In [None]:
%%capture

%pip install openai

#!python -m spacy download en_core_web_lg --user

**Import**

In [None]:
import numpy as np
import openai
import os
import pandas as pd
import re
import spacy
import time
import warnings

from bs4 import BeautifulSoup

from google.colab import drive

#spacy.cli.download('en_core_web_lg')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

pd.options.mode.copy_on_write = True

pd.set_option(
              'display.max_columns',
              None,
              )

pd.set_option(
              'display.max_rows',
              None,
              )

warnings.simplefilter(
                      action = 'ignore',
                      category = FutureWarning,
                      )

#!python -m prodigy stats

**Set env variables**

In [None]:
os.environ['OPENAI_API_KEY'] = '<my_key>'
os.environ

**Mount gdrive**

In [None]:
drive.mount(
            '/content/drive',
            #force_remount = True,
            )

**Structure directories**

In [None]:
%cd /content/drive/My Drive/Colab/mhp_subtle_discrimination
#%cd /content/drive/My Drive/#<my_project_folder>

#%mkdir inputs

In [None]:
#%cd inputs
#%mkdir annotation data html

In [None]:
mhp_subtle_discrimination/
└── inputs/
    ├── annotation/
    │   └── #d_cycle_{0...n}xlsx
    ├── data/
    │   └── d_pilot.xlsx
    └── html/

#### Housekeeping: $\mathcal{d}$<sub>pilot</sub>

In [None]:
%cd inputs/data

d_pilot = pd.read_excel(
                        'd_pilot.xlsx',
                        index_col = 'index',
                        )
# 'pilot' var

d_pilot['pilot'] = 1

# 'MHP ID#' var

d_pilot['MHP ID#'] = '.'

d_pilot.info()
d_pilot.head(3)

In [None]:
d_pilot.to_excel(
                 'd_pilot.xlsx',
                 index = True,
                 )

### 2. Write
Writes and imports custom modules

In [None]:
%cd code

#### preprocess.py

**_ner_redact_response_texts_**

In [None]:
%%writefile preprocess.py

import spacy
nlp = spacy.load('en_core_web_lg')

def ner_redact_response_texts(mhp_text):
    """
    Redacts all named entities recognized by spaCy EntityRecognizer, replaces with <|PII|> pseudo-word token.
    """
    ne = list(
              [
               'PERSON',   ### people, including fictional
               'NORP',     ### nationalities or religious or political groups
               'FAC',      ### buildings, airports, highways, bridges, etc.
               'ORG',      ### companies, agencies, institutions, etc.
               'GPE',      ### countries, cities, states
               'LOC',      ### non-GPE locations, mountain ranges, bodies of water
               'PRODUCT',  ### objects, vehicles, foods, etc. (not services)
               'EVENT',    ### named hurricanes, battles, wars, sports events, etc.
               ]
                )

    doc = nlp(mhp_text)
    ne_to_remove = []
    final_string = str(mhp_text)
    for sent in doc.ents:
        if sent.label_ in ne:
            ne_to_remove.append(str(sent.text))
    for i in range(len(ne_to_remove)):
        final_string = final_string.replace(
                                            ne_to_remove[i],
                                            '<|PII|>',
                                            )
    return final_string

#### annotate.py

**_sample_by_cycle_**

In [None]:
%%writefile annotate.py

import pandas as pd

def sample_by_cycle(d_pilot, sample_size, cycle_number):
    """
    Creates random subsample of d_pilot, excises prior tags, and unneeded columns,
    exports to .xlsx for human annotation.

    Parameters:
    -----------

    d_pilot : pd.DataFrame
        The d_pilot df in memory.

    sample_size : int
        The number of rows to sample from d_pilot.

    cycle_number : int
        The cycle number used to name the output dataframe and the Excel file.

    Returns:
    --------
    pd.DataFrame
        A new dataframe called d_cycle_{cycle_number} after applying the operations.
    """

    d_cycle = d_pilot.sample(
                             n = sample_size,
                             #random_state = 56,
                             )

    # reset index

    d_cycle.reset_index(
                        drop = True,
                        inplace = True,
                        )
    # DAL themes

    d_cycle['brdn'] = ' '
    d_cycle['dmnd'] = ' '
    d_cycle['rbnd'] = ' '

    # excise prior tags

    tag_cols = [
                'prbl',
                'refl',
                'just',
                'afrm',
                'fitt',
                'agnt',
                'brdn',
                'dmnd',
                'rbnd',
                'rtnl',
                'note',
                ]

    d_cycle[tag_cols] = ' '

    # excise unneeded columns

    drop_cols = [
                 'EmailPairID',
                 'WithinPatientID',
                 'FirstInPair',
                 'pilot',
                 'MHP ID#',
                 ]

    d_cycle.drop(
                 columns = drop_cols,
                 axis = 1,
                 inplace = True,
                 )

    # export

    filename = f'd_cycle_{cycle_number}.xlsx'

    d_cycle.to_excel(
                     filename,
                     index = True,
                     )

    return d_cycle

#### calculate.py

**_calculate_kappa_by_cycle_**

In [None]:
%%writefile calculate.py

import pandas as pd
from sklearn.metrics import cohen_kappa_score

def calculate_kappa_by_cycle(cycle_num):
    """
    Calculate Cohen's Kappa and encode disagreements between independent annotators across multiple cycles.

    Parameters:
    --------
    cycle_num : int
        Annotation cycle number, used to load the corresponding Excel files (e.g., cycle 0, cycle 1).

    Returns:
    --------
    d : pd.DataFrame
        Processed df after merging, includes encoded disagreements in *_dis columns.

    kappa_results : dict
        A dictionary containing the Cohen's Kappa scores for each indepednently co-annotated target.
    """
    # read independently annotated files

    d_dal = pd.read_excel(f'd_cycle_{cycle_num}_dal.xlsx', index_col = [0])
    d_dal.columns = [f'{col}_dal' for col in d_dal.columns]

    d_sjs = pd.read_excel(f'd_cycle_{cycle_num}_sjs.xlsx', index_col = [0])
    d_sjs.columns = [f'{col}_sjs' for col in d_sjs.columns]

    # merge

    d = pd.merge(
                 d_dal,
                 d_sjs,
                 left_index = True,
                 right_index = True,
                 )

    # housekeeping

    targets = [
               'afrm_dal', 'afrm_sjs',
               'agnt_dal', 'agnt_sjs',
               'brdn_dal', 'brdn_sjs',
               'dmnd_dal', 'dmnd_sjs',
               'fitt_dal', 'fitt_sjs',
               'just_dal', 'just_sjs',
               'prbl_dal', 'prbl_sjs',
               'rbnd_dal', 'rbnd_sjs',
               'refl_dal', 'refl_sjs',
               ]

    texts = [
             'text_dal', 'text_sjs',
             'rtnl_dal', 'rtnl_sjs',
             'note_dal', 'note_sjs',
             ]

    d[targets] = d[targets].apply(
                                  pd.to_numeric,
                                  errors = 'coerce',
                                  )
    d[targets] = d[targets].fillna(0)
    d[texts] = d[texts].replace(' ', '.')

    d = d[[
           'text_dal',
           'afrm_dal', 'afrm_sjs',
           'agnt_dal', 'agnt_sjs',
           'brdn_dal', 'brdn_sjs',
           'dmnd_dal', 'dmnd_sjs',
           'fitt_dal', 'fitt_sjs',
           'just_dal', 'just_sjs',
           'prbl_dal', 'prbl_sjs',
           'rbnd_dal', 'rbnd_sjs',
           'refl_dal', 'refl_sjs',
           'rtnl_dal', 'rtnl_sjs',
           'note_dal', 'note_sjs',
           ]].copy()

    # kappa Fx

    def calculate_kappa(d, col_dal, col_sjs):
        return cohen_kappa_score(d[col_dal], d[col_sjs])

    col_pairs = [
                 ('afrm_dal', 'afrm_sjs'),
                 ('agnt_dal', 'agnt_sjs'),
                 ('brdn_dal', 'brdn_sjs'),
                 ('dmnd_dal', 'dmnd_sjs'),
                 ('fitt_dal', 'fitt_sjs'),
                 ('just_dal', 'just_sjs'),
                 ('prbl_dal', 'prbl_sjs'),
                 ('rbnd_dal', 'rbnd_sjs'),
                 ('refl_dal', 'refl_sjs'),
                 ]

    # initialize dict

    kappa_results = {}

    # kappa loop
    print("\n--------------------------------------------------------------------------------------")
    print(f"Cycle {cycle_num}: Cohen's Kappa by target")
    print("--------------------------------------------------------------------------------------")

    for col_dal, col_sjs in col_pairs:
        kappa = calculate_kappa(d, col_dal, col_sjs)
        kappa_results[f'{col_dal} and {col_sjs}'] = kappa

    for pair, kappa in kappa_results.items():
        print(f"{pair} Kappa = {kappa:.2f}")

    # dummy code disagreements Fx

    def encode_disagreements(row):
        return 1 if row[0] != row[1] else 0

    col_dis = [
               ('afrm_dal', 'afrm_sjs', 'afrm_dis'),
               ('agnt_dal', 'agnt_sjs', 'agnt_dis'),
               ('brdn_dal', 'brdn_sjs', 'brdn_dis'),
               ('dmnd_dal', 'dmnd_sjs', 'dmnd_dis'),
               ('fitt_dal', 'fitt_sjs', 'fitt_dis'),
               ('just_dal', 'just_sjs', 'just_dis'),
               ('prbl_dal', 'prbl_sjs', 'prbl_dis'),
               ('rbnd_dal', 'rbnd_sjs', 'rbnd_dis'),
               ('refl_dal', 'refl_sjs', 'refl_dis'),
               ]

    for col1, col2, dis_col in col_dis:
        d[dis_col] = d[[col1, col2]].apply(encode_disagreements, axis = 1)

    # display counts for targets

    print("\n--------------------------------------------------------------------------------------")
    print(f"Cycle {cycle_num}: Counts by target")
    print("--------------------------------------------------------------------------------------")
    print(d[targets].apply(pd.Series.value_counts))

    # drop target cols for readability + fillna

    d = d.drop(targets, axis = 1)
    d = d.fillna('.')

    # export: cycle-specific

    d.to_excel(f'd_cycle_{cycle_num}_dis.xlsx')

    return d, kappa_results

#### gpt_assist.py

**_transform_text_with_gpt_**

In [None]:
%%writefile gpt_assist.py

import pandas as pd
import openai
import time

def transform_text_with_gpt(df, input_column, output_column, system_prompt, prompt_template, model='gpt-4o'):
    """
    Transforms text data in a specified df column using GPT based on provided prompts.

    Args:
        df (pd.DataFrame): df containing the text to be transformed.
        input_column (str): name of the input column in the df that contains the text to transform.
        output_column (str): name of the output column where the transformed text will be stored.
        system_prompt (str): system prompt that sets up the assistant's behavior.
        prompt_template (str): template string describing the transformation to be applied to each entry.
          Use '{input_text}' as a placeholder for the input text.
        model (str, optional): The name of the OpenAI GPT model to use (default = 'gpt-4o').

    Returns:
        pd.DataFrame: df with new output column added, containing the transformed text.
    """

    # Fx to send row-wise API requests

    def call_gpt(input_text):
        if pd.isnull(input_text) or input_text.strip() == ' ':
            return ' '

        prompt = prompt_template.format(input_text = input_text)

        try:
            response = openai.chat.completions.create(
                                                      model = model,
                                                      messages = [
                                                                  {'role': 'system',
                                                                   'content': system_prompt},
                                                                  {'role': 'user',
                                                                   'content': prompt},
                                                                  ],
                                                      #max_tokens = 500,
                                                      #n = 1,
                                                      #temperature = 0,
                                                      )

            # extract text from API response

            result = response.choices[0].message.content.strip()
            return result

        except Exception as e:
            print(f"Error processing input text: {input_text}\nError: {str(e)}")
            return input_text ### returns input string in case of error

        finally:

            # impose delay between API calls

            time.sleep(1)

    df[output_column] = df[input_column].apply(call_gpt)

    return df

#### Import

In [None]:
from annotate import(
                     sample_by_cycle
                     )

#from preprocess import(
#                       ner_redact_response_texts
#                       )

from calculate import(
                      calculate_kappa_by_cycle
                      )

from gpt_assist import(
                       transform_text_with_gpt
                       )

### 3. Sample
Randomly samples cycle-specific MHP response subsets for annotation.
***

In [None]:
%pwd

In [None]:
%cd ../inputs/data

d_pilot = pd.read_excel(
                        'd_pilot.xlsx',
                        index_col = 'index',
                        )

#d_pilot.info()
#d_pilot.head(3)

#### Cycle 0

In [None]:
%cd ../annotation

In [None]:
# sample

d_cycle_0 = d_pilot.sample(
                           n = 50,
                           random_state = 56,
                           )

# reset index

d_cycle_0.reset_index(
                      drop = True,
                      inplace = True,
                      )

# excise prior tags

tag_cols = [
            'afrm',
            'agnt',
            'fitt',
            'just',
            'prbl',
            'refl',
            'rtnl',
            'note',
            ]

d_cycle_0[tag_cols] = ' '

# excise unneeded cols

drop_cols = [
             'EmailPairID',
             'WithinPatientID',
             'FirstInPair',
             'pilot',
             'MHP ID#',
             ]

    ### SJS 9/16: add DAL targets (for now): brdn, dmnd, rbnd

d_cycle_0.drop(
               columns = drop_cols,
               axis = 1,
               inplace = True,
               )

# export

d_cycle_0.head(3)

d_cycle_0.to_excel(
                   'd_cycle_0.xlsx',
                   index = True,
                   )

#### Cycle 1

In [None]:
%cd ../annotation

In [None]:
# call sample_by_cycle

d_cycle_1 = sample_by_cycle(
                            d_pilot,
                            80, # sample_size = 80
                            1, # cycle_number = 1
                            )

d_cycle_1.info()
d_cycle_1.head(3)

#### Cycle 2

In [None]:
%cd ../annotation

In [None]:
# call sample_by_cycle

d_cycle_2 = sample_by_cycle(
                            d_pilot,
                            80, # sample_size = 80
                            2, # cycle_number = 2
                            )

d_cycle_2.info()
d_cycle_2.head(3)

### 4. Triangulate
Computes Cohen's $\kappa$, dummy codes discrepant tags for in-person deliberation.
***

#### Cycle 0

In [None]:
%cd ../inputs/annotation

d, kappa_results = calculate_kappa_by_cycle(0)

#### Cycle 1

In [None]:
%cd ../inputs/annotation

d, kappa_results = calculate_kappa_by_cycle(1)

In [None]:
d.head(3)

### 5. Extract
Uses substring extraction, regex, and GPT-4 API to restructure .htm and .html into MHP-indexed df of background attributes
***

In [None]:
%cd /content/drive/My Drive/Colab/mhp_subtle_discrimination/inputs/html
#del d

In [None]:
%cd ../inputs/html

html = [file for file in os.listdir() if file.endswith(('.htm', '.html'))]

# initialize list

html_data = []

# load

for h in html:
    with open(h, 'r', encoding = 'utf-8') as file:
        content = file.read()

    # parse .html

    soup = BeautifulSoup(
                         content,
                         'html.parser',
                         )

    # extract attributes

    name_title = soup.find(
                           'meta',
                           property = 'og:title',
                           )

    profile = soup.find(
                        'meta',
                        property = 'og:url',
                        )
    image = soup.find(
                      'meta',
                      property = 'og:image',
                      )

    image_alt = soup.find(
                          'meta',
                          property = 'og:image:alt',
                          )

    place = soup.find(
                      'meta',
                      attrs = {'name': 'geo.placename'},
                      )

    # extract attribute contents

    practice_name_text = name_title['content'] if name_title else '.'
    profile_url = profile['content'] if profile else '.'
    image_url = image['content'] if image else '.'
    image_alt_text = image_alt['content'] if image_alt else '.'
    place_name = place['content'] if place else '.'

    # extract filename as MHP ID

    mhp_id = h.replace('.html', ' ').replace('.htm', ' ')

    # extract full text

    full_text = soup.get_text()

    # 'pronouns' str: extract text preceding "Verified"

    extracted_text = re.search(
                               r'^(.*?)Verified',
                               full_text,
                               re.DOTALL,
                               )

    if extracted_text:
        extracted_text = extracted_text.group(1).strip()
    else:
        extracted_text = ' '

    # extract pronouns from parens

    pronoun_text = re.findall(r'\(([^0-9]+?)\)', extracted_text)
    pronoun_text = ' '.join(pronoun_text).strip()

    # 'description' str: extract text between "Let's Connect" and "Call or Email"

    start_description = full_text.find("Let's Connect")
    end_description = full_text.find("Call or Email", start_description)
    description_text = full_text[start_description + len("Let's Connect"):end_description].strip() \
    if start_description != -1 \
    and end_description != -1 \
    else '.'

    # 'at_a_glance' str (incl 'finances'): extract text between "Practice at a Glance" and "Qualifications"

    start_glance = full_text.find("Practice at a Glance")
    end_glance = full_text.find("Qualifications", start_glance)
    glance_text = full_text[start_glance + len("Practice at a Glance"):end_glance].strip() \
    if start_glance != -1 \
    and end_glance != -1 \
    else '.'

    # 'qualifications' str: extract text between "Qualifications" and "Feel free to ask"

    start_qualifications = full_text.find("Qualifications")
    end_qualifications = full_text.find("Feel free to ask", start_qualifications)
    qualifications_text = full_text[start_qualifications + len("Qualifications"):end_qualifications].strip() \
    if start_qualifications != -1 \
    and end_qualifications != -1 \
    else '.'

    # 'specialities' str: extract text between "Top Specialties" and "Do these issues"

    start_specialties = full_text.find("Top Specialties")
    end_specialties = full_text.find("Do these issues", start_specialties)
    specialties_text = full_text[start_specialties + len("Top Specialties"):end_specialties].strip() \
    if start_specialties != -1 \
    and end_specialties != -1 \
    else '.'

    # 'client_focus' str: extract text between "Client Focus" and "Treatment Approach"

    start_client = full_text.find("Client Focus")
    end_client = full_text.find("Treatment Approach", start_client)
    client_text = full_text[start_client + len("Client Focus"):end_client].strip() \
    if start_client != -1 \
    and end_client != -1 \
    else '.'

    # 'types_of_therapy' str: extract text between "Types of Therapy" and "Ask about what"

    start_therapy = full_text.find("Types of Therapy")
    end_therapy = full_text.find("Ask about what", start_therapy)
    therapy_text = full_text[start_therapy + len("Types of Therapy"):end_therapy].strip() \
    if start_therapy != -1 \
    and end_therapy != -1 \
    else '.'

    therapy_text_with_commas = re.sub(r'(?<=[a-z])(?=[A-Z])', ', ', therapy_text)
    therapy_text_with_commas = re.sub(r'(?<=\))', ', ', therapy_text_with_commas)

    # append to list

    html_data.append({
                      'MHP ID': mhp_id,
                      'practice_name': practice_name_text,
                      'pronouns': pronoun_text,
                      'description': description_text,
                      'profile_url': profile_url,
                      'image_url': image_url,
                      'image_alt_text': image_alt_text,
                      'at_a_glance': glance_text,
                      'qualifications': qualifications_text,
                      'specialties_raw': specialties_text,
                      'client_focus': client_text,
                      'types_of_therapy': therapy_text_with_commas,
                      'place_name': place_name,
                      })

# build df

d = pd.DataFrame(html_data)

# 'client_focus' clean + parse

d['client_focus'] = d['client_focus'].str.replace(
                                                  r'\s+,',
                                                  ',',
                                                  regex = True,
                                                  )

# 'ages' str: extract text following "Age" from 'client_focus'

d['ages'] = d['client_focus'].str.extract(
                                          r'Age\s*(.*?)\s*Participants',
                                          flags = re.I,
                                          )

d['client_focus'] = d['client_focus'].str.replace(
                                                  'Age',
                                                  ' ',
                                                  flags = re.I,
                                                  regex = True,
                                                  )


# 'participants' str: extract text following "Participants" from 'client_focus'

d['participants'] = d['client_focus'].str.extract(
                                                  r'Participants\s*(.*?)\s*Communities',
                                                  flags = re.I,
                                                  )

d['client_focus'] = d['client_focus'].str.replace(
                                                  'Participants',
                                                  ' ',
                                                  flags = re.I,
                                                  regex = True,
                                                  )

# 'communities' str: extract text following "Communities" from 'client_focus'

d['communities'] = d['client_focus'].str.extract(
                                                 r'Communities\s*(.*?)\s*Ethnicity',
                                                 flags = re.I,
                                                 )

d['client_focus'] = d['client_focus'].str.replace(
                                                  'Communities',
                                                  ' ',
                                                  flags = re.I,
                                                  regex = True,
                                                  )

# 'ethnicities' str: extract text following "Ethnicity" from 'client_focus'

d['ethnicities'] = d['client_focus'].str.extract(
                                                 r'Ethnicity\s*(.*?)\s*Religion',
                                                 flags = re.I,
                                                 )

d['client_focus'] = d['client_focus'].str.replace(
                                                  'Ethnicity',
                                                  ' ',
                                                  flags = re.I,
                                                  regex = True,
                                                  )


# 'religions' str: extract text following "Religion" from 'client_focus'

d['religions'] = d['client_focus'].str.extract(
                                              r'(Religion.*)',
                                              flags = re.I,
                                              expand = False,
                                              )

d['client_focus'] = d['client_focus'].str.replace(
                                                  r'Religion',
                                                  ' ',
                                                  regex = True,
                                                  )

d['religions'] = d['religions'].str.replace(
                                            r'Religion',
                                            ' ',
                                            regex = True,
                                            )

# 'finances' str: extract text following "Finances" from 'at_a_glance'

d['finances'] = d['at_a_glance'].str.extract(
                                             r'Finances(.*)',
                                             expand = False,
                                             )

d['finances'] = d['finances'].fillna('.').str.strip()

# delete "Finances" from 'at_a_glance'

d['at_a_glance'] = d['at_a_glance'].str.replace(
                                                r'Finances.*',
                                                ' ',
                                                regex = True,
                                                n = 1,
                                                ).str.strip()

# 'name' str: extract from 'practice_name'

d['name'] = d['practice_name']
d['name'] = d['name'].str.split(
                                ',',
                                n = 1,
                                ).str[0].str.strip()

# 'availability' str: extract (pre-specified) from 'at_a_glance'

availabilities = [
                  'Available both in-person and online',
                  'Available in-person',
                  'Available online',
                  ]

d['availability'] = d['at_a_glance'].str.extract(
                                                 f"({'|'.join(availabilities)})",
                                                 expand = False,
                                                 )

d['availability'] = d['availability'].fillna('.')

# 'years_in_practice' str: extract from 'qualifications

d['years_in_practice'] = d['qualifications'].str.extract(
                                                         r'In Practice for (\d+) Years',
                                                         expand = False,
                                                         )

d['years_in_practice'] = pd.to_numeric(
                                       d['years_in_practice'],
                                       errors = 'coerce',
                                       )

d['years_in_practice'] = d['years_in_practice'].fillna('.')

# 'licensed_by_state' str: extract from 'qualifications

d['licensed_by_state'] = d['qualifications'].str.extract(r'Licensed by State of ([A-Za-z]+(?: [A-Za-z]+)*)')

# 'license_number' int: extract after "State /"

d['license_number'] = d['qualifications'].str.extract(r'/\s*(\d+)')

# 'insurance_raw' str: extract from 'finances' - GPT-4o to polish

d['insurance_raw'] = d['finances'].str.extract(r'Insurance\s*(.*)')
d['insurance_raw'] = d['insurance_raw'].str.replace(
                                                    r'Check fees.*',
                                                    ' ',
                                                    regex = True,
                                                    )

# 'fees' str: extract from 'finances'

d['fees'] = d['finances'].str.extract(r'Fees\s*(.*?)\s*Insurance')

# 'individual_fee' int: extract from 'fees'

d['individual_fee'] = d['fees'].str.extract(r'Individual Sessions\s*\$?(\d+)')

# 'couple_fee' int: extract from 'fees'

d['couple_fee'] = d['fees'].str.extract(r'Couple Sessions\s*\$?(\d+)')
d['couple_fee'].fillna(
                       '.',
                       inplace = True,
                       )

# 'sliding_scale' bool: extract from 'fee's

d['sliding_scale'] = d['fees'].str.contains(
                                            'Sliding scale:',
                                            regex = False,
                                            ).fillna(False).astype(int)

# delete PT footer from 'practice_name'

d['practice_name'] = d['practice_name'].str.replace(
                                                    '| Psychology Today',
                                                    ' ',
                                                    regex = False,
                                                    )

# delete contact details from 'description'

tel_re = r'\(\d{3}\) \d{3}-\d{4}'

d['description'] = d['description'].str.replace(
                                                'Take the first step to help',
                                                ' ',
                                                regex = False,
                                                )

d['description'] = d['description'].str.replace(
                                                'Email me',
                                                ' ',
                                                regex = False,
                                                )

d['description'] = d['description'].str.replace(
                                                'Email us',
                                                ' ',
                                                regex = False,
                                                )

d['description'] = d['description'].str.replace(
                                                tel_re,
                                                ' ',
                                                regex = True,
                                                )

# excise artifacts, 'description'

artifact_re = '^\s*[xX]\d+\s*'

d['description'] = d['description'].str.replace(
                                                artifact_re,
                                                ' ',
                                                regex = True,
                                                )

d['description'] = d['description'].str.replace(
                                                '\n',
                                                ' ',
                                                regex = True,
                                                )

# excise leading, excess spaces, 'description'

d['description'] = d['description'].str.strip().str.replace(
                                                            '\s+',
                                                            ' ',
                                                            regex = True,
                                                            )

# delete duped text (follows "Let's Connect") from 'at_a_glance'

d['at_a_glance'] = d['at_a_glance'].str.replace(
                                                r"Let's Connect.*",
                                                ' ',
                                                regex = True,
                                                ).str.strip()

# add space: 'specialties_raw'

d['specialties_raw'] = d['specialties_raw'].str.replace(
                                                        r'([a-z])([A-Z])',
                                                        r'\1 \2',
                                                        regex = True,
                                                        )

d['specialties_raw'] = d['specialties_raw'].str.replace(
                                                        r'(\(BPD\)|OCD\)|ADHD|LGBTQ\+|PTSD)',
                                                        r'\1 ',
                                                        regex = True,
                                                        )

d['specialties_raw'] = d['specialties_raw'].str.strip()

# delete whitespace from 'pronouns'

d['pronouns'] = d['pronouns'].replace(
                                      ' ',
                                      np.nan,
                                      ).fillna('.').str.strip()

d['pronouns'] = d['pronouns'].replace(
                                      r'^\s*$',
                                      '.',
                                      regex = True,
                                      )

# replace NaN, empty cells, with "."

d.fillna(
         '.',
         inplace = True,
         )

d.replace(
          r'^\s*$', '.',
          regex = True,
          inplace = True,
          )

# inspect

#d
#d.info()
#d.head(3)

**Dummy code populations**

In [None]:
# define Fx: convert str to snake_case

def to_snake_case(term):
    return term.strip().replace(" ", "_").lower()

# split 'client_focus' populations by commas, explode the lists into separate rows

d['ages_split'] = d['ages'].str.split(",")
d['communities_split'] = d['communities'].str.split(",")
d['ethnicities_split'] = d['ethnicities'].str.split(",")
d['religions_split'] = d['religions'].str.split(",")

# define all unique populations (converted to snake case)

all_ages = set(term for sublist in d['ages_split'] for term in sublist if term)
all_communities = set(term for sublist in d['communities_split'] for term in sublist if term)
all_ethnicities = set(term for sublist in d['ethnicities_split'] for term in sublist if term)
all_religions = set(term for sublist in d['religions_split'] for term in sublist if term)

# dummy code each population

for a in all_ages:
    age_snake_case = to_snake_case(a)
    d[age_snake_case] = d['ages'].apply(lambda i: 1 if a in i else 0)

for c in all_communities:
    community_snake_case = to_snake_case(c)
    d[community_snake_case] = d['communities'].apply(lambda i: 1 if c in i else 0)

for e in all_ethnicities:
    ethnicity_snake_case = to_snake_case(e)
    d[ethnicity_snake_case] = d['ethnicities'].apply(lambda i: 1 if e in i else 0)

for r in all_religions:
    religion_snake_case = to_snake_case(r)
    d[religion_snake_case] = d['religions'].apply(lambda i: 1 if r in i else 0)

# drop '*_split' columns

d = d.drop([
            'ages_split',
            'communities_split',
            'ethnicities_split',
            'religions_split',
            ],
            axis = 1,
            #inplace = True,
            )


**Disambiguate insurers**

In [None]:
# retrieve OpenAI API key

openai.api_key = os.getenv('OPENAI_API_KEY')

# define system prompt

system_prompt = """
You are an expert at recognizing and separating insurer names with commas.
"""

# define prompt template

prompt_template = """
Please separate the following run-together insurer names with commas:

{input_text}

Ensure the insurers are properly separated by commas. Return only the comma-separated insurer names without additional text or characters such as newlines.

If there are no insurer names in the input text, return only a single period: '.'
"""

# transform, inspect

d = transform_text_with_gpt(
                            d,
                            'insurance_raw',
                            'insurance',
                            system_prompt,
                            prompt_template,
                            )

# drop 'raw' col

d = d.drop(
           'insurance_raw',
           axis = 1,
           )


**Disambiguate specialties**

In [None]:
# retrieve OpenAI API key

openai.api_key = os.getenv('OPENAI_API_KEY')

# define system prompt

system_prompt = """
You are an expert at recognizing and separating mental health counseling specialties (symptoms, special populations, etc.) with commas.
"""

# define prompt template

prompt_template = """
Please separate the following mental health counseling specialties with commas:

{input_text}

Ensure the specialties are properly separated by commas. Return only the comma-separated specialties without additional text or characters such as newlines.

If there are no specialties in the input text, return only a single period: '.'
"""

# transform, inspect

d = transform_text_with_gpt(
                            d,
                            'specialties_raw',
                            'specialties',
                            system_prompt,
                            prompt_template,
                            )

# drop 'raw' col

d = d.drop(
           'specialties_raw',
           axis = 1,
           )


**Extract accreditations**

In [None]:
# retrieve OpenAI API key

openai.api_key = os.getenv('OPENAI_API_KEY')

# define system prompt

system_prompt = """
You are an expert at recognizing the certificates, degrees, and other accreditations of mental health professionals.
"""

# define prompt template

prompt_template = """
Please review this text and gather all certificates, degrees, and other accreditations of mental health professionals:

{input_text}

Return all certificates, degrees, and other accreditations of mental health professionals separated by commas. Do not
add additional text or characters such as newlines.

If there are no accreditations in the input text, return only a single period: '.'
"""

# transform, inspect

d = transform_text_with_gpt(
                            d,
                            'image_alt_text',
                            'accreditations',
                            system_prompt,
                            prompt_template,
                            )


**Housekeeping**

In [None]:
d = d.reindex(
              columns = [
                         'MHP ID',
                         'name',
                         'pronouns',
                         'accreditations',
                         'practice_name',
                         'profile_url',
                         'image_url',
                         'image_alt_text',
                         'place_name',
                         'description',
                         'at_a_glance',
                         'qualifications',
                         #'client_focus',
                         #'ages',
                         #'communities',
                         #'ethnicities',
                         #'religions',
                        'specialties',
                        'types_of_therapy',
                        'insurance',
                        'fees',
                        'individual_fee',
                        'couple_fee',
                        'sliding_scale',
                        'availability',
                        'years_in_practice',
                        'licensed_by_state',
                        'license_number',
                        'toddler',
                        'children_(6_to_10)',
                        'preteen',
                        'teen',
                        'adults',
                        'elders_(65+)',
                        'single_mother',
                        'couples',
                        'family',
                        'racial_justice_allied',
                        'hispanic_and_latino',
                        'sex_worker_allied',
                        'hiv_/_aids_allied',
                        'immuno-disorders',
                        'gay_allied',
                        #'.',
                        'bisexual_allied',
                        'lesbian_allied',
                        'non-binary_allied',
                        'queer_allied',
                        'intersex_allied',
                        'transgender_allied',
                        #'black_and_african_american',
                        #'christian_____children_(6_to_10)',
                        #'elders_(65+)____individuals',
                        #'couples_____deaf_allied',
                        #'christian_____adults',

                        #'adults____individuals_____bisexual_allied',
                        #'transgender_allied____black_and_african_american',
                        #'immuno-disorders__i_also_speak_american_sign_langu__(asl)____christian',
                        #'transgender_allied____christian',
                        #'christian_____toddler',
                        #'elders_(65+)____individuals____christian',
                        #'group_____bisexual_allied',
                        #'group____christian',
                        #'family____christian',
                        #'adults____individuals',
                        #'christian_____adults____individuals_____single_mother____black_and_african_american____christian',
                        #'hispanic_and_latino____christian',

                        #'christian_____teen',
                        #'elders_(65+)____individuals_____bisexual_allied',
                          ])

d.rename(
         columns = {
                    'children_(6_to_10)': 'children_6_to_10',
                    'elders_(65+)': 'elders_65_plus',
                    'hiv_/_aids_allied': 'hiv_aids_allied',
                    'immuno-disorders': 'immuno_disorders',
                    }, inplace = True,
            )
d.info()
d.head(3)

In [None]:
#%pwd
%cd ../../outputs/tables

d.to_excel(
           'd_html.xlsx',
           index = False,
           )

> End of mhp_annotate_iaa_append.ipynb