This notebook contains code for determining the ***stance*** in doctor's reponse. The possible stances are *entailment*, *neutral* and *contradiction*. Each response is assigned percentage across these three stances. Stance detection is carried out by using the LLM `DeBERTa-v3-large-mnli-fever-anli-ling-wanli`


In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.0 MB/s[0m eta [36m0:00:0

## Load libaries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy

from spacy.matcher import Matcher
from google.colab import drive

## Load data

In [None]:
%%time

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to the .feather file
file_path = "/content/drive/MyDrive/diagnose_en_dataset-patient_info-With_Gender_and_Reference-response_Bias_measures.feather"

# Load the .feather file as a Pandas DataFrame
df = pd.read_feather(file_path)

# Print the DataFrame
df.head()

Mounted at /content/drive
CPU times: user 1.85 s, sys: 1.26 s, total: 3.1 s
Wall time: 24.4 s


Unnamed: 0,id,Description,Doctor,Patient,Self,Gender,Gend_bias_freq,Response_len,Disagree_freq
0,0,Q. What does abutment of the nerve root mean?,Hi. I have gone through your query with dilige...,"Hi doctor,I am just wondering what is abutting...",1,X,0,35,0
1,1,"Q. Every time I eat spicy food, I poop blood. ...",Hello. I have gone through your information an...,"Hi doctor, I am a 26 year old male. I am 5 fee...",1,M,0,65,0
2,2,Q. Will Nano-Leo give permanent solution for e...,Hi. For further doubts consult a sexologist on...,"Hello doctor, I am 48 years old. I am experien...",1,M,0,13,0
3,3,Q. Will Kalarchikai cure multiple ovarian cyst...,Hello. I just read your query. See Kalarachi K...,"Hello doctor, I have multiple small cysts in b...",1,F,0,127,1
4,4,Q. I masturbate only by rubbing the tip of the...,Hi. For further doubts consult a sexologist on...,"Hi doctor, During masturbation I just rub the ...",1,M,0,21,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257469 entries, 0 to 257468
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   id              257469 non-null  int64 
 1   Description     257469 non-null  object
 2   Doctor          257469 non-null  object
 3   Patient         257469 non-null  object
 4   Self            257469 non-null  int64 
 5   Gender          257469 non-null  object
 6   Gend_bias_freq  257469 non-null  int64 
 7   Response_len    257469 non-null  int64 
 8   Disagree_freq   257469 non-null  int64 
dtypes: int64(5), object(4)
memory usage: 17.7+ MB


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
model_name = "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/395 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/870M [00:00<?, ?B/s]

In [None]:
count = 0

def doctors_stance(premise):
  # global count
  # count += 1

  hypothesis = "The health concerns are valid."
  input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
  output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
  prediction = torch.softmax(output["logits"][0], -1).tolist()
  label_names = ["entailment", "neutral", "contradiction"]
  prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}

  # if count % 1000 == 0:
  #   print(f'No. of entries processed: {count}')

  return prediction['entailment'], prediction['neutral'], prediction['contradiction']

## Apply the function *`doctors_stance`* to small subset of the DataFrame at a time and save the output as a separate .csv file.

In [None]:
%%time

# Define the size of each smaller DataFrame
chunk_size = 10000
# chunk_size = 10

# Iterate over the chunks of the DataFrame
for i in range(0, len(df), chunk_size):
    chunk = df[i:i+chunk_size]

    # Apply your function on the current chunk
    chunk[['entailment','neutral', 'contradiction']] = chunk['Doctor'].apply(lambda x: pd.Series(doctors_stance(x)))

    output_path = "/content/drive/MyDrive/Stance-files/diagnose_en_dataset-Stance-"+ str(i) + ".csv"
    print(f'No. of entries processed: {i}')

    # # Save the DataFrame as a .feather file
    # df.to_feather(output_path)
    chunk.to_csv(output_path)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


No. of entries processed: 0
