# Classifying Ailemnts using Medical Text and Speech Phrases

The dataset contains brief audio recordings and transcripts of issues related to 25 different ailments. In this kernal, I'll do a little bit of pre-processing and then categorize the ailments using the text and then the speech recording files.

## Import Packages

In [1]:

!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main
!pip install huggingface_hub
!pip install -U datasets huggingface-hub

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Installing collected packages: pyarrow-hotfix, fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.12.2
    Uninstalling fsspec-2023.12.2:
      Successfully uninstalled fsspec-2

In [9]:
import os
import pandas as pd
from huggingface_hub import notebook_login
from datasets import Dataset, Audio


In [3]:
TOKEN = "<YOUR_WRITE_TOKEN>"
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
df = pd.read_csv('/kaggle/input/medical-speech-transcription-and-intent/Medical Speech, Transcription, and Intent/overview-of-recordings.csv')
df = df[['file_name','phrase','prompt','overall_quality_of_the_audio','speaker_id']]
print(df.shape)
df.head()

(6661, 5)


Unnamed: 0,file_name,phrase,prompt,overall_quality_of_the_audio,speaker_id
0,1249120_43453425_58166571.wav,When I remember her I feel down,Emotional pain,3.33,43453425
1,1249120_43719934_43347848.wav,When I carry heavy things I feel like breaking...,Hair falling out,3.33,43719934
2,1249120_43719934_53187202.wav,there is too much pain when i move my arm,Heart hurts,3.33,43719934
3,1249120_31349958_55816195.wav,My son had his lip pierced and it is swollen a...,Infected wound,3.33,31349958
4,1249120_43719934_82524191.wav,My muscles in my lower back are aching,Infected wound,4.67,43719934


In [5]:
# lists from data_frame
file_name_list = list(df["file_name"].values)
sentence_list = list(df["phrase"].values)
prompt_list = list(df["prompt"].values)
overall_quality_of_the_audio_list = list(df["overall_quality_of_the_audio"].values)
speaker_id_list = list(df["speaker_id"].values)  

In [6]:
assert len(file_name_list) == len(sentence_list) == len(prompt_list) == len(overall_quality_of_the_audio_list) == len(speaker_id_list), "something missing in any one of the list"

In [7]:
medical_data_dict = {} 

length = len(file_name_list)

for i in range(length):
    idx = file_name_list[i].split(".")[0]
    medical_data_dict[idx] = {"sentence": sentence_list[i], "prompt": prompt_list[i], "speaker_id": speaker_id_list[i]}
    

In [10]:
base_dir = '/kaggle/input/medical-speech-transcription-and-intent/Medical Speech, Transcription, and Intent/recordings/'
train_files = [base_dir + 'train/' + i for i in os.listdir(base_dir + 'train')]
val_files = [base_dir + 'validate/' + i for i in os.listdir(base_dir + 'validate')]
test_files = [base_dir + 'test/' + i for i in os.listdir(base_dir + 'test')]

In [11]:
print(f"length of train_files: {len(train_files)}")
print(f"length of validation_files: {len(val_files)}")
print(f"length of test_files: {len(test_files)}")

length of train_files: 381
length of validation_files: 385
length of test_files: 5895


In [12]:
def prepare_split_dict(split):
    
    split_dict = {}
    split_id_list, split_sentence_list, split_prompt_list, split_speaker_id_list, audio_array_list, file_paths = [], [], [], [], [], []
    for file_path in split:
        idx = file_path.split("/")[-1].split(".")[0]
        sentence = medical_data_dict[idx]["sentence"]
        prompt = medical_data_dict[idx]["prompt"]        
        speaker_id = medical_data_dict[idx]["speaker_id"]  
        
        #appending 
        split_id_list.append(idx)
        split_sentence_list.append(sentence)        
        split_prompt_list.append(prompt)        
        split_speaker_id_list.append(speaker_id)   
        file_paths.append(file_path)
        
    #preparing metadata dictionary for HF upload 
    split_dict["id"] = split_id_list
    split_dict["sentence"] = split_sentence_list
    split_dict["prompt"] = split_prompt_list
    split_dict["speaker_id"] = split_speaker_id_list
    split_dict["path"] = file_paths
    
    return split_dict
    

In [13]:

# making arrow_dataset for HF push
med_asr_train_dataset = Dataset.from_dict(prepare_split_dict(train_files)).cast_column("path", Audio())
med_asr_validation_dataset = Dataset.from_dict(prepare_split_dict(val_files)).cast_column("path", Audio())
med_asr_test_dataset = Dataset.from_dict(prepare_split_dict(test_files)).cast_column("path", Audio())


In [14]:
# pushing to HF hub
med_asr_train_dataset.push_to_hub("yashtiwari/PaulMooney-Medical-ASR-Data", private=False, split="train")
med_asr_validation_dataset.push_to_hub("yashtiwari/PaulMooney-Medical-ASR-Data", private=False, split="validation")
med_asr_test_dataset.push_to_hub("yashtiwari/PaulMooney-Medical-ASR-Data", private=False, split="test")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/381 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/653 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/385 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/653 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/11 [00:00<?, ?it/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/536 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/653 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/yashtiwari/PaulMooney-Medical-ASR-Data/commit/f7ea01e0030ec2ec1420362bc090d7dd1fcd769b', commit_message='Upload dataset', commit_description='', oid='f7ea01e0030ec2ec1420362bc090d7dd1fcd769b', pr_url=None, pr_revision=None, pr_num=None)