# **Exploratory Data Analysis (EDA) of PHEE Data**

Four aims during the EDA of the PHEE Data.

1. Description of data attributes
2. Quantify data quality
3. Distribution of text length
4. Word Cloud of Text to understand most prevalent topics in the text
5. Distribution of drugs
6. Distribution of disorders

In [1]:
import pickle

# **Load Data**

In [2]:
def pkl_load_dict(filename):
  """
    Load the dictionary data from a pickle file into a variable

    @P:
    filename (str): Name of the pkl file
    varname : Name of the df to save the pkl data

    @R:
    varname : Containing the pkl data
  """

  with open(filename, 'rb') as handle:
      df = pickle.load(handle)

  return df

df=pkl_load_dict('/content/drive/MyDrive/PHEE/output/data_df.pkl')
df.head()

Unnamed: 0,index,Text,Subject,Negation_cue,Potential_therapeutic_event,Drug,Effect,Adverse_event,Race,Age,...,Freq,Dosage,Combination.Drug,Treat-Disorder,Treatment,Severity_cue,Severity,Time_elapsed_y,Speculation_cue,Sub-Disorder
0,14766993_2,To report a case of possible interaction of sm...,,,,warfarin,possible interaction of smokeless tobacco with...,treated,,,...,,,,several thromboembolic events,warfarin,,,,,
1,11485141_2,CONCLUSIONS: We report the first case of gemci...,,,,gemcitabine,LABD,induced,,,...,,,,,gemcitabine,,,,,
2,11804071_3,Such anagen effluvium with lichenoid eruption ...,,,,INH,anagen effluvium with lichenoid eruption,following,,,...,,,,,INH,,,,,
3,11096051_1,Intensive high-flux hemodiafiltration is often...,,,management,,,,,,...,,,,vancomycin toxicity,Intensive high-flux hemodiafiltration,,,,,
4,1422497_3,Continuous bladder irrigation of a 1% alum sol...,,,treat,1% alum,,,,,...,Continuous,,,bleeding urothelium,Continuous bladder irrigation of a 1% alum sol...,,,,,


# **Description and Relationship of Medical/Data Terms**

*** *For the frequency, include the relative frequencies - include a table*

Description of select attributes

**Subject** Patients involved in the medical event

Majority of the subjects are unknown (48%). The column should be preprocessed to group similar terms, like 'patient' and 'patients,' which may result in more meaningful data.

**Drug** A prescribed medication that causes a physiological effect 

The distribution of drugs in the dataset are uniform, with a slight majority of records with the drug 'methotrexate.' Unsure whether there is a mix of brand and generic drugs in the column.

**Adverse_event** Abbreviated as ADE. Denotes potentially harmful effects of medical therapies.

The most adverse event is 'induced.' Further grouping may be needed for the events.

**Effect** Indicates the outcome of the treatment.

There may be lots of overlap in terms, which may need to be normalized.

**Race** Indicates the subject’s race/nationality

Supporting the statistic that majority of the subject's are unknown, the majority of the race demoigraphics are also unknown (92%).

**Text** Sentences extracted from biomedical literature, MEDLINE case reports, annotated with information
relevant to pharmacovigilance.


> **Ex** Diarrhoea, T-CD4+ lymphopenia and bilateral patchy pulmonary infiltrates developed in a male 60 yrs of age, who was treated with oxaliplatinum and 5-fluorouracil for unresectable rectum carcinoma.



In [3]:
print("An example sentence extracted from a MEDLINE case report: \n {} \n".format(df['Text'].iloc[5]))

An example sentence extracted from a MEDLINE case report: 
 Occasionally, despite good therapeutic response, clozapine must be stopped due to dangerous side effects such as agranulocytosis.
 



In [4]:
def relative_freq(df,col):
  """
    Relective frequency of a column in a df

    @P:
      df (dataframe): Dataframe of data
      col (str): Name of the column of interest
    
  """
  return round(df[col].value_counts(normalize=True)*100,0).to_frame()

In [10]:
relative_freq(df,'Race')

Unnamed: 0,Race
,92.0
High,6.0
Low,0.0
white,0.0
Japanese,0.0
Medium,0.0
Caucasian,0.0
Indian,0.0
black,0.0
Chinese,0.0


# **Distribution of Text Data Tokens, Parts of Speech, Prevalent Terms**