## <img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
# GA Capstone - Classifying Adverse Event Seriousness using NLP

## Part 1 - Data Cleaning
- [Executive Summary](#Executive-Summary)
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Datasets](#Datasets)
- [Data Cleaning](#Data-Cleaning)

## Executive Summary

An adverse event (AE) is a harmful or negative outcome that occurs when a patient has been provided with medical care or treatment. This project aims to build a model to classify serious and non-serious AE. In the data cleaning section, duplicates were removed, null values were either imputed or dropped and the target column of 'serious' was created. Through EDA, we found symptom_text to be the most predicitve text column out of all and will be used for modelling.

A total of 5 models were evaluated - (i) Logistic Regression; (ii) Naive Bayes - Multinomal; (iii) Random Forest Classifier; (iv) Ada Boost Classifier; and (v) Support Vector Machine (SVM). 2 different train-test split were chosen: 10/90 and 80/20 train/test respectively. Modelling was done with and without Synthetic Minority Oversampling Technique (SMOTE), and the model that met the target was Logistic Regression utilizing TF-IDF vectorizer with 80/20 split without SMOTE. This model yielded a test accuracy of 0.904, train accuracy of 0.914 and F1 score of 0.709.

In conclusion, the model was successful in meeting the requirements from the problem statement. Our recommendation is to implement the model as a preliminary screening tool for all incoming AE reports, to get an initial seriousness classification. This would help to enable serious reports to get expedited and processed more quickly, enabling signal detection to occur more efficiently. 




## Problem Statement

As a data scientist in a consultant firm to the Health Authority in Singapore, we have been tasked to help create a model to differentiate between serious and non-serious AE using Natural Langauge Processing (NLP) from reports obtained from various sources. 

The following models will be tested as potential candidates:
* Logistic Regression
* Naive Bayes - Multinomial
* Random Forest Classifier
* Ada Boost Classifier
* Support Vector Machine (SVM)

A successful model is defined as having an accuracy and F1 score of at least 0.7.

## Background

An AE is a harmful or negative outcome that occurs when a patient has been provided with medical care or treatment ([source](https://www.ncbi.nlm.nih.gov/books/NBK558963/)).

With every new drug or vaccine that comes into the market, there is a need for health authorities worldwide to be on the look out for any signals indicating the product is unsafe for the general population. A major source of these signals is through the submission of spontaneous AE reports from anyone (e.g. patients, healthcare professionals, social media, etc.) ([source](https://cioms.ch/wp-content/uploads/2018/03/WG8-Signal-Detection.pdf)). 

Generally companies and authorities would want to detect a spike in serious AE. This labelling of severity in companies are generally done manually, which increase the risk of human misclassification and is time consuming. An AE is classified as serious if the patient outcome is one of the following ([source](https://www.fda.gov/safety/reporting-serious-problems-fda/what-serious-adverse-event)):
* Death
* Life-threatening
* Hospitalisation (initial or prolonged)
* Disability or Permanent Damage
* Congenital Anomaly or Birth Defect
* Other Serious (Important Medical Events (IME))

Since not all reporters have the expertise to determine whether an AE is serious, there is a need to create a system to quickly and accurately identify potential serious AE to ensure timely response which may include advisories or drug recall by the health authority of pharmaceutical company. 

The dataset is taken from Vaccine Adverse Event Reporting System ([VAERS](https://vaers.hhs.gov/index.html)). VAERS is a national warning system in US to detect possible safety problems in US-licensed vaccines, and anyone can report an AE to VAERS ([source](https://vaers.hhs.gov/about.html)).


## Datasets
* `VAERSDATA.CSV`: Patient information, general AE description and fields
* `VAERSVAX.CSV`: Information on vaccine 
* `VAERSSYMPTOMS.CSV`: Information on symptoms

In [1]:
# Import libraries

# Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Time
import time
from datetime import datetime

# Data Processing
import re
import string
import nltk

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

### Load Data

In [3]:
# Load files
vaers_data = pd.read_csv('../data/2021VAERSDATA1.csv')
vaers_sym = pd.read_csv('../data/2021VAERSSYMPTOMS.csv')
vaers_vax = pd.read_csv('../data/2021VAERSVAX1.csv')
ime = pd.read_excel('../data/ime_v240.xlsx')

In [4]:
vaers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611413 entries, 0 to 611412
Data columns (total 35 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   VAERS_ID      611413 non-null  int64  
 1   RECVDATE      611413 non-null  object 
 2   STATE         535013 non-null  object 
 3   AGE_YRS       547517 non-null  float64
 4   CAGE_YR       490215 non-null  float64
 5   CAGE_MO       2328 non-null    float64
 6   SEX           611413 non-null  object 
 7   RPT_DATE      350 non-null     object 
 8   SYMPTOM_TEXT  611275 non-null  object 
 9   DIED          7896 non-null    object 
 10  DATEDIED      7047 non-null    object 
 11  L_THREAT      9388 non-null    object 
 12  ER_VISIT      52 non-null      object 
 13  HOSPITAL      36801 non-null   object 
 14  HOSPDAYS      25011 non-null   float64
 15  X_STAY        324 non-null     object 
 16  DISABLE       9576 non-null    object 
 17  RECOVD        555788 non-null  object 
 18  VAX_

In [5]:
vaers_sym.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816998 entries, 0 to 816997
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   VAERS_ID         816998 non-null  int64  
 1   SYMPTOM1         816998 non-null  object 
 2   SYMPTOMVERSION1  816998 non-null  float64
 3   SYMPTOM2         639352 non-null  object 
 4   SYMPTOMVERSION2  639352 non-null  float64
 5   SYMPTOM3         494077 non-null  object 
 6   SYMPTOMVERSION3  494077 non-null  float64
 7   SYMPTOM4         374761 non-null  object 
 8   SYMPTOMVERSION4  374761 non-null  float64
 9   SYMPTOM5         278654 non-null  object 
 10  SYMPTOMVERSION5  278654 non-null  float64
dtypes: float64(5), int64(1), object(5)
memory usage: 68.6+ MB


In [6]:
vaers_vax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640911 entries, 0 to 640910
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   VAERS_ID         640911 non-null  int64 
 1   VAX_TYPE         640911 non-null  object
 2   VAX_MANU         640911 non-null  object
 3   VAX_LOT          440447 non-null  object
 4   VAX_DOSE_SERIES  637937 non-null  object
 5   VAX_ROUTE        491738 non-null  object
 6   VAX_SITE         471116 non-null  object
 7   VAX_NAME         640911 non-null  object
dtypes: int64(1), object(7)
memory usage: 39.1+ MB


In [7]:
vaers = vaers_data.merge(vaers_sym, on='VAERS_ID',how='inner').merge(vaers_vax, on='VAERS_ID',how='inner')

In [8]:
vaers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 861433 entries, 0 to 861432
Data columns (total 52 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   VAERS_ID         861433 non-null  int64  
 1   RECVDATE         861433 non-null  object 
 2   STATE            766942 non-null  object 
 3   AGE_YRS          786777 non-null  float64
 4   CAGE_YR          706194 non-null  float64
 5   CAGE_MO          4685 non-null    float64
 6   SEX              861433 non-null  object 
 7   RPT_DATE         490 non-null     object 
 8   SYMPTOM_TEXT     861285 non-null  object 
 9   DIED             14536 non-null   object 
 10  DATEDIED         13439 non-null   object 
 11  L_THREAT         21764 non-null   object 
 12  ER_VISIT         79 non-null      object 
 13  HOSPITAL         83553 non-null   object 
 14  HOSPDAYS         60868 non-null   float64
 15  X_STAY           658 non-null     object 
 16  DISABLE          21370 non-null   obje

In [9]:
vaers.head()

Unnamed: 0,VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES,SYMPTOM1,SYMPTOMVERSION1,SYMPTOM2,SYMPTOMVERSION2,SYMPTOM3,SYMPTOMVERSION3,SYMPTOM4,SYMPTOMVERSION4,SYMPTOM5,SYMPTOMVERSION5,VAX_TYPE,VAX_MANU,VAX_LOT,VAX_DOSE_SERIES,VAX_ROUTE,VAX_SITE,VAX_NAME
0,916600,01/01/2021,TX,33.0,33.0,,F,,Right side of epiglottis swelled up and hinder swallowing pictures taken Benadryl Tylenol taken,,,,,,,,,Y,12/28/2020,12/30/2020,2.0,,PVT,,,,,,,2,01/01/2021,,Y,,Pcn and bee venom,Dysphagia,23.1,Epiglottitis,23.1,,,,,,,COVID19,MODERNA,037K20A,1,IM,LA,COVID19 (COVID19 (MODERNA))
1,916601,01/01/2021,CA,73.0,73.0,,F,,"Approximately 30 min post vaccination administration patient demonstrated SOB and anxiousness. Assessed at time of event: Heart sounds normal, Lung sounds clear. Vitals within normal limits for patient. O2 91% on 3 liters NC Continuous flow. 2 consecutive nebulized albuterol treatments were administered. At approximately 1.5 hours post reaction, patients' SOB and anxiousness had subsided and the patient stated that they were feel ""much better"".",,,,,,,,,Y,12/31/2020,12/31/2020,0.0,,SEN,,Patient residing at nursing facility. See patients chart.,Patient residing at nursing facility. See patients chart.,Patient residing at nursing facility. See patients chart.,,,2,01/01/2021,,Y,,"""Dairy""",Anxiety,23.1,Dyspnoea,23.1,,,,,,,COVID19,MODERNA,025L20A,1,IM,RA,COVID19 (COVID19 (MODERNA))
2,916602,01/01/2021,WA,23.0,23.0,,F,,"About 15 minutes after receiving the vaccine, the patient complained about her left arm hurting. She also complained of chest tightness and difficulty swallowing. Patient also had vision changes. We gave the patient 1 tablet of Benadryl 25 mg and called EMS services. EMS checked her out and we advised the patient to go to the ER to be observed and given more Benadryl. Patient was able to walk out of facility herself.",,,,,,,,,U,12/31/2020,12/31/2020,0.0,,SEN,,,,,,,2,01/01/2021,,,Y,Shellfish,Chest discomfort,23.1,Dysphagia,23.1,Pain in extremity,23.1,Visual impairment,23.1,,,COVID19,PFIZER\BIONTECH,EL1284,1,IM,LA,COVID19 (COVID19 (PFIZER-BIONTECH))
3,916603,01/01/2021,WA,58.0,58.0,,F,,"extreme fatigue, dizziness,. could not lift my left arm for 72 hours",,,,,,,,,Y,12/23/2020,12/23/2020,0.0,none,WRK,,none,kidney infection,"diverticulitis, mitral valve prolapse, osteoarthritis","got measles from measel shot, mums from mumps shot, headaches and nausea from flu shot",,2,01/01/2021,,,,"Diclofenac, novacaine, lidocaine, pickles, tomatoes, milk",Dizziness,23.1,Fatigue,23.1,Mobility decreased,23.1,,,,,COVID19,MODERNA,unknown,UNK,,,COVID19 (COVID19 (MODERNA))
4,916604,01/01/2021,TX,47.0,47.0,,F,,"Injection site swelling, redness, warm to the touch and itchy",,,,,,,,,N,12/22/2020,12/29/2020,7.0,,PUB,,Na,Na,,,,2,01/01/2021,,,,Na,Injection site erythema,23.1,Injection site pruritus,23.1,Injection site swelling,23.1,Injection site warmth,23.1,,,COVID19,MODERNA,,1,IM,LA,COVID19 (COVID19 (MODERNA))


In [10]:
# Convert column names into lowercase
vaers.columns = vaers.columns.str.lower() 

In [11]:
vaers['vax_type'].unique()

array(['COVID19', 'FLUC4', 'DTAPHEPBIP', 'HIBV', 'PNC13', 'RV1', 'UNK',
       'FLU4', 'PPV', 'FLUA3', 'VARZOS', 'MMR', 'DT', 'HPV9', 'DTAP',
       'MMRV', 'TDAP', 'FLUR4', 'DTAPIPVHIB', 'HEPA', 'MNQ', 'FLUX', 'YF',
       'HEP', 'FLUA4', 'FLUC3', 'HPV4', 'ANTH', 'VARCEL', 'RV5', 'MENB',
       'IPV', 'RAB', 'FLUN4', 'DTAPIPV', 'TYP', 'ADEN_4_7', 'CHOL',
       'TTOX', 'FLU3', 'HEPAB', 'TD', 'EBZR', 'PNC', 'DF', 'HPVX',
       'FLUX(H1N1)', 'RVX', 'MENHIB', 'DTP', 'MEN', 'JEV1', 'FLU(H1N1)',
       'MNQHIB', 'OPV', 'SMALL', 'TDAPIPV', 'FLUN3', 'DTPHEP', 'JEVX',
       'DTPPVHBHPB', '6VAX-F'], dtype=object)

In [12]:
# We only want COVID19 vaccine related AEs
# Create a new df with only rows that contain such values
df = vaers[vaers['vax_type'] == 'COVID19'].copy()

In [13]:
df.head()

Unnamed: 0,vaers_id,recvdate,state,age_yrs,cage_yr,cage_mo,sex,rpt_date,symptom_text,died,datedied,l_threat,er_visit,hospital,hospdays,x_stay,disable,recovd,vax_date,onset_date,numdays,lab_data,v_adminby,v_fundby,other_meds,cur_ill,history,prior_vax,splttype,form_vers,todays_date,birth_defect,ofc_visit,er_ed_visit,allergies,symptom1,symptomversion1,symptom2,symptomversion2,symptom3,symptomversion3,symptom4,symptomversion4,symptom5,symptomversion5,vax_type,vax_manu,vax_lot,vax_dose_series,vax_route,vax_site,vax_name
0,916600,01/01/2021,TX,33.0,33.0,,F,,Right side of epiglottis swelled up and hinder swallowing pictures taken Benadryl Tylenol taken,,,,,,,,,Y,12/28/2020,12/30/2020,2.0,,PVT,,,,,,,2,01/01/2021,,Y,,Pcn and bee venom,Dysphagia,23.1,Epiglottitis,23.1,,,,,,,COVID19,MODERNA,037K20A,1,IM,LA,COVID19 (COVID19 (MODERNA))
1,916601,01/01/2021,CA,73.0,73.0,,F,,"Approximately 30 min post vaccination administration patient demonstrated SOB and anxiousness. Assessed at time of event: Heart sounds normal, Lung sounds clear. Vitals within normal limits for patient. O2 91% on 3 liters NC Continuous flow. 2 consecutive nebulized albuterol treatments were administered. At approximately 1.5 hours post reaction, patients' SOB and anxiousness had subsided and the patient stated that they were feel ""much better"".",,,,,,,,,Y,12/31/2020,12/31/2020,0.0,,SEN,,Patient residing at nursing facility. See patients chart.,Patient residing at nursing facility. See patients chart.,Patient residing at nursing facility. See patients chart.,,,2,01/01/2021,,Y,,"""Dairy""",Anxiety,23.1,Dyspnoea,23.1,,,,,,,COVID19,MODERNA,025L20A,1,IM,RA,COVID19 (COVID19 (MODERNA))
2,916602,01/01/2021,WA,23.0,23.0,,F,,"About 15 minutes after receiving the vaccine, the patient complained about her left arm hurting. She also complained of chest tightness and difficulty swallowing. Patient also had vision changes. We gave the patient 1 tablet of Benadryl 25 mg and called EMS services. EMS checked her out and we advised the patient to go to the ER to be observed and given more Benadryl. Patient was able to walk out of facility herself.",,,,,,,,,U,12/31/2020,12/31/2020,0.0,,SEN,,,,,,,2,01/01/2021,,,Y,Shellfish,Chest discomfort,23.1,Dysphagia,23.1,Pain in extremity,23.1,Visual impairment,23.1,,,COVID19,PFIZER\BIONTECH,EL1284,1,IM,LA,COVID19 (COVID19 (PFIZER-BIONTECH))
3,916603,01/01/2021,WA,58.0,58.0,,F,,"extreme fatigue, dizziness,. could not lift my left arm for 72 hours",,,,,,,,,Y,12/23/2020,12/23/2020,0.0,none,WRK,,none,kidney infection,"diverticulitis, mitral valve prolapse, osteoarthritis","got measles from measel shot, mums from mumps shot, headaches and nausea from flu shot",,2,01/01/2021,,,,"Diclofenac, novacaine, lidocaine, pickles, tomatoes, milk",Dizziness,23.1,Fatigue,23.1,Mobility decreased,23.1,,,,,COVID19,MODERNA,unknown,UNK,,,COVID19 (COVID19 (MODERNA))
4,916604,01/01/2021,TX,47.0,47.0,,F,,"Injection site swelling, redness, warm to the touch and itchy",,,,,,,,,N,12/22/2020,12/29/2020,7.0,,PUB,,Na,Na,,,,2,01/01/2021,,,,Na,Injection site erythema,23.1,Injection site pruritus,23.1,Injection site swelling,23.1,Injection site warmth,23.1,,,COVID19,MODERNA,,1,IM,LA,COVID19 (COVID19 (MODERNA))


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 830256 entries, 0 to 861432
Data columns (total 52 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         830256 non-null  int64  
 1   recvdate         830256 non-null  object 
 2   state            741381 non-null  object 
 3   age_yrs          763255 non-null  float64
 4   cage_yr          684564 non-null  float64
 5   cage_mo          1236 non-null    float64
 6   sex              830256 non-null  object 
 7   rpt_date         316 non-null     object 
 8   symptom_text     830117 non-null  object 
 9   died             14041 non-null   object 
 10  datedied         13027 non-null   object 
 11  l_threat         20939 non-null   object 
 12  er_visit         51 non-null      object 
 13  hospital         80555 non-null   object 
 14  hospdays         58937 non-null   float64
 15  x_stay           580 non-null     object 
 16  disable          19797 non-null   obje

In [15]:
df.describe()

Unnamed: 0,vaers_id,age_yrs,cage_yr,cage_mo,hospdays,numdays,form_vers,symptomversion1,symptomversion2,symptomversion3,symptomversion4,symptomversion5
count,830256.0,763255.0,684564.0,1236.0,58937.0,746438.0,830256.0,830256.0,652444.0,506226.0,385859.0,288495.0
mean,1315408.0,49.895218,49.587898,0.058738,22.644095,26.248461,1.999559,23.928381,23.928405,23.927833,23.929738,23.929175
std,246505.6,18.540705,18.688149,0.155221,1302.428343,592.728382,0.020991,0.256573,0.256308,0.257123,0.255132,0.257114
min,916600.0,0.08,0.0,0.0,1.0,0.0,1.0,23.1,23.1,23.1,23.1,23.1
25%,1105591.0,35.0,35.0,0.0,2.0,0.0,2.0,24.0,24.0,24.0,24.0,24.0
50%,1288410.0,50.0,50.0,0.0,3.0,1.0,2.0,24.0,24.0,24.0,24.0,24.0
75%,1540335.0,64.0,64.0,0.0,6.0,6.0,2.0,24.0,24.0,24.0,24.0,24.0
max,1771204.0,119.0,120.0,1.0,99999.0,44224.0,2.0,24.1,24.1,24.1,24.1,24.1


## Data Cleaning

#### Removing Duplicates
Duplicate records may be attributed to a few reasons, such as having multiple reporters (e.g. healthcare professionals, pharmaceutical companies) reporting the same case, or due to erronous data entry (e.g. clicking submit twice). Since having these duplicates could potentially skew our preditions resulting in overfitting of our model, these rows will be removed.

In [16]:
# Check for duplicate records
df.duplicated().value_counts()

False    828665
True       1591
dtype: int64

In [17]:
# Check if duplicates have been dropped
df.drop_duplicates(inplace=True)
df.duplicated().value_counts()

False    828665
dtype: int64

### Form Version

There seem to be 2 different versions of form for reporting AE, VAERS 1 & VAERS 2 ([source](https://vaers.hhs.gov/docs/VAERSDataUseGuide_en_September2021.pdf)). 

Most of the fields in both forms are identical with an exception to a number of fields (present in VAER 1 and not VAER 2) such as `er_visit`, `v_fundby`, `rpt_date`. The fields that differ look to contain administrative information or have an equivalent field with a different name in VAER 2. 

Let's take a look to see if there's an even split between the 2 versions.

In [18]:
# Look at distribution of form versions
df['form_vers'].value_counts()

2    828302
1       363
Name: form_vers, dtype: int64

Since the majority of forms are reported in VAERS 2 form, and VAERS 1 make up < 0.04% of the total dataset, we will drop VAERS 1 and use VAERS 2 fields for training the model.

In [19]:
# Drop rows that are not VAERS 2
df.drop(df[df['form_vers'] != 2].index, inplace=True)
df['form_vers'].value_counts()

2    828302
Name: form_vers, dtype: int64

In [20]:
# Drop columns that are associated with VAERS 1 only
df.drop(columns=['rpt_date', 'er_visit', 'v_fundby'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 828302 entries, 0 to 861432
Data columns (total 49 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         828302 non-null  int64  
 1   recvdate         828302 non-null  object 
 2   state            739950 non-null  object 
 3   age_yrs          761370 non-null  float64
 4   cage_yr          682687 non-null  float64
 5   cage_mo          1234 non-null    float64
 6   sex              828302 non-null  object 
 7   symptom_text     828163 non-null  object 
 8   died             13959 non-null   object 
 9   datedied         12951 non-null   object 
 10  l_threat         20869 non-null   object 
 11  hospital         80298 non-null   object 
 12  hospdays         58810 non-null   float64
 13  x_stay           578 non-null     object 
 14  disable          19724 non-null   object 
 15  recovd           761919 non-null  object 
 16  vax_date         781749 non-null  obje

### MedDRA versioning

Adopted by the international community, the Medical Dictionary for Regulatory Activities (MedDRA) is a standardised medical terminology dictionary created by working parties of the EU regulatory authorities and industry representatives based on terminology belonging to the Medicines and Healthcare products Regulatory Agency (MHRA) of UK ([source](https://www.meddra.org/about-meddra/history)).

MedDRA is updated biannually with new versions released in March and September each year ([source](https://www.meddra.org/faq)). 

Let's take a look at the MedDRA versions present in this dataset.

In [21]:
df['symptomversion1'].value_counts()

24.0    702811
23.1     71962
24.1     53529
Name: symptomversion1, dtype: int64

In [22]:
df['symptomversion2'].value_counts()

24.0    553171
23.1     56443
24.1     41470
Name: symptomversion2, dtype: int64

In [23]:
df['symptomversion3'].value_counts()

24.0    428969
23.1     44102
24.1     32174
Name: symptomversion3, dtype: int64

In [24]:
df['symptomversion4'].value_counts()

24.0    325954
23.1     32974
24.1     26159
Name: symptomversion4, dtype: int64

In [25]:
df['symptomversion5'].value_counts()

24.0    241635
23.1     25015
24.1     21249
Name: symptomversion5, dtype: int64

Since our dataset comprises of AE from 2021, we see a mixture of 3 MedDRA versions present. With 23.1 released in Sep 2020, 24.0 in Mar 2021 and 24.1 in Sep 2021. Since not all patients experience 5 AEs, we noted that the total number of values for symptoms 2-5 may not necessarily add up to the total number of records. With most of the data using MedDRA version 24.0 (over 85%), these rows will be kept while rows containing the other versions will be dropped.

In [26]:
# Drop rows with MedDRA vesion != 24.0
df.drop(df[(df['symptomversion1'] == 23.1)|(df['symptomversion1'] == 24.1)|
           (df['symptomversion2'] == 23.1)|(df['symptomversion2'] == 24.1)|
           (df['symptomversion3'] == 23.1)|(df['symptomversion3'] == 24.1)|
           (df['symptomversion4'] == 23.1)|(df['symptomversion4'] == 24.1)|
           (df['symptomversion5'] == 23.1)|(df['symptomversion5'] == 24.1)].index, inplace=True)
df['symptomversion1'].value_counts()

24.0    702795
Name: symptomversion1, dtype: int64

In [27]:
# Drop the version columns as we no longer require them
df.drop(columns=['symptomversion1', 'symptomversion2', 'symptomversion3', 
                 'symptomversion4', 'symptomversion5'], inplace=True)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 702795 entries, 17 to 861395
Data columns (total 44 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         702795 non-null  int64  
 1   recvdate         702795 non-null  object 
 2   state            631674 non-null  object 
 3   age_yrs          655269 non-null  float64
 4   cage_yr          600845 non-null  float64
 5   cage_mo          1080 non-null    float64
 6   sex              702795 non-null  object 
 7   symptom_text     702665 non-null  object 
 8   died             9329 non-null    object 
 9   datedied         8663 non-null    object 
 10  l_threat         16234 non-null   object 
 11  hospital         58479 non-null   object 
 12  hospdays         43512 non-null   float64
 13  x_stay           385 non-null     object 
 14  disable          15574 non-null   object 
 15  recovd           647579 non-null  object 
 16  vax_date         671012 non-null  obj

### Missing Values

In [29]:
df_missing = df.isnull().sum()
df_missing = pd.DataFrame(df_missing, columns = ['number_missing'])
df_missing['percentage_missing'] = df_missing['number_missing']*100/df.shape[0]
df_missing.sort_values(by='number_missing', ascending=False)

Unnamed: 0,number_missing,percentage_missing
x_stay,702410,99.945219
birth_defect,702338,99.934974
cage_mo,701715,99.846328
datedied,694132,98.76735
died,693466,98.672586
disable,687221,97.783991
l_threat,686561,97.69008
prior_vax,663978,94.476768
hospdays,659283,93.808721
hospital,644316,91.679081


There are a few rows that will be dropped, namely `recvdate`, `state`, `cage_yr`, `cage_mo`, `todays_date`, `splttype`, `vax_lot`, `datedied`, `ofc_visit`.

In [30]:
# Drop columns that do not contain any predictive power
df = df.drop(columns=['recvdate', 'state', 'cage_yr', 'cage_mo', 'todays_date', 'splttype', 'vax_lot', 'datedied', 'ofc_visit'])
df.head()

Unnamed: 0,vaers_id,age_yrs,sex,symptom_text,died,l_threat,hospital,hospdays,x_stay,disable,recovd,vax_date,onset_date,numdays,lab_data,v_adminby,other_meds,cur_ill,history,prior_vax,form_vers,birth_defect,er_ed_visit,allergies,symptom1,symptom2,symptom3,symptom4,symptom5,vax_type,vax_manu,vax_dose_series,vax_route,vax_site,vax_name
17,916612,71.0,F,"Left side of face became numb, including to behind the left ear. Happened within 10 minutes of injection. Subsided within 30 minutes. The next day, some numbness returned at about 9pm in the evening. Pain behind left ear.",,,,,,,U,12/30/2020,12/30/2020,0.0,None yet,PVT,"levothyroxine 100mcg/day, estradiol 1mg/day",none,Graves Disease,,2,,,"penicillin, toradol, methimazole",Ear pain,Hypoaesthesia,,,,COVID19,MODERNA,1,IM,LA,COVID19 (COVID19 (MODERNA))
57,916641,44.0,F,"Vertigo every evening when lying down and every morning when getting up. I have been lying in bed for 5-10 minutes with eyes open, then sitting up slowly. Next, I sit on the side of the bed for a few minutes. When I get up, I need to hold onto something so I don't fall down.",,,,,,,N,12/28/2020,12/28/2020,0.0,none,PVT,"multivitamin, D3, baby aspirin",none,none,,2,,,"latex, sulfa drugs",Vertigo,,,,,COVID19,MODERNA,1,IM,RA,COVID19 (COVID19 (MODERNA))
138,916702,70.0,F,body aches and stomach ache,,,,,,,N,12/01/2020,01/01/2021,31.0,,PVT,Triamterene HCTZ Montelukast Celecoxib Aller-Tec Multivitamin Vitamin D3 Magneseum,,asthma when I get a cold,,2,,,too much cordosone,Abdominal pain upper,Pain,,,,COVID19,MODERNA,1,SYR,,COVID19 (COVID19 (MODERNA))
171,916725,66.0,F,A large red rash around injection site. Area is also hard.,,,,,,,N,12/23/2020,01/01/2021,9.0,,PVT,"Calcium, Fish oil, levothyroxine, Fosamax",none,hypothyroid,,2,,,bee stings,Injection site induration,Injection site reaction,Rash erythematous,,,COVID19,MODERNA,UNK,SYR,LA,COVID19 (COVID19 (MODERNA))
346,916850,20.0,F,"20 year old female c/o possible adverse reaction to vaccine. States she received the 1st round of covid vaccine 2 days ago. Reports having increase difficulty swallowing, difficulty breathing, + diarrhea, chest tightness and increase difficulty ambulating ever since getting the vaccine. She is calling seeking guidance on what to do. She reports she had the pfizer vaccine.",,,,,,,U,12/26/2020,12/28/2020,2.0,sent to ER,UNK,Escitalopram 5 mg Venlafaxine 37.5 mg tablet ZyrTEC;,,none,,2,,Y,Seasonal,Chest discomfort,Diarrhoea,Dysphagia,Dyspnoea,,COVID19,UNKNOWN MANUFACTURER,UNK,,,COVID19 (COVID19 (UNKNOWN))


#### Age

The columns `age_yrs` make up less than 5% of the dataset, hence rows with missing values will be dropped. 

In [31]:
# Drop all rows that have missing values
df = df.dropna(subset = ['age_yrs'])

#### Missing Symptom Text

The column `symptom_text` contains a description of the AE experienced by the patient. Since we plan to use the text in this column as a feature to predict seriousness, and only 130 rows have missing values, we will drop rows with missing `symptom_text`.

In [32]:
# Look at rows with no symptom text
df[df['symptom_text'].isnull()].head()

Unnamed: 0,vaers_id,age_yrs,sex,symptom_text,died,l_threat,hospital,hospdays,x_stay,disable,recovd,vax_date,onset_date,numdays,lab_data,v_adminby,other_meds,cur_ill,history,prior_vax,form_vers,birth_defect,er_ed_visit,allergies,symptom1,symptom2,symptom3,symptom4,symptom5,vax_type,vax_manu,vax_dose_series,vax_route,vax_site,vax_name
39566,945204,23.0,F,,,,,,,,,01/14/2021,01/14/2021,0.0,,PVT,,,,,2,,,,Unevaluable event,,,,,COVID19,PFIZER\BIONTECH,1,IM,AR,COVID19 (COVID19 (PFIZER-BIONTECH))
40390,1349930,93.0,M,,,,,,,,U,01/11/2020,01/13/2020,2.0,,OTH,,,,,2,,Y,,Unevaluable event,,,,,COVID19,MODERNA,1,IM,RA,COVID19 (COVID19 (MODERNA))
45054,950337,70.0,F,,,,,,,,N,01/16/2021,01/16/2021,0.0,,PHM,,,,,2,,,,Unevaluable event,,,,,COVID19,MODERNA,1,,LA,COVID19 (COVID19 (MODERNA))
45090,950370,70.0,M,,,,,,,,Y,01/16/2021,01/16/2021,0.0,,PHM,,,,,2,,,,Unevaluable event,,,,,COVID19,MODERNA,1,,LA,COVID19 (COVID19 (MODERNA))
45813,1349873,79.0,M,,,,,,,,,01/14/2021,01/14/2021,0.0,,OTH,,,,,2,,Y,,Unevaluable event,,,,,COVID19,MODERNA,1,IM,LA,COVID19 (COVID19 (MODERNA))


In [33]:
df = df.dropna(subset = ['symptom_text'])

In [34]:
# Some of the rows have symptoms stated as 'No adverse event'
# Take a closer look to see the rationale behind this
# df[df['symptom1'] == 'No adverse event']

#### Missing Serious Criteria

For serious criterion columns, a null value indicates that the serious criterion was not selected, hence null values will be indicated as 0 and as 'Y' will be indicated as 1.

In [35]:
# Replace 'Y' with 1 and null values as 0
df['died'] = df['died'].apply(lambda x: 1 if x == 'Y' else 0)
df['l_threat'] = df['l_threat'].apply(lambda x: 1 if x == 'Y' else 0)
df['hospital'] = df['hospital'].apply(lambda x: 1 if x == 'Y' else 0)
df['x_stay'] = df['x_stay'].apply(lambda x: 1 if x == 'Y' else 0)
df['disable'] = df['disable'].apply(lambda x: 1 if x == 'Y' else 0)
df['birth_defect'] = df['birth_defect'].apply(lambda x: 1 if x == 'Y' else 0)

#### Missing hospital days
The column `hospdays` will only be filled if the patient was hospitalised, and blank spaces indicate that the patient was not hospitalised. Hence, missing values will be imputed with 0.

In [36]:
# Imputed missing values with 0
df['hospdays'] = df['hospdays'].fillna(0)

#### Missing recovered status
`recovd` has 3 possible values: 'U', 'N', 'Y'. For missing values, we will impute it as 'U' which indicates that the recovery status is unknown.

In [37]:
df['recovd'].unique()

array(['U', 'N', 'Y', nan], dtype=object)

In [38]:
# Impute null values with 'U'
df['recovd'] = df['recovd'].fillna('U')

#### Numdays
`numdays` will be replaced with the difference between `onset_date` and `vax_date`.

In [39]:
# Remove null values from vax_date and onset_date
df = df[df['vax_date'].notna()]
df = df[df['onset_date'].notna()]

# Convert columns to pd datetime format
df['vax_date'] = pd.to_datetime(df['vax_date'])
df['onset_date'] = pd.to_datetime(df['onset_date'])

# Calculate 'num_days'
df['numdays'] = (df['onset_date'] - df['vax_date']).dt.days

# Interval between vaccination date and onset date should be positive, otherwise AE occured prior to vaccination
df = df[df['numdays'] >= 0]
df['numdays'] = df['numdays'].astype(int)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 625108 entries, 17 to 861395
Data columns (total 35 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   vaers_id         625108 non-null  int64         
 1   age_yrs          625108 non-null  float64       
 2   sex              625108 non-null  object        
 3   symptom_text     625108 non-null  object        
 4   died             625108 non-null  int64         
 5   l_threat         625108 non-null  int64         
 6   hospital         625108 non-null  int64         
 7   hospdays         625108 non-null  float64       
 8   x_stay           625108 non-null  int64         
 9   disable          625108 non-null  int64         
 10  recovd           625108 non-null  object        
 11  vax_date         625108 non-null  datetime64[ns]
 12  onset_date       625108 non-null  datetime64[ns]
 13  numdays          625108 non-null  int64         
 14  lab_data         30

In [41]:
# Keep only 'vax_date' after 2021
df = df[df['vax_date'] > '2020-04-01']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 624073 entries, 17 to 861395
Data columns (total 35 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   vaers_id         624073 non-null  int64         
 1   age_yrs          624073 non-null  float64       
 2   sex              624073 non-null  object        
 3   symptom_text     624073 non-null  object        
 4   died             624073 non-null  int64         
 5   l_threat         624073 non-null  int64         
 6   hospital         624073 non-null  int64         
 7   hospdays         624073 non-null  float64       
 8   x_stay           624073 non-null  int64         
 9   disable          624073 non-null  int64         
 10  recovd           624073 non-null  object        
 11  vax_date         624073 non-null  datetime64[ns]
 12  onset_date       624073 non-null  datetime64[ns]
 13  numdays          624073 non-null  int64         
 14  lab_data         30

#### Dropping of other text columns
Some text columns will not be used, hence they will be dropped. The columns include `lab_data`, `cur_ill`, `prior_vax`, the columns `onset_date` and `vax_date` will be dropped here as well.


In [42]:
# Drop text columns not used
df = df.drop(columns=['lab_data', 'cur_ill', 'onset_date', 'vax_date', 'prior_vax'])

#### ER/ED visits
Since null values for this column represents that the patient did not visit the ER/ED, it will be imputed as 0, while rows with 'Y' will be imputed with 1.

In [43]:
# Look at unique values for ER/ED visits
df['er_ed_visit'].unique()

array([nan, 'Y'], dtype=object)

In [44]:
df['er_ed_visit'] = df['er_ed_visit'].apply(lambda x: 1 if x == 'Y' else 0)

#### Dose Series
Since there is a wide range of values pertaining to number of doses, rows with 'NaN' and 'UNK' will be dropped. Since the recommended dosing, during the period of AE collection, was 2 shots, we will classify the values into 2 groups, with 1 & 2 being labelled as 0 and the rest labelled as 1. 

In [45]:
df['vax_dose_series'].unique()

array(['1', 'UNK', '2', '3', nan, '7+', '4', '5', '6'], dtype=object)

In [46]:
# Convert 'UNK' to 'NaN' and drop all null values
df.loc[df['vax_dose_series'] == 'UNK'] = np.nan 
df = df.dropna(subset = ['vax_dose_series'])
# df = df[df['vax_dose_series'].notna()]
df['vax_dose_series'] = df['vax_dose_series'].map({'1':0, '2':0, '3':1, '4':1, '5':1, '6':1, '7+':1})

#### Vaccine Route
For null values, they will be imputed as 'UN'. 

In [47]:
# Look at the unique values and count for vax_route
print(df['vax_route'].unique())
df['vax_route'].value_counts()

['IM' 'SYR' nan 'ID' 'UN' 'SC' 'JET' 'IN' 'OT' 'PO']


IM     294719
SYR    136142
OT      49904
UN       3461
SC       1929
ID        328
JET       120
IN          4
PO          3
Name: vax_route, dtype: int64

In [48]:
# Impute null values as 'UN'
df['vax_route'] = df['vax_route'].fillna('UN')

#### Vaccine Site
For null values, they will be imputed as 'UN'. 

In [49]:
print(df['vax_site'].unique())
df['vax_site'].value_counts()

['LA' 'RA' nan 'AR' 'OT' 'UN' 'LL' 'GM' 'RL' 'NS' 'LG' 'MO']


LA    346494
RA    117731
AR     11377
UN     10884
OT       411
LL       184
RL       158
GM        15
LG         9
NS         7
MO         3
Name: vax_site, dtype: int64

In [50]:
df['vax_site'] = df['vax_site'].fillna('UN')

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543323 entries, 17 to 861395
Data columns (total 30 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         543323 non-null  float64
 1   age_yrs          543323 non-null  float64
 2   sex              543323 non-null  object 
 3   symptom_text     543323 non-null  object 
 4   died             543323 non-null  float64
 5   l_threat         543323 non-null  float64
 6   hospital         543323 non-null  float64
 7   hospdays         543323 non-null  float64
 8   x_stay           543323 non-null  float64
 9   disable          543323 non-null  float64
 10  recovd           543323 non-null  object 
 11  numdays          543323 non-null  float64
 12  v_adminby        543323 non-null  object 
 13  other_meds       385946 non-null  object 
 14  history          382625 non-null  object 
 15  form_vers        543323 non-null  float64
 16  birth_defect     543323 non-null  flo

### Text columns 
Text columns will undergo some additional data cleaning steps.

Some text columns will not be imputed as they will be used for EDA but not modelling.
keep: `symptom_text`, `other_meds`, `history` and `allergies`.

In [52]:
# Take a look at text columns
df[['symptom_text', 'other_meds', 'history', 'allergies']].head()

Unnamed: 0,symptom_text,other_meds,history,allergies
17,"Left side of face became numb, including to behind the left ear. Happened within 10 minutes of injection. Subsided within 30 minutes. The next day, some numbness returned at about 9pm in the evening. Pain behind left ear.","levothyroxine 100mcg/day, estradiol 1mg/day",Graves Disease,"penicillin, toradol, methimazole"
57,"Vertigo every evening when lying down and every morning when getting up. I have been lying in bed for 5-10 minutes with eyes open, then sitting up slowly. Next, I sit on the side of the bed for a few minutes. When I get up, I need to hold onto something so I don't fall down.","multivitamin, D3, baby aspirin",none,"latex, sulfa drugs"
138,body aches and stomach ache,Triamterene HCTZ Montelukast Celecoxib Aller-Tec Multivitamin Vitamin D3 Magneseum,asthma when I get a cold,too much cordosone
821,"12/31/2020 H/a, diarrhea, SEVERE joint pain all through body, severe exhaustion., nausea, chills, fever 99.9. It felt almost identical to my first couple says of covid.",,Serious episode of covid + 11/18/2020,
822,"12/31/2020 H/a, diarrhea, SEVERE joint pain all through body, severe exhaustion., nausea, chills, fever 99.9. It felt almost identical to my first couple says of covid.",,Serious episode of covid + 11/18/2020,


In [53]:
# df[['symptom_text', 'other_meds', 'history', 'allergies']] = df[['symptom_text', 'other_meds', 'history', 'allergies']].astype(dtype="string")

In [54]:
# df[['symptom_text', 'other_meds', 'history', 'allergies']].dtypes

In [55]:
%%time
# Create function to remove special terms, digits and non-english symbols
def regex_clean(text):
            
    # Remove special terms    
    text = re.sub(pattern='#x200B;|&lt;|&gt;|&amp;|_', repl=' ', string=str(text))  
    
    # Remove all digits
    text = re.sub(pattern=r'\d+', repl=' ', string=str(text))
    #text = re.sub(pattern=r'\w*\d\w*', repl='', string=text)
    
    # Remove non-english symbols
    text = re.sub(pattern=r'[^a-zA-Z1-9]+', repl=' ', string=str(text))
  
    return text

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs


In [56]:
# Apply regex cleaning and look at results
df['symptom_text'] = df['symptom_text'].apply(regex_clean)
df['other_meds'] = df['other_meds'].apply(regex_clean)
df['history'] = df['history'].apply(regex_clean)
df['allergies'] = df['allergies'].apply(regex_clean)
df[['symptom_text', 'other_meds', 'history', 'allergies']].head()

Unnamed: 0,symptom_text,other_meds,history,allergies
17,Left side of face became numb including to behind the left ear Happened within minutes of injection Subsided within minutes The next day some numbness returned at about pm in the evening Pain behind left ear,levothyroxine mcg day estradiol mg day,Graves Disease,penicillin toradol methimazole
57,Vertigo every evening when lying down and every morning when getting up I have been lying in bed for minutes with eyes open then sitting up slowly Next I sit on the side of the bed for a few minutes When I get up I need to hold onto something so I don t fall down,multivitamin D baby aspirin,none,latex sulfa drugs
138,body aches and stomach ache,Triamterene HCTZ Montelukast Celecoxib Aller Tec Multivitamin Vitamin D Magneseum,asthma when I get a cold,too much cordosone
821,H a diarrhea SEVERE joint pain all through body severe exhaustion nausea chills fever It felt almost identical to my first couple says of covid,,Serious episode of covid,
822,H a diarrhea SEVERE joint pain all through body severe exhaustion nausea chills fever It felt almost identical to my first couple says of covid,,Serious episode of covid,


### Creating a column for severity
Adverse events can be classified as either serious or non-serious. The event is classified if the patient outcome is one of the following ([source](https://www.fda.gov/safety/reporting-serious-problems-fda/what-serious-adverse-event)):
* Death
* Life-threatening
* Hospitalisation (initial or prolonged)
* Disability or Permanent Damage
* Congenital Anomaly or Birth Defect
* Other Serious (Important Medical Events (IME))

The first 5 serious criteria are represented in our dataset in the following columns - `died`, `l_threat`, `hospital`, `x_stay`, `disable`, `birth_defect`.

The list of IME can be found from MedDRA/EMA and can be matched to `symptomsX`(X = 1, 2, 3, 4 or 5) column ([source](https://www.meddra.org/how-to-use/support-documentation/english)).

#### Serious column from serious criteria columns

In [57]:
# Creating a df of serious criteria
serious = df[['died', 'l_threat', 'hospital', 'x_stay', 'disable', 'birth_defect']].copy()

In [58]:
serious.died.value_counts()

0.0    535936
1.0      7387
Name: died, dtype: int64

In [59]:
%%time
# Create a function for assigning severity

def serious_criteria(row):
    row['serious'] = 0
    for col in ['died', 'l_threat', 'hospital', 'x_stay', 'disable', 'birth_defect']:
        if row[col] == 1:
            row['serious'] = 1
            
    return row

#     if row['died'] == 'Y' or row['l_threat'] == 'Y' or row['hospital'] == 'Y' or row['x_stay'] == 'Y' or row['disable'] == 'Y' or row['birth_defect'] == 'Y':
#         return 1
#     else:
#         return 0


df = df.apply(serious_criteria, axis=1)

CPU times: user 2min 34s, sys: 3.19 s, total: 2min 37s
Wall time: 2min 38s


In [60]:
df['serious'].value_counts()

0    482320
1     61003
Name: serious, dtype: int64

#### IME column from IME list

In [61]:
ime.head()

Unnamed: 0,MedDRA Code,PT Name,SOC Name,Comment,Added in 24.0,Primary SOC Change
0,10083258,Erythropoietin deficiency anaemia,Blood and lymphatic system disorders,Existing PT. Added after review by EVEWG.,X,
1,10051778,Factor IX inhibition,Blood and lymphatic system disorders,Existing PT. Added after review by EVEWG.,X,
2,10048619,Factor VIII inhibition,Blood and lymphatic system disorders,Existing PT. Added after review by EVEWG.,X,
3,10058116,Nephrogenic anaemia,Blood and lymphatic system disorders,Existing PT. Added after review by EVEWG.,X,
4,10068698,Familial hypocalciuric hypercalcaemia,"Congenital, familial and genetic disorders",Existing PT. Added after review by EVEWG.,X,


In [62]:
# Create list of ime conditions and convert them to lowercase
ime_conditions = ime['PT Name'].tolist()
ime_conditions = [i.lower() for i in ime_conditions]
ime_conditions[:5]

['erythropoietin deficiency anaemia',
 'factor ix inhibition',
 'factor viii inhibition',
 'nephrogenic anaemia',
 'familial hypocalciuric hypercalcaemia']

In [63]:
%%time
# Define a funtion that checks each row of the dataframe and see if symptom 1-5 is in the ime list
def ime(row):
    
    row['ime'] = 0
    
    for col in ['symptom1', 'symptom2', 'symptom3', 'symptom4', 'symptom5']:
        
        if str(row[col]).lower() in ime_conditions:
            
            row['ime'] = 1
    
    return row

# apply the function to the dataframe and check the relevant columns
df = df.apply(ime, axis=1)
# df[['symptom1', 'symptom2', 'symptom3', 'symptom4', 'symptom5', 'ime']]

CPU times: user 6min 45s, sys: 5.15 s, total: 6min 50s
Wall time: 6min 54s


In [64]:
df['ime'].value_counts()

0    474438
1     68885
Name: ime, dtype: int64

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543323 entries, 17 to 861395
Data columns (total 32 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         543323 non-null  float64
 1   age_yrs          543323 non-null  float64
 2   sex              543323 non-null  object 
 3   symptom_text     543323 non-null  object 
 4   died             543323 non-null  float64
 5   l_threat         543323 non-null  float64
 6   hospital         543323 non-null  float64
 7   hospdays         543323 non-null  float64
 8   x_stay           543323 non-null  float64
 9   disable          543323 non-null  float64
 10  recovd           543323 non-null  object 
 11  numdays          543323 non-null  float64
 12  v_adminby        543323 non-null  object 
 13  other_meds       543323 non-null  object 
 14  history          543323 non-null  object 
 15  form_vers        543323 non-null  float64
 16  birth_defect     543323 non-null  flo

In [66]:
# Combine columns `serious` and `ime`
df['serious'] = df['serious'] + df['ime']
df['serious'] = df['serious'].apply(lambda x: 0 if x == 0 else 1)

In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543323 entries, 17 to 861395
Data columns (total 32 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   vaers_id         543323 non-null  float64
 1   age_yrs          543323 non-null  float64
 2   sex              543323 non-null  object 
 3   symptom_text     543323 non-null  object 
 4   died             543323 non-null  float64
 5   l_threat         543323 non-null  float64
 6   hospital         543323 non-null  float64
 7   hospdays         543323 non-null  float64
 8   x_stay           543323 non-null  float64
 9   disable          543323 non-null  float64
 10  recovd           543323 non-null  object 
 11  numdays          543323 non-null  float64
 12  v_adminby        543323 non-null  object 
 13  other_meds       543323 non-null  object 
 14  history          543323 non-null  object 
 15  form_vers        543323 non-null  float64
 16  birth_defect     543323 non-null  flo

In [68]:
df.to_csv('../data/clean_df.csv', index=False)