I'm still learning all the in-and-outs of ML and creating/sharing notebooks. All constructive feedback would be greatly appreciated.

# Importing the data

In [None]:
import numpy as np
import pandas as pd

df_2021VAERSVAX = pd.read_csv('../input/covid19-vaccine-adverse-reactions/2021VAERSVAX.csv')
#This table provide the remaining vaccine information for each of the vaccines listed, which could be multiple

df_2021VAERSSYMPTOMS = pd.read_csv('../input/covid19-vaccine-adverse-reactions/2021VAERSSYMPTOMS.csv')
#Provides the adverse event coded terms utilizing the MedDRA dictionary
#Each row in the .csv will contain up to 5 MedDRA terms per VAERS ID; thus, there could be multiple rows per VAERS ID

df_2021VAERSDATA = pd.read_csv('../input/covid19-vaccine-adverse-reactions/2021VAERSDATA.csv', encoding='latin-1')
#This is the primary factstable 

In [None]:
df_2021VAERSVAX.head()

In [None]:
df_2021VAERSSYMPTOMS.head()

In [None]:
df_2021VAERSDATA.head()

In [None]:
df_flattable = df_2021VAERSDATA.merge(df_2021VAERSVAX, left_on="VAERS_ID", right_on="VAERS_ID").merge(df_2021VAERSSYMPTOMS, left_on="VAERS_ID", right_on="VAERS_ID")

# EDA
## Pandas Profiling

In [None]:
import pandas_profiling
pandas_profile = pandas_profiling.ProfileReport(df_flattable, progress_bar=False, correlations={"cramers": {"calculate": False}})
pandas_profile.to_widgets()

Based on analyzing the pandas_profiling warnings and the data description we can easily gather useful information to perform data cleansing.

2021VAERSSYMPTOMS
- SymptomX could also be considered an array of symptoms rather than separate columns
- SymptomVersionX, which specifies the MedDRA dictionary version, is constant for all entries in the current dataset due to the limit time-scope of the data and can be dropped

2021VAERSDATA
- cage_yr and cage_mo are age_yrs split into years and months. In addition, months is missing in 99.1% of cases.
- Died is either Yes or NaN. For the purpose of this analysis we will assume NaN means "No(t yet)"
- RPT_DATE is missing in 99.9% of cases so we drop this column due to low added value
- ER_VISIT is NaN for all entries

2021VAERSVAX
- Our focus for this analysis focusses on COVID19 Vx. There are currently only 9 reports (distinct VAERS_ID) where the reporting also includes another vaccine. This is a too small sample to analyze any heterologues vaccine administration effects so we will consider these as only having administered COVID


Additional cleaning not yet done:
- Conversion of dates
- Additional column cleaning
- Deeper analysis of pandas_profiling output

In [None]:
df_2021VAERSSYMPTOMS_arr = df_2021VAERSSYMPTOMS.drop(columns=["SYMPTOMVERSION1","SYMPTOMVERSION2", "SYMPTOMVERSION3", "SYMPTOMVERSION4", "SYMPTOMVERSION5"]).set_index("VAERS_ID").unstack().dropna().reset_index(name='SYMPTOMS')[["VAERS_ID", "SYMPTOMS"]].groupby("VAERS_ID").agg(lambda x: list(x))

df_2021VAERSDATA_clean = df_2021VAERSDATA.drop(columns=["CAGE_YR", "CAGE_MO", "ER_VISIT", "RPT_DATE", "NUMDAYS", "V_FUNDBY", "FORM_VERS"]).fillna(value={'DIED': 'N', "RECOVD": 'U'}).astype({'SEX': 'category', "RECOVD": 'category', "V_ADMINBY": 'category'})

df_2021VAERSVAX_clean = df_2021VAERSVAX[df_2021VAERSVAX["VAX_TYPE"]=="COVID19"].drop(columns=["VAX_DOSE_SERIES"])

df_cleaned = (
    df_2021VAERSDATA.merge(df_2021VAERSVAX_clean, how="inner", left_on="VAERS_ID", right_on="VAERS_ID")
    .merge(df_2021VAERSSYMPTOMS_arr, how="left", left_on="VAERS_ID", right_on="VAERS_ID")
             ).set_index("VAERS_ID")



In [None]:
df_cleaned.head()

## Specific Column Analysis

Analyzing the receive date we see a weekly repetition with lower reporting during weekends which is expected considering less facilities are opened during weekends or only for emergency cases.
Although we see a slightly higher reporting in the beginning of february, this doesn't yet seem to be a monthly pattern as we have too little info

In [None]:
import plotly.express as px
df_2021VAERSDATA['RECVDATE_dt'] = pd.to_datetime(df_2021VAERSDATA['RECVDATE']) #Convert to Pandas DateTime
df_2021VAERSDATA['RECVDATE_dt_dayname'] = df_2021VAERSDATA['RECVDATE_dt'].dt.day_name()
px.violin(df_2021VAERSDATA, y='RECVDATE_dt', points="all", hover_data=["VAERS_ID", 'RECVDATE_dt_dayname'])

Note that the below count is not population-adjust, so although California currently has the most AE's this could simply be due to more percent of people already vaccined or due to more people in the state in general.
The distribution of manufacturers does seem to be similar across states at this moment.

In [None]:
fig = px.histogram(df_cleaned, x="STATE", color="VAX_MANU")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

As we already saw from the STATE analysis, currently majority of AE come from Pfizer, but this is aligned with our current understanding of the distribution of vaccinations in itself. Population adjusted we thus expect this to be a smaller difference.
There are also a limited number of AEs with Unknown Manufacturer, but it is unclear to me where these come from.

In [None]:
px.pie(df_cleaned, names="VAX_MANU")

# Summary
This is currently only a very draft EDA and much further work is required, but I do hope it can give inspiration to some of you and you can hopefully re-use part of the data cleaning activities.

Especially an addition of per state/date vaccinition figures would be a great addition to the dataset to enable population-adjusted analysis,

Feedback is much appreciated