# EDA of (CFSAN) Adverse Event Reporting System

### Content

- [What is CFSAN](http://)?
- Imports
- Missing and duplicate data
- Product Role
- Gender
- Date
- Outcome
- Industry
- Brand
- Symptoms
- Outcomes by ...
- Ovarian Cancer

In [43]:
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [44]:
df = pd.read_csv('../input/CAERS_ASCII_2004_2017Q2.csv')

In [45]:
df.shape

In [46]:
df.info()

In [47]:
df.head()

In [48]:
plt.figure(figsize=(12,9))
sns.heatmap(df.isnull(),
            cmap='plasma',
            yticklabels=False,
            cbar=False)
plt.title('Missing Data\n',fontsize=20)
plt.xticks(fontsize=15)
plt.show()

Just by looking at the head of the dataframe, I can tell that there is a lot of duplicate data. Let's see what we can do to remove some of that data. The 'RA_Report #' column looks like a ID column. Let's look for duplicates in that column and go from there.

In [49]:
print('Duplicate Data?')
df.duplicated('RA_Report #').value_counts()

In [50]:
df.drop_duplicates(['RA_Report #'],keep='last',inplace=True)

In [51]:
len(df)

In [52]:
print('Duplicate Data?')
df.duplicated('RA_Report #').value_counts()

Dropping the two columns with the most Nan Values.

In [53]:
df.drop(['AEC_Event Start Date','CI_Age at Adverse Event'],axis=1,inplace=True)

In [60]:
plt.figure(figsize=(12,9))
sns.heatmap(df.isnull(),
            cmap='plasma',
            yticklabels=False,
            cbar=False)
plt.title('Missing Data\n',fontsize=20)
plt.xticks(fontsize=15)
plt.show()

In [56]:
print('Rows in the Dataframe?')
len(df)

In [14]:
df.head()

In [57]:
df.columns

## Product Role

In [61]:
plt.figure(figsize=(12,9))
df['PRI_Product Role'].value_counts().plot.bar()
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('Suspect or Concomitant?\n',fontsize=20)
plt.show()
print('Suspect or Concomitant?\n')
print(df['PRI_Product Role'].value_counts())

Suspect to be related to the adverse event (*as reported*). Concomitant are believed to be taken but not related to the adverse event (*as reported*). Notice all of the suspect instances.

## Gender

In [66]:
plt.figure(figsize=(10,10))
df['CI_Gender'].value_counts()[:3].plot(kind='pie')
plt.title('Adverse Events Reported by Gender\n',fontsize=20)
plt.show()
print('Reported Events by Gender\n')
print(df['CI_Gender'].value_counts())

**Females have around twice the likelihood of reporting to the FDA**. I wonder if those reports are injuries, serious injuries or deaths? 

# Date

In [67]:
type(df['RA_CAERS Created Date'][1])

The 'RA_CAERS Created Date' is a string type. Let's convert it to a pandas.Timestamp type.

In [69]:
df['RA_CAERS Created Date'] = pd.to_datetime(df['RA_CAERS Created Date'])

In [70]:
type(df['RA_CAERS Created Date'][1])

Create a month and year column from the Created Date.

In [71]:
df['Created Year'] = df['RA_CAERS Created Date'].apply(lambda x: x.year)
df['Created Month'] = df['RA_CAERS Created Date'].apply(lambda x: x.month)

In [74]:
plt.figure(figsize=(12,9))
df.groupby('Created Month').count()['RA_CAERS Created Date'].plot(kind='bar')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('Reports by Month\n',fontsize=20)
plt.show()
print(df.groupby('Created Month').count()['RA_CAERS Created Date'])

Reporting seems to slow down slightly during November and December. Other than that, adverse event reporting (by the month) looks random.

In [76]:
plt.figure(figsize=(12,9))
df.groupby('Created Year').count()['RA_CAERS Created Date'].plot(kind='bar')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('Reports by Year\n',fontsize=20)
plt.show()
print(df.groupby('Created Year').count()['RA_CAERS Created Date'])

Without a doubt, adverse event reporting has increased throughout the years. **Are products getting more dangerous, or are people becoming more aware and reporting** the effects of consumer products???

# Outcome

In [77]:
plt.figure(figsize=(12,12))
df['AEC_One Row Outcomes'].value_counts()[:20].sort_values(ascending=True).plot(kind='barh')
plt.title('20 Most Common Adverse Event Outcome\n',fontsize=20)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.show()
print('\nNumber of different Outcomes: ',len(df['AEC_One Row Outcomes'].value_counts()))

There seems to be a lot of overlap in the outcomes column. Instead of finding all of the unique values, let's count the times certain terms come up.

In [78]:
visit_count = 0
death_count = 0
non_serious = 0
serious = 0
dis_count = 0

for i in df['AEC_One Row Outcomes']:
    if 'HOSPITALIZATION' in i or 'VISITED A HEALTH CARE PROVIDER' in i or 'VISITED AN ER' in i:
        visit_count += 1
    if 'DEATH' in i:
        death_count += 1
    if 'NON-SERIOUS INJURIES/ ILLNESS' in i:
        non_serious += 1
    if  i == 'SERIOUS INJURIES/ ILLNESS' or ' SERIOUS INJURIES/ ILLNESS' in i or i[:25] == 'SERIOUS INJURIES/ ILLNESS':    
        serious += 1
    if 'DISABILITY' in i:
        dis_count += 1
        
print('PERCENTAGES OF OUTCOMES\n')        
print('VISITED AN ER, VISITED A HEALTH CARE PROVIDER, HOSPITALIZATION percentage: {}%'.format(round(visit_count/len(df),3)*100))
print('NON-SERIOUS INJURIES/ ILLNESS percentage: {}%'.format(round(non_serious/len(df),3)*100))
print('SERIOUS INJURIES/ ILLNESS percentage: {}%'.format(round(serious/len(df),2)*100))
print('DEATH percentage: {}%'.format(round(death_count/len(df),3)*100))
print('DISABILITY percentage: {}%'.format(round(dis_count/len(df),3)*100))

Almost 40% of all reported events had to seek medical treatment or examination. 3% reported SERIOUS INJURY/ ILLNESS' and 2.7% reported 'DEATH'.

In [86]:
d = {'VISITED AN ER, VISITED A HEALTH CARE PROVIDER, HOSPITALIZATION':visit_count,
     'NON-SERIOUS INJURIES/ ILLNESS':non_serious,
     'SERIOUS INJURIES/ ILLNESS':serious,
     'DEATH':death_count,
     'DISABILITY':dis_count}

outcomesDF = pd.Series(data=d)

In [87]:
plt.figure(figsize=(10,8))
outcomesDF.sort_values().plot(kind='barh')
plt.title('Outcome Counts',fontsize=20)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.show()
print(outcomesDF.sort_values(ascending=False))

## Note: 

Many of these values overlapped with each other. So while on instances that read 'SERIOUS INJURY', can also say 'HOSPITALIZATION', 'DEATH' and 'DISABILITY'. An instance such as this would count four times. See below.

In [88]:
df[df['AEC_One Row Outcomes']=="DISABILITY, LIFE THREATENING, HOSPITALIZATION, DEATH"]

# Industry

In [89]:
plt.figure(figsize=(12,15))
df['PRI_FDA Industry Name'].value_counts()[:40].sort_values(ascending=True).plot(kind='barh')
plt.title('Reports by Industry\n',fontsize=20)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.show()
print(df['PRI_FDA Industry Name'].value_counts()[:40])

# Brand

In [90]:
plt.figure(figsize=(12,12))
df['PRI_Reported Brand/Product Name'].value_counts()[1:21].sort_values(ascending=True).plot(kind='barh')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('Most Reported Brands\n',fontsize=20)
plt.show()
print(df['PRI_Reported Brand/Product Name'].value_counts()[:21])

## REDACTED?

Notice that the brand with by far the most reports is 'REDACTED'. I have no idea what that means, but for the sake of our visualization, I removed it. '**REDACTED' has 10 times as many reports than the next highest brand.**

In [92]:
plt.figure(figsize=(12,12))
df[df['PRI_Reported Brand/Product Name']=='REDACTED']['PRI_FDA Industry Name'].value_counts()[:21].sort_values(ascending=True).plot(kind='barh')
plt.title('"REDACTED" by Industry\n',fontsize=20)
plt.show()
print(df[df['PRI_Reported Brand/Product Name']=='REDACTED']['PRI_FDA Industry Name'].value_counts()[:21])

In [93]:
print('{}%'.format(round(5455 / len(df[df['PRI_Reported Brand/Product Name']=='REDACTED'])*100),3) + ' of the "REDACTED" instances belong to the cosmetics industry.')

In [94]:
redacted_df = df[df['PRI_Reported Brand/Product Name']=='REDACTED']

visit_count = 0
death_count = 0
non_serious = 0
serious = 0
dis_count = 0

for i in redacted_df['AEC_One Row Outcomes']:
    if 'HOSPITALIZATION' in i or 'VISITED A HEALTH CARE PROVIDER' in i or 'VISITED AN ER' in i:
        visit_count += 1
    if 'DEATH' in i:
        death_count += 1
    if 'NON-SERIOUS INJURIES/ ILLNESS' in i:
        non_serious += 1
    if  i == 'SERIOUS INJURIES/ ILLNESS' or ' SERIOUS INJURIES/ ILLNESS' in i or i[:25] == 'SERIOUS INJURIES/ ILLNESS':    
        serious += 1
    if 'DISABILITY' in i:
        dis_count += 1

In [95]:
redacted_dict = {'VISITED AN ER, VISITED A HEALTH CARE PROVIDER, HOSPITALIZATION':visit_count,
     'NON-SERIOUS INJURIES/ ILLNESS':non_serious,
     'SERIOUS INJURIES/ ILLNESS':serious,
     'DEATH':death_count,
     'DISABILITY':dis_count}

redacted_data = pd.Series(data=d)

In [98]:
plt.figure(figsize=(12,9))
redacted_data.sort_values().plot(kind='barh')
plt.title('Count of "REDACTED" Outcomes',fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()
print(redacted_data.sort_values(ascending=False))

In [99]:
redacted_df[redacted_df['AEC_One Row Outcomes']=='DEATH']['SYM_One Row Coded Symptoms'].value_counts()[:10]

Seems that the REDACTED instances have a large proportion of DEATH. I wonder what the symptoms are?

# Symptoms

In [100]:
df['SYM_One Row Coded Symptoms'].value_counts()[:25]

**Whoa. Ovarian cancer is a thing**. We will look into this later.

In [101]:
# Drop missing values from the column. Can not loop through column otherwise.

df['SYM_One Row Coded Symptoms'].dropna(axis=0,how='any',inplace=True)

In [102]:
cancer_count = 0
choking_count = 0
diarrhoea_count = 0
vomit_count = 0
nausea_count = 0
dysgeusia_count = 0
malaise_count = 0
alopecia_count = 0
abpain_count = 0
rash_count = 0
headache_count = 0
laceration_count = 0
convulsion_count = 0
hyper_count = 0

#If you know how to make this loop more pythonic, please let me know. 

for i in df['SYM_One Row Coded Symptoms']:
    if 'CHOKING' in i:
        choking_count += 1
    if 'DIARRHOEA' in i:
        diarrhoea_count += 1
    if 'CANCER' in i:
        cancer_count += 1
    if 'VOMIT' in i:
        vomit_count += 1
    if 'NAUSEA' in i:
        nausea_count += 1
    if 'DYSGEUSIA' in i:
        dysgeusia_count += 1
    if 'MALAISE' in i:
        malaise_count += 1
    if 'ALOPECIA' in i:
        alopecia_count += 1
    if 'ABDOMINAL PAIN' in i:
        abpain_count += 1
    if 'RASH' in i:
        rash_count += 1
    if 'HEADACHE' in i:
        headache_count += 1
    if 'LACERATION' in i:
        laceration_count += 1
    if 'CONVULSION' in i:
        convulsion_count += 1
    if 'HYPERSENSITIVITY' in i:
        hyper_count += 1

In [103]:
symptoms_dict = {
 'ABDOMINAL PAIN': abpain_count,
 'ALOPECIA': alopecia_count,
 'CHOKING': choking_count,
 'CONVULSION': convulsion_count,
 'DIARRHOEA': diarrhoea_count,
 'DYSGEUSIA': dysgeusia_count,
 'HEADACHE': headache_count,
 'HYPERSENSITIVITY': hyper_count,
 'LACERATION': laceration_count,
 'MALAISE': malaise_count,
 'NAUSEA': nausea_count,
 'CANCER': cancer_count,
 'RASH': rash_count,
 'VOMITING': vomit_count
}

for k,v in symptoms_dict.items():
    print(k + ': ',v)

In [104]:
symptom_df = pd.Series(symptoms_dict)

In [106]:
plt.figure(figsize=(12,9))
symptom_df.sort_values(ascending=True).plot(kind='barh')
plt.title('SYMPTOMS COUNT',fontsize=20)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.show()
print(symptom_df.sort_values(ascending=False))

It is very hard for me to imagine cancer as more common as an adverse event than headaches, rashes and choking. How does the REDACTED dataframe's column for symptoms look?

# Outcome by Industry

In [107]:
plt.figure(figsize=(12,9))
df[df['AEC_One Row Outcomes']=='NON-SERIOUS INJURIES/ ILLNESS']['PRI_FDA Industry Name'].value_counts()[:25].sort_values(ascending=True).plot(kind='barh')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('Non-Serious Injuries or Illness by Industry\n',fontsize=20)
plt.show()
print(df[df['AEC_One Row Outcomes']=='NON-SERIOUS INJURIES/ ILLNESS']['PRI_FDA Industry Name'].value_counts()[:25])

For **non serious adverse events**, **Nuts and Seeds, Cosmetics, Vegetables** and **Vit/Min products** produced the most instances.

In [108]:
plt.figure(figsize=(12,9))
df[df['AEC_One Row Outcomes']=='OTHER SERIOUS (IMPORTANT MEDICAL EVENTS)']['PRI_FDA Industry Name'].value_counts()[:25].sort_values(ascending=True).plot(kind='barh')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('OTHER SERIOUS (IMPORTANT MEDICAL EVENTS by Industry\n',fontsize=20)
plt.show()
print(df[df['AEC_One Row Outcomes']=='OTHER SERIOUS (IMPORTANT MEDICAL EVENTS)']['PRI_FDA Industry Name'].value_counts()[:25])

In [110]:
plt.figure(figsize=(12,9))
df[df['AEC_One Row Outcomes']=='DEATH']['PRI_FDA Industry Name'].value_counts()[:25].sort_values(ascending=True).plot(kind='barh')
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.title('DEATH by Industry\n',fontsize=20)
plt.show()
print(df[df['AEC_One Row Outcomes']=='DEATH']['PRI_FDA Industry Name'].value_counts()[:25])

# Outcome by Gender

In [135]:
injury = df[df['AEC_One Row Outcomes']=='NON-SERIOUS INJURIES/ ILLNESS']

plt.figure(figsize=(20,6))
sns.countplot(injury['PRI_FDA Industry Name'],hue=injury['CI_Gender'])
plt.title('NON-SERIOUS INJURIES/ ILLNESS by Gender\n',fontsize=20)
plt.xticks(fontsize=15,rotation=90)
plt.yticks(fontsize=15)
plt.xlabel('Industry',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.legend(loc=7)
plt.show()
print('Reported Male Non Serious Injuries: ' + str(len(df[(df['CI_Gender']=='Male')&(df['AEC_One Row Outcomes']=='NON-SERIOUS INJURIES/ ILLNESS')]['PRI_FDA Industry Name'])))
print('Reported Female Non Serious Injuries: ' + str(len(df[(df['CI_Gender']=='Female')&(df['AEC_One Row Outcomes']=='NON-SERIOUS INJURIES/ ILLNESS')]['PRI_FDA Industry Name'])))

In [136]:
serious = df[df['AEC_One Row Outcomes']=='OTHER SERIOUS (IMPORTANT MEDICAL EVENTS)']

plt.figure(figsize=(20,6))
sns.countplot(serious['PRI_FDA Industry Name'],hue=serious['CI_Gender'])
plt.title('OTHER SERIOUS (IMPORTANT MEDICAL EVENTS) by Gender\n',fontsize=20)
plt.xticks(fontsize=15,rotation=90)
plt.yticks(fontsize=15)
plt.xlabel('Industry',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.legend(loc=7)
plt.show()
print('Reported Male Serious Injuries: ' + str(len(df[(df['CI_Gender']=='Male')&(df['AEC_One Row Outcomes']=='OTHER SERIOUS (IMPORTANT MEDICAL EVENTS)')]['PRI_FDA Industry Name'])))
print('Reported Female Serious Injuries: ' + str(len(df[(df['CI_Gender']=='Female')&(df['AEC_One Row Outcomes']=='OTHER SERIOUS (IMPORTANT MEDICAL EVENTS)')]['PRI_FDA Industry Name'])))

In [115]:
death = df[df['AEC_One Row Outcomes']=='DEATH']

plt.figure(figsize=(15,6))
sns.countplot(death['PRI_FDA Industry Name'],hue=death['CI_Gender'])
plt.title('DEATH by Gender\n',fontsize=20)
plt.xticks(fontsize=15,rotation=90)
plt.yticks(fontsize=15)
plt.xlabel('Industry',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.legend(loc=7)
plt.show()
print('Reported Male Deaths: ' + str(len(df[(df['CI_Gender']=='Male')&(df['AEC_One Row Outcomes']=='DEATH')]['PRI_FDA Industry Name'])))
print('Reported Female Deaths: ' + str(len(df[(df['CI_Gender']=='Female')&(df['AEC_One Row Outcomes']=='DEATH')]['PRI_FDA Industry Name'])))

Ouch. **Females outpace males** in **Non Serious Injury, Serious Injury** and **Death** in Adverse Event Reporting by a large margin. Is this because women are more likely to report an adverse event or that women actually suffer from more adverse events?

# Ovarian Cancer

In [116]:
df['SYM_One Row Coded Symptoms'].value_counts()[:20]

In [123]:
death = df[df['AEC_One Row Outcomes']=='DEATH']
death['SYM_One Row Coded Symptoms'].value_counts()[:20]

In [124]:
ovarian = 0
for i in death['SYM_One Row Coded Symptoms']:
    if 'OVARIAN CANCER' in i:
        ovarian += 1

In [125]:
print('{}%'.format(round(ovarian/len(death)*100),3) + ' of the symptoms in the "DEATH" dataframe had the term "OVERIAN CANCER" in it.')

In [126]:
plt.figure(figsize=(12,9))
death[death['SYM_One Row Coded Symptoms']=='OVARIAN CANCER']['Created Year'].value_counts().plot(kind='bar')
plt.title('Ovarian Cancer by Year Reported',fontsize=20)
plt.show()
print(death[death['SYM_One Row Coded Symptoms']=='OVARIAN CANCER']['Created Year'].value_counts())

In [130]:
ovarian_death = death[death['SYM_One Row Coded Symptoms']=='OVARIAN CANCER']

plt.figure(figsize=(12,9))
ovarian_death.groupby('Created Month').count()['RA_Report #'].plot(kind='bar')
plt.title('Ovarian Cancer by Month Reported',fontsize=20)
plt.show()
print(death.groupby('Created Month')['Created Month'].value_counts())

In [131]:
ovarian_death['PRI_FDA Industry Name'].value_counts()

This is pretty big! ALL of the instances of "OVARIAN DEATH" come from the Cosmetic industry. 

In [132]:
ovarian_death['PRI_Reported Brand/Product Name'].value_counts()

## Adverse Events or Adverse Reporting?

After looking at the overian_death dataframe, I noticed some funky behavior and reporting. For example let's look at the dates '2015-01-28' and '2017-02-27'. 

In [133]:
ovarian_death[ovarian_death['RA_CAERS Created Date']=='2015-01-28']

In [134]:
ovarian_death[ovarian_death['RA_CAERS Created Date']=='2017-02-27']

On '2015-01-28', **there were 5 instances of the exact same report** (with the exception of the 'RA_Report #'). On '2017-02-27', **there were 13 instances of nearly exact same report** (with the exception of the 'RA_Report #'). Maybe my intuition is a bit off, but this report looks a little off.

Unfortunately, the brand column does not provide much insight. Most of the results come back as REDACTED, whatever that means. Furthermore, **I do not believe we have enough information** to totally blame the cosmetics industry for the instances of ovarian cancer. 

## Feedback is massively appreciated! Thank you!