# Adverse Event Detection in Pharmacovigilance (v0.1)
## Objective
To build a basic system to detect adverse events (AEs) from text input (e.g., patient reports) using keyword matching and a SQLite database for drug-side effect lookup. 
This is 0.1v for a pharmacovigilance NLP project, focusing on data prep and basic NLP.
## Features- 
Load and clean text data (synthetic AE reports).
- Store drug-side effect pairs in SQLite.
- Tokenize text and match AE keywords with simple negation handling.
- Output: 'AE detected: <symptom>' or 'No AE detected'.

In [9]:
import pandas as pd
import numpy as np

import sqlite3 #db actions
import nltk #this is a word processing library which can find context and sarcasm etc, used for tokenizing the text

nltk.download('punkt') #used for word or sentence spliting
nltk.download('punkt_tab') #used for word or sentence spliting


from nltk.tokenize import word_tokenize #importing word tokenizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vinay\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\vinay\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Stage 1: Data Collection and Preparation
Loading synthetic data similar to AE reports, cleaning the text, and creating a SQLite DB for drug-side effect pairs.

In [40]:
# Will be using small synthetic dataset for v0.1 - I will try replacing with real dataset in next version
data = pd.DataFrame({
    'report_id': [1, 2, 3, 4,5,6],
    'text': [
        'I took Aspirin and felt severe dizzy',
        'After DrugX, no nausea reported',
        'Ibuprofen caused severe headache',
        'Took DrugY, feeling great',
        'I inhaled DrugV, feeling mild dizzy',
        'I applied LotionX, I felt severe fatigue',
    ],
    "severity":[
        'severe',
        'none',
        'severe',
        'none',
        'mild',
        'severe',
    ],
})


In [41]:

# Cleaning text by removing space, tabs, characters and removing caps and making it lower case
data['text_clean'] = data['text'].str.lower().str.replace('[^a-zA-Z\s]', '', regex=True)

#lets see the cleaned data
print('Cleaned Data:')
print(data[['report_id', 'text_clean']])




Cleaned Data:
   report_id                               text_clean
0          1     i took aspirin and felt severe dizzy
1          2           after drugx no nausea reported
2          3         ibuprofen caused severe headache
3          4                 took drugy feeling great
4          5       i inhaled drugv feeling mild dizzy
5          6  i applied lotionx i felt severe fatigue


In [42]:
# Creating SQLite DB for drug-side effect pairs
conn = sqlite3.connect('ae_database.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS drug_side_effects (
        drug TEXT,
        side_effect TEXT
    )
''')


<sqlite3.Cursor at 0x296ef5260c0>

In [43]:
# lets add sample data into side effects
side_effects = [
    ('Aspirin', 'dizzy'),
    ('Aspirin', 'nausea'),
    ('Ibuprofen', 'headache'),
    ('DrugX', 'nausea'),
    ('DrugV', 'dizzy'),
    ('LotionX', 'fatigue'),
]
cursor.executemany('INSERT OR REPLACE INTO drug_side_effects VALUES (?, ?)', side_effects)
conn.commit() #commiting the changes to DB


In [44]:

# checking the data in table
cursor.execute("SELECT * FROM drug_side_effects WHERE drug = 'Aspirin'")
print('\nDrug-Side Effect Table:')
print(cursor.fetchall())


Drug-Side Effect Table:
[('Aspirin', 'dizzy'), ('Aspirin', 'nausea'), ('Aspirin', 'dizzy'), ('Aspirin', 'nausea'), ('Aspirin', 'dizzy'), ('Aspirin', 'nausea'), ('Aspirin', 'dizzy'), ('Aspirin', 'nausea')]


## Stage 2: Basic NLP PipelineTokenize text, match AE keywords, and handle simple negation.

In [45]:
# Adverse event key words as we defined earlier, we will add more of these in next versions
ae_keywords = {'dizzy', 'nausea', 'headache', "fatigue", "rash","allergy"}


#lets create a function to tokenize the text and check if any adverse events have happenend
def detect_ae(text, keywords, conn):
    # Tokenize
    tokens = word_tokenize(text.lower())
    
    # Check for negation - will add more words later version if required this will help in identifying negative statements
    negation_words = {'no', 'not', 'never'}
    detected_aEs = [] #initiatizing the list
    for i, token in enumerate(tokens):
        if token in keywords:
            if i > 0 and tokens[i-1] in negation_words:  # Check previous word for negation and skip if yes
                continue
            # ortherwise continue checking
            cursor = conn.cursor()
            cursor.execute('SELECT drug FROM drug_side_effects WHERE side_effect = ?', (token,)) #compare the token with table and fetch matching
            drugs = cursor.fetchall() #assigning fetched values into drugs
            if drugs: #if value is present then add to list
                detected_aEs.append(token)
    return detected_aEs #finally return the detected adverse event



In [46]:
# Process reports
data['detected_aEs'] = data['text_clean'].apply(lambda x: detect_ae(x, ae_keywords, conn)) # applying the function to each row of the dataframe
data['result'] = data['detected_aEs'].apply(lambda x: f'AE detected: {x}' if x else 'No AE detected') #if drug is found in adverse event then show as detected or else state not found

print('\nResults:')
print(data[['report_id', 'text', 'result']])

# Close database
conn.close()


Results:
   report_id                                      text  \
0          1      I took Aspirin and felt severe dizzy   
1          2           After DrugX, no nausea reported   
2          3          Ibuprofen caused severe headache   
3          4                 Took DrugY, feeling great   
4          5       I inhaled DrugV, feeling mild dizzy   
5          6  I applied LotionX, I felt severe fatigue   

                      result  
0     AE detected: ['dizzy']  
1             No AE detected  
2  AE detected: ['headache']  
3             No AE detected  
4     AE detected: ['dizzy']  
5   AE detected: ['fatigue']  


## Stage 3: Prepare for Severity Classification
Add data for logistic regression to classify AE severity (mild/severe). - will continue