# Feature Extraction

The purpose of this notebook is to show how and the process of extracting each feature form texts of posts in Reddis and migraine.com

Based on the work done for the [project proposal](./group_11_proposal.ipynb) we identified following features to extract:

- Age
- Gender
- Medicine use and dosage
- Suicidal thoughts
- Migraine triggers
- Presence of aura
- Trouble with sleeping
- ADHD

In the rest of this notebook we show the process of getting each feature out and constructing the code.

In [47]:
import pandas as pd
import re
import unittest
import copy


In [2]:
reddis_data_filename = 'reddis_migraine_posts.csv'

In [3]:
def read_reddis_data(filename):
    df = pd.read_csv(f'data/{filename}', header=0)
    df = df.dropna(subset=['Text'])
    return df

In [4]:
posts_and_commnets = read_reddis_data(reddis_data_filename)

In [5]:
posts_and_commnets[:5]

Unnamed: 0,Type,Parent,Author,Text,Title,Tags,Webpage
0,P,q1pdf8,Conscious_Escape_408,I've been awake the entire night with the wors...,Worst I've ever had/calling in sick,,
1,P,q1p2lt,Sia-King,"Hey y’all, I got a referral for a neurologist ...",What preventative to trial next? (Asthmatic &a...,,
2,P,q1otox,netluv,It’s day two night two of a migraine. I’ve max...,Pain,,
3,P,q1odf7,Dazee80,I am in a fucked position. I have had migraine...,Pain vs Relationship,,
5,P,q1kv1i,doitforthepizza,"Hi everyone, I'm new here (44f). First I'll sa...",New to this and wondering if others experience...,,


# Feature Extraction Approach

Different features needed somewhat different approach to retrieving them from the posts and comments.  However, there is general workflow that we used for working on all of the features.  

The nature of the data is that authors write in multiple posts and comments.  The main goal is to scan through these posts and identify features for each author.  Different features can come from author's different posts.  For example, in one post author can be speaking about something that identifies their gender and in another post about something that identifies their age.

To capture all the features we create index where author is the key and value is a dictionary of the features.

Workflow:

- Find sentences that describe feature we are looking for.  We do this by first identifying some keyword and filtering posts by that keyword and than taking random sample from that list.
- Once the list of sample sentences that represent feature is created, we build set of regex expression to identify all of the language patterns that the feature can be expressed with.
- With both sample sentences and regex expressions we build function that can identify the feature given text.  To ensure that the function works correctly we create unittest and use sample sentences as input for the unittest.  This saves us time from debugging later on large dataset.
- Finally, we run feature extracting function again entire dataset and update author index with found features.

# Set up Author Index

In [41]:
from collections import defaultdict


author_index = defaultdict(dict)

# Determine Number of Unique Authors

Checking number of unique authors in the dataset. This will be maximum possible number of entries for our resulting dataset.

In [7]:
print(f'Total posts and comments: {len(posts_and_commnets)}')
print(f'Unique authors: {len(set(posts_and_commnets["Author"]))}')

Total posts and comments: 387535
Unique authors: 41209


# Feature - Discover Gender

For Reddis authors we must extract information on author's gender from the posts themselves as userids are auto generated by Reddis so there is very little chance that someone would change it to their names.

In order to figure out how to do it we looked through posts and looked for examples of how people either identify themselves or if they say something that would help us to identify their gender, for example, "Me and my husband."

We found following sentences that we used to retrieve regex patterns from:

In [8]:
sample_male_sentences = [
    "I am married and my Wife and I....",
    "Me and my girlfriend went somewhere",
    "Hello, me 44m and have migraines",
    "Hello, me 44(m) and have migraines",
    "Hello, me 44 (M) and have migraines",
    "Hello, me 44 male and have migraines",
    "Hello, I am male 44 and have migraines"
]

sample_female_sentences = [
    "Me and my husband have a car.",
    "Something I am currently pregnant and so on",
    "Something I am pregnant and so on",
    "Something I'm pregnant and so on",
    "I have had menstruation related migraine",
    "Me and my boyfriend went somewhere",
    "Hello, me 44f and have migraines",
    "Hello, me 44(f) and have migraines",
    "Hello, me 44 (F) and have migraines",
    "Hello, me 44 female and have migraines",
    "Hello, I am female 44 and have migraines"
]

In [9]:
# regex patterns
male_matchers = [
    re.compile('my\s+wife', re.IGNORECASE),
    re.compile('my\s.*girlfriend', re.IGNORECASE),
    re.compile('\s[0-9][0-9](m\s|\(m\)|\s\(m\))', re.IGNORECASE),
    re.compile('\s[0-9][0-9].*male', re.IGNORECASE),
    re.compile('male.*[0-9][0-9]', re.IGNORECASE)
]

female_matchers = [
    re.compile('my\s+husband', re.IGNORECASE),
    re.compile('I( am|\'m)\s.*pregnant', re.IGNORECASE),
    re.compile('I\s.*menstruation', re.IGNORECASE),
    re.compile('my\s.*boyfriend', re.IGNORECASE),
    re.compile('\s[0-9][0-9](f|\(f\)|\s\(f\))', re.IGNORECASE),
    re.compile('\s[0-9][0-9].*female', re.IGNORECASE),
    re.compile('female.*[0-9][0-9]', re.IGNORECASE)
]

In [10]:
# Gender discovery functions
def discover_gender(matchers):
    def find_in_text(text):
        return any([
            matcher.search(text) for matcher in matchers
        ])
    return find_in_text

find_females = discover_gender(female_matchers)
find_males = discover_gender(male_matchers)

In [11]:
# Unit tests
class TestGenderDiscovery(unittest.TestCase):
    def test_male_matchers(self):
        self.assertTrue(all([find_males(text) for text in sample_male_sentences]))
        self.assertFalse(all([find_males(text) for text in sample_female_sentences]))

    def test_female_matchers(self):
        self.assertTrue(all([find_females(text) for text in sample_female_sentences]))
        self.assertFalse(all([find_females(text) for text in sample_male_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.002s

OK


In [27]:
def identify_gender(text):
    if find_males(text):
        return 'male'
    elif find_females(text):
        return 'female'
    return 'unknown'

In [55]:
def identify_gender_in_posts(idx):
    gender_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        gender_idx[author] = {'gender': identify_gender(text)}

    posts_and_commnets.apply(lambda x: process_entry(x['Author'], x['Text']), axis=1)
    return gender_idx


In [56]:
author_index = identify_gender_in_posts(author_index)
len(author_index)

41209

Count how many authors were identified as male or female or unknown.

In [62]:
count_m = 0
count_f = 0
count_u = 0

for _, v in author_index.items():
    if v['gender'] == 'male':
        count_m += 1
    if v['gender'] == 'female':
        count_f += 1
    if v['gender'] == 'unknown':
        count_u += 1

print(f'male: {count_m}, female: {count_f}, unknown: {count_u}')

male: 634, female: 631, unknown: 39944


# Feature - Drug Information

This section describes process of identifying medicine and dosage used by authors.

We first started by searching for most common drugs used for migraine.  We found that information on this [website](https://www.healthgrades.com/right-care/migraine-and-headache/12-drugs-commonly-prescribed-for-migraine)

## Common Migraine Drugs

- **Amitriptyline (Elavil)** is an antidepressant. The dosing ranges from once a day up to four times a day. It belongs to a group of antidepressants called tricyclics. Drowsiness and sleepiness are very common with this group, so your doctor may recommend taking it at bedtime.
- **Divalproex sodium extended-release (Depakote ER)** is an anticonvulsant. You take the extended-release tablet once a day. Taking it with food can help prevent stomach upset.
- **Eletriptan (Relpax)** is a triptan. It is a tablet you take at the onset of your migraine symptoms. For triptans, your doctor will tell you how many tablets you can take in a 24 hour period.
- **Metoprolol (Lopressor, Toprol XL)** is a beta blocker. It comes in both an immediate-release and an extended-release form.
- **Propranolol extended-release (Inderal, Inderal LA, Inderal XL)** is another beta blocker. It comes in several forms, each with their own dosing. Talk with your doctor or pharmacist about how to take your medicine.
- **Rizatriptan (Maxalt)** is a triptan you use at the onset of symptoms. It comes as a tablet and a disintegrating tablet, which melts in your mouth without water.
- **Sumatriptan (Imitrex)** is another triptan. It comes in several forms, including a tablet, injection, and nasal spray.
- **Topiramate (Topamax, Trokendi XR)** is an anticonvulsant. It comes in a regular-release tablet and an extended-release capsule. You can take either kind with or without food.
- **Venlafaxine (Effexor, Effexor XR)** is an antidepressant. You take both the tablet and the extended-release capsule with food. Stomach upset, headache, and appetite loss are common side effects.
- **Zolmitriptan (Zomig)** is another triptan. It comes as a tablet, disintegrating table, and nasal spray.
- **OnabotulinumtoxinA (Botox)** is a botulinum toxin that, when injected into areas of the face and scalp, can prevent the brain's pain response from activating. This stops migraine attacks before they occur.
- **Erenumab (Aimovig)** is a CGRP blocker. It's given by self-injection once a month.


From this we created a list of drugs and added some additional drugs we saw in Reddit posts.

In [206]:
drug_list = [
    'Amitriptyline',
    'Elavil',
    'Divalproex',
    'Depakote',
    'Eletriptan',
    'Relpax',
    'triptan',
    'Metoprolol',
    'Lopressor',
    'Toprol',
    'Propranolol',
    'Inderal',
    'beta blocker',
    'Rizatriptan',
    'Maxalt',
    'Sumatriptan',
    'Imitrex',
    'Topiramate',
    'Topamax',
    'Trokendi',
    'Venlafaxine',
    'Effexor',
    'Zolmitriptan',
    'Zomig',
    'OnabotulinumtoxinA',
    'Botox',
    'Erenumab',
    'Aimovig',
    'CGRP',
    'Nurtec',  # found in the subreddit post
    'Topomax',  # popular misspelling of Topamax,
    'nortiptyline',  # found in the subreddit post
    'metoclopramide',  # found in the subreddit post
    'caffeine pill',  # found in the subreddit posts_and_comments
    'naproxen',
    'magnesium',
    'Delta 8',
    'Aimovig',
    'sulfate',
    'Xanax',
    'amitryptiline',
    'Amoxicillin'
]

# drug_list = set([s.lower() for s in drug_list])

## Medicine Patterns

Manually sampling above output we found following patterns to base our regex expressions on:

- 75mg topamax daily
- Recently been prescribed 25mg topiramate to take one a day before bed
- I'm on 50mg and feel no effect at all
- I'm on 50mg and dont feel completely wrecked the next day. But I also have a sleep disorder that benefits from amitriptyline making me tired at night
- I’m currently on daily 40mg of Propranolol
- my current amitryptiline dose (50mg)
- prescribed me 50mg amitriptyline
- I take 2 x 600 mg caps of magnesium
- I take 10mg of edible Delta 8
- I was on 10mg which worked great for 6 months
- Now I take 30 mg
- It's a combination of sumatriptan 85 mg and naproxen sodium 500 mg
- I use 50 mg with a triptan
- I currently take 75mg daily
- I'm at 20mg 3x a day now
- I started Aimovig in July 2018 at the 70mg dose
- I take 50mg daily
- She said 875mg of Amoxicillin
- I take a total of 2400mg/day: 900mg/600mg/900mg.
- it 80mg twice a day
- Years ago i took 900mg three times a day
- 300mg 3x a day
- sulfate 325 mg twice daily
- I took .5 mg of Xanax
- my doc prescribed me 10mg of amitriptyline nightly
- my dose is 2x 25mg
- Topamax took about a month for me (at 50 mg)
- Mine comes in 2.5mg.
- but I take 250mg/day


In [212]:
sample_drug_sentences = [
    '75mg topamax daily',
    'Recently been prescribed 25mg topiramate to take one a day before bed',
    "I'm on 50mg Topomax and feel no effect at all",
    "I'm on 50mg Topomax and dont feel completely wrecked the next day. But I also have a sleep disorder that benefits from amitriptyline making me tired at night",
    'I’m currently on daily 40mg of Propranolol',
    'my current amitryptiline dose (50mg)',
    'prescribed me 50mg amitriptyline',
    'I take 2 x 600 mg caps of magnesium',
    'I take 10mg of edible Delta 8',
    'I was on 10mg of Topomax which worked great for 6 months',
    'Now I take 30 mg of Topomax',
    'It\'s a combination of sumatriptan 85 mg and naproxen sodium 500 mg',
    'I use 50 mg with a triptan',
    'I currently take Topomax 75mg daily',
    'I\'m on Topomax at 20mg 3x a day now',
    'I started Aimovig in July 2018 at the 70mg dose',
    'I take 50mg daily of Topomax',
    'She said 875mg of Amoxicillin',
    'I take Topomax a total of 2400mg/day: 900mg/600mg/900mg.',
    'Topomax 80mg twice a day',
    'Years ago i took Topomax 900mg three times a day',
    'Topomax 300mg 3x a day',
    'sulfate 325 mg twice daily',
    'I took .5 mg of Xanax',
    'my doc prescribed me 10mg of amitriptyline nightly',
    'my dose of Aimovig is 2x 25mg',
    'Topamax took about a month for me (at 50 mg)',
    'Mine Topamax comes in 2.5mg.',
    'but I take 250mg/day of Xanax'
]

invalid_drug_sentences = [
    'I took .5 days of vaction',
    'my doc prescribed me 10 days of rest',
    'my car is 25kg heavier',
    '50 mg of stuff',
    'Mine Topamax comes in 2-5 weeks.'
]

List of regex patterns is a bit more complex than previous one as we needed to store regex group indexes for dosage and quantity as they can appear at different positions.

In [208]:
# Regex patterns
drug_matchers = [
    {
        'regex': re.compile('([0-9]x).*([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg)', flags=re.IGNORECASE),
        'dosage_group': 2,
        'qty_group': 1
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg).*([0-9]x)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg).*(three times a day|four times a day|twice a day|one a day|twice daily)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+mg|[0-9]+\smg).*(nightly|daily|dose|day)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': -1
    }
]

In [220]:
# Medicine discovery functions
def normalize_qty(qty_text):
    if qty_text == 'daily' or qty_text == 'dose' or qty_text == 'day' or qty_text == 'one a day' or qty_text == '1x':
        return '1x'

    if qty_text == 'twice daily' or qty_text == 'twice a day':
        return '2x'

    if qty_text == 'three times a day':
        return '3x'

    return qty_text

def find_medicine_name(text):
    meds_matched = []
    for drug in drug_list:
        if re.search(drug, text, re.IGNORECASE):
            meds_matched.append(drug)
    return meds_matched

def find_dosage(reg_res, matcher):
    if matcher['qty_group'] == -1:
        qty = '1x'
    else:
        qty = normalize_qty(reg_res.group(matcher['qty_group']))
    return reg_res.group(matcher['dosage_group']), qty

def discover_medicine_dosage(matchers):
    def process_medicine_dosage(text):
        for matcher in matchers:
            if (reg_res := matcher['regex'].search(text)):
                dosage, qty = find_dosage(reg_res, matcher)
                if dosage:
                    med = find_medicine_name(text)
                    if med:
                        return (
                            med[0],
                            dosage,
                            qty
                        )

        return 'unknown', 'unknown', 'unknown'
    return process_medicine_dosage

find_medicine_dosage = discover_medicine_dosage(drug_matchers)

In [221]:
# Unit tests
def is_dosage_found(result):
    med, dosage, qty = result
    if med != 'unknown' and dosage != 'unknown' and qty != 'unknown':
        return True
    return False


class TestMedicineDosageDiscovery(unittest.TestCase):
    def test_dosage_matchers(self):
        self.assertTrue(all([is_dosage_found(find_medicine_dosage(text)) for text in sample_drug_sentences]))
        self.assertFalse(all([is_dosage_found(find_medicine_dosage(text)) for text in invalid_drug_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.007s

OK


Now run `find_medicine_dosage` function to find all entries in posts.

In [224]:
def identify_medicine_in_posts(idx):
    medicine_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        med, dosage, qty = find_medicine_dosage(text)
        medicine_idx[author] = {
            'medicine': med,
            'dosage': dosage,
            'qty': qty
        }

    posts_and_commnets.apply(lambda x: process_entry(x['Author'], x['Text']), axis=1)
    return medicine_idx

In [225]:
author_index = identify_medicine_in_posts(author_index)
len(author_index)

41209

In [226]:
count_medicine = 0

for _, v in author_index.items():
    if v['medicine'] != 'unknown':
        count_medicine += 1

print(f'entries with medicine dosage: {count_medicine}')

entries with medicine dosage: 1044


# Feature - Suicidal Thoughts

This section attempts to identify all of the authors that had suicidal thoughts.

Following is the list of sample sentences talking about suicide and they will serve as a test sentences and patterns for creating regex expression to retrieve information on suicidal thoughts.

Regex expression are divided into positive and negative.
With these sentences it is important not to pick up negative sentences that would mean the exact opposite to what we are trying to detect.

For example, we want to capture `I have suicidal thoughts` BUT we don't want to capture `I don't have suicidal thoughts`.  For this reason we will create regex set for capturing the negative sentences as well so that we can eliminate them from consideration.

In [227]:
suicide_sample_sentences = [
    'Had suicidal thoughts',
    'made me think a lot about suicide',
    'I still thought about suicide',
    'Suicide ideation',
    'suicidal ideations',
    'The “intrusive thoughts” and experiencing life far away'
    'I had felt suicidal',
    'feeling sad/suicidal',
    'it made me wildly suicidal',
    'I have been extremely suicidal',
    'i also have been extremely suicidal',
    'and then attempted suicide to stop the pain',
    'my near suicide attempt',
    'Depression and suicide thoughts are unbearable for me',
    'I was having suicidal thoughts',
    'I self harm and have suicidal thoughts',
    'just made me suicidal',
    'I was so suicidal',
    'made me legitimately suicidal',
    'Had suicidal thoughts',
    'I still thought about suicide',
    'made me think a lot about suicide'
]

neg_suicide_sample_sentences = [
    'I am not suicidal',
    'I have never been suicidal'
]

In [228]:
# Regex expressions
positive_suicide_matchers = [
    re.compile('(am|have|had|felt|having|me|was|been|think|about|feeling).*(suicidal|suicide)', re.IGNORECASE),
    re.compile('(my near|made me|have been|thought about|).*(suicidal|suicide)', re.IGNORECASE)
]

negative_suicide_matchers = [
    re.compile('(am|have|had|felt|having|me|was|been|think|about|feeling) (not|never).*(suicidal|suicide)', re.IGNORECASE),
    re.compile('(my near|made me|have been|thought about|) (not|never).*(suicidal|suicide)', re.IGNORECASE)
]

In [229]:
def search_for_suicide(text):
    return any([matcher.search(text) for matcher in positive_suicide_matchers]) \
        and not any([matcher.search(text) for matcher in negative_suicide_matchers])


In [231]:
# Unit tests
class TestSuicidalThoughtsDiscovery(unittest.TestCase):
    def test_suicidal_matchers(self):
        self.assertTrue(all([search_for_suicide(text) for text in suicide_sample_sentences]))
        self.assertFalse(all([search_for_suicide(text) for text in neg_suicide_sample_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok
test_suicidal_matchers (__main__.TestSuicidalThoughtsDiscovery) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.010s

OK


In [232]:
def identify_suicidal_thoughts_in_posts(idx):
    suicidal_thoughts_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        if search_for_suicide(text):
            suicidal_thoughts_idx[author] = { 'suicidal': 'yes' }
        else:
            suicidal_thoughts_idx[author] = { 'suicidal': 'no' }

    posts_and_commnets.apply(lambda x: process_entry(x['Author'], x['Text']), axis=1)
    return suicidal_thoughts_idx

In [233]:
author_index = identify_suicidal_thoughts_in_posts(author_index)
len(author_index)

41209

In [234]:
total_suicidal = 0

for _, v in author_index.items():
    if v['suicidal'] == 'yes':
        total_suicidal += 1

print(f'Authors with with suicidal thoughts: {total_suicidal}')

Authors with with suicidal thoughts: 139
