# Feature Extraction

The purpose of this notebook is to show how and the process of extracting each feature form texts of posts in Reddis and migraine.com

Based on the work done for the [project proposal](./group_11_proposal.ipynb) we identified following features to extract:

- Age
- Gender
- Medicine use and dosage
- Suicidal thoughts
- Migraine triggers
- Presence of aura
- Trouble with sleeping
- ADHD

In the rest of this notebook we show the process of getting each feature out and constructing the code.

In [1]:
import pandas as pd
import re
import unittest
import copy


In [2]:
reddis_data_filename = 'reddis_migraine_posts.csv'
migraine_dot_com = 'migraine.com.csv'
csv_files = [reddis_data_filename, migraine_dot_com]

In [3]:
def read_reddis_data(files):
    dfs = []
    for file in files:
        dfs.append(pd.read_csv(f'data/{file}', header=0))
        df = pd.concat(dfs)
        df = df.dropna(subset=['Text'])
    return df

In [4]:
posts_and_commnets = read_reddis_data(csv_files)

In [5]:
posts_and_commnets[:5]

Unnamed: 0,Type,Parent,Author,Text,Title,Tags,Webpage
0,P,q1pdf8,Conscious_Escape_408,I've been awake the entire night with the wors...,Worst I've ever had/calling in sick,,
1,P,q1p2lt,Sia-King,"Hey y’all, I got a referral for a neurologist ...",What preventative to trial next? (Asthmatic &a...,,
2,P,q1otox,netluv,It’s day two night two of a migraine. I’ve max...,Pain,,
3,P,q1odf7,Dazee80,I am in a fucked position. I have had migraine...,Pain vs Relationship,,
5,P,q1kv1i,doitforthepizza,"Hi everyone, I'm new here (44f). First I'll sa...",New to this and wondering if others experience...,,


For most of our processing we will need Author and Text columns

In [6]:
posts_and_commnets = list(posts_and_commnets[['Author', 'Text']].to_records(index=False))

In [7]:
posts_and_commnets[:2]

[('Conscious_Escape_408', "I've been awake the entire night with the worst migraine I have ever had. Im a long time sufferer but this one is different. My thinking is more impaired than usual and my blood pressure is 150/111. I need to call out sick from work but I'm so afraid of getting in trouble. I'm afraid to go to hospital emergency room with the number of COVID cases in my area. I don't know what to do."),
 ('Sia-King', 'Hey y’all, I got a referral for a neurologist and while I’m waiting on it, I’ve decided to trial another preventative. I’ve tried Topomax (was on it 2 weeks, couldn’t handle the side effects), currently on Sandomigran (been on it for 19 months, stopped working 9 months in but I continued w it because I didn’t want to accept it wasn’t working 😭 when it did work, it was bloody amazing. While I can increase the dosage for effectiveness, I can barely handle the fatigue it gives). \n\nI have asthma and take venlafaxine for depression &amp; anxiety btw. My GP said this

# Feature Extraction Approach

Different features needed somewhat different approach to retrieving them from the posts and comments.  However, there is general workflow that we used for working on all of the features.  

The nature of the data is that authors write in multiple posts and comments.  The main goal is to scan through these posts and identify features for each author.  Different features can come from author's different posts.  For example, in one post author can be speaking about something that identifies their gender and in another post about something that identifies their age.

To capture all the features we create index where author is the key and value is a dictionary of the features.

Workflow:

- Find sentences that describe feature we are looking for.  We do this by first identifying some keyword and filtering posts by that keyword and than taking random sample from that list.
- Once the list of sample sentences that represent feature is created, we build set of regex expression to identify all of the language patterns that the feature can be expressed with.
- With both sample sentences and regex expressions we build function that can identify the feature given text.  To ensure that the function works correctly we create unittest and use sample sentences as input for the unittest.  This saves us time from debugging later on large dataset.
- Finally, we run feature extracting function again entire dataset and update author index with found features.

# Set up Author Index

In [8]:
from collections import defaultdict


author_index = defaultdict(dict)

# Determine Number of Unique Authors

Checking number of unique authors in the dataset. This will be maximum possible number of entries for our resulting dataset.

In [10]:
from itertools import groupby


print(f'Total posts and comments: {len(posts_and_commnets)}')
unique_authors = [x[0] for x in posts_and_commnets]
print(f'Unique authors: {len(set(unique_authors))}')

Total posts and comments: 410708
Unique authors: 46918


# Feature - Discover Gender

For Reddis authors we must extract information on author's gender from the posts themselves as userids are auto generated by Reddis so there is very little chance that someone would change it to their names.

In order to figure out how to do it we looked through posts and looked for examples of how people either identify themselves or if they say something that would help us to identify their gender, for example, "Me and my husband."

We found following sentences that we used to retrieve regex patterns from:

In [11]:
sample_male_sentences = [
    "I am married and my Wife and I....",
    "Me and my girlfriend went somewhere",
    "Hello, me 44m and have migraines",
    "Hello, me 44(m) and have migraines",
    "Hello, me 44 (M) and have migraines",
    "Hello, me 44 male and have migraines",
    "Hello, I am male 44 and have migraines"
]

sample_female_sentences = [
    "Me and my husband have a car.",
    "Something I am currently pregnant and so on",
    "Something I am pregnant and so on",
    "Something I'm pregnant and so on",
    "I have had menstruation related migraine",
    "Me and my boyfriend went somewhere",
    "Hello, me 44f and have migraines",
    "Hello, me 44(f) and have migraines",
    "Hello, me 44 (F) and have migraines",
    "Hello, me 44 female and have migraines",
    "Hello, I am female 44 and have migraines"
]

In [12]:
# regex patterns
male_matchers = [
    re.compile('my\s+wife', re.IGNORECASE),
    re.compile('my\s.*girlfriend', re.IGNORECASE),
    re.compile('\s[0-9][0-9](m\s|\(m\)|\s\(m\))', re.IGNORECASE),
    re.compile('\s[0-9][0-9].*male', re.IGNORECASE),
    re.compile('male.*[0-9][0-9]', re.IGNORECASE)
]

female_matchers = [
    re.compile('my\s+husband', re.IGNORECASE),
    re.compile('I( am|\'m)\s.*pregnant', re.IGNORECASE),
    re.compile('I\s.*menstruation', re.IGNORECASE),
    re.compile('my\s.*boyfriend', re.IGNORECASE),
    re.compile('\s[0-9][0-9](f|\(f\)|\s\(f\))', re.IGNORECASE),
    re.compile('\s[0-9][0-9].*female', re.IGNORECASE),
    re.compile('female.*[0-9][0-9]', re.IGNORECASE)
]

In [13]:
# Gender discovery functions
def discover_gender(matchers):
    def find_in_text(text):
        return any([
            matcher.search(text) for matcher in matchers
        ])
    return find_in_text

find_females = discover_gender(female_matchers)
find_males = discover_gender(male_matchers)

In [14]:
# Unit tests
class TestGenderDiscovery(unittest.TestCase):
    def test_male_matchers(self):
        self.assertTrue(all([find_males(text) for text in sample_male_sentences]))
        self.assertFalse(all([find_males(text) for text in sample_female_sentences]))

    def test_female_matchers(self):
        self.assertTrue(all([find_females(text) for text in sample_female_sentences]))
        self.assertFalse(all([find_females(text) for text in sample_male_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.001s

OK


In [15]:
def identify_gender(text):
    if find_males(text):
        return 'male'
    elif find_females(text):
        return 'female'
    return 'unknown'

In [18]:
def identify_gender_in_posts(idx):
    gender_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        gender_idx[author]['gender'] = identify_gender(text)

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return gender_idx


In [19]:
author_index = identify_gender_in_posts(author_index)
len(author_index)

46918

Count how many authors were identified as male or female or unknown.

In [20]:
count_m = 0
count_f = 0
count_u = 0

for _, v in author_index.items():
    if v['gender'] == 'male':
        count_m += 1
    if v['gender'] == 'female':
        count_f += 1
    if v['gender'] == 'unknown':
        count_u += 1

print(f'male: {count_m}, female: {count_f}, unknown: {count_u}')

male: 780, female: 826, unknown: 45312


# Feature - Drug Information

This section describes process of identifying medicine and dosage used by authors.

We first started by searching for most common drugs used for migraine.  We found that information on this [website](https://www.healthgrades.com/right-care/migraine-and-headache/12-drugs-commonly-prescribed-for-migraine)

## Common Migraine Drugs

- **Amitriptyline (Elavil)** is an antidepressant. The dosing ranges from once a day up to four times a day. It belongs to a group of antidepressants called tricyclics. Drowsiness and sleepiness are very common with this group, so your doctor may recommend taking it at bedtime.
- **Divalproex sodium extended-release (Depakote ER)** is an anticonvulsant. You take the extended-release tablet once a day. Taking it with food can help prevent stomach upset.
- **Eletriptan (Relpax)** is a triptan. It is a tablet you take at the onset of your migraine symptoms. For triptans, your doctor will tell you how many tablets you can take in a 24 hour period.
- **Metoprolol (Lopressor, Toprol XL)** is a beta blocker. It comes in both an immediate-release and an extended-release form.
- **Propranolol extended-release (Inderal, Inderal LA, Inderal XL)** is another beta blocker. It comes in several forms, each with their own dosing. Talk with your doctor or pharmacist about how to take your medicine.
- **Rizatriptan (Maxalt)** is a triptan you use at the onset of symptoms. It comes as a tablet and a disintegrating tablet, which melts in your mouth without water.
- **Sumatriptan (Imitrex)** is another triptan. It comes in several forms, including a tablet, injection, and nasal spray.
- **Topiramate (Topamax, Trokendi XR)** is an anticonvulsant. It comes in a regular-release tablet and an extended-release capsule. You can take either kind with or without food.
- **Venlafaxine (Effexor, Effexor XR)** is an antidepressant. You take both the tablet and the extended-release capsule with food. Stomach upset, headache, and appetite loss are common side effects.
- **Zolmitriptan (Zomig)** is another triptan. It comes as a tablet, disintegrating table, and nasal spray.
- **OnabotulinumtoxinA (Botox)** is a botulinum toxin that, when injected into areas of the face and scalp, can prevent the brain's pain response from activating. This stops migraine attacks before they occur.
- **Erenumab (Aimovig)** is a CGRP blocker. It's given by self-injection once a month.


From this we created a list of drugs and added some additional drugs we saw in Reddit posts.

In [21]:
drug_list = [
    'Amitriptyline',
    'Elavil',
    'Divalproex',
    'Depakote',
    'Eletriptan',
    'Relpax',
    'triptan',
    'Metoprolol',
    'Lopressor',
    'Toprol',
    'Propranolol',
    'Inderal',
    'beta blocker',
    'Rizatriptan',
    'Maxalt',
    'Sumatriptan',
    'Imitrex',
    'Topiramate',
    'Topamax',
    'Trokendi',
    'Venlafaxine',
    'Effexor',
    'Zolmitriptan',
    'Zomig',
    'OnabotulinumtoxinA',
    'Botox',
    'Erenumab',
    'Aimovig',
    'CGRP',
    'Nurtec',  # found in the subreddit post
    'Topomax',  # popular misspelling of Topamax,
    'nortiptyline',  # found in the subreddit post
    'metoclopramide',  # found in the subreddit post
    'caffeine pill',  # found in the subreddit posts_and_comments
    'naproxen',
    'magnesium',
    'Delta 8',
    'Aimovig',
    'sulfate',
    'Xanax',
    'amitryptiline',
    'Amoxicillin'
]

# drug_list = set([s.lower() for s in drug_list])

## Medicine Patterns

Manually sampling above output we found following patterns to base our regex expressions on:

- 75mg topamax daily
- Recently been prescribed 25mg topiramate to take one a day before bed
- I'm on 50mg and feel no effect at all
- I'm on 50mg and dont feel completely wrecked the next day. But I also have a sleep disorder that benefits from amitriptyline making me tired at night
- I’m currently on daily 40mg of Propranolol
- my current amitryptiline dose (50mg)
- prescribed me 50mg amitriptyline
- I take 2 x 600 mg caps of magnesium
- I take 10mg of edible Delta 8
- I was on 10mg which worked great for 6 months
- Now I take 30 mg
- It's a combination of sumatriptan 85 mg and naproxen sodium 500 mg
- I use 50 mg with a triptan
- I currently take 75mg daily
- I'm at 20mg 3x a day now
- I started Aimovig in July 2018 at the 70mg dose
- I take 50mg daily
- She said 875mg of Amoxicillin
- I take a total of 2400mg/day: 900mg/600mg/900mg.
- it 80mg twice a day
- Years ago i took 900mg three times a day
- 300mg 3x a day
- sulfate 325 mg twice daily
- I took .5 mg of Xanax
- my doc prescribed me 10mg of amitriptyline nightly
- my dose is 2x 25mg
- Topamax took about a month for me (at 50 mg)
- Mine comes in 2.5mg.
- but I take 250mg/day


In [22]:
sample_drug_sentences = [
    '75mg topamax daily',
    'Recently been prescribed 25mg topiramate to take one a day before bed',
    "I'm on 50mg Topomax and feel no effect at all",
    "I'm on 50mg Topomax and dont feel completely wrecked the next day. But I also have a sleep disorder that benefits from amitriptyline making me tired at night",
    'I’m currently on daily 40mg of Propranolol',
    'my current amitryptiline dose (50mg)',
    'prescribed me 50mg amitriptyline',
    'I take 2 x 600 mg caps of magnesium',
    'I take 10mg of edible Delta 8',
    'I was on 10mg of Topomax which worked great for 6 months',
    'Now I take 30 mg of Topomax',
    'It\'s a combination of sumatriptan 85 mg and naproxen sodium 500 mg',
    'I use 50 mg with a triptan',
    'I currently take Topomax 75mg daily',
    'I\'m on Topomax at 20mg 3x a day now',
    'I started Aimovig in July 2018 at the 70mg dose',
    'I take 50mg daily of Topomax',
    'She said 875mg of Amoxicillin',
    'I take Topomax a total of 2400mg/day: 900mg/600mg/900mg.',
    'Topomax 80mg twice a day',
    'Years ago i took Topomax 900mg three times a day',
    'Topomax 300mg 3x a day',
    'sulfate 325 mg twice daily',
    'I took .5 mg of Xanax',
    'my doc prescribed me 10mg of amitriptyline nightly',
    'my dose of Aimovig is 2x 25mg',
    'Topamax took about a month for me (at 50 mg)',
    'Mine Topamax comes in 2.5mg.',
    'but I take 250mg/day of Xanax'
]

invalid_drug_sentences = [
    'I took .5 days of vaction',
    'my doc prescribed me 10 days of rest',
    'my car is 25kg heavier',
    '50 mg of stuff',
    'Mine Topamax comes in 2-5 weeks.'
]

List of regex patterns is a bit more complex than previous one as we needed to store regex group indexes for dosage and quantity as they can appear at different positions.

In [23]:
# Regex patterns
drug_matchers = [
    {
        'regex': re.compile('([0-9]x).*([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg)', flags=re.IGNORECASE),
        'dosage_group': 2,
        'qty_group': 1
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg).*([0-9]x)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg).*(three times a day|four times a day|twice a day|one a day|twice daily)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+mg|[0-9]+\smg).*(nightly|daily|dose|day)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': 2
    },
    {
        'regex': re.compile('([0-9]+\.?[0-9]+mg|[0-9]+\.?[0-9]+\smg|\.?[0-9]+mg|\.?[0-9]+\smg)', flags=re.IGNORECASE),
        'dosage_group': 1,
        'qty_group': -1
    }
]

In [24]:
# Medicine discovery functions
def normalize_qty(qty_text):
    if qty_text == 'daily' or qty_text == 'dose' or qty_text == 'day' or qty_text == 'one a day' or qty_text == '1x':
        return '1x'

    if qty_text == 'twice daily' or qty_text == 'twice a day':
        return '2x'

    if qty_text == 'three times a day':
        return '3x'

    return qty_text

def find_medicine_name(text):
    meds_matched = []
    for drug in drug_list:
        if re.search(drug, text, re.IGNORECASE):
            meds_matched.append(drug)
    return meds_matched

def find_dosage(reg_res, matcher):
    if matcher['qty_group'] == -1:
        qty = '1x'
    else:
        qty = normalize_qty(reg_res.group(matcher['qty_group']))
    return reg_res.group(matcher['dosage_group']), qty

def discover_medicine_dosage(matchers):
    def process_medicine_dosage(text):
        for matcher in matchers:
            if (reg_res := matcher['regex'].search(text)):
                dosage, qty = find_dosage(reg_res, matcher)
                if dosage:
                    med = find_medicine_name(text)
                    if med:
                        return (
                            med[0],
                            dosage,
                            qty
                        )

        return 'unknown', 'unknown', 'unknown'
    return process_medicine_dosage

find_medicine_dosage = discover_medicine_dosage(drug_matchers)

In [25]:
# Unit tests
def is_dosage_found(result):
    med, dosage, qty = result
    if med != 'unknown' and dosage != 'unknown' and qty != 'unknown':
        return True
    return False


class TestMedicineDosageDiscovery(unittest.TestCase):
    def test_dosage_matchers(self):
        self.assertTrue(all([is_dosage_found(find_medicine_dosage(text)) for text in sample_drug_sentences]))
        self.assertFalse(all([is_dosage_found(find_medicine_dosage(text)) for text in invalid_drug_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.007s

OK


Now run `find_medicine_dosage` function to find all entries in posts.

In [26]:
def identify_medicine_in_posts(idx):
    medicine_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        med, dosage, qty = find_medicine_dosage(text)
        medicine_idx[author]['medicine'] = med
        medicine_idx[author]['dosage'] = dosage
        medicine_idx[author]['qty'] = qty

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return medicine_idx

In [27]:
author_index = identify_medicine_in_posts(author_index)
len(author_index)

46918

In [28]:
count_medicine = 0

for _, v in author_index.items():
    if v['medicine'] != 'unknown':
        count_medicine += 1

print(f'entries with medicine dosage: {count_medicine}')

entries with medicine dosage: 1387


# Feature - Suicidal Thoughts

This section attempts to identify all of the authors that had suicidal thoughts.

Following is the list of sample sentences talking about suicide and they will serve as a test sentences and patterns for creating regex expression to retrieve information on suicidal thoughts.

Regex expression are divided into positive and negative.
With these sentences it is important not to pick up negative sentences that would mean the exact opposite to what we are trying to detect.

For example, we want to capture `I have suicidal thoughts` BUT we don't want to capture `I don't have suicidal thoughts`.  For this reason we will create regex set for capturing the negative sentences as well so that we can eliminate them from consideration.

In [32]:
suicide_sample_sentences = [
    'Had suicidal thoughts',
    'made me think a lot about suicide',
    'I still thought about suicide',
    'Suicide ideation',
    'suicidal ideations',
    'The “intrusive thoughts” and experiencing life far away'
    'I had felt suicidal',
    'feeling sad/suicidal',
    'it made me wildly suicidal',
    'I have been extremely suicidal',
    'i also have been extremely suicidal',
    'and then attempted suicide to stop the pain',
    'my near suicide attempt',
    'Depression and suicide thoughts are unbearable for me',
    'I was having suicidal thoughts',
    'I self harm and have suicidal thoughts',
    'just made me suicidal',
    'I was so suicidal',
    'made me legitimately suicidal',
    'Had suicidal thoughts',
    'I still thought about suicide',
    'made me think a lot about suicide'
]

neg_suicide_sample_sentences = [
    'I am not suicidal',
    'I have never been suicidal'
]

In [33]:
# Regex expressions
positive_suicide_matchers = [
    re.compile('(am|have|had|felt|having|me|was|been|think|about|feeling).*(suicidal|suicide)', re.IGNORECASE),
    re.compile('(my near|made me|have been|thought about|).*(suicidal|suicide)', re.IGNORECASE)
]

negative_suicide_matchers = [
    re.compile('(am|have|had|felt|having|me|was|been|think|about|feeling) (not|never).*(suicidal|suicide)', re.IGNORECASE),
    re.compile('(my near|made me|have been|thought about|) (not|never).*(suicidal|suicide)', re.IGNORECASE)
]

In [34]:
def search_for_suicide(text):
    return any([matcher.search(text) for matcher in positive_suicide_matchers]) \
        and not any([matcher.search(text) for matcher in negative_suicide_matchers])


In [35]:
# Unit tests
class TestSuicidalThoughtsDiscovery(unittest.TestCase):
    def test_suicidal_matchers(self):
        self.assertTrue(all([search_for_suicide(text) for text in suicide_sample_sentences]))
        self.assertFalse(all([search_for_suicide(text) for text in neg_suicide_sample_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok
test_suicidal_matchers (__main__.TestSuicidalThoughtsDiscovery) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.007s

OK


In [36]:
def identify_suicidal_thoughts_in_posts(idx):
    suicidal_thoughts_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        if search_for_suicide(text):
            suicidal_thoughts_idx[author]['suicidal'] = 'yes'
        else:
            suicidal_thoughts_idx[author]['suicidal'] = 'no'

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return suicidal_thoughts_idx

In [37]:
author_index = identify_suicidal_thoughts_in_posts(author_index)
len(author_index)

46918

In [38]:
total_suicidal = 0

for _, v in author_index.items():
    if v['suicidal'] == 'yes':
        total_suicidal += 1

print(f'Authors with with suicidal thoughts: {total_suicidal}')

Authors with with suicidal thoughts: 155


# Feature - Author's Age

This section discovers author's age from the posts.  This is similar pattern of finding sample sentences and creating regex expressions based on those samples.

In [39]:
age_sample_sentences = [
    "I'm 26",
    "I'm 9",
    "I am 26",
    "I'm in my 40's",
    "45 years old",
    "I am 20 (F)",
    "I am 28M",
    "I am now at 58",
    "I'm now 20",
    "40f"
]

no_age_sentences = [
    "I'm good",
    "I am bad",
    "I'm in my house",
    "years old",
    "I am (F)",
    "I am M",
    "I am now at the shop",
    "I'm now better",
    "is f",
    "I am on vyepti (every 3 months) along with",
    "a good 30-45 minutes"
]

In [40]:
# Regex expressions
age_matchers = [
    {'matcher': re.compile("I('m| am) ([0-9][0-9]*)", re.IGNORECASE), 'group': 2},
    {'matcher': re.compile("I('m| am) in my ([0-9][0-9]*)", re.IGNORECASE), 'group': 2},
    {'matcher': re.compile("([0-9][0-9]*) years old", re.IGNORECASE), 'group': 1},
    {'matcher': re.compile("I('m| am) now ([0-9][0-9]*)", re.IGNORECASE), 'group': 2},
    {'matcher': re.compile("I('m| am) now at ([0-9][0-9]*)", re.IGNORECASE), 'group': 2},
    {'matcher': re.compile("([0-9][0-9]*)(f\b|m\b|f$|m$)", re.IGNORECASE), 'group': 1},
    {'matcher': re.compile("([0-9][0-9]*) (f\b|m\b|f$|m$)", re.IGNORECASE), 'group': 1},
    {'matcher': re.compile("([0-9][0-9]*)\((f|m)\)", re.IGNORECASE), 'group': 1},
    {'matcher': re.compile("([0-9][0-9]*) \((f|m)\)", re.IGNORECASE), 'group': 1}
]

In [41]:
# Find age in text or return 0 if no age information
def find_age(text):
    for matcher in age_matchers:
        if (r := matcher['matcher'].search(text)):
            return int(r.group(matcher['group']))
    return 0

In [42]:
# Unit tests
class TestAgeDiscovery(unittest.TestCase):
    def test_age_matchers(self):
        self.assertTrue(all([find_age(text) for text in age_sample_sentences]))
        self.assertFalse(all([find_age(text) for text in no_age_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_age_matchers (__main__.TestAgeDiscovery) ... ok
test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok
test_suicidal_matchers (__main__.TestSuicidalThoughtsDiscovery) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.010s

OK


In [44]:
def identify_authors_age_in_posts(idx):
    age_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        age = find_age(text)
        age_idx[author]['age'] = age

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return age_idx

In [45]:
author_index = identify_authors_age_in_posts(author_index)
len(author_index)

46918

In [46]:
total_authors_with_age = 0

for _, v in author_index.items():
    if v['age'] != 0:
        total_authors_with_age += 1

print(f'Authors with identified age: {total_authors_with_age}')

Authors with identified age: 1192


# Feature - Migraine Triggers

This section identifies authors and their triggers.  The approach to discovery of this feature is somewhat different.

First we create a list of posts with word trigger(s) and then we sample those to discover sentences we can use to identify triggers.

In [48]:
pattern = re.compile('(trigger|triggers)', re.IGNORECASE)

def trigger_filter(text):
    if pattern.search(text, re.IGNORECASE):
        return True
    return False

posts_with_triggers = [post for post in posts_and_commnets if trigger_filter(post[1])]
len(posts_with_triggers)

29046

In [49]:
# Sample
posts_with_triggers[:5]

[('danawl', 'Hello fellow cool kids, \n\nI’m wondering if any of you have screen based triggers, and notice any certain side effects. This is just out of pure curiosity. \n\nI, myself, am able to stare at my phone and tv screen a lot longer than I can my computer screen. I have blue light glasses, I always have “night mode” on. \n\nI think because I use dark mode on my phone, that helps, and I’m not staring at flat text as much on a phone / tv than I am when I’m on a computer. I’m a coding student, so my assignments consist of staring at small lines of code on a high contrast screen. \n\nThoughts?'),
 ('redsquirrel5000', "So I moved from Nevada 2 years ago to Florida, I had migraines in Nevada as well, but they are a bit worse in Florida especially the humid and stormy summers.\n\nSome info says I should move. I'm on preventatives, but maybe there are better meds my doc is not aware of for rain or lightning and stormy weather related migraines?\n\n&amp;#x200B;\n\nMy triggers include st

The output from the above produced lots of samples of how triggers are described:

- have screen based triggers
- My triggers include stress, screens, bright lights, fast motion, tv's, and rain, lightning and things like that
- experience migraine trigger from leafy greens / power greens (kale, baby kale, chard spinach, arugula)?
- Triggers include screens on computers, phones etc, flicker type lights, and motion, tv, and when it rains
- exercise triggering migraines
- I have identified coffee/caffeine and lack of sleep as my two main triggers
- have any perfume brands/scents that do not trigger headaches or migraines
- I found i can't wear perfumes, they all trigger me
- Coffee creamers were my trigger specifically International Delight Brand
- I can personally confirm that coffee/caffeine is a migraine trigger
- My migraines were triggered by a stressful life event
- Mold triggers horrifying migraines for me
- I'll check the weather as my main trigger seems to be barometric pressure changes
- When I'm in danger of migraines I avoid orgasm because it's a trigger for me
- Of course certain sound trigger them
- Sound sensitivity *triggers* migraines for me
- Scent triggers are the primary challenge for me
- The Covid vaccines we had did trigger migraines
- Barometric pressure is my other trigger
- covid can trigger nasty migraines
- One of my triggers is disturbed sleep so sure
- fasting is my #1 trigger
- I found out that my migraine triggers were foods
- Stress is my #1 trigger too
- Bright light is my #1 trigger
- I’m triggered by weather so it’s usually a storm or weather front that causes mine.
- My main trigger is weather
- Weather triggers were the easiest to be solved by a good prophylactic
- Some of my triggers are unavoidable (like lack of sleep)
- I realize I can’t smoke sativa or it can trigger a migraine and anxiety
- OMG aspartame is a migraine trigger for me too
- My worst triggers are smells/scent/odors
- Common triggers for me though are strong scents
- Even chocolate can trigger it with a high amount of caffeine.
- Stress triggers tension headaches for me
- Stress is my biggest trigger by far
- swimming is a trigger for me
- Stress and neck problems are my top two triggers
- Stress is my major trigger
- Triggers for me: Sunlight, headlights, smells (perfume and cologne and candles), stress, lack of sleep, and dark chocolate.
- I just recently discovered lemonade as a trigger
- my other triggers are weather, certain foods
- Stress, weather and sugar are my triggers
- Blood sugar swings are a primary trigger for me.
- Cranial pressure is a trigger for migraines
- My migraines are triggered by high air pressure
- bright white lights are a big trigger
- I have migraines very evidently triggered by sugar
- Too many carbs or sugar in a meal can trigger one for me as well
- The vaccine triggered migraines
- migraines (no headache) triggered by sunlight and glare
- My other triggers are barometric pressure and smells

There seems to be a specific pattern here.  Most of the description of triggers are contained in one sentence and they form pattern of word trigger and some form of description or list of symptoms. This might be tough to capture with just regex, therefore we turn to  Spacy library to help with identifying what words are referred to by word trigger.

In [50]:
# To this once
# ! python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')

To figure out how we can capture triggers we used Spacy's dependency graph to better understand the relationship between word trigger and trigger description(s).

In [51]:
# From each post take text and split it into senteces
# and only keep sentences referring to triggers
def get_sentences_with_triggers():
    trigger_sentences = []
    pattern = re.compile('(trigger|triggers)', re.IGNORECASE)
    for post in posts_with_triggers:
        text = post[1]
        doc = nlp(text)
        sentences = [str(sent) for sent in doc.sents if pattern.search(str(sent))]
        trigger_sentences.extend(sentences)
    return trigger_sentences

trigger_sentences = get_sentences_with_triggers()
len(trigger_sentences)

37734

In [52]:
# Take couple sample sentences
trigger_sentences[:2]

['Hello fellow cool kids, \n\nI’m wondering if any of you have screen based triggers, and notice any certain side effects.',
 "\n\n&amp;#x200B;\n\nMy triggers include stress, screens, bright lights, fast motion, tv's, and rain, lightning and things like that."]

In [53]:
from spacy import displacy


# graph word dependencies
doc_vis = nlp(trigger_sentences[0])
displacy.render(doc_vis, style="dep")

In [54]:
doc_vis = nlp(trigger_sentences[1])
displacy.render(doc_vis, style="dep")

From couple examples shown above we can see that we can identify word `trigger(s)` as NOUN part of speech and then from that point we can look at the dependencies.  In all cases triggers will be NOUN but they will vary in dependency type so that they can be dobj or conj or npadvmod.

Based on this we can write a function that will look for work `trigger(s)` that is NOUN and can be `nsubj`, `dobj`, or `pobj`.  Once we capture part of speech and dependency we can start looking for the triggers based on the rules above.

Here is the resulting function:

In [55]:
# Find all of the triggers and store them in a list
def find_unfiltered_triggers(nlp, text):
    triggers = []
    doc = nlp(text)
    dep_type = None
    for token in doc:
        if (token.dep_ == 'nsubj' or token.dep_ == 'dobj' or token.dep_ == 'pobj') \
            and re.search('(trigger|triggers)', token.text, re.IGNORECASE):
            dep_type = token.dep_
        if token.pos_ == 'NOUN' and token.dep_ != dep_type and (token.dep_ == 'punc' or token.dep_ == 'dobj' or token.dep_ == 'conj'):
            triggers.append(token.text)
    return triggers

We called above function `find_unfiltered_triggers` because we know they we will find some nouns that are not actual triggers.  This happens because we are unable to build tight enough rules to account for people writing in different ways or even not adhering to grammar.

However, we decided to run this and see how many total triggers we can identify and maybe if there aren't too many we could just prune those.

In [57]:

unfiltered_triggers = []
for trigger_sentence in trigger_sentences:
    unfiltered_triggers.extend(find_unfiltered_triggers(nlp, trigger_sentence))

len(unfiltered_triggers)

41732

This is a lot of triggers but there are duplicates in there.  So for the next step we count each trigger type and sort to see which are the most frequent.

In [58]:
from collections import Counter

counted_triggers = Counter(unfiltered_triggers)
most_frequent = sorted(counted_triggers.items(), key=lambda k: k[1], reverse=True)
most_frequent

[('migraines', 3826),
 ('migraine', 3093),
 ('triggers', 653),
 ('pain', 510),
 ('stress', 505),
 ('foods', 466),
 ('things', 464),
 ('headaches', 460),
 ('lot', 441),
 ('attack', 407),
 ('headache', 389),
 ('changes', 388),
 ('symptoms', 381),
 ('lights', 318),
 ('alcohol', 314),
 ('sleep', 270),
 ('food', 267),
 ('caffeine', 258),
 ('attacks', 257),
 ('diet', 248),
 ('injections', 244),
 ('weather', 232),
 ('pressure', 229),
 ('diary', 220),
 ('time', 217),
 ('chocolate', 210),
 ('anxiety', 208),
 ('patterns', 201),
 ('sugar', 199),
 ('dehydration', 198),
 ('light', 188),
 ('trigger', 188),
 ('issues', 172),
 ('neck', 164),
 ('hormones', 158),
 ('exercise', 157),
 ('idea', 149),
 ('meds', 149),
 ('wine', 140),
 ('water', 137),
 ('lack', 133),
 ('information', 129),
 ('days', 125),
 ('coffee', 124),
 ('head', 121),
 ('meals', 120),
 ('cheese', 113),
 ('journal', 112),
 ('list', 112),
 ('points', 108),
 ('tension', 108),
 ('one', 103),
 ('heat', 102),
 ('medication', 101),
 ('people', 

In [59]:
len(counted_triggers)

4367

Looking at this list we can clearly see that anything with frequency below 90 is just noise.  And even in the most frequent list there are some words that are clearly not triggers.  So let's keep only the triggers with frequency above 90.

In [61]:
most_frequent_and_useful = [entry for entry in most_frequent if entry[1] > 90]
len(most_frequent_and_useful)

65

In [62]:
most_frequent_and_useful

[('migraines', 3826),
 ('migraine', 3093),
 ('triggers', 653),
 ('pain', 510),
 ('stress', 505),
 ('foods', 466),
 ('things', 464),
 ('headaches', 460),
 ('lot', 441),
 ('attack', 407),
 ('headache', 389),
 ('changes', 388),
 ('symptoms', 381),
 ('lights', 318),
 ('alcohol', 314),
 ('sleep', 270),
 ('food', 267),
 ('caffeine', 258),
 ('attacks', 257),
 ('diet', 248),
 ('injections', 244),
 ('weather', 232),
 ('pressure', 229),
 ('diary', 220),
 ('time', 217),
 ('chocolate', 210),
 ('anxiety', 208),
 ('patterns', 201),
 ('sugar', 199),
 ('dehydration', 198),
 ('light', 188),
 ('trigger', 188),
 ('issues', 172),
 ('neck', 164),
 ('hormones', 158),
 ('exercise', 157),
 ('idea', 149),
 ('meds', 149),
 ('wine', 140),
 ('water', 137),
 ('lack', 133),
 ('information', 129),
 ('days', 125),
 ('coffee', 124),
 ('head', 121),
 ('meals', 120),
 ('cheese', 113),
 ('journal', 112),
 ('list', 112),
 ('points', 108),
 ('tension', 108),
 ('one', 103),
 ('heat', 102),
 ('medication', 101),
 ('people', 

This number is reasonable to manually prune and remove incorrect entries.

In [84]:
bad_entries = set([
    'migraines',
    'migraine',
    'triggers',
    'things',
    'headaches',
    'lot',
    'attack',
    'headache',
    'changes',
    'symptoms',
    'food',
    'attacks',
    'diet',
    'injections',
    'diary',
    'time',
    'patterns',
    'trigger',
    'issues',
    'neck',
    'idea',
    'lack',
    'information',
    'days',
    'head',
    'meals',
    'journal',
    'list',
    'points',
    'one',
    'people',
    'sense',
    'relief',
    'others',
    'thing',
    'lots',
    'ones',
    'severity',
    'life',
    'track'
])

In [85]:
migraine_triggers = set([entry[0] for entry in most_frequent_and_useful if entry[0] not in bad_entries])
migraine_triggers

{'alcohol',
 'anxiety',
 'caffeine',
 'cheese',
 'chocolate',
 'coffee',
 'dehydration',
 'exercise',
 'foods',
 'heat',
 'hormones',
 'light',
 'lights',
 'medication',
 'meds',
 'nausea',
 'pain',
 'pressure',
 'sleep',
 'stress',
 'sugar',
 'tension',
 'water',
 'weather',
 'wine'}

Now we can create index of triggers per author!

In [86]:
def find_triggers_in_posts(nlp):
    def process(idx):
        trigger_idx = copy.deepcopy(idx)
        pattern = re.compile('(trigger|triggers)', re.IGNORECASE)

        def normalize_triggers(word):
            if word == 'pressure':
                return 'barometric pressure'
            if word == 'water':
                return 'dehydration'
            if word == 'meds':
                return 'medication'
            return word

        def find_triggers_in_text(text):
            triggers = []
            doc = nlp(text)
            dep_type = None
            for token in doc:
                if (token.dep_ == 'nsubj' or token.dep_ == 'dobj' or token.dep_ == 'pobj') \
                    and pattern.search(token.text):
                    dep_type = token.dep_
                if token.pos_ == 'NOUN' and token.dep_ != dep_type and (token.dep_ == 'punc' or token.dep_ == 'dobj' or token.dep_ == 'conj'):
                    if token.text in migraine_triggers:
                        triggers.append(
                            normalize_triggers(token.text)
                        )
            return triggers

        def process_entry(author, text):
            triggers = []
            if pattern.search(text) is None:
                trigger_idx[author]['triggers'] = triggers
                return
            for sentence in sentences_with_triggers(text):
                triggers.extend(find_triggers_in_text(sentence))
            trigger_idx[author]['triggers'] = triggers

        def sentences_with_triggers(text):
            doc = nlp(text)
            sentences = [str(sent) for sent in doc.sents if pattern.search(str(sent))]
            return sentences

        for author, text in posts_and_commnets:
            process_entry(author, text)
        return trigger_idx
    return process

In [87]:
identify_authors_triggers_in_posts = find_triggers_in_posts(nlp)

In [88]:
author_index = identify_authors_triggers_in_posts(author_index)
len(author_index)

46918

In [97]:
total_with_triggers = 0

for _, v in author_index.items():
    if len(v['triggers']):
        total_with_triggers += 1

print(f'Total authors with identifiable triggers {total_with_triggers}')

Total authors with identifiable triggers 627


# Feature - Author's Who Experience Aura

Start by finding all articles that refer to aura and then get random sample to find sentences and determine patterns.

In [99]:
auro_pattern = re.compile('(aura)', re.IGNORECASE)

aura_posts = []
for entry in posts_and_commnets:
    text = entry[1]
    if auro_pattern.search(text):
        aura_posts.append(text)

len(aura_posts)

15778

In [101]:
import random


# We used larger samples of 100 but changed to 10 so the notebook
# doesn't get to large
aura_samples = random.sample(aura_posts, 10)

for sample in aura_samples:
    print(sample)
    print('---------------------------------')

Mine started a little before then, I was having extremely painful migraines with the aura, headache; the throwing up. It always came before my period.
---------------------------------
I want to start out by saying that I’m very familiar with migraines, ocular ones specially. Once every few months I get the visual aura, limbs and face falling asleep, and confusion that comes with these types of migraines, and then usually get the headache that follows, but not always.

The last three days I’ve had a headache in the back of my head that doesn’t go away unless I take Advil, and a constant nausea that’s more in my head/throat than in my stomach. I don’t think I have a stomach bug or ate anything bad because I still get hungry and can eat just fine, there’s just a constant weird nauseous feeling in the bottom of my throat.

Anyway, can this nausea be a migraine? I’m just so used to the visual type of migraines that this seems weird to me. Does anyone else experience this? Any remedies you 

With the above input we get following sample sentences that we can base regex patterns from:

- my aura started
- to the first aura
- migraines with aura
- I start seeing auras
- I’ve been experiencing migraine with aura
- during my auras
- I do have aura
- My auras vision is permanent
- I get visual auras
- This morning during the aura
- I notice the aura
- the aura tells me
- I used to get auras
- My early stage aura
- When I get auras
- I had an aura
- I get a visual aura

So there are couple patterns here but the most common one is `I <some words> aura(s)`.  However, with this pattern we need to watch out for
negations, specifically, following patterns:

- I don't have aura
- I do not have aura
- I am without aura
- I am w/o aura


In [102]:
positive_aura_sentences = [
    "my aura started",
    "to the first aura",
    "migraines with aura",
    "I start seeing auras",
    "I’ve been experiencing migraine with aura",
    "during my auras",
    "I do have aura",
    "My auras vision is permanent",
    "I get visual auras",
    "This morning during the aura",
    "I notice the aura",
    "the aura tells me",
    "I used to get auras",
    "My early stage aura",
    "When I get auras",
    "I had an aura",
    "I get a visual aura"
]

negative_aura_sentences = [
    "I don't have aura",
    "I do not have aura",
    "I am without aura",
    "I am w/o aura"
]

In [103]:
positive_aura_matchers = [
    re.compile('(i|my).*(aura|auras)', re.IGNORECASE),
    re.compile('(with|first).*(aura|auras)', re.IGNORECASE),
    re.compile('(aura|auras).*(me)', re.IGNORECASE)
]

negative_aura_matchers = [
    re.compile('(i|my).*(\snot\s|without).*(aura|auras)', re.IGNORECASE),
    re.compile("(i|my).*(don't|w/o).*(aura|auras)", re.IGNORECASE)
]

In [104]:
def search_for_auras(text):
    return any([matcher.search(text) for matcher in positive_aura_matchers]) \
        and not any([matcher.search(text) for matcher in negative_aura_matchers])

In [105]:
# Unit tests
class TestAuraDiscovery(unittest.TestCase):
    def test_aura_matchers(self):
        self.assertTrue(all([search_for_auras(text) for text in positive_aura_sentences]))
        self.assertFalse(all([search_for_auras(text) for text in negative_aura_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_age_matchers (__main__.TestAgeDiscovery) ... ok
test_aura_matchers (__main__.TestAuraDiscovery) ... ok
test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok
test_suicidal_matchers (__main__.TestSuicidalThoughtsDiscovery) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.010s

OK


Now we can create author to aura index.

In [106]:
def identify_authors_aura_in_posts(idx):
    aura_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        aura_idx[author]['aura'] = str(bool(search_for_auras(text))).lower()

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return aura_idx

In [110]:
author_index = identify_authors_aura_in_posts(author_index)
len(author_index)

46918

In [111]:
total_authors_with_aura = 0

for _, v in author_index.items():
    if v['aura'] == 'true':
        total_authors_with_aura += 1

print(f'Total authors with aura {total_authors_with_aura}')

Total authors with aura 2370


# Feature - Authors with ADHD

This has similar approach as Aura detection.  Find texts with ADHD in it and sample to gather sentences describing ADHD and build regex based on them.

In [112]:
adhd_pattern = re.compile('(ADHD)', re.IGNORECASE)

adhd_posts = []
for entry in posts_and_commnets:
    text = entry[1]
    if adhd_pattern.search(text):
        adhd_posts.append(text)

len(adhd_posts)

720

In [114]:
adhd_samples = random.sample(adhd_posts, 5)

for sample in adhd_samples:
    print(sample)
    print('---------------------------------')

Yes! ADHD medications are stimulants, and stimulants can make migraines worse. So anything like caffeine, nicotine, or even prescription drugs are not very good for migraines.

I was just diagnosed with ADHD in April, so I'm not super missing out on the meds for it, I've had to learn to cope without my whole life (I'm 25). I have a little caffeine every day, and have for years, because I have issues focusing without. This is basically me self medicating my ADHD without realizing it. Keeping how much caffeine I have every day as low as possible did help my migraines, but cutting it out completely didn't help further, so I've managed to find a good balance that my doctor's approve of.

Just have a conversation with your doctor! They might suggest lowering your dose a bit to see if that helps with migraines. But if your migraines got worse after birth control, they might be triggered by hormones (mine are). Every person is different, and unfortunately it takes some trial and error to figu

Here is the list of the picked sample sentences.

In [166]:
positive_adhd_sentences = [
    "I struggle with ADHD",
    "I have ADHD",
    "I was diagnosed with ADHD",
    "ADHD here",
    "treating ADHD",
    "I was already struggling with ADHD",
    "I’ve got ADHD",
    "I have adhd",
    "prescription for ADHD medication",
    "because my ADHD meds",
    "I have adhd and asd",
    "I take ADHD meds",
    "you can take charge of the ADHd",
    "my ADHD makes",
    "diagnosed with ADHD",
    "I have ADHD"
]

negative_adhd_sentences = [
    "I have asd",
    "I ride a bike",
    "I don't have ADHD",
    "I've got no ADHD",
    "I do not have ADHD",
]

In [175]:
positive_adhd_matchers = [
    re.compile('(i|my).*(adhd)', re.IGNORECASE),
    re.compile('(take|treat|treating|diagnosed|prescription).*(adhd)', re.IGNORECASE),
    re.compile('(adhd).*(here)', re.IGNORECASE)
]

negative_adhd_matchers = [
    re.compile('(i|my).*(\snot\s|without).*(aura|auras)', re.IGNORECASE),
    re.compile("(i|my).*(don't|\sno\s).*(aura|auras)", re.IGNORECASE)
]

In [176]:
def search_for_adhd(text):
    return any([matcher.search(text) for matcher in positive_adhd_matchers]) \
        and not any([matcher.search(text) for matcher in negative_adhd_matchers])

In [179]:
# Unit tests
class TestAdhdDiscovery(unittest.TestCase):
    def test_adhd_matchers(self):
        self.assertTrue(all([search_for_adhd(text) for text in positive_adhd_sentences]))
        self.assertFalse(all([search_for_adhd(text) for text in negative_adhd_sentences]))

res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_adhd_matchers (__main__.TestAdhdDiscovery) ... ok
test_age_matchers (__main__.TestAgeDiscovery) ... ok
test_aura_matchers (__main__.TestAuraDiscovery) ... ok
test_female_matchers (__main__.TestGenderDiscovery) ... ok
test_male_matchers (__main__.TestGenderDiscovery) ... ok
test_dosage_matchers (__main__.TestMedicineDosageDiscovery) ... ok
test_suicidal_matchers (__main__.TestSuicidalThoughtsDiscovery) ... ok

----------------------------------------------------------------------
Ran 7 tests in 0.012s

OK


In [181]:
def identify_authors_adhd_in_posts(idx):
    adhd_idx = copy.deepcopy(idx)
    def process_entry(author, text):
        adhd_idx[author]['adhd'] = str(bool(search_for_adhd(text))).lower()

    for author, text in posts_and_commnets:
        process_entry(author, text)
    return adhd_idx

In [182]:
author_index = identify_authors_adhd_in_posts(author_index)
len(author_index)

46918

In [183]:
total_authors_with_adhd = 0

for _, v in author_index.items():
    if v['adhd'] == 'true':
        total_authors_with_adhd += 1

print(f'Total authors with ADHD {total_authors_with_adhd}')

Total authors with ADHD 92


# Final Features Counts

Now we want to check how many authors we have that have at least one feature.  We really want to get rid of any author that we could not identify any features for as there is no value to that entry.

In addition, it's interesting to know how many authors have all of the features.

First let's define some helper functions...

In [186]:
# Features like suicidal, ADHD, aura are set for all of the authors
# so we ignore those 
def has_features(entry):
    if entry['age'] == 0 and \
       len(entry['triggers']) == 0 and \
       entry['medicine'] == 'unknown' and \
       entry['gender'] == 'unknown':
        return False
    return True

def has_all_features(entry):
    if entry['age'] != 0 and \
       len(entry['triggers']) != 0 and \
       entry['medicine'] != 'unknown' and \
       entry['gender'] != 'unknown':
        return True
    return False

In [187]:
# Count at least one and all features
total_at_least_one = 0
total_all = 0

for author, entry in author_index.items():
    if has_features(entry):
        total_at_least_one += 1
    if has_all_features(entry):
        total_all += 1

print(f'Authors with at least one feature: {total_at_least_one}')
print(f'Authors with all features: {total_all}')

Authors with at least one feature: 4298
Authors with all features: 3


# Build the  Dataset

Our dataset will take all the author entries that have at least one feature and create a CSV file to store it.

However, to ensure author's privacy we need to replace author's user IDs with some unique identifier. We decided to simply replace author's Reddis and Migraine.com userid with UUID.  This allows us to still have unique id for each author without giving away their Reddis or Migraine.com userid.


In [189]:
import uuid

for _, entry in author_index.items():
    entry['id'] = uuid.uuid4()


In [191]:
data_list = [entry for _, entry in author_index.items() if has_features(entry)]
output_df = pd.DataFrame(data=data_list)
len(output_df)

4298

In [198]:
output_df[195:200]

Unnamed: 0,gender,medicine,dosage,qty,suicidal,age,triggers,aura,adhd,id
195,unknown,unknown,unknown,unknown,no,0,[pain],False,False,f943925d-e66b-4ecf-8cd0-0df7dd0a4ec6
196,unknown,unknown,unknown,unknown,no,0,[stress],False,False,343502cc-db1f-4aa0-b761-822297f47c1d
197,unknown,unknown,unknown,unknown,no,0,"[caffeine, chocolate, lights, lights, alcohol,...",False,False,e577e633-d3e5-4bd2-b5c6-cb1488062a5f
198,unknown,triptan,40mg,1x,no,0,[],False,False,f0587974-3841-4e9b-853a-b34fcd03eabf
199,unknown,unknown,unknown,unknown,no,0,[sleep],False,False,d6b642cf-8dfd-481d-b568-5aee7f8c056e


In [197]:
output_dataset_filename = 'migraine_all.csv'
output_df.to_csv(f'data/{output_dataset_filename}')