# Information Extraction from patient.info

Patient.info is another source on people sharing their experiences with migraines, the challenges they face, and how they deal with them. This is less popular than reddit and migraine.com, but it was scraped as it still has an active community.

The forum of posts is:
https://patient.info/forums/discuss/browse/migraine-1415?page=0#group-discussions

This site doesn't provide APIs so its forums need to be scraped.  Migraines.com `Terms of Use` specify that one needs to get permission for scraping their articles, however, the robots.txt file doesn't prohibit scraping of migraines.com/forums:

[robots.txt](https://migraine.com/robots.txt)

    Sitemap: https://patient.info/sitemap.xml
    User-agent: *
    Disallow: /aspnet_client/
    Disallow: /pdf/
    Disallow: /forums/profiles/
    Disallow: /printer.asp
    Disallow: /print/
    Disallow: /sponsored
    Disallow: *order=*
    Disallow: /feedback?ref=*
    Disallow: /forums/user?returnurl*
    Disallow: /forums/new-discussion/
    Disallow: /in/
    Disallow: /wellbeing/
    Disallow: /forums/me/
    Disallow: */related
    Disallow: /forums/index-*
    Disallow: *returnurl=*
    Disallow: /search.asp?*
    Disallow: *?discussionOrder*
    Disallow: *onlyWithImages*
    Disallow: /advertisement

    User-agent: Mediapartners-Google
    Disallow:

These forums contain 23 pages of forum posts. Extraction was undertaken using selenium to open each page of posts and find the webpage of all forum topics. Selenium was further used to go to each topic and extract the parent comment, and each reply. 

This website may have the reponses on multiple webpages. The script will look for 'Next" button, and if found go to the linked website to go to the next set of responses until the 'Next' button is not present.

The script was able to pull the following information:
    
    Type: Parent ('P') or child ('C') comment
    
    Parent: If a parent comment, a sequential integer is used to identify, with a 'P' prefix
            Child comments use the parent identifier appended with a '.' and a second sequential integer, to allow user to find the parent comment
            
    Author: migraine.com user ID for the comment author
    
    Text: Text of comment
    
    Webpage: Location of the page on 'https://patient.info'
    
  
    

In [1]:
#Code requires the following python modules:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import csv
from os.path import exists
import html
import unicodedata

In [4]:
# Code also requires setup of a webdriver, in this case chromedriver is used. 
# User needs to ensure proper paths are set up
DRIVER_BIN = 'chromedriver'
pages = []
ID = 1
migraine_file_name = 'patient.info.csv'
existingTitles = set()

#Check to see if data exists, if so previous accessed webpages are ignored
#Note: This does not check to see if a parent topic had additional replies added since last run
field_names = ['Type', 'Parent', 'Author', 'Text', 'Title','Webpage']
if not exists(f'data/{migraine_file_name}'):
    posts_file = open(f'data/{migraine_file_name}', 'w')
    csv_writer = csv.DictWriter(posts_file, fieldnames=field_names)
    csv_writer.writeheader()
    posts_file.close()
else:
    posts_file = open(f'data/{migraine_file_name}', 'r')
    contents = csv.reader(posts_file)
    for lines in contents:
        existingTitles.add(lines[5])
    posts_file.close()

In [7]:
#Find pages of posts
#Set range to be pages of posts found on the main forum page. 
#This is set to 1 in this file, to reduce runtime, but as of 11/24/2021 should be set to 23 to get all posts
driver = webdriver.Chrome(executable_path = DRIVER_BIN)
root = 'https://patient.info'   
for i in range (1):
    driver.get('https://patient.info/forums/discuss/browse/migraine-1415?page={}#group-discussions'.format(i))
    content = driver.page_source
    soup = BeautifulSoup(content,features="html.parser")
    for a in soup.findAll('h3',attrs={'class':"post__title"}):
        for b1 in a.findAll('a'):
            if b1['href'] not in existingTitles:
                pages.append(b1['href'])

In [20]:
#Go through each topic and find each parent and all child comments
posts_file = open(f'data/{migraine_file_name}', 'a')
csv_writer = csv.DictWriter(posts_file, fieldnames=field_names)
pagecount = 0 
for page in pages:
    pagecount+=1
    print(page)
    print('Page {} of {}'.format(pagecount,len(pages)))
    PID = 'P' + str(ID)
    parent = {'Type':'P','Parent' : PID,'Webpage' : page}
    webpage = root + page
    driver.get(webpage)
    content = driver.page_source
    soup = BeautifulSoup(content,features="html.parser")
    count = 0
    for a in soup.findAll('div',attrs={'id':"topic"}):
        for b in a.findAll('h5',attrs={'class':'author__info'}):
            parent['Author'] = b.text if b.text else 'none'
        for c in a.findAll('input',value=True):
            parent['Text'] = html.unescape(unicodedata.normalize('NFKD',c['value']))
            
        for d in a.findAll('h1',attrs={'class':'u-h1 post__title'}):
            parent['Title'] = d.text
    csv_writer.writerow(parent)   
    childID = 1
    while soup.find('a',attrs={'aria-label':'Next page'}):
        for ca in soup.findAll('div',attrs={'id':'topic-replies'}):
            for cb in ca.findAll('article',attrs={'class':'post'}):
                IDparent = PID + '.' + str(childID)
                child = {'Type':'C','Parent' : IDparent, 'Title' : parent['Title']}
                childID+=1
                for cc in cb.find('a',text=True,attrs={'itemprop':'name'}):
                    child['Author'] = cc
                else:
                    child['Author'] = 'none'
                child['Text'] = html.unescape(unicodedata.normalize('NFKD',cb.find('input',value=True)['value'])) if \
                                cb.find('input',value=True) else 'none'
                csv_writer.writerow(child) 
        next = soup.find('a',attrs={'aria-label':'Next page'})
        nextpage = next['href']
        driver.get(nextpage)
        content = driver.page_source
        soup = BeautifulSoup(content,features="html.parser")
    for ca in soup.findAll('div',attrs={'id':'topic-replies'}):
        for cb in ca.findAll('article',attrs={'class':'post'}):
            IDparent = PID + '.' + str(childID)
            child = {'Type':'C','Parent' : IDparent, 'Title' : parent['Title']}
            childID+=1
            for cc in cb.find('a',text=True,attrs={'itemprop':'name'}):
                child['Author'] = cc
            else:
                child['Author'] = 'none'
            child['Text'] = html.unescape(unicodedata.normalize('NFKD',cb.find('input',value=True)['value'])) if \
                            cb.find('input',value=True) else 'none'
            csv_writer.writerow(child) 
    posts_file.close()
    posts_file = open(f'data/{migraine_file_name}', 'a')
    csv_writer = csv.DictWriter(posts_file, fieldnames=field_names)
    ID += 1

#input('Press ENTER to close the automated browser')
driver.quit()
posts_file.close()

/forums/discuss/vestibular-migraine-for-weeks-does-anyone-find-that-overthecounter-painkillers-make-a-difference--774632
Page 1 of 35
/forums/discuss/daily-migraine-since-02-26-2021-help--773803
Page 2 of 35
/forums/discuss/constant-ear-fullness-popping-ringing-754218
Page 3 of 35
/forums/discuss/medication-for-ocular-migraine--772360
Page 4 of 35
/forums/discuss/tapering-off-of-nortriptyline-643137
Page 5 of 35
/forums/discuss/i-am-currently-going-through-hell-696625
Page 6 of 35
/forums/discuss/24-7-headache-all-day-everyday-for-18-months-532717
Page 7 of 35
/forums/discuss/nortryptalin-tapering-768145
Page 8 of 35
/forums/discuss/migraine-medication-help-at-me-wits-end--767795
Page 9 of 35
/forums/discuss/all-day-every-day-pain-765738
Page 10 of 35
/forums/discuss/migraine-diagnosis--765586
Page 11 of 35
/forums/discuss/has-anyone-had-neurological-problems-after-botox-treatment-for-migraine--236467
Page 12 of 35
/forums/discuss/daily-tension-headache-fluctuates-all-day-double-vision