## a ~~proper `numpy`~~ `pandas` dataset

organize the total dataset in a useable format and save it to disk so that it is shareable on GH.

In [28]:
!pip install rdkit-pypi
!pip install transformers
!pip install deepchem



In [29]:
# for files
import os
import pickle
from google.colab import drive, files

# to get source
import requests

# rabbit hunting with a howitzer
from bs4 import BeautifulSoup

# I don't know, maybe needed for some reason
import numpy as np

# pandas for saving
import pandas as pd

# for parsing
import re
from functools import reduce

# for Mol and SMILES objects
import rdkit
from rdkit import Chem

# for deepchem data
import deepchem as dc

# for sentiment
from transformers import pipeline

Carrying over a helpful function and creating another one...

In [30]:
# to get smiles from requests response text content
def textToSMILES(text):
  # text is a string recording the .mol file
  try:
    os.remove('myMol.mol')
    with open('myMol.mol', 'w') as molF:
      molF.writelines(text)
    myMol = Chem.MolToSmiles(Chem.MolFromMolFile('myMol.mol'))
  except:
    with open('myMol.mol', 'w') as molF:
      molF.writelines(text)
    myMol = Chem.MolToSmiles(Chem.MolFromMolFile('myMol.mol'))
  return myMol

# to get rdkit mol object from mol string
def textToMol(text):
  # text is a string recording the .mol file
  try: # is the file there?
    os.remove('myMol.mol')
    with open('myMol.mol', 'w') as molF:
      molF.writelines(text)
    myMol = Chem.MolFromMolFile('myMol.mol')
  except: # maybe not!
    with open('myMol.mol', 'w') as molF:
      molF.writelines(text)
    myMol = Chem.MolFromMolFile('myMol.mol')
  return myMol

Let's load the preliminary datasets that I saved to my Drive.

In [31]:
drive.mount('/content/drive')

Mounted at /content/drive


In [32]:
# loading tihkal dataset from Drive
with open('/content/drive/MyDrive/Colab Notebooks/rawMolTihkalDrive', 'rb') as tihk:
  tihkalMolDr = pickle.load(tihk)

# loading pihkal dataset from Drive
with open('/content/drive/MyDrive/Colab Notebooks/rawMolPihkalDrive', 'rb') as pihk:
  pihkalMolDr = pickle.load(pihk)  

We can start building our structured arrays. In fact I will have an intermediate/auxiliary couple of lists, but here is the final list of dtypes we are aiming at:

In [33]:
# smiles: string
# dosing: string describing how substance was administered
# text: string, raw content of comment on response to dosing
# score: float, huggingface's sentiment score (confidence signed by sentiment)
shulginTypes = [('smiles', object), ('dosing', object), ('text', object), ('score', float)]

That reminds me: let's get our sentiment scorer. We'll wrap it in a thing that gives it a sign.

In [34]:
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


The function that will produce our sentiment score. It only does 512 characters; we'll have to truncate later.

In [35]:
sentimentScore = lambda x: (-1)*(sentiment(x)[0]['score']) if sentiment(x)[0]['label'] == 'NEGATIVE' else sentiment(x)[0]['score']

Wow, looks pretty dumb! That is OK, we are just getting started. It will be interesting to see whether its scores contain any information (e.g. will everything just be a 98-percent-confident positive score?). Let's find out.



## Constructing the Shulgin Object

At first I thought I would do this manually, and that would be faster, but I want this dataset to be *auditable* in the sense that a person can look at Erowid's *PiHKAL* and *TiHKAL* pages next to this code and say "Yes, this dataset is as claimed." The process should be reproducible, so that in fact another person can start from the scraping of the HTML and wind up with the same dataset.

It turns out that ***you really don't need soup for this***, but I had used it before and thought I might use it again here. But it is possible that inadvertently I took advantage of Soup's cleaning that it does to broken HTML.

In [36]:
# a function to get the (smiles, soup) pair
baseData = lambda x: (textToSMILES(x[0]),
                     BeautifulSoup(x[1]))

Let's collect the base data.

In [37]:
# with the soup
tihkalBase = list(map(baseData, tihkalMolDr))
pihkalBase = list(map(baseData, pihkalMolDr))

### Qualitative Comments

Let's define a function to extract the *Qualitative Comments* section from Shulgin's pages.

In [38]:
def dosingFunc(page):
  """
  page is the source (raw string)
  returns a list of dosage strings and a list of quotations
  strategy: split on 'QUALITATIVE COMMENTS' first, then 'EXTENSIONS'
  """
  # first and second splitters
  firstSplitter = re.compile(r'QUALITATIVE COMMENTS|QUALITATIVE|COMMENTS')
  secondSplitter = re.compile(r'EXTENSIONS|COMMENTARY')
  chunks = firstSplitter.split(page)[1:] # chunks following 'QUAL...'
  # how does the second splitter act on a chunk? really just to clean
  chunkCleaner = lambda chunk: secondSplitter.split(chunk)[0]
  chunks = [chunkCleaner(x) for x in chunks]
  # dosing regex
  dosingRegex = re.compile(r'\(.*\d{1,3} µg.*?\)|\(.*\d{1,3} mg.*?\)|\(.*\d*.*\d* mg.*?\)')
  # collect all dosings
  dosingFunc = dosingRegex.findall
  dosingList = reduce(lambda x,y: x + y, map(dosingFunc, chunks), [])
  # get the comments by splitting on dosages
  commentFunc = lambda x: dosingRegex.split(x)[1:] #(comments follow dosings)
  commentList = reduce(lambda x,y: x + y, map(commentFunc, chunks), [])
  # get rid of newlines
  commentList = [x.replace('\n', ' ') for x in commentList if x]
  # return the lists
  return dosingList, commentList

In [39]:
tihkDoses = [dosingFunc(str(x)) for x in [y[1] for y in tihkalBase]]
# how many are missing any comments from the dosed?
[i for i,x in enumerate(tihkDoses) if not x[0]]

[1, 24, 25, 28, 29, 41, 43, 48, 49]

That passes the sanity check; all of those pages had zero dose information. Let's try it on *PiHKAL*.

In [40]:
pihkDoses = [dosingFunc(str(x)) for x in [y[1] for y in pihkalBase]]
# how many are missing any comments from the dosed?
#len([i for i,x in enumerate(pihkDoses) if not x[0]])
# print some actual pages numbers
print([i+1 for i,x in enumerate(pihkDoses) if not x[0]])

[1, 17, 19, 29, 32, 54, 73, 74, 75, 79, 80, 83, 86, 90, 101, 102, 103, 104, 107, 111, 112, 117, 120, 121, 124, 166]


In [41]:
# for reducing
flattenPage = lambda x,p: [(x, y, z) for y,z in zip(*p) if p[0]]
aggPage = lambda x,y: x + flattenPage(*y)
# reduce tihk
tihkZip = zip(map(lambda x: x[0], tihkalBase), tihkDoses)
flattenPage = lambda x,p: [(x, y, z) for y,z in zip(*p) if p[0]]
aggPage = lambda x,y: x + flattenPage(*y)
tihkDosings = reduce(aggPage, tihkZip, [])
# reduce pihk
pihkZip = zip(map(lambda x: x[0], pihkalBase), pihkDoses)
flattenPage = lambda x,p: [(x, y, z) for y,z in zip(*p) if p[0]]
aggPage = lambda x,y: x + flattenPage(*y)
pihkDosings = reduce(aggPage, pihkZip, [])

In [42]:
len(pihkDosings), len(tihkDosings)

(464, 255)

In [43]:
pihkDosings[0]

('C=CCOc1c(OC)cc(CCN)cc1OC',
 '(with 24 mg)',
 ' I first became aware of something in about 10 minutes, a pleasant increase in energy.  By 20 minutes it was getting pronounced and was a nice, smooth development.  During the next hour positive and negative feelings developed simultaneously. Following a suggestion, I ate a bit of food even though I had not been hungry, and to my surprise all the negative feelings dropped away.  I felt free to join the others wherever they were at.  I moved into the creative, free-flowing kind of repertoire which I dearly love, and found everything enormously funny.  Much of the laughter was so deep that I felt it working through buried depressions inside me and freeing me.  From this point on, the experience was most enjoyable. The experience was characterized by clear-headedness and an abundance of energy which kept on throughout the day and evening.  At one point I went out back and strolled along to find a place to worship.  I had a profound sense of 

Finally, we tack on the sentiment. We are truncating comments strictly for the purpose of gitting er done.

In [44]:
pihkScoredDosings = list(map(lambda t: (*t, sentimentScore(t[-1][:512])), pihkDosings))
tihkScoredDosings = list(map(lambda t: (*t, sentimentScore(t[-1][:512])), tihkDosings))

In [45]:
tihkScoredDosings[0]

('[cH:1]1[cH:2][cH:3][c:14]2[c:13]3[c:4]1[C:6]1=[CH:11][CH:10]([C:19](=[O:20])[N:21]([CH2:22][CH3:25])[CH2:23][CH3:24])[CH2:9][N:8]([CH2:26][CH:27]=[CH2:28])[CH:7]1[CH2:5][c:12]3[cH:16][nH:15]2',
 '(with 50 µg)',
 ' "I am aware in twenty minutes, and am into a stoned place, not too LSD like, in another hour. I would very much like to push higher, but that is not in the cards today and I must acknowledge recovery by hour eight."  ',
 -0.9833899140357971)

In [46]:
tihkArray = np.array(tihkScoredDosings, dtype=shulginTypes)
pihkArray = np.array(pihkScoredDosings, dtype=shulginTypes)

In [47]:
with open('tihkBin','wb') as bin:
  pickle.dump(tihkArray, bin)
!cp tihkBin /content/drive/MyDrive/Colab\ Notebooks

In [None]:
with open('pihkBin','wb') as bin:
  pickle.dump(pihkArray, bin)
!cp pihkBin /content/drive/MyDrive/Colab\ Notebooks

Let's do pandas instead.

In [48]:
tihkFrame = pd.DataFrame(tihkArray)

In [None]:
tihkFrame.to_json('tihkFrame.json')

In [None]:
!cp tihkFrame.json /content/drive/MyDrive/Colab\ Notebooks/

In [None]:
tihkFrame.dtypes

smiles     object
dosing     object
text       object
score     float64
dtype: object

Wonderful.

In [49]:
pihkFrame = pd.DataFrame(pihkArray)

In [50]:
tihkFrame.to_csv('tihkFrameTwo.csv', index=False)
pihkFrame.to_csv('pihkFrameTwo.csv', index=False)

In [51]:
!cp tihkFrameTwo.csv /content/drive/MyDrive/Colab\ Notebooks
!cp pihkFrameTwo.csv /content/drive/MyDrive/Colab\ Notebooks

In [None]:
pihkFrame.to_json('pihkFrame.json')

In [None]:
!cp pihkFrame.json /content/drive/MyDrive/Colab\ Notebooks/

In [None]:
# to csv
tihkFrame.to_csv('tihkFrame.csv')

In [None]:
# again 
pihkFrame.to_csv('pihkFrame.csv')

In [None]:
!cp tihkFrame.csv /content/drive/MyDrive/Colab\ Notebooks/
!cp pihkFrame.csv /content/drive/MyDrive/Colab\ Notebooks/

In [52]:
drive.flush_and_unmount()