# Troubleshooting of logD estimation

In our current SAMPL7 analysis, we use something like (following Bannan et al. 10.1007/s10822-016-9954-8), for a basic solute:

$ \log D = \log P - log( 1+ 10^{pK_a - pH})$ (eq. 3)

or for an acidic solute:
$ \log D = \log P - log( 1+ 10^{pH - pK_a})$ (eq. 4)

These are for a compound with a single change in protonation; accounting for other states could include Equation 5 there.


Pion, whose instruments yielded the logD analysis (in Francisco et al., https://doi.org/10.1016/j.ejmech.2021.113399) says that their logD analysis uses a general equation which originates from Avdeef, A., "Assessment of distribution-pH profiles", *Lipophilicity in drug action and toxicology*, Pliska, V.; Testa, B.; Van de Waterbeemd, H. eds., VCH, Weinheim; p109-139 (1996). The equation given is 
\begin{equation}
D = \frac{P^0 + [H+]\beta_1 P^1 + [H+]^2 \beta_2 P^2 + ...}{1 + [H+]\beta_1 + [H+]^2 \beta_2 + ...}
\end{equation}

where I assume [H+] is found from the pH and $\beta_1 = K_{a1}$, $\beta_2 = K_{a1}K_{a2}$, etc. $\beta$ is a "cumulative" protonation constant. 



Transition networks: https://docs.google.com/presentation/d/16SlgjA3mxwmi6bdbAkhiZrNEc6wJ94hLW8pvTBkI3Uo/edit#slide=id.g98334362cf_0_199

Note there are several cases where our current approach disagrees profoundly with that used by Pion. These are SM42, SM25, SM26, SM41 and SM43, in increasing order of disagreement. From the transition networks, 
- SM42 has two neutral states and a +1 and -1
- SM25 is complex, with three neutral states, a +1 and two -1 states
- SM26 is complex, with three neutral states, a +1 state and two -1 states
- SM41 has a single neutral state, a +1 and -1
- SM43 is complex, with two neutral states, two +1 states, a +2 state, and a -1 state 

**The best starting point for simple troubleshooting, then, is SM41**

## Analysis of SM41

Experimental values for the challenge are found here: https://github.com/samplchallenges/SAMPL7/blob/master/physical_property/experimental_data/Experimental_Properties_of_SAMPL7_Compounds.csv

SM41 has a pKa of 5.22 and a logP of 0.58. The reported logD7.4 is -0.42.

### Using the Bannan expression for a base


In [7]:
import numpy as np
pKa = 5.22
logP = 0.58
pH = 7.4

logD = logP - np.log10(1.+10**(pKa-pH))
print(logD)

0.57714008208903


### Using the Bannan expression for an acid

In [8]:
import numpy as np
pKa = 5.22
logP = 0.58
pH = 7.4

logD = logP - np.log10(1.+10**(pH-pKa))
print(logD)

-1.6028599179109704


### Analyze using Pion's expression

This involves solving for the proton concentration from the pH.



In [21]:
pKa = 5.22
logP = 0.58
pH = 7.4
concH = 10**(-pH)
P = 10**(logP)
Ka = 10**(-pKa)

D = ( P + concH*Ka*P)/(1 + concH*Ka)
logD = np.log10(D)
print(logD)
print(Ka, P, concH)
print(concH*Ka, concH*Ka*P)
print(D)

0.58
6.025595860743581e-06 3.8018939632056115 3.981071705534969e-08
2.39883291901949e-13 9.120108393559093e-13
3.801893963205612


# Throw that out and analyze a different way

The Comer et al. paper, and Sirius's slide deck, provide more context to indicate that P^0 is the partition of the neutral species and P^1 the partition of hte +1 species, etc. This means the above analysis, where I was exponentiating instead, is completely a mistake. 

My key question to proceed is how to get P^0 and P^1 etc for an example case, like SM41, but Figure 4 in the Comer paper suggests a path forward as the graphs look a lot like those in the Sirius report for this particular compound; the caption says "Partial lipophilicity profiles derived using equations 8 or 9, after calculating logP^N from the data in graphs (a) and (b) using Eqns 10 or 11." This basically describes data I have from Sirius, and I have equations 10 and 11 (though one concern is these are for the monoprotic case only, but still I can try).

The other concern is that Eq. 10-11 have r, a volume fraction, which does not occur in the Sirius report, but that's a minor/surmountable issue I think. I am also not sure I know $p_o K_a$, the pKa in octanol, but perhaps this can be figured out. 

So let's try that. 

- Begin with $p_o K_a$. Where do I get it? Sirius slide deck (e.g. ch 7 slide 2) indicates it's the shifted pKa in the presence of octanol. (There is also a special version of this, the limiting one, called the Scherrer $p_o K_a$, which occurs in octanol saturation, but that is not what we need here.) 
    - **But where exactly does the $p_o K_a$ come from?** 
       - It's PROBABLY the rightmost shifted pKa in the "mean molecular charge" graph. Checking by viewing underlying data
       - XLS spreadsheet has theoretical mean molecular charge as based on pKa in columns
       - Next to it there's the actual one "shifted due to solvent" run at three solvent concentrations (not themselves stated)
       - So probably use the rightmost of these 
       - pH that concentration hits 0.5 is about 7.3 (between 7.18 and 7.4)
- Now what about $r$?
    - Could use just a placeholder value of some kind
    - There is no data in the paper on what $r$ is, nor in the Sirius report
    - Emailed Karol/Carlo asking
    - In the interim probably try some limits, e.g. volume fraction of 1, 0.1, 0.5

In [64]:
pKa = 5.22
logP = 0.58
pH = 7.4
concH = 10**(-pH)
poKa = 7.3
r = 2/1.09995

P0 = 10**(logP)
P1 = (10**(poKa-pKa))/r - 1

Ka = 10**(-pKa)

D = (P0 + P1*concH*Ka)/(1+concH*Ka)
print(D)

3.8018939632203215


In [65]:
print(P0, P1)

3.8018939632056115 65.12153824287118


In [66]:
print(concH*Ka)


2.39883291901949e-13


In [67]:
print(1+concH*Ka)

1.0000000000002398


In [68]:
10**5.22

165958.69074375596

In [69]:
print(np.log10(D))

0.5800000000016803


The Sirius analysis gives a negative value for logD here; the only way this could happen is if P^1 is negative, which will happen only if the $P^1 = 10^{p_o K_a -pKa}/r$ term is less than 1. Note that for a poKa-pKa gap of 2.1, as here, we're dealing with a factor of about 120 so $r$ would have to mean that the partition solvent is far in excess of the aqueous phase. That seems... odd.  

Sirius has slides on this in Ch 7 though, e.g. slide 16-17, basically the point is that you want an r such that you get a shift in pKa as you modulate the amount of octanol. They have an example for a logP between 0 and 2.5 where you might be using, as a third octanol concentration, an octanol volume of 1.2 and a water volume of 1.

Looks like I'm looking at data point 114, which has a water volume of 1.09995 mL and an octanol volume of 2 mL.

In [71]:
print(10**(poKa -pKa)/r)

66.12153824287118


In [63]:
poKa-pKa

2.08