In [23]:
%load_ext autoreload
%autoreload 2

In [40]:
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

from typing import List, Set
from pydantic import BaseModel, Field

from plotreader.utils.document import MultimodalDirectoryHandler

In [None]:
fig_handler = MultimodalDirectoryHandler(
    name = 'figure',
    dirpath = './storage/tmp',
    desc = "A scientific paper from which we'd like to extract quantitative information.",
    storage_dir = './storage',
    use_cache=False,
#     parsing_instructions =
# """
# The provided document is a scientific paper. 
# Extract the all of the text but DO NOT generate text descriptions of the images. 
# DO NOT include separations between pages in the markdown or text output. Ignore all headers and footers that are metadata (like citation info).
# """
)

In [3]:
query_tool = fig_handler.query_engine_tool(top_k = 5)

NameError: name 'fig_handler' is not defined

In [53]:
tool_response = query_tool("""In Figure 2d, plot number 1, what values are taken by the independent variable Wavelength (nm)?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCES THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE PLOTTED POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!
        THERE MAY BE TYPE-OS IN THE TEXT, SO TRY TO RESOLVE DISCREPANCIES IN A PARSIMONIOUS WAY!!
    Return your answer as structured data.
    


Here's a JSON schema to follow:
{{"properties": {{"name": {{"title": "Name", "type": "string"}}, "values": {{"items": {{}}, "title": "Values", "type": "array"}}, "unit": {{"default": "None", "title": "Unit", "type": "string"}}}}, "required": ["name"], "title": "IndependentVariable", "type": "object"}}
""")

In [54]:
print(tool_response.content)

Based on the information provided in the images and text, I can provide the following structured response:

{
  "name": "Wavelength",
  "values": [397, 439, 481, 546, 567, 600],
  "unit": "nm"
}

Explanation:
I derived this answer primarily from examining Figure 2d in the image and cross-referencing it with the information provided in the "Light delivery and imaging" section of the text.

The plot in Figure 2d shows wavelength on the x-axis, ranging from 400 to 600 nm. However, the exact values used are not clearly labeled on the axis.

From the text, we learn that for full spectra measurements (which Figure 2e refers to, and Figure 2d likely follows the same protocol), they measured photocurrents at seven different wavelengths:

1. (violet) 397 ± 3 nm
2. (blue) 439 ± 8 nm LED
3. (cyan) 481 ± 3 nm LED
4. (green) 546 ± 16 nm
5. (yellow) 567 ± 13 nm
6. (red) 640 ± 3 nm

Looking at the plot, we can see data points that correspond to these wavelengths, except for 640 nm which appears to be

In [55]:
tool_response = query_tool("""In Figure 2d, plot number 1, how many points are plotted for each line?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCES THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE PLOTTED POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!
        THERE MAY BE TYPE-OS IN THE TEXT, SO TRY TO RESOLVE DISCREPANCIES IN A PARSIMONIOUS WAY!!
    Return your answer as structured data.
    


Here's a JSON schema to follow:
{{"properties": {{"name": {{"title": "Name", "type": "string"}}, "values": {{"items": {{}}, "title": "Values", "type": "array"}}, "unit": {{"default": "None", "title": "Unit", "type": "string"}}}}, "required": ["name"], "title": "IndependentVariable", "type": "object"}}
""")

In [56]:
print(tool_response.content)

I apologize, but I cannot provide an accurate answer to this query based on the information given. The context and images provided do not contain a Figure 2d or any information specifically related to it. The available information primarily discusses Figure 1 and Figure 4, but does not include details about a Figure 2d or its specific plot points.

To accurately answer this question, we would need to see Figure 2d and any accompanying text that describes the data presented in that figure. Without this information, it's not possible to determine how many points are plotted for each line in the first plot of Figure 2d, or what the independent and dependent variables are.

If you have additional information or images that include Figure 2d, I'd be happy to analyze those and provide a more accurate response.


In [65]:
tool_response = query_tool("""Summarize Figure 2 and each panel in it.
    IMPORTANT: 
        THERE MAY BE TYPE-OS IN THE TEXT, SO TRY TO RESOLVE DISCREPANCIES IN A PARSIMONIOUS WAY!!
""")

In [66]:
print(tool_response.content)

Based on the information provided in the caption and the image for Figure 2, I can summarize the figure and its panels as follows:

Figure 2 demonstrates that model-predicted channelrhodopsins (ChRs) exhibit a wide range of functional properties, often surpassing those of their parent ChRs.

Panel (a): Shows representative current traces after 0.5 seconds of light exposure for selected designer ChR variants. It also displays the corresponding expression and localization in HEK cells. Each ChR current trace has a vertical colored scale bar representing 500 pA, and a horizontal scale bar representing 250 ms. The color assigned to each variant in this panel is used consistently throughout the other panels.

Panel (b): Presents measured peak and steady-state photocurrents with different wavelengths of light in HEK cells. The measurements were taken using:
- 397 nm light at 1.5 mW mm^-2
- 481 nm light at 2.3 mW mm^-2
- 546 nm light at 2.8 mW mm^-2
- 640 nm light at 2.2 mW mm^-2
The data is 

In [67]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='cd99e54d-99eb-46be-850c-b9d3a8fce9e7', embedding=None, metadata={'images': [{'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-img_p21_1.png', 'page_num': 22}, {'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-page_22.jpg', 'page_num': 22}], 'window': 'Bedbrook et al.  Page 22\n\n[START FIGURE 2 CAPTION]\nFigure 2.\n The model-predicted ChRs exhibit a large range of functional properties often far exceeding the parents.  (a) Representative current traces after 0.5 s light exposure for select designer ChR variants with corresponding expression and localization in HEK cells.  Vertical colored scale bar for each ChR current trace represents 500 pA, and horizontal scale bar represents 250 ms.  The variant color presented in (a) is constant throughout panels.  (b) Measured peak and steady-state photocurrent with different wavelengths of light in HEK cells (n = 4–8 cells, see Dataset 2).  397 nm light at 1.5 m

In [69]:
class Experiment(BaseModel):
    independent_variables: List[str]
    dependent_variables: List[str]

class Plot(BaseModel):
    name: str
    experiments: List[Experiment]

class Panel(BaseModel):
    name: str
    plots: list[Plot]

class Figure(BaseModel):
    panels: List[Panel]
    # statistics: List[str]



In [70]:
import plotreader
from llama_index.core.prompts import PromptTemplate
from llama_index.core.output_parsers.pydantic import PydanticOutputParser

In [73]:
output_parser = PydanticOutputParser(output_cls=Figure)

prompt = """For each plot in each panel of Figure 2, determine the experiment in terms of Independent variables (IVs) and dependent variable (DV). 
IMPORTANT:
    By plot, we mean each set of axes or displays. Each panel can have multiple plots.
Return your answer as structured data.
"""

prompt = PromptTemplate(prompt, output_parser=output_parser).format(llm=plotreader._MM_LLM)

In [74]:
tool_response = query_tool(prompt)

panels=[Panel(name='a', plots=[Plot(name='Current strength', experiments=[Experiment(independent_variables=['Time'], dependent_variables=['Current'])]), Plot(name='Off-kinetics', experiments=[Experiment(independent_variables=['Time'], dependent_variables=['Current'])]), Plot(name='Wavelength sensitivity of activation', experiments=[Experiment(independent_variables=['Wavelength'], dependent_variables=['Normalized photocurrent'])])]), Panel(name='b', plots=[Plot(name='Train classification models', experiments=[Experiment(independent_variables=['Predicted probability of localizing'], dependent_variables=['Predicted probability of functioning'])])]), Panel(name='c', plots=[Plot(name='Current strength', experiments=[Experiment(independent_variables=['Measured value'], dependent_variables=['Predicted value'])]), Plot(name='Off-kinetics', experiments=[Experiment(independent_variables=['Measured value'], dependent_variables=['Predicted value'])]), Plot(name='Spectral properties', experiments=[

In [77]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='22ca468e-3ec5-4932-b378-8c2a74c5fd54', embedding=None, metadata={'images': [{'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-img_p19_1.png', 'page_num': 20}, {'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-page_20.jpg', 'page_num': 20}], 'window': '[START FIGURE 1 CAPTION]\nFigure 1.\n Machine learning-guided optimization of ChRs.  (a) Upon light exposure, ChRs open and reach a peak inward current and then desensitize reaching a lower steady-state current.  We use both peak and steady-state current as metrics for photocurrent strength.  To evaluate ChR off-kinetics we used the current decay rate (τoff) after a 1 ms light exposure and also the time to reach 50% of the light-exposed current after light removal.  As a metric for wavelength sensitivity of activation, we used the normalized photocurrent with green (546 nm) light, which easily differentiates blue-shifted ChRs (peak activation: ~450–480 nm)

In [75]:
figure_struct = output_parser.parse(tool_response.content)

In [76]:
for panel in figure_struct.panels:
    print(f"Panel: {panel.name}")
    for plot in panel.plots:
        print(f"\tPlot: {plot.name}")
        for experiment in plot.experiments:
            print(f"\t\t{experiment}")

Panel: a
	Plot: Current strength
		independent_variables=['Time'] dependent_variables=['Current']
	Plot: Off-kinetics
		independent_variables=['Time'] dependent_variables=['Current']
	Plot: Wavelength sensitivity of activation
		independent_variables=['Wavelength'] dependent_variables=['Normalized photocurrent']
Panel: b
	Plot: Train classification models
		independent_variables=['Predicted probability of localizing'] dependent_variables=['Predicted probability of functioning']
Panel: c
	Plot: Current strength
		independent_variables=['Measured value'] dependent_variables=['Predicted value']
	Plot: Off-kinetics
		independent_variables=['Measured value'] dependent_variables=['Predicted value']
	Plot: Spectral properties
		independent_variables=['Measured value'] dependent_variables=['Predicted value']
Panel: e
	Plot: Peak photocurrent
		independent_variables=['Mutation strategy'] dependent_variables=['Peak photocurrent']
	Plot: Log10 off-kinetics
		independent_variables=['Mutation strat

In [78]:
tool_response = query_tool(
    """Tell me as much as possible about Figure 2 using only the provided text. DO NOT USE IMAGES!!!!!"""
)
print(tool_response.content)

Based solely on the provided text description of Figure 2, I can tell you the following:

1. The figure demonstrates that model-predicted ChRs (channelrhodopsins) exhibit a wide range of functional properties, often surpassing those of their parent variants.

2. Part (a) of the figure shows representative current traces after 0.5 seconds of light exposure for selected designer ChR variants. These traces are accompanied by corresponding expression and localization data in HEK cells. Each ChR current trace has a vertical colored scale bar representing 500 pA, and a horizontal scale bar representing 250 ms.

3. Part (b) presents measured peak and steady-state photocurrent data for different wavelengths of light in HEK cells. The specific light conditions used were:
   - 397 nm light at 1.5 mW mm−2
   - 481 nm light at 2.3 mW mm−2
   - 546 nm light at 2.8 mW mm−2
   - 640 nm light at 2.2 mW mm−2
   The data comes from 4-8 cells, as referenced in Dataset 2.

4. Part (c) shows the off-kineti

In [79]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='cd99e54d-99eb-46be-850c-b9d3a8fce9e7', embedding=None, metadata={'images': [{'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-img_p21_1.png', 'page_num': 22}, {'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-page_22.jpg', 'page_num': 22}], 'window': 'Bedbrook et al.  Page 22\n\n[START FIGURE 2 CAPTION]\nFigure 2.\n The model-predicted ChRs exhibit a large range of functional properties often far exceeding the parents.  (a) Representative current traces after 0.5 s light exposure for select designer ChR variants with corresponding expression and localization in HEK cells.  Vertical colored scale bar for each ChR current trace represents 500 pA, and horizontal scale bar represents 250 ms.  The variant color presented in (a) is constant throughout panels.  (b) Measured peak and steady-state photocurrent with different wavelengths of light in HEK cells (n = 4–8 cells, see Dataset 2).  397 nm light at 1.5 m

In [58]:
tool_response = query_tool(
    """In Figure 2d, plot number 1, which opsin variant is signified by each color? Check for both
    IMPORTANT: 
        YOU HAVE ACCESS TO IMAGES TOO. BE SURE TO LOOK AT THE FIGURES.
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!"""
)
print(tool_response.content)

I apologize, but I do not have access to Figure 2d in the provided information. The images and text provided do not contain a Figure 2d or any specific color-coded plot showing opsin variants. 

The closest relevant information I can find is in the text, which mentions:

"We assessed the light sensitivity of select designer ChRs. Compared with CsChrimR, CheRiff, and C1C2, the designer ChRs have ≥9-times larger currents at the lowest intensity of light tested (10−1 mW mm−2), larger currents at all intensities of light tested, and minimal decrease in photocurrent magnitude over the range of intensities tested (10−1–101 mW mm−2), suggesting that photocurrents were saturated at these intensities and would only attenuate at much lower light intensities (Figure 2d)."

However, this text does not provide specific color assignments for each opsin variant or detailed data points that would be needed to answer the query as requested. Without the actual Figure 2d, I cannot provide the specific in

In [60]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='b6a6cd2d-8e02-4411-a1c7-ddd762b7473c', embedding=None, metadata={'images': [{'image_path': './storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-page_12.jpg', 'page_num': 12}], 'window': '## Light delivery and imaging\n\nPatch-clamp recordings were done with short light pulses to measure photocurrents.  Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spect

In [61]:
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from pprint import pprint
pprint(
    "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) + "\n\n" for r in tool_response.raw_output.source_nodes]
        )
)

("images: [{'image_path': "
 "'./storage/data_images/9c556600-70ee-4d6b-b9b1-881985b8d8bc-page_12.jpg', "
 "'page_num': 12}]\n"
 '\n'
 '## Light delivery and imaging\n'
 '\n'
 'Patch-clamp recordings were done with short light pulses to measure '
 'photocurrents.  Light pulse duration, wavelength, and power were varied '
 'depending on the experiment (as described in the text).  Light pulses were '
 'generated using a Lumencor SPECTRAX light engine.  The illumination/output '
 'spectra for each color were measured (Supplemental Figure 5).  To evaluate '
 'normalized green photocurrent, we measured photocurrent strength at three '
 'wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) '
 '546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light '
 'intensity was matched for these measurements, with 481 nm light at 2.3 mW '
 'mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For '
 'full spectra measurements depicted in Figure 2e, we me

In [139]:
tool_response = query_tool(
    """In Figure 2, for each plot in each panel, determine the experiment in terms of Independent variables (IVs) and dependent variable (DV). 
IMPORTANT:
    By plot, we mean each set of axes or displays. Each panel can have multiple plots.
Return your answer as structured data.



Here's a JSON schema to follow:
{{"$defs": {{"Experiment": {{"properties": {{"independent_variables": {{"items": {{"type": "string"}}, "title": "Independent Variables", "type": "array"}}, "dependent_variables": {{"items": {{"type": "string"}}, "title": "Dependent Variables", "type": "array"}}}}, "required": ["independent_variables", "dependent_variables"], "title": "Experiment", "type": "object"}}, "Panel": {{"properties": {{"name": {{"title": "Name", "type": "string"}}, "plots": {{"items": {{"$ref": "#/$defs/Plot"}}, "title": "Plots", "type": "array"}}}}, "required": ["name", "plots"], "title": "Panel", "type": "object"}}, "Plot": {{"properties": {{"name": {{"title": "Name", "type": "string"}}, "experiments": {{"items": {{"$ref": "#/$defs/Experiment"}}, "title": "Experiments", "type": "array"}}}}, "required": ["name", "experiments"], "title": "Plot", "type": "object"}}}}, "properties": {{"panels": {{"items": {{"$ref": "#/$defs/Panel"}}, "title": "Panels", "type": "array"}}}}, "required": ["panels"], "title": "Figure", "type": "object"}}

Output a valid JSON object but do not repeat the schema."""
)
print(tool_response.content)


I apologize, but I cannot provide a detailed analysis of Figure 2 as requested. The information provided in the context does not include a Figure 2 or its specific details. The images and text provided primarily focus on Figure 1, Figure 3, and Figure 4. Without the actual Figure 2 to reference, I cannot accurately determine the independent and dependent variables for each plot within its panels.

To provide a valid response following the given JSON schema, I would need to see Figure 2 and its associated description. The information available is insufficient to construct the structured data as requested for Figure 2.

If you meant to ask about Figure 1 instead, I could provide an analysis based on the information visible in the image and caption for Figure 1. However, to avoid making assumptions, I believe it's best to clarify which figure you're interested in analyzing before proceeding with a structured response.


In [140]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='cb41dfc6-3bbc-4348-b98a-926ccd6e74bf', embedding=None, metadata={'image_path': 'storage/data_images/29687a3b-1eea-4fd7-9934-78e4870d10f6-page_20.jpg', 'page_num': 20, 'window': "The figure illustrates various aspects of the study, including photocurrent properties, classification and regression models, and training data.\n\n Panel a shows photocurrent properties including current strength, off-kinetics, and wavelength sensitivity of activation.\n\n Panel b displays a classification model for predicting ChR localization and function.\n\n Panel c presents regression models for different properties like current strength, off-kinetics, and spectral properties.\n\n Panel d outlines the workflow of the study, showing the progression from training data through classification and regression models to characterization and validation.\n\n Panel e shows the training set data for different properties across various ChR types or conditions.\n\n Panel f contains sca

In [11]:
tool_response = query_tool(
    """Figure 2"""
)
print(tool_response.content)


Based on the information provided, I can describe Figure 2 as follows:

Figure 2 contains multiple panels labeled a through e:

a) Shows current traces and corresponding cell images for different ChR variants.

b) Bar graphs showing photocurrent strength with different wavelength excitation.

d) A line graph showing normalized photocurrent vs wavelength (nm) for different ChR variants.

e) Two scatter plots showing peak photocurrent and steady-state photocurrent vs intensity (mW mm⁻²) for different ChR variants.

This description is pieced together from the parsed text descriptions of individual panels provided for Figure 2. The information comes from multiple instances where "Figure 2" is mentioned, each describing different parts of the figure.

I did not get this information from a single coherent caption or description, but rather had to combine details from multiple mentions of Figure 2's components scattered throughout the provided text.

There's no comprehensive caption provided

In [12]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='fbc8b3f1-f5a8-4ada-b30f-bd403d02f8c1', embedding=None, metadata={'image_path': 'storage/data_images/63bb490d-05d2-4c24-9646-8baf1c7157dd-page_17.jpg', 'page_num': 22, 'window': 'Bedbrook et al.  Page 22\n\nFigure 2\n\n[Figure caption start]\nFigure 2.\n The model-predicted ChRs exhibit a large range of functional properties often far exceeding the parents.  (a) Representative current traces after 0.5 s light exposure for select designer ChR variants with corresponding expression and localization in HEK cells.  Vertical colored scale bar for each ChR current trace represents 500 pA, and horizontal scale bar represents 250 ms.  The variant color presented in (a) is constant throughout panels.  (b) Measured peak and steady-state photocurrent with different wavelengths of light in HEK cells (n = 4–8 cells, see Dataset 2). ', 'original_text': 'Page 22\n\nFigure 2\n\n[Figure caption start]\nFigure 2.\n'}, excluded_embed_metadata_keys=['window', 'original_tex

In [46]:
from pprint import pprint
pprint(tool_response.raw_input)

{'input': 'In Panel d, plot number 1, what values are taken by the independent '
          'variable Wavelength (nm)?\n'
          '    If the variable is not quantitative (like an image), only set '
          'the name field of IndependentVariable.\n'
          '    IMPORTANT: \n'
          '        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF '
          'THE DATA POINTS!!!\n'
          '        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL '
          'OTHER INDEPENDENT OR DEPENDENT VARIABLES!!\n'
          '        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO '
          'THE POSITION OF EACH DATAPOINT!!\n'
          '    Return your answer as structured data.\n'
          '    \n'
          '\n'
          '\n'
          "Here's a JSON schema to follow:\n"
          '{{"properties": {{"name": {{"title": "Name", "type": "string"}}, '
          '"values": {{"items": {{}}, "title": "Values", "type": "array"}}, '
          '"unit": {{"default": "

In [47]:
query_engine = fig_handler.query_engine(top_k = 5)

Error while parsing the file './storage/tmp/.DS_Store': Currently, only the following file types are supported: ['.pdf', '.602', '.abw', '.cgm', '.cwk', '.doc', '.docx', '.docm', '.dot', '.dotm', '.hwp', '.key', '.lwp', '.mw', '.mcw', '.pages', '.pbd', '.ppt', '.pptm', '.pptx', '.pot', '.potm', '.potx', '.rtf', '.sda', '.sdd', '.sdp', '.sdw', '.sgl', '.sti', '.sxi', '.sxw', '.stw', '.sxg', '.txt', '.uof', '.uop', '.uot', '.vor', '.wpd', '.wps', '.xml', '.zabw', '.epub', '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg', '.tiff', '.webp', '.htm', '.html', '.xlsx', '.xls', '.xlsm', '.xlsb', '.xlw', '.csv', '.dif', '.sylk', '.slk', '.prn', '.numbers', '.et', '.ods', '.fods', '.uos1', '.uos2', '.dbf', '.wk1', '.wk2', '.wk3', '.wk4', '.wks', '.123', '.wq1', '.wq2', '.wb1', '.wb2', '.wb3', '.qpw', '.xlr', '.eth', '.tsv']
Current file type: 
Started parsing the file under job_id 9f8e0009-6585-49a0-b7f4-ea0aae09dd9f
> Image for page 1: [{'name': 'page_1.jpg', 'height': 0, 'width': 0, 'x': 0, 'y'

In [48]:
result =query_engine.query("""In Panel d, plot number 1, what values are taken by the independent variable Wavelength (nm)?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!""")

In [51]:
pprint(result.response)

('To answer this question, I need to examine Panel d in Figure 1, which is '
 'described in the image content for page 20. However, after carefully '
 "reviewing the information provided, I don't find any specific details about "
 'wavelength values in Panel d of Figure 1.\n'
 '\n'
 'The figure caption for Figure 1 describes Panel d as follows:\n'
 '"Panel d outlines the workflow of the study, showing the progression from '
 'training data through classification and regression models to '
 'characterization and validation."\n'
 '\n'
 "This description doesn't mention any plot with wavelength as an independent "
 'variable.\n'
 '\n'
 "Additionally, I've searched through the provided text for any mention of "
 "specific wavelength values that might correspond to Panel d, but I don't "
 'find any such information.\n'
 '\n'
 "It's possible that the question is referring to a different figure or panel "
 "that isn't fully described in the provided information. The closest relevant "
 'infor

In [41]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='2ec8b35e-d3fe-489c-a8da-a59b16876693', embedding=None, metadata={'images': [{'image_path': './storage/data_images/103e836e-2bdf-4cea-9b41-62c4e7e48539-page_12.jpg', 'page_num': 12}], 'window': 'Bedbrook et al.                                                                                                       Page 12\n\n## Light delivery and imaging\n\nPatch-clamp recordings were done with short light pulses to measure photocurrents.  Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for th

In [38]:
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from pprint import pprint
pprint(
    "\n\n".join(
            [r.metadata['window'] + "\n\n" for r in tool_response.raw_output.source_nodes]
        )
)

('Bedbrook et '
 'al.                                                                                                       '
 'Page 12\n'
 '\n'
 '## Light delivery and imaging\n'
 '\n'
 'Patch-clamp recordings were done with short light pulses to measure '
 'photocurrents.  Light pulse duration, wavelength, and power were varied '
 'depending on the experiment (as described in the text).  Light pulses were '
 'generated using a Lumencor SPECTRAX light engine.  The illumination/output '
 'spectra for each color were measured (Supplemental Figure 5).  To evaluate '
 'normalized green photocurrent, we measured photocurrent strength at three '
 'wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) '
 '546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light '
 'intensity was matched for these measurements, with 481 nm light at 2.3 mW '
 'mm−2, 546 nm light at 2.8 mW mm−2, and 640 nm light at 2.2 mW mm−2.  For '
 'full spectra measurements depicted in Figur

In [34]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
postprocessor = MetadataReplacementPostProcessor(target_metadata_key="window")

In [36]:
procsesed_nodes = postprocessor.postprocess_nodes(tool_response.raw_output.source_nodes)

In [46]:
tool_response.raw_output.source_nodes

[NodeWithScore(node=TextNode(id_='2ec8b35e-d3fe-489c-a8da-a59b16876693', embedding=None, metadata={'images': [{'image_path': './storage/data_images/103e836e-2bdf-4cea-9b41-62c4e7e48539-page_12.jpg', 'page_num': 12}], 'window': 'Bedbrook et al.                                                                                                       Page 12\n\n## Light delivery and imaging\n\nPatch-clamp recordings were done with short light pulses to measure photocurrents.  Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for th

In [45]:
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from pprint import pprint
pprint(
    "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) + "\n\n" for r in tool_response.raw_output.source_nodes]
        )
)

("images: [{'image_path': "
 "'./storage/data_images/103e836e-2bdf-4cea-9b41-62c4e7e48539-page_12.jpg', "
 "'page_num': 12}]\n"
 '\n'
 'Bedbrook et '
 'al.                                                                                                       '
 'Page 12\n'
 '\n'
 '## Light delivery and imaging\n'
 '\n'
 'Patch-clamp recordings were done with short light pulses to measure '
 'photocurrents.  Light pulse duration, wavelength, and power were varied '
 'depending on the experiment (as described in the text).  Light pulses were '
 'generated using a Lumencor SPECTRAX light engine.  The illumination/output '
 'spectra for each color were measured (Supplemental Figure 5).  To evaluate '
 'normalized green photocurrent, we measured photocurrent strength at three '
 'wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) '
 '546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light '
 'intensity was matched for these measurements, with 481 nm light 

In [54]:
docs

[TextNode(id_='7a584ce4-6ec9-479b-952d-0f2fb623f008', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_1.jpg', 'page_num': 1}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# HHS Public Access\nAuthor manuscript\nNat Methods. Author manuscript; available in PMC 2020 April 14.\n\nPublished in final edited form as:\nNat Methods. 2019 November ; 16(11): 1176–1184. doi:10.1038/s41592-019-0583-8.\n\n## Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics\n\nClaire N. Bedbrook1, Kevin K. Yang2,†, J. Elliott Robinson1,†, Elisha D. Mackey1, Viviana Gradinaru1,*, Frances H. Arnold1,2,*\n\n1Division of Biology and Biological Engineering; California Institute of Technology; Pasadena; California; USA\n\n2Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena; California; USA\n\n### Abstract\n\nWe engineered light-gated channel

In [58]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(docs)

In [60]:
len(nodes)

107

In [57]:
docs

[TextNode(id_='7a584ce4-6ec9-479b-952d-0f2fb623f008', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_1.jpg', 'page_num': 1}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# HHS Public Access\nAuthor manuscript\nNat Methods. Author manuscript; available in PMC 2020 April 14.\n\nPublished in final edited form as:\nNat Methods. 2019 November ; 16(11): 1176–1184. doi:10.1038/s41592-019-0583-8.\n\n## Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics\n\nClaire N. Bedbrook1, Kevin K. Yang2,†, J. Elliott Robinson1,†, Elisha D. Mackey1, Viviana Gradinaru1,*, Frances H. Arnold1,2,*\n\n1Division of Biology and Biological Engineering; California Institute of Technology; Pasadena; California; USA\n\n2Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena; California; USA\n\n### Abstract\n\nWe engineered light-gated channel

In [61]:
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter

In [76]:
# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=5,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()



from llama_index.core import Settings

Settings.text_splitter = text_splitter

In [None]:
Settings.embed_model

In [77]:
nodes = node_parser.get_nodes_from_documents(docs)

In [67]:
len(nodes)

594

In [68]:
base_nodes = text_splitter.get_nodes_from_documents(docs)
len(docs)

26

In [107]:

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

# base_index = VectorStoreIndex(base_nodes)

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

query_engine = sentence_index.as_query_engine(
    similarity_top_k=5,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

In [79]:
window_response = query_engine.query(
    """In Panel d, plot number 1, what values are taken by the independent variable Wavelength (nm)?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!"""
)
print(window_response)

Based on the detailed methods description provided, the wavelengths used for measuring photocurrents in the full spectra measurements, which likely correspond to Panel d, are:

397 nm, 439 nm, 481 nm, 523 nm, 546 nm, 567 nm, and 640 nm

These seven wavelengths were used with a 0.5 second light pulse for each color, with light intensity matched across wavelengths at 1.3 mW mm⁻². The text specifically mentions that these measurements are depicted in Figure 2e, but given that Panel d is also discussing photocurrent strength across wavelengths, it's reasonable to assume these same wavelengths were used.


In [72]:
window_response.source_nodes

[NodeWithScore(node=TextNode(id_='dbae37dc-abf9-4274-a0ab-0df41ae19f37', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_12.jpg', 'page_num': 12, 'window': 'The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spectra measurements depicted in Figure 2e, we measured photocurrents at seven different wavelengths (peak ± half width half maximum): (red) 640 ± 3 nm, (yellow) 567 ± 13 nm, (green) 546 ± 16 nm, (teal) 523 ± 6 nm, (cyan) 481 ± 3 nm LED, (blue) 439 ± 8 nm LED, and (violet) 397 ± 3 nm with a 0.5 s light pulse for each c

In [74]:
base_query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = base_query_engine.query(
     """In Panel d, plot number 1, what values are taken by the independent variable Wavelength (nm)?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!"""
)
print(vector_response)

I apologize, but I cannot provide specific wavelength values for Panel d, plot number 1 based on the given context information. The context does not contain detailed data about wavelength values for that particular plot. 

The context mentions wavelength sensitivity in panel (e), but does not provide specific wavelength values. It only states that wavelength sensitivity of activation was compared for select ChRs and parental ChRs, without giving the actual wavelength values used.

Without seeing the actual figure or having more detailed information about the specific data points, I cannot accurately report the wavelength values used in the experiment. If you have access to the full figure or additional data, that would be needed to determine the precise wavelength values.


In [80]:
window_response = query_engine.query(
    """In Panel d, plot number 1, which opsin variant is signified by each color?
    IMPORTANT: 
        YOU HAVE ACCESS TO IMAGES TOO. BE SURE TO LOOK AT THE FIGURES.
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!"""
)
print(window_response)

I apologize, but I cannot provide specific information about which opsin variant is signified by each color in Panel d, plot number 1. The context information does not provide detailed color-coding for specific opsin variants in that particular plot. The image description and surrounding text do not give a clear breakdown of color assignments to individual opsin variants for that specific part of the figure. Without more detailed information about the color scheme used in that exact plot, I cannot confidently state which colors represent which opsin variants.


In [81]:
window_response.source_nodes

[NodeWithScore(node=TextNode(id_='33d8d565-0124-41af-843e-f8e876c75fe2', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_12.jpg', 'page_num': 12, 'window': 'Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spectra measurements depicted in Figure 2e, we measured photocurrents at seven different wavelengths (peak ± half width half maximum): (re

In [108]:
query_engine = sentence_index.as_retriever(
    similarity_top_k=5,
    # the target key defaults to `window` to match the node_parser's default
    # node_postprocessors=[
    #     MetadataReplacementPostProcessor(target_metadata_key="window")
    # ],
)

In [109]:
window_response = query_engine.retrieve(
    """In Panel d, plot number 1, what values are taken by the independent variable Wavelength (nm)?
    IMPORTANT: 
        BE SURE TO EXAMINE THE TEXT THAT REFERENCE THIS PANEL AND NEARBY PANELS. ESPECIALLY THE METHODS AND RESULTS SECTIONS.
        DO NOT JUST READOUT AXIS TICK VALUES, REPORT THE VALUES OF THE DATA POINTS!!!
        INCLUDE ALL VALUES EVEN IF THEY ARE NOT MEASURED FOR ALL OTHER INDEPENDENT OR DEPENDENT VARIABLES!!
        THE VALUES MAY NOT BE EVENLY SPACED, PAY CLOSE ATTENTION TO THE POSITION OF EACH DATAPOINT!!"""
)
print(window_response)

[NodeWithScore(node=TextNode(id_='33d8d565-0124-41af-843e-f8e876c75fe2', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_12.jpg', 'page_num': 12, 'window': 'Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spectra measurements depicted in Figure 2e, we measured photocurrents at seven different wavelengths (peak ± half width half maximum): (re

In [110]:
[node for node in window_response]

[NodeWithScore(node=TextNode(id_='33d8d565-0124-41af-843e-f8e876c75fe2', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_12.jpg', 'page_num': 12, 'window': 'Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spectra measurements depicted in Figure 2e, we measured photocurrents at seven different wavelengths (peak ± half width half maximum): (re

In [111]:
postprocessor = MetadataReplacementPostProcessor(target_metadata_key="window")

In [112]:
processed_nodes = postprocessor.postprocess_nodes(window_response)

In [113]:
[node for node in window_response]

[NodeWithScore(node=TextNode(id_='33d8d565-0124-41af-843e-f8e876c75fe2', embedding=None, metadata={'image_path': 'storage/data_images/8b86f7b0-5429-4ff8-aaf7-8fef28c8c362-page_12.jpg', 'page_num': 12, 'window': 'Light pulse duration, wavelength, and power were varied depending on the experiment (as described in the text).  Light pulses were generated using a Lumencor SPECTRAX light engine.  The illumination/output spectra for each color were measured (Supplemental Figure 5).  To evaluate normalized green photocurrent, we measured photocurrent strength at three wavelengths (peak ± half width at half maximum): (red) 640 ± 3 nm, (green) 546 ± 16 nm, and (cyan) 481 ± 3 nm with a 0.5 s light pulse.  Light intensity was matched for these measurements, with 481 nm light at 2.3 mW mm⁻², 546 nm light at 2.8 mW mm⁻², and 640 nm light at 2.2 mW mm⁻².  For full spectra measurements depicted in Figure 2e, we measured photocurrents at seven different wavelengths (peak ± half width half maximum): (re

In [114]:
[node.node.metadata.keys() for node in window_response]

[dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text'])]

In [115]:
[node.node.metadata.keys() for node in processed_nodes]

[dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text']),
 dict_keys(['image_path', 'page_num', 'window', 'original_text'])]

In [1]:
from llmsherpa.readers import LayoutPDFReader

In [2]:
llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all"
pdf_url = "/Users/loyalshababo/dev/plotreader/sandbox/storage/tmp/nihms-1538039.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
# pdf_url = "/Users/loyalshababo/dev/plotreader/sandbox/storage/README.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

In [3]:
len(doc.sections())

37

In [None]:
doc.

In [16]:
chunk = doc.chunks()[0]

In [17]:
chunk.block_json

{'bbox': [90.0, 77.0, 439.29, 99.0],
 'block_class': 'cls_5',
 'block_idx': 3,
 'level': 2,
 'page_idx': 0,
 'sentences': ['Published in final edited form as: Nat Methods.',
  '2019 November ; 16(11): 1176–1184.',
  'doi:10.1038/s41592-019-0583-8.'],
 'tag': 'para'}

In [34]:
import llmsherpa
    # ... existing code ...

def to_markdown(doc):
    """
    Converts the document to a markdown formatted string.
    """
    markdown = ""
    
    for section in doc.sections():
        # Add section header
        markdown += f"{'#' * (section.level + 1)} {section.to_text()}\n\n"
        
        for child in section.children:
            if isinstance(child, llmsherpa.readers.layout_reader.Paragraph):
                markdown += f"{child.to_text()}\n\n"
            elif isinstance(child, llmsherpa.readers.layout_reader.ListItem):
                # markdown += f"- {child.to_text()}\n"
                # markdown += f"{child.to_text()}\n\n"
                pass
            elif isinstance(child, llmsherpa.readers.layout_reader.Table):
                # markdown += child.to_text() + "\n\n"
                pass
    
    return markdown.strip()

In [35]:
print(to_markdown(doc))


# HHS Public Access

# Author manuscript

## Nat Methods. Author manuscript; available in PMC 2020 April 14.

Published in final edited form as: Nat Methods.
2019 November ; 16(11): 1176–1184.
doi:10.1038/s41592-019-0583-8.

# Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics

## Claire N. Bedbrook1, Kevin K. Yang2,†, J. Elliott Robinson1,†, Elisha D. Mackey1, Viviana Gradinaru1,*, Frances H. Arnold1,2,*

1Division of Biology and Biological Engineering; California Institute of Technology; Pasadena; California; USA 2Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena; California; USA

## Abstract

We engineered light-gated channelrhodopsins (ChRs) whose current strength and light sensitivity enable minimally-invasive neuronal circuit interrogation.
Current ChR tools applied to the mammalian brain require intracranial surgery for transgene delivery and implantation of invasive fiber-optic cables to produc

In [14]:
[print(chunk) for chunk in doc.chunks()]

<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2690>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec26c0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2720>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2780>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec27b0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec27e0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2810>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2840>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2870>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec28a0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec28d0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2900>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2990>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec29c0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x118ec2a20>
<llmsherpa

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [4]:

for section in doc.sections():
    # print(section.to_context_text())
    # print('break')
    prefix = "\t"
    for i in range(section.level):
        prefix += f"\t"
    print(f"{section.page_idx}{prefix}Section: {section.title} --- {len(section.paragraphs())} paragraphs")
    # chunk = doc.sections()[-5].chunks()[0]

0	Section: HHS Public Access --- 0 paragraphs
0	Section: Author manuscript --- 1 paragraphs
0		Section: Nat Methods. Author manuscript; available in PMC 2020 April 14. --- 1 paragraphs
0	Section: Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics --- 83 paragraphs
0		Section: Claire N. Bedbrook1, Kevin K. Yang2,†, J. Elliott Robinson1,†, Elisha D. Mackey1, Viviana Gradinaru1,*, Frances H. Arnold1,2,* --- 1 paragraphs
0		Section: Abstract --- 1 paragraphs
0		Section: Introduction --- 9 paragraphs
2		Section: Results --- 23 paragraphs
2			Section: Functional characterization of ChR variants for machine learning --- 2 paragraphs
2			Section: Training Gaussian process (GP) classification and regression models --- 4 paragraphs
2				Section: 4.5 Å in the C1C2 crystal structure (3UG9 --- 3 paragraphs
3			Section: Selection of designer ChRs using trained models --- 4 paragraphs
4			Section: Sequence and structural determinants of ChR functional propert

In [129]:
from llmsherpa.readers.layout_reader import Section
from llama_index.core.schema import TextNode
from llama_index.core.ingestion import IngestionPipeline
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.extractors import PydanticProgramExtractor


from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

import plotreader

In [121]:
text_nodes = [
    TextNode(
        text = chunk.to_text(),
        metadata = {
            "parsed_parent_titles": ">".join([parent.title for parent in chunk.parent_chain()[1:] if isinstance(parent, Section) and parent.level != 0]),
            "page_num": chunk.page_idx,
            "num_sentences": len(chunk.sentences)
        },
    )
    for chunk in doc.chunks()
]


In [109]:


class NodeMetadata(BaseModel):
    """Node metadata."""

    is_aux_text: bool = Field(
        ...,
        description=(
            "Is this text extracted from a part of the document that is not useful (e.g. headers, footers, titles...)"
        )
    )
    entities: list[str] = Field(
        ..., description="Unique entities in this text chunk."
    )
    summary: str = Field(
        ..., description="A one sentence summary of this text chunk."
    )
    fig_or_panel_refs: list[str] = Field(
        ...,
        description=(
            "The names of any figures or panels that are referenced explicitly or implicitly."
        ),
    )

In [113]:
EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""

# openai_program = MultiModalLLMCompletionProgram.from_defaults(
#     output_parser=PydanticOutputParser(Restaurant),
#     image_documents=image_documents,
#     prompt_template_str=prompt_template_str,
#     multi_modal_llm=openai_mm_llm,
#     verbose=True,
# )

openai_program = OpenAIPydanticProgram.from_defaults(
    output_cls=NodeMetadata,
    prompt_template_str="{input}",
    llm=plotreader._GPT4O_TEXT
    # extract_template_str=EXTRACT_TEMPLATE_STR
)

program_extractor = PydanticProgramExtractor(
    program=openai_program, input_key="input", show_progress=True
)

In [122]:
new_nodes = program_extractor.process_nodes(text_nodes)

100%|██████████| 135/135 [01:48<00:00,  1.25it/s]


In [124]:
good_nodes = []
bad_nodes = []

for node in new_nodes:
    if node.metadata['is_aux_text']:
        bad_nodes.append(node)
    else:
        good_nodes.append(node)
        print(node.metadata["summary"])
        print(node.metadata["fig_or_panel_refs"])

Engineered light-gated channelrhodopsins (ChRs) enable minimally-invasive neuronal circuit interrogation with high-photocurrent and light sensitivity, facilitating optogenetics without invasive implants.
[]
Channelrhodopsins (ChRs) are light-gated ion channels found in photosynthetic algae, and their transgenic expression in the brain enables light-dependent neuronal activation, making them useful tools in neuroscience research.
[]
Limitations of available ChRs restrict various optogenetic applications due to broad activation spectra, high-intensity light requirements, and low conductance, confining activation to small brain tissue volumes.
[]
Engineering ChRs to overcome limits in conductance and light sensitivity and extend the reach of optogenetic experiments requires overcoming three major challenges.
[]
Diverse ChRs have been published, including variants discovered from nature, engineered through recombination and mutagenesis, and resulting from rational design, but predicting fu

In [131]:
len(good_nodes)

77

In [132]:
len(bad_nodes)

58

In [125]:
bad_nodes[0]

TextNode(id_='c08f1bbe-4dbf-4627-9373-9956082e73b4', embedding=None, metadata={'parsed_parent_titles': 'Nat Methods. Author manuscript; available in PMC 2020 April 14.', 'page_num': 0, 'num_sentences': 3, 'is_aux_text': True, 'entities': [], 'summary': 'This section contains publication details and citation information.', 'fig_or_panel_refs': []}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Published in final edited form as: Nat Methods.\n2019 November ; 16(11): 1176–1184.\ndoi:10.1038/s41592-019-0583-8.', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n', metadata_template='{key}: {value}', metadata_seperator='\n')

In [128]:
for node in bad_nodes:
    print(node.text)
    print(node.metadata["fig_or_panel_refs"])


Published in final edited form as: Nat Methods.
2019 November ; 16(11): 1176–1184.
doi:10.1038/s41592-019-0583-8.
[]
1Division of Biology and Biological Engineering; California Institute of Technology; Pasadena; California; USA 2Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena; California; USA
[]
†K.K.Y.
& J.E.R.
contributed equally to this work.
[]
Author Contributions C.N.B., K.K.Y., V.G., and F.H.A.
conceptualized the project.
C.N.B.
coordinated all experiments and data analysis.
C.N.B.
and K.K.Y.
built machine-learning models.
C.N.B.
performed construct design and cloning.
C.N.B.
and E.D.M.
performed AAV production.
E.D.M.
prepared cultured neurons.
C.N.B and J.E.R.
conducted electrophysiology.
C.N.B.
and J.E.R.
performed injections.
J.E.R.
performed fiber cannula implants and behavioral experiments.
C.N.B.
performed all data analysis.
C.N.B.
wrote the manuscript with input and editing from all authors.
V.G.
supervised optogenetics/electr

In [25]:
type(doc)

llmsherpa.readers.layout_reader.Document

In [32]:
for paragraph in doc.root_node.paragraphs():
    print(paragraph.parent.title)

Nat Methods. Author manuscript; available in PMC 2020 April 14.
Claire N. Bedbrook1, Kevin K. Yang2,†, J. Elliott Robinson1,†, Elisha D. Mackey1, Viviana Gradinaru1,*, Frances H. Arnold1,2,*
Abstract
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Functional characterization of ChR variants for machine learning
Functional characterization of ChR variants for machine learning
Training Gaussian process (GP) classification and regression models
4.5 Å in the C1C2 crystal structure (3UG9
4.5 Å in the C1C2 crystal structure (3UG9
4.5 Å in the C1C2 crystal structure (3UG9
Selection of designer ChRs using trained models
Selection of designer ChRs using trained models
Selection of designer ChRs using trained models
Selection of designer ChRs using trained models
Sequence and structural determinants of ChR functional properties
Machine-guided search identifies ChRs with a range of useful functional properties
Machine-guided sea

In [33]:
for section in doc.sections():
    print(len(section.paragraphs()))

0
1
1
83
1
1
9
23
2
4
3
4
1
5
2
5
2
36
3
2
3
2
1
1
1
2
18
2
1
1
1
1
1
9
1
1
7


In [42]:
from groundx import Groundx, ApiException
import os

groundx = Groundx(
  api_key=os.environ['GROUNDX_API_KEY'],
)

In [None]:
groundx.buckets.list()

In [44]:
response = groundx.buckets.create(
    name="test_papers"
)

In [45]:
response.body

{'bucket': {'bucketId': 11481, 'name': 'test_papers'}}

In [46]:
response = groundx.documents.ingest_local(
    body=[
        {
            "blob": open(pdf_url, "rb"),
            "metadata": {
                "bucketId": 11481,
                "fileName": 'nihms-1538039',
                "fileType": "pdf",
                "searchData": {},
            },
        },
    ]
)

In [47]:
response.body

{'ingest': {'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
  'status': 'queued'}}

In [54]:
response = groundx.documents.get_processing_status_by_id(
    process_id=response.body['ingest']['processId']
)

In [55]:
response.body

{'ingest': {'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
  'progress': {'complete': {'documents': [{'documentId': '3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04',
      'fileName': 'nihms-1538039',
      'fileSize': '626 KB',
      'fileTokens': 148795,
      'fileType': 'pdf',
      'bucketId': 11481,
      'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
      'searchData': {},
      'sourceUrl': 'https://upload.groundx.ai/prod/file/039538b8-e220-4e11-8047-6d43fde4bd66/nihms-1538039',
      'status': 'complete',
      'xrayUrl': 'https://upload.eyelevel.ai/layout/processed/d57a270d-fe50-4257-9830-95b96168eed9/3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04-xray.json'}],
    'total': 1}},
  'status': 'complete'}}

In [97]:
from pprint import pprint

pprint("""{'ingest': {'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
  'progress': {'complete': {'documents': [{'documentId': '3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04',
      'fileName': 'nihms-1538039',
      'fileSize': '626 KB',
      'fileTokens': 148795,
      'fileType': 'pdf',
      'bucketId': 11481,
      'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
      'searchData': {},
      'sourceUrl': 'https://upload.groundx.ai/prod/file/039538b8-e220-4e11-8047-6d43fde4bd66/nihms-1538039',
      'status': 'complete',
      'xrayUrl': 'https://upload.eyelevel.ai/layout/processed/d57a270d-fe50-4257-9830-95b96168eed9/3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04-xray.json'}],
    'total': 1}},""")

("{'ingest': {'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',\n"
 "  'progress': {'complete': {'documents': [{'documentId': "
 "'3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04',\n"
 "      'fileName': 'nihms-1538039',\n"
 "      'fileSize': '626 KB',\n"
 "      'fileTokens': 148795,\n"
 "      'fileType': 'pdf',\n"
 "      'bucketId': 11481,\n"
 "      'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',\n"
 "      'searchData': {},\n"
 "      'sourceUrl': "
 "'https://upload.groundx.ai/prod/file/039538b8-e220-4e11-8047-6d43fde4bd66/nihms-1538039',\n"
 "      'status': 'complete',\n"
 "      'xrayUrl': "
 "'https://upload.eyelevel.ai/layout/processed/d57a270d-fe50-4257-9830-95b96168eed9/3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04-xray.json'}],\n"
 "    'total': 1}},")


In [56]:
response = groundx.documents.get(
    document_id="3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04"
)

In [57]:
response.body

{'document': {'documentId': '3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04',
  'fileName': 'nihms-1538039',
  'fileSize': '626 KB',
  'fileTokens': 148795,
  'fileType': 'pdf',
  'bucketId': 11481,
  'processId': 'd57a270d-fe50-4257-9830-95b96168eed9',
  'searchData': {},
  'sourceUrl': 'https://upload.groundx.ai/prod/file/039538b8-e220-4e11-8047-6d43fde4bd66/nihms-1538039',
  'status': 'complete',
  'xrayUrl': 'https://upload.eyelevel.ai/layout/processed/d57a270d-fe50-4257-9830-95b96168eed9/3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04-xray.json'}}

In [62]:
import urllib3

doc_json = urllib3.request("GET",response.body['document']['xrayUrl'])

In [66]:
doc_json.json()

{'chunks': [{'boundingBoxes': [{'bottomRightX': 1088,
     'bottomRightY': 96,
     'pageNumber': 2,
     'topLeftX': 1042,
     'topLeftY': 79},
    {'bottomRightX': 264,
     'bottomRightY': 94,
     'pageNumber': 2,
     'topLeftX': 163,
     'topLeftY': 79},
    {'bottomRightX': 1064,
     'bottomRightY': 1497,
     'pageNumber': 1,
     'topLeftX': 186,
     'topLeftY': 1462},
    {'bottomRightX': 1084,
     'bottomRightY': 1459,
     'pageNumber': 1,
     'topLeftX': 187,
     'topLeftY': 1342},
    {'bottomRightX': 330,
     'bottomRightY': 1336,
     'pageNumber': 1,
     'topLeftX': 186,
     'topLeftY': 1323},
    {'bottomRightX': 523,
     'bottomRightY': 1321,
     'pageNumber': 1,
     'topLeftX': 187,
     'topLeftY': 1303},
    {'bottomRightX': 724,
     'bottomRightY': 1303,
     'pageNumber': 1,
     'topLeftX': 191,
     'topLeftY': 1287},
    {'bottomRightX': 1074,
     'bottomRightY': 1278,
     'pageNumber': 1,
     'topLeftX': 187,
     'topLeftY': 1243},
    {'bo

In [67]:
parsed_doc = doc_json.json()

In [90]:
len(parsed_doc['chunks'])

31

In [71]:
first_chunk = parsed_doc['chunks'][0]

In [74]:
figures = []

for chunk in parsed_doc['chunks']:
    if 'figure' in chunk['contentType']:
        figures.append(chunk)

In [77]:
figures[1]

{'boundingBoxes': [{'bottomRightX': 1186.7144,
   'bottomRightY': 1034.8567,
   'pageNumber': 22,
   'topLeftX': 217.97604,
   'topLeftY': 136.88692}],
 'chunk': 'whq1zm-0',
 'contentType': ['figure'],
 'json': [{'description': 'The model-predicted ChRs exhibit a large range of functional properties often far exceeding the parents.',
   'title': 'Figure 2',
   'type': 'graphic'},
  {'ChR_variants': ['CheRiff',
    'CsChrimR',
    'C1C2',
    '11_10',
    '12_10',
    '25_9',
    '10_10',
    '15_10',
    '28_10',
    '21_10',
    '3_10'],
   'color_coding': {'10_10': 'yellow',
    '11_10': 'cyan',
    '12_10': 'pink',
    '15_10': 'light blue',
    '21_10': 'purple',
    '25_9': 'blue',
    '28_10': 'orange',
    '3_10': 'red',
    'C1C2': 'gray',
    'CheRiff': 'gray',
    'CsChrimR': 'black'},
   'current_traces': '0.5 s light exposure',
   'expression_localization': 'HEK cells',
   'horizontal_scale_bar': '250 ms',
   'section': 'a',
   'vertical_scale_bar': '500 pA'},
  {'measured_

In [78]:
# figures = []

for chunk in parsed_doc['chunks']:
    print(chunk['contentType'])


['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['paragraph']
['figure']
['paragraph']
['table']
['paragraph']
['figure']
['paragraph']
['figure']
['paragraph']
['figure']
['paragraph']
['table']
['paragraph']


In [85]:
parsed_doc['chunks'][2]['suggestedText'].split('\n\n')

['Using the channelrhodopsin (ChR) sequence, structure, and functional data as inputs, we trained Gaussian Process (GP) classification and regression models (Figure 1). GP models successfully predicted thermostability, substrate binding affinity, and kinetics for several soluble enzymes, and more recently, ChR membrane localization. For a detailed description of the GP model architecture used for protein engineering, see references 8 and 24. Briefly, these models infer predictive values for new sequences from training examples by assuming that similar inputs (ChR sequence variants) will have similar outputs (photocurrent properties). ',
 "To quantify the relatedness of inputs (ChR sequence variants), we compared both sequence and structure. ChR sequence information is encoded in the amino acid sequence. For structural comparisons, we convert the 3D crystal-structural information into a 'contact map' that is convenient for modeling. Two residues are considered to be in contact and poten

In [86]:
tables = []

for chunk in parsed_doc['chunks']:
    if 'table' in chunk['contentType']:
        tables.append(chunk)

In [88]:
tables[1]

{'boundingBoxes': [{'bottomRightX': 1065,
   'bottomRightY': 677,
   'pageNumber': 26,
   'topLeftX': 711,
   'topLeftY': 564}],
 'chunk': 'l51eoh-0',
 'contentType': ['table'],
 'json': [{'frame_0': 'image_data_1',
   'frame_22': 'image_data_2',
   'frame_45': 'image_data_3',
   'light_condition': '447 nm'},
  {'frame_0': 'image_data_4',
   'frame_22': 'image_data_5',
   'frame_45': 'image_data_6',
   'light_condition': '671 nm'}],
 'multimodalUrl': 'https://upload.eyelevel.ai/layout/raw/prod/d57a270d-fe50-4257-9830-95b96168eed9/3bbdc9f9-47b9-4e13-8e0e-cd8e939bdd04/table-26-0.jpg',
 'narrative': ['Under the 447 nm light condition, the images at frame 0, frame 22, and frame 45 are displayed. Under the 671 nm light condition, the images at frame 0, frame 22, and frame 45 are displayed.'],
 'pageNumbers': [26],
 'sectionSummary': 'The document focuses on the development and optimization of channelrhodopsins (ChRs) using machine learning techniques for optogenetic applications. The main t

In [91]:
parsed_doc['chunks'][4]

{'boundingBoxes': [{'bottomRightX': 1085,
   'bottomRightY': 210,
   'pageNumber': 7,
   'topLeftX': 311,
   'topLeftY': 130},
  {'bottomRightX': 1086,
   'bottomRightY': 96,
   'pageNumber': 7,
   'topLeftX': 1042,
   'topLeftY': 80},
  {'bottomRightX': 264,
   'bottomRightY': 94,
   'pageNumber': 7,
   'topLeftX': 163,
   'topLeftY': 79},
  {'bottomRightX': 64,
   'bottomRightY': 1421,
   'pageNumber': 6,
   'topLeftX': 39,
   'topLeftY': 1213},
  {'bottomRightX': 63,
   'bottomRightY': 1065,
   'pageNumber': 6,
   'topLeftX': 39,
   'topLeftY': 861},
  {'bottomRightX': 63,
   'bottomRightY': 715,
   'pageNumber': 6,
   'topLeftX': 38,
   'topLeftY': 508},
  {'bottomRightX': 63,
   'bottomRightY': 363,
   'pageNumber': 6,
   'topLeftX': 39,
   'topLeftY': 157},
  {'bottomRightX': 804,
   'bottomRightY': 1505,
   'pageNumber': 6,
   'topLeftX': 358,
   'topLeftY': 1489},
  {'bottomRightX': 1083,
   'bottomRightY': 1442,
   'pageNumber': 6,
   'topLeftX': 311,
   'topLeftY': 1276},
  {

In [133]:
test = "[Excerpt from document]\nimages: []\npage_num: 8\nis_aux_text: False\nentities: ['ChRger2', 'optogenetic intracranial self-stimulation', 'oICSS', 'dopaminergic neurons', 'ventral tegmental area', 'VTA', 'rAAV-PHP.eB', 'DIO', 'ChR2(H134R)', 'Dat-Cre mice', '447 nm laser', 'ChRger2', 'ChR2(H134R)']\nsummary: The study evaluated the optogenetic efficiency of ChRger2 in Dat-Cre mice using systemic delivery and optogenetic intracranial self-stimulation of VTA dopaminergic neurons.\npage_numbers: [8]\nfig_or_panel_refs: ['Figure 4a']\nfigures_on_page: []\nExcerpt:\n-----\nWe next evaluated the optogenetic efficiency of ChRger2 after systemic delivery using optogenetic intracranial self-stimulation (oICSS) of dopaminergic neurons of the ventral tegmental area (VTA)32.\nWe systemically delivered rAAV-PHP.eB packaging a double- floxed inverted open reading frame (DIO) containing either ChRger2 or ChR2(H134R) into Dat-Cre mice (Figure 4a and Supplemental Table 5).\nThree weeks after systemic delivery and stereotaxic implantation of fiber-optic cannulas above the VTA, mice were placed in an operant box and were conditioned to trigger a burst of 447 nm laser stimulation via nose poke.\nAnimals expressing ChRger2 displayed robust optogenetic self-stimulation in a frequency-dependent and laser power-dependent manner.\nHigher frequencies (up to 20 Hz) and higher light power (up to 10 mW) promoted greater maximum operant response rates (Figure 4a).\nConversely, laser stimulation failed to reinforce operant responding in ChR2(H134R)-expressing animals (Figure 4a); these results were consistent with results in acute slice where the light-induced currents of ChR2(H134R) are too weak at the low copy number produced by systemic delivery for robust neuronal activation.\n-----"

In [134]:
print(test)

[Excerpt from document]
images: []
page_num: 8
is_aux_text: False
entities: ['ChRger2', 'optogenetic intracranial self-stimulation', 'oICSS', 'dopaminergic neurons', 'ventral tegmental area', 'VTA', 'rAAV-PHP.eB', 'DIO', 'ChR2(H134R)', 'Dat-Cre mice', '447 nm laser', 'ChRger2', 'ChR2(H134R)']
summary: The study evaluated the optogenetic efficiency of ChRger2 in Dat-Cre mice using systemic delivery and optogenetic intracranial self-stimulation of VTA dopaminergic neurons.
page_numbers: [8]
fig_or_panel_refs: ['Figure 4a']
figures_on_page: []
Excerpt:
-----
We next evaluated the optogenetic efficiency of ChRger2 after systemic delivery using optogenetic intracranial self-stimulation (oICSS) of dopaminergic neurons of the ventral tegmental area (VTA)32.
We systemically delivered rAAV-PHP.eB packaging a double- floxed inverted open reading frame (DIO) containing either ChRger2 or ChR2(H134R) into Dat-Cre mice (Figure 4a and Supplemental Table 5).
Three weeks after systemic delivery and ste