# Advanced Querying Techniques in Elasticsearch for BiorXiv Data Analysis

Elasticsearch, renowned for its speed and scalability, is an indispensable tool for data scientists and researchers working with large datasets like those from BiorXiv. BiorXiv provides a rich corpus of preprint publications in the life sciences, offering a wealth of data for analysis. In this guide, we delve into sophisticated querying techniques to extract meaningful insights from BiorXiv data using Elasticsearch. We'll explore everything from basic keyword searches to complex aggregations and filters.

### Establishing a Secure Connection to Elasticsearch


Establishing a secure connection to your Elasticsearch cluster is paramount. This ensures that your data interactions are encrypted and protected. Here’s a brief refresher on setting up a secure connection:

In [2]:
import ssl
import json
from elasticsearch import Elasticsearch
from typing import List, Dict, Any
    

# Path to the CA certificate
ca_cert_path = '/workspace/repos/osl/rxiv-restapi/containers/esconfig/certs/http_ca.crt'
# Create an SSL context
ssl_context = ssl.create_default_context(cafile=ca_cert_path)
# Create a connection to Elasticsearch with authentication and SSL context
es = Elasticsearch(
    ["https://es:9200"],
    basic_auth=("elastic", "worksfine"),
    ssl_context=ssl_context
)

### Index JSON Data into Elasticsearch
**Before we can search, you must have the data indexed in Elasticsearch.** 
**Here's a simplified function to read the JSON file and index its contents.** 
**This example assumes that your JSON data is an array of objects, each representing a document to be indexed.**

In [3]:

def index_json_data(es: Elasticsearch, file_path: str, index_name: str) -> None:
    """
    Reads data from a JSON file and indexes it into Elasticsearch.

    Parameters:
    es (Elasticsearch): An Elasticsearch client instance.
    file_path (str): The path to the JSON file.
    index_name (str): The name of the Elasticsearch index where data will be stored.
    """
    # Load JSON data from the file
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    # Assuming `data` is a list of documents
    for doc in data:
        # Index each document
        res = es.index(index=index_name, document=doc)
        # print(res['result'])


### Indexing Data into Elasticsearch

In [4]:
%%time

# Assuming index_json_data is a previously defined function that indexes data from a JSON file to Elasticsearch
# Path to the JSON file containing data to be indexed
file_path = '/workspace/repos/osl/rxiv-restapi/docs/notebooks/data/biorxiv_2022-01-01_2024-01-11.json'
# Name of the Elasticsearch index
index_name = 'biorxiv'
# Index data from the specified JSON file into Elasticsearch
index_json_data(es, file_path, index_name)

CPU times: user 3min 41s, sys: 19.1 s, total: 4min
Wall time: 25min 48s


## Querying Analyses for BiorXiv Data


### Create a Search Function

**After indexing the data, we can create a function to perform searches using the Elasticsearch client.**

**Note**: This function, search_data, retrieves all documents matching the query using **Elasticsearch's Scroll API**, which is suitable for retrieving large sets of results. The function returns a list of all documents

In [104]:

def search_data(es: Elasticsearch, index_name: str, query: Dict[str, Any], page_size: int = 9999) -> List[Dict[str, Any]]:
    """
    Performs a search query in an Elasticsearch index and returns all documents matching the query.

    Parameters:
    - es (Elasticsearch): An Elasticsearch client instance.
    - index_name (str): The name of the Elasticsearch index to search in.
    - query (dict): The search query in Elasticsearch Query DSL format.
    - page_size (int): The number of results to return per page.

    Returns:
    - List[Dict[str, Any]]: A list of all documents from the search results.
    """
    documents = []

    # Include 'from' and 'size' within the query body
    body = query
    # body['from'] = from_param
    body['size'] = page_size

    # Initialize the scroll
    response = es.search(index=index_name, body=query, scroll='2m')
    scroll_id = response['_scroll_id']
    hits = response['hits']['hits']

    # Start scrolling
    while hits:
        documents.extend(hits)
        response = es.scroll(scroll_id=scroll_id, scroll='2m')
        scroll_id = response['_scroll_id']
        hits = response['hits']['hits']

    return documents



In [105]:
%%time

query = {
    "query": {
        "match": {
            "abstract": "CRISPR"
        }
    }
}

all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 12470 

First result: {'doi': '10.1101/2024.01.05.574328', 'title': 'CRISPR-repressed toxin-antitoxin provides population-level immunity against diverse anti-CRISPR elements', 'authors': 'Li, M.; Shu, X.; Wang, R.; Li, Z.; Xue, Q.; Liu, C.; Cheng, F.; Zhao, H.; Wang, J.; Liu, J.; Hu, C.; Li, J.; Ouyang, S.', 'author_corresponding': 'Ming Li', 'author_corresponding_institution': 'Institute of Microbiology, CAS', 'date': '2024-01-05', 'version': '1', 'license': 'cc_no', 'category': 'Microbiology', 'jatsxml': 'https://www.biorxiv.org/content/early/2024/01/05/2024.01.05.574328.source.xml', 'abstract': 'Prokaryotic CRISPR-Cas systems are highly vulnerable to phage-encoded anti-CRISPR (Acr) factors. How CRISPR-Cas systems protect themselves remains unclear. Here, we uncovered a broad-spectrum anti-anti-CRISPR strategy involving a phage-derived toxic protein. Transcription of this toxin is normally reppressed by the CRISPR-Cas effector, but is activated to halt cell division wh

### Keyword Searches: The Basics
#### Search for documents with a specific title.


These query examples illustrate the flexibility of Elasticsearch's Query DSL to retrieve specific data based on various search criteria.

Creating a variety of Elasticsearch query combinations involves using different aspects of the Elasticsearch Query DSL (Domain Specific Language) to retrieve specific documents based on your criteria. Below are several examples of query combinations that can be used to retrieve keys and values from the JSON data provided

Each of these queries can be passed to the `search_data` function you've defined to retrieve documents from Elasticsearch based on the specified criteria. Remember to replace `index_name` with the name of your index when calling the function:

### Match Query


Keyword searches are the foundation of data retrieval in Elasticsearch. For instance, finding all BiorXiv papers related to CRISPR:


In [106]:
%%time 

query = {
      "query": {
          "match": {
              "abstract": "CRISPR"
          } 
      }         
}

all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 12470 

First result: {'doi': '10.1101/2024.01.05.574328', 'title': 'CRISPR-repressed toxin-antitoxin provides population-level immunity against diverse anti-CRISPR elements', 'authors': 'Li, M.; Shu, X.; Wang, R.; Li, Z.; Xue, Q.; Liu, C.; Cheng, F.; Zhao, H.; Wang, J.; Liu, J.; Hu, C.; Li, J.; Ouyang, S.', 'author_corresponding': 'Ming Li', 'author_corresponding_institution': 'Institute of Microbiology, CAS', 'date': '2024-01-05', 'version': '1', 'license': 'cc_no', 'category': 'Microbiology', 'jatsxml': 'https://www.biorxiv.org/content/early/2024/01/05/2024.01.05.574328.source.xml', 'abstract': 'Prokaryotic CRISPR-Cas systems are highly vulnerable to phage-encoded anti-CRISPR (Acr) factors. How CRISPR-Cas systems protect themselves remains unclear. Here, we uncovered a broad-spectrum anti-anti-CRISPR strategy involving a phage-derived toxic protein. Transcription of this toxin is normally reppressed by the CRISPR-Cas effector, but is activated to halt cell division wh

### Term Query
Retrieve documents where the `license` field exactly matches the specified value.


In [107]:
%%time

query = {
      "query": {
          "term": {
              "license": {
                  "value": "cc_by_nc_nd"
            }
        }
    }
}


all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 195728 

First result: {'doi': '10.1101/2023.09.21.558920', 'title': 'Automated customization of large-scale spiking network models to neuronal population activity', 'authors': 'Wu, S.; Huang, C.; Snyder, A.; Smith, M. A.; Doiron, B.; Yu, B.', 'author_corresponding': 'Shenghao Wu', 'author_corresponding_institution': 'Carnegie Mellon University', 'date': '2023-09-22', 'version': '1', 'license': 'cc_by_nc_nd', 'category': 'Neuroscience', 'jatsxml': 'https://www.biorxiv.org/content/early/2023/09/22/2023.09.21.558920.source.xml', 'abstract': 'Understanding brain function is facilitated by constructing computational models that accurately reproduce aspects of brain activity. Networks of spiking neurons capture the underlying biophysics of neuronal circuits, yet the dependence of their activity on model parameters is notoriously complex. As a result, heuristic methods have been used to configure spiking network models, which can lead to an inability to discover activity regim

### Range Query
Find documents published within a specific date range.


In [108]:
%%time

query = {
    "query": {
        "range": {
            "date": {
                "gte": "2022-12-01",
                "lte": "2022-12-31"
            }
        }
    }
}

all_results = search_data(es, index_name, query, page_size=100)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 20035 

First result: {'doi': '10.1101/2021.04.27.441649', 'title': 'Developmental diversity and unique sensitivity to injury of lung endothelial subtypes during a period of rapid postnatal growth', 'authors': 'Zanini, F.; Che, X.; Knutsen, C.; Liu, M.; Suresh, N. E.; Domingo-Gonzalez, R.; Dou, S. H.; Pryhuber, G. S.; Jones, R. C.; Quake, S. R.; Cornfield, D. N.; Alvira, C. M.', 'author_corresponding': 'Cristina M. Alvira', 'author_corresponding_institution': 'Division of Critical Care Medicine, Department of Pediatrics, Stanford University School of Medicine', 'date': '2022-12-21', 'version': '2', 'license': 'cc_by_nc_nd', 'category': 'Developmental Biology', 'jatsxml': 'https://www.biorxiv.org/content/early/2022/12/21/2021.04.27.441649.source.xml', 'abstract': 'At birth, the lung is still immature, heightening susceptibility to injury but enhancing regenerative capacity. Angiogenesis drives postnatal lung development. Therefore, we profiled the transcriptional ontogeny


### Bool Query
Combine multiple search criteria. For example, search for documents by authors in a specific category and with a specific license.


In [109]:
%%time

query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"authors": "Colman-Lerner"}},
                {"match": {"category": "Systems Biology"}},
                {"term": {"license": "cc_by_nc_nd"}}
            ]
        }
    }        
}


all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 40 

First result: {'doi': '10.1101/2022.10.06.511167', 'title': 'High selectivity of frequency induced transcriptional responses', 'authors': 'Givre, A.; Colman-Lerner, A.; Ponce-Dawson, S.', 'author_corresponding': 'Silvina Ponce-Dawson', 'author_corresponding_institution': 'School of Natural and Exact Sciences, University of Buenos Aires', 'date': '2022-10-07', 'version': '1', 'license': 'cc_by_nc_nd', 'category': 'Systems Biology', 'jatsxml': 'https://www.biorxiv.org/content/early/2022/10/07/2022.10.06.511167.source.xml', 'abstract': 'Cells continuously interact with their environment, detect its changes and generate responses accordingly. This requires interpreting the variations and, in many occasions, producing changes in gene expression. In this paper we use information theory and a simple transcription model to analyze the extent to which the resulting gene expression is able to identify and assess the intensity of extracellular stimuli when they are encoded in 

### Match phrase Query
Use a match_phrase to search for documents with titles that contain specific patterns.


In [110]:
%%time

query = {
    "query": {
        "match_phrase": {
            "title": "Carm1 regulates the speed of"
        }
    }
}

all_results = search_data(es, index_name, query, page_size=100)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 15 

First result: {'doi': '10.1101/2022.10.03.510647', 'title': 'Carm1 regulates the speed of C/EBPa-induced transdifferentiation by a cofactor stealing mechanism', 'authors': 'Garcia, G. T.; Kowenz-Leutz, E.; Tian, T. V.; Klonizakis, A.; Lerner, J.; De Andres-Aguayo, L.; Berenguer, C.; Carmona, M. P.; Casadesus, M. V.; Bulteau, R.; Francesconi, M.; Leutz, A.; Zaret, K. S.; Zaret, K. S.; Peiro, S.', 'author_corresponding': 'Achim Leutz', 'author_corresponding_institution': 'MDC, Berlin', 'date': '2022-10-04', 'version': '1', 'license': 'cc_by_nc_nd', 'category': 'Cell Biology', 'jatsxml': 'https://www.biorxiv.org/content/early/2022/10/04/2022.10.03.510647.source.xml', 'abstract': 'Cell fate decisions are driven by lineage-restricted transcription factors but how they are regulated is incompletely understood. The C/EBP-induced B cell to macrophage transdifferentiation (BMT) is a powerful system to address this question. Here we describe that C/EBP with a single arginine 


### Multi-Match Query
Search for a text across multiple fields.


In [111]:
%%time

query = {
      "query": {
        "multi_match": {
            "query": "transcription",
            "fields": ["title", "abstract"]
        }
      }
    }

all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 34881 

First result: {'doi': '10.1101/2022.11.21.517317', 'title': 'Targeted disruption of transcription bodies causes widespread activation of transcription', 'authors': 'Ugolini, M.; Kuznetsova, K.; Oda, H.; Kimura, H.; Vastenhouw, N. L.', 'author_corresponding': 'Nadine L. Vastenhouw', 'author_corresponding_institution': 'MPI-CBG, UNIL', 'date': '2022-11-21', 'version': '1', 'license': 'cc_by_nc_nd', 'category': 'Cell Biology', 'jatsxml': 'https://www.biorxiv.org/content/early/2022/11/21/2022.11.21.517317.source.xml', 'abstract': 'The localization of transcriptional activity in specialized transcription bodies is a hallmark of gene expression in eukaryotic cells. It remains unclear, however, if and how they affect gene expression. Here, we disrupted the formation of two prominent endogenous transcription bodies that mark the onset of zygotic transcription in zebrafish embryos and analysed the effect on gene expression using enriched SLAM-Seq and live-cell imaging. We

### Advanced Filtering: Beyond Keywords

Filtering allows for more refined searches, such as retrieving documents within a specific date range or by particular authors, enhancing the precision of your data analysis.


In [112]:
%%time

query = {
    "query": {
        "bool": {
            "must": {
                "match": {"title": "Molecular Biology"}
            },
            "filter": {
                "range": {
                    # publish_date
                    "date": {
                        "gte": "2022-01-01",
                        "lte": "2023-12-31"
                    }
                }
            }
        }
    }
}

all_results = search_data(es, index_name, query)

print("Total results:", len(all_results), "\n")
if all_results:
    print("First result:", all_results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

Total results: 10198 

First result: {'doi': '10.1101/2023.05.24.542151', 'title': 'POMBOX: a fission yeast toolkit for molecular and synthetic biology', 'authors': 'Hebra, T.; Smrckova, H.; Elkatmis, B.; Prevorovsky, M.; Pluskal, T.', 'author_corresponding': 'Tomas Pluskal', 'author_corresponding_institution': 'Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Praha, Czech Republic', 'date': '2023-05-24', 'version': '1', 'license': 'cc_by_nc_nd', 'category': 'Synthetic Biology', 'jatsxml': 'https://www.biorxiv.org/content/early/2023/05/24/2023.05.24.542151.source.xml', 'abstract': 'Schizosaccharomyces pombe is a popular model organism in molecular biology and cell physiology. With its ease of genetic manipulation and growth, supported by in-depth functional annotation in the PomBase database and genome-wide metabolic models, S. pombe is an attractive option for synthetic biology applications. However, S. pombe currently lacks modular tools for generatin

### Aggregation Query


Aggregations are pivotal for summarizing data, enabling the analysis of trends across thousands of documents. For example, an aggregation query to count publications by category:

In [113]:
def search_data_with_aggregation(es: Elasticsearch, index_name: str, query: Dict[str, Any]) -> List[Dict[str, Any]]:
    try:
        response = es.search(index=index_name, body=query)
        if 'aggregations' in response:
            # Safely access the 'categories' aggregation results
            categories_agg = response.get('aggregations', {}).get('categories', {}).get('buckets', [])
            return categories_agg  # Directly return the categories aggregation results
        else:
            print("No aggregations found in the response.")
            return []
    except Exception as e:
        print(f"Search failed: {e}")
        return []

In [115]:
query = {
    "size": 0,  # No hits, only aggregations
    "query": {
        "match_all": {}  # Or adjust to your specific matching needs
    },
    "aggs": {
        "categories": {
            "terms": {
                "field": "category.keyword",  # Adjust field name as needed
                "size": 10000  # Increase this to accommodate all expected buckets
            }
        }
    }
}


#### Aggregate data, such as counting documents by category.

Checks if the response contains an 'aggregations' key and processes the results accordingly. For aggregation queries, it returns the aggregation results directly. 

In [116]:
%%time

# Assuming 'es', 'index_name', and 'query' are properly defined
results = search_data_with_aggregation(es, index_name, query)

# Display the aggregation results
if results:
    for category in results:
        print(f"Category: {category['key']}, Count: {category['doc_count']}")
else:
    print("No results found.")

Category: Neuroscience, Count: 98616
Category: Microbiology, Count: 45626
Category: Bioinformatics, Count: 44355
Category: Cell Biology, Count: 30761
Category: Biophysics, Count: 25587
Category: Evolutionary Biology, Count: 25552
Category: Biochemistry, Count: 21820
Category: Immunology, Count: 21693
Category: Cancer Biology, Count: 21634
Category: Ecology, Count: 21539
Category: Genomics, Count: 21376
Category: Molecular Biology, Count: 20221
Category: Plant Biology, Count: 17687
Category: Bioengineering, Count: 16946
Category: Developmental Biology, Count: 15226
Category: Genetics, Count: 14602
Category: Systems Biology, Count: 10545
Category: Physiology, Count: 8716
Category: Animal Behavior And Cognition, Count: 8459
Category: Pharmacology And Toxicology, Count: 5535
Category: Synthetic Biology, Count: 4922
Category: Pathology, Count: 3186
Category: Zoology, Count: 2734
Category: Scientific Communication And Education, Count: 2195
Category: Paleontology, Count: 767
CPU times: user 

---

### Generating and Executing an Elasticsearch Query

#### Create a function that generates Elasticsearch queries according to specified logic operators and a date range:

In [117]:
from typing import List, Union, Dict, Any

def generate_es_queries(logic_operators: List[Union[str, List[str]]], start_date: str, end_date: str, abstract_field: str = "abstract", date_field: str = "date") -> Dict[str, Any]:
    """
    Generates an Elasticsearch query based on logic operators and a date range.

    Parameters:
    - logic_operators (List[Union[str, List[str]]]): A list of strings and/or lists representing the logic operators.
      Nested lists represent OR conditions within AND conditions.
    - start_date (str): The start date in 'YYYY-MM-DD' format.
    - end_date (str): The end date in 'YYYY-MM-DD' format.
    - abstract_field (str): The document field to search for abstract text.
    - date_field (str): The document field that contains the date.

    Returns:
    - Dict[str, Any]: An Elasticsearch query in DSL format.
    """
    must_conditions = []  # To store AND conditions
    for operator in logic_operators:
        if isinstance(operator, list):  # Handle OR conditions
            should_conditions = [{"match": {abstract_field: term}} for term in operator]
            must_conditions.append({
                "bool": {"should": should_conditions, "minimum_should_match": 1}
            })
        else:  # Handle AND conditions
            must_conditions.append({"match": {abstract_field: operator}})
    
    # Add date range filter
    must_conditions.append({
        "range": {
            date_field: {  # Use the provided date field name
                "gte": start_date,
                "lte": end_date,
                "format": "yyyy-MM-dd"
            }
        }
    })
    
    # Construct the final query
    es_query = {
        "query": {
            "bool": {
                "must": must_conditions
            }
        }
    }
    
    return es_query


#### Construct an Elasticsearch query using a predefined function generate_es_queries, based on logic operators and a specified date range. It then pretty prints the generated query, executes it to fetch results with pagination, and displays the results.


In [118]:
%%time

logic_operators = ['COVID-19', 'coronavirus', 'vaccine']

start_date = '2020-01-01'
end_date = '2024-12-31'
es_query = generate_es_queries(logic_operators, start_date, end_date)

# Pretty print the Elasticsearch query object
pretty_es_query = json.dumps(es_query, indent=4)
print(pretty_es_query)

page_size = 9999

results = search_data(es, index_name, es_query, page_size)

print("Total results:", len(results), "\n")
if all_results:
    print("First result:", results[0]['_source'])
else:
    print("No results found.")

print("-" * 25)

{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "abstract": "COVID-19"
                    }
                },
                {
                    "match": {
                        "abstract": "coronavirus"
                    }
                },
                {
                    "match": {
                        "abstract": "vaccine"
                    }
                },
                {
                    "range": {
                        "date": {
                            "gte": "2020-01-01",
                            "lte": "2024-12-31",
                            "format": "yyyy-MM-dd"
                        }
                    }
                }
            ]
        }
    }
}
Total results: 394 

First result: {'doi': '10.1101/2023.05.24.541850', 'title': 'Cross-Protection Induced by Highly Conserved Human B, CD4+, and CD8+ T Cell Epitopes-Based Coronavirus Vaccine Against

---

### The biorxiv database was downloaded from the medrxivr library

The json file used is structure as a list of dictionaries downloaded by MedrdR from a specific date range from 2022 to 2024 and contain 000 papers of data

```python
[
  {
    "doi": "10.1101/043794",
    "title": "Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet",
    "authors": "Viner, C.; Ishak, C. A.; Johnson, J.; Walker, N. J.; Shi, H.; Sjöberg-Herrera, M. K.; Shen, S. Y.; Lardo, S. M.; Adams, D. J.; Ferguson-Smith, A. C.; De Carvalho, D. D.; Hainer, S. J.; Bailey, T. L.; Hoffman, M. M.",
    "author_corresponding": "Michael M. Hoffman",
    "author_corresponding_institution": "Princess Margaret Cancer Centre, Toronto, ON, Canada",
    "date": "2022-07-29",
    "version": "2",
    "license": "cc_by_nc_nd",
    "category": "Bioinformatics",
    "jatsxml": "https://www.biorxiv.org/content/early/2022/07/29/043794.source.xml",
    "abstract": "Transcription factors bind DNA in specific sequence contexts. In addition to distinguishing one nucleobase from another, some transcription factors can distinguish between unmodified and modified bases. Current models of transcription factor binding tend not take DNA modifications into account, while the recent few that do often have limitations. This makes a comprehensive and accurate profiling of transcription factor affinities difficult.\n\nHere, we developed methods to identify transcription factor binding sites in modified DNA. Our models expand the standard A/C/G/T DNA alphabet to include cytosine modifications. We developed Cytomod to create modified genomic sequences and enhanced the Multiple EM for Motif Elicitation (MEME) Suite by adding the capacity to handle custom alphabets. We adapted the well-established position weight matrix (PWM) model of transcription factor binding affinity to this expanded DNA alphabet.\n\nUsing these methods, we identified modification-sensitive transcription factor binding motifs. We confirmed established binding preferences, such as the preference of ZFP57 and C/EBP{beta} for methylated motifs and the preference of c-Myc for unmethylated E-box motifs. Using known binding preferences to tune model parameters, we discovered novel modified motifs for a wide array of transcription factors. Finally, we validated predicted binding preferences of OCT4 using cleavage under targets and release using nuclease (CUT&RUN) experiments across conventional, methylation-, and hydroxymethylation-enriched sequences. Our approach readily extends to other DNA modifications. As more genome-wide single-base resolution modification data becomes available, we expect that our method will yield insights into altered transcription factor binding affinities across many different modifications.",
    "published": "NA",
    "node": 2,
    "link_page": "https://www.biorxiv.org/content/10.1101/043794v2?versioned=TRUE",
    "link_pdf": "https://www.biorxiv.org/content/10.1101/043794v2.full.pdf"
  },
]
```

---

### Final Note: The Impact of Advanced Querying on Research


Advanced querying techniques in Elasticsearch empower researchers to navigate and analyze the vast repository of BiorXiv data with unprecedented depth and precision. From basic keyword searches to sophisticated aggregations and scripted calculations, Elasticsearch facilitates a comprehensive understanding of the life sciences landscape. By harnessing these querying capabilities, researchers can accelerate discovery, foster innovation, and contribute to the advancement of science.



---