In [1]:
from pathlib import Path
import json
import random

from pprint import pprint

import notebook_bootstrap 
import config

In [2]:
FILES = {
    'md':config.MD_JSONL,
    'txt':config.TXT_JSONL
}
KEYWORDS = ("introduction", "detail", "prelim", "result")
NUMBER_RANDOM_ELEMENTS = 20

In [3]:
def load_matches(path, keywords=KEYWORDS):
    matches = []
    with path.open("r", encoding="utf-8") as fh:
        for raw_line in fh:
            line = raw_line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
            except json.JSONDecodeError:
                continue  # skip malformed rows
            section = record.get("section", "")
            # any(keyword.lower() in section_path.lower() for keyword in keywords)
            section_path = record.get("section_path", "")
            keyword_in_sec = any(keyword.lower() in section.lower() + section_path.lower() for keyword in keywords)
            if keyword_in_sec:
                matches.append(record)
    return matches

def sample_wo_replacement(collection,num_elements):
    if num_elements > len(collection):
        sample = collection
    else:
        sample = random.sample(collection, num_elements)
    return sample

In [4]:
config.MD_JSONL

PosixPath('/home/rajinder-mavi/code/SME/data/data_etl/md_data.jsonl')

In [5]:
md_matches = load_matches(FILES["md"])
md_sample = sample_wo_replacement(md_matches,NUMBER_RANDOM_ELEMENTS)
txt_matches = load_matches(FILES["txt"])
txt_sample = sample_wo_replacement(txt_matches,NUMBER_RANDOM_ELEMENTS)

In [9]:
md_matches[0]['text']

'# Introduction\n\nIn this paper we study one-dimensional Schrödinger operators of the form\n\n$$\\label{operator}\nH = -\\frac{d^2}{dx^2} + V ( x ),$$ acting on $L^2({\\mathbb R})$, with some real-valued $L^1_{\\rm loc}$-potential $V$. We will be particularly interested in potentials of the form\n\n$$\\label{potential}\nV ( x ) = V_1 ( x )  + V_2 ( x \\alpha + \\theta ),$$ where we assume that $V_1$ and $V_2$ are $1$-periodic and locally integrable, and $\\alpha , \\theta \\in [0,1)$. If $\\alpha = \\frac{p}{q}$ is rational, then the potential $V$ is $q$-periodic and $H$ has purely absolutely continuous spectrum. If $\\alpha$ is irrational, then the potential is quasiperiodic and the spectral theory of $H$ is far from trivial; compare .'

In [7]:
md_sample

[{'chunk_id': '1909.04429v1::L25-32::s2',
  'paper_id': '1909.04429v1',
  'source_file': 'data/data_etl/full_markdown/1909.04429v1.md',
  'section': 'Introduction',
  'labels': ['zero'],
  'refs': [],
  'neighbors': [{'id': '1909.04429v1::L15-24::s1', 'direction': 'previous'},
   {'id': '1909.04429v1::L33-40::s3', 'direction': 'next'},
   {'id': '1909.04429v1::L33-40::s3', 'direction': 'comment'},
   {'id': '1909.04429v1::L41-47::s4', 'direction': 'comment'},
   {'id': '1909.04429v1::L65-66::s7', 'direction': 'comment'},
   {'id': '1909.04429v1::L67-69::s8', 'direction': 'comment'},
   {'id': '1909.04429v1::L214-241::s0', 'direction': 'comment'}],
  'start_line': 25,
  'end_line': 32,
  'text': '</div>\n\nOf course, it only makes sense to discuss upper bounds on the Hausdorff dimension of a set on the real line once its Lebesgue measure is shown to be zero. The Aubry-Andre conjecture stated that the measure of the spectrum of $H_{\\alpha,\\theta,\\lambda}$ is equal to $4|1-|\\lambda||,

In [8]:
txt_sample

[{'chunk_id': '1712.04700v1::L2214-2291::c35',
  'paper_id': '1712.04700v1',
  'source_file': 'data/data_etl/full_text/1712.04700v1.txt',
  'section_path': 'Preliminaries',
  'section': 'Preliminaries',
  'start_line': 2214,
  'end_line': 2291,
  'chunk_type': 'math_body',
  'text': 'and hence\n\nIn view of Proposition 18 of, there exists , with\nsuch that\n\nSince , we have\n\nand then, by , one has\n\nBy  of Lemma , we have\n\nThen, under the condition (), we have\n\nconsequently,  the cocycle  is uniformly hyperbolic, and , which means that .\nIn particular, if ,  by  of Lemma ,  we have\n\nSimilarly as above, under the condition (), we have\n.\n\nApplications of the criterion - upper bound\n\nAs the first application of Theorem ,  for discrete  quasi-periodic Schrodinger operator with small potential, we get exponentially decaying upper bounds on the size of spectral gaps.  As we mentioned before, the result is perturbative for a multifrequency. However, it is non-perturbative in t