Skip to content

Efficient information extraction from large files

License

Notifications You must be signed in to change notification settings

seyedb/file-ops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Operations

License: MIT docs: latest coverage: 92%

Python scripts to perform the following file operations:

  • Jump to a line number in a file and read a line.
  • Read a large file as a stream of lines and filter only the lines that match some criteria.
  • Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
  • Read a JSON input and load it into an object.

Timing Results

The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).

  • Adding line numbers to the file:
    addLineNumber : 58.024850428 s
    addLineNumber_inplace : 103.272668963 s

  • Reading a line at a given line number:
    getline from the linecache module is not practical for large files.
    getLine uses enumerate() to read the file line-by-line until it reaches the target line number.
    getLine_binarysearch searches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.

Use ./tools/timingplot.py to generate an interactive plotly plot. The timing data can be found at: ./data/

Test Data
shakespeare.txt : "As You Like It" by William Shakespeare.
exoplanets.json : list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json file.

Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/

About

Efficient information extraction from large files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages