Python scripts to perform the following file operations:
- Jump to a line number in a file and read a line.
- Read a large file as a stream of lines and filter only the lines that match some criteria.
- Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
- Read a JSON input and load it into an object.
The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml
file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).
-
Adding line numbers to the file:
addLineNumber : 58.024850428 s
addLineNumber_inplace : 103.272668963 s
-
Reading a line at a given line number:
getline
from thelinecache
module is not practical for large files.
getLine
usesenumerate()
to read the file line-by-line until it reaches the target line number.
getLine_binarysearch
searches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.
Use ./tools/timingplot.py
to generate an interactive plotly plot. The timing data can be found at: ./data/
Test Data
shakespeare.txt
: "As You Like It" by William Shakespeare.
exoplanets.json
: list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json
file.
Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/