Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Common scripts, mainly for text processing and experimental control
Python Perl Shell
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
html2text
local
shuffle
README
all-xml-to-json.sh
boilerpipe-stdin-urls-to-mongo.py
choose-columns.py
chop-columns.pl
citeseer-get.pl
compare-file-lengths.pl
condor-count-user-jobs-in-queue.sh
convert-yaml-to-hashdb.py
convert-yaml-to-json.py
convert-yaml-to-split-json.py
count-undefined_references.pl
cumulative.py
delexicalize-low-frequency-words.py
dumpdb.py
enscript-landscape-all.pl
file-count-recursive.py
filter-json-by-key.py
fix-filenames.py
from-one-line-per-word-to-one-line-per-sentence.py
fuck-condor.py
grep-json-by-field.py
grep-json.py
hashdb-append.py
hashdb-dump.py
htmldecode.pl
htmlencode.pl
ikigrabimage
ikiquip
interleave
join-json.py
lines-with-funny-characters.pl
lines-with-no-funny-characters.pl
load-directory-of-textfiles-into-mongodb.py
load-json-into-mongodb.py
make-single-sided.pl
mongodb-count.py
mongodb-field-lengths.py
mongodb-remove-field.py
mongodb-remove-short-fields.py
mongodb-to-lucene.py
non-ascii-lines.pl
numberlines
numberlines-normalized
one-sentence-per-line-to-json.py
page-count.pl
paired-T-test.py
percentile.py
print-all.pl
quickrm
read-xml-mysqldump.py
remove-funny-characters.pl
remove-non-utf10-characters.pl
remove-non-utf11-characters.pl
remove-nonascii-characters.pl
run-lock.pl
rzip.pl
rzipdir.py
sample.pl
shuffle-files.pl
shuffle.sh
shuffle_galleries.pl
sort-curves.py
statistics.pl
statistics.py
strip-invalid-lines.py
tailpercent
tokenize-English.pl
tokenizer.sed
tsv-to-json.py
tsv_to_html.py
unichars
untokenizer
version.pl
vowpal-to-libsvm.py
weighted-sum.pl
words-integers-mapfile.py
words-to-integers.py
xmlmysqldump.py

README

local/                      -Files of local interest, e.g. with fixed
                            hostnames

README.txt			        - This file

all-xml-to-json.sh          - For every XML file in the command-line,
                            convert it to JSON.

boilerpipe-stdin-urls-to-mongo.py
                            - Run every sys.stdin URL through Boilerpipe
                            (or diffbot), and store in a MongoDB.

citeseer-get.pl			    - Fetch PDFs from citeseer.

cumulative.py               - Output a cumulative sum for each line in
                            the input file.

delexicalize-low-frequency-words.py
                            - Delexicalize all words with freq less than
                            minfreq to *UNKNOWN*

dumpdb.py                   - Dump the MongoDB

enscript-landscape-all.pl	- Enscript all files listed in @ARGV in
            				landscape mode.

filter-json.py              - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

from-one-line-per-word-to-one-line-per-sentence.py
                            - Read one-line-per-word and convert to
                            one-line-per-sentence.

grep-json.py                - Filter JSON in sys.stdin to find only docs
                            that match each regex against raw JSON.

grep-json-by-field.py       - Filter JSON in sys.stdin to find only docs
                            that match each regex with at least one
                            field value.

join-json.py                - For each JSON file in sys.argv, join them
                            and output to stdout.

lines-with-funny-characters.pl
                            - Print lines with funny characters

lines-with-no-funny-characters.pl
                            - Print lines without funny characters

load-directory-of-textfiles-into-mongodb.py
                            - For all files recursively in a subdir, load
                            them into a MongoDB with a certain field name.

load-json-into-mongodb.py   - Load JSON from stdin into a MongoDB

htmldecode.pl               - Decode HTML entities, e.g. &lt; becomes <

htmlencode.pl               - Encode HTML entities, e.g. < becomes &lt;

html2text                   - Convert HTML to text

mongodb-count.py            - Count the number of entries in a mongodb
                            collection.

mongodb-field-lengths.py    - Print MongoDB field length and field,
                            for every row.

mongodb-remove-field.py     - Remove every occurrence of some field,
                            for every row, in MongoDB.

mongodb-remove-short-fields.py
                            - Remove every occurrence of some field if it
                            is shorter than some length, for every row,
                            in MongoDB.

mongodb-to-lucene.py        - Read all mongo docs, and insert them
                            into Lucene.

one-sentence-per-line-to-json.py
                            - For line in stdin, convert it to a JSON
                            dict with key: "content" and value: line.

page-count.pl			    - For each file (usually .ps or .pdf)
                            specified in stdin, count the number of
                            pages in the file

print-all.pl			    - For each file (.ps or .pdf) specified
            				as a command-line argument, print the
            				file to a random printer.

ptb/one-sentence-per-line.pl    - Output one PTB sentence per line,
                            using PTB tagged/ files.

read-xml-mysqldump.py       - Read in the XML mysqldump from sys.sdin.

remove-funny-characters.pl  - Remove any funny character

remove-nonascii-characters.pl   - Remove non-ASCII characters

remove-non-utf10-characters.pl  - Remove non-UTF 1.0 characters

remove-non-utf11-characters.pl  - Remove non-UTF 1.1 characters

sample.pl                   - Sample and print only a certain percentage
                            of input lines.

shuffle/shuffle.sh		    - Shuffle lines of stdin

sort-curves.py              - Sort gnuplot curves

tokenizer.sed               - Penn Treebank tokenizer.

tokenize-English.pl         - Word Tokenizer for English by Al-Onaizan
                            and Melamed.

tsv-to-json.py              - Read TSV from stdin and output as JSON.

unichars                    - List characters for one or more properties
                            (by Tom Christiansen)

untokenize                  - Detokenize Penn Treebank formatted text.

vowpal-to-libsvm.py         - Convert a vowpal-wabbit file in stdin
                            to libsvm.

words-integers-mapfile.py   - Create a integers mapfile for the words
                            in textfile.

words-to-integers.py        - Convert words to integers, according to
                            the mapping in mapfile.

xmlmysqldump.py             - Read in the XML mysqldump for sys.sdin.
Something went wrong with that request. Please try again.