Skip to content
Alexander Trufanov edited this page Feb 20, 2017 · 3 revisions

Shuf-t is a cross-platform command line tool that shuffles the lines of a text file. It supports the same command line interface as shuf from GNU Core Utilities, but It's designed to work with a text files that much bigger than available RAM.

Why do I ever need to shuffle lines of a huge text file?

The main sphere of usage for shuf-t is the machine learning. To be more specific - data preparation for Vowpal Wabbit or similar online learning systems. Machine learning allows you to predict (and not only that) some process' results based of its historical data. For example, it's used in banks for credit scoring to decide if you're eligible for credit or not. Usually, the more historical data you use to train your prediction model the better its accuracy later.

Thus data scientist eventually faces with datasets bigger than single PC RAM and is forced to use clusters (MapReduce etc.) and clouds. Or he can stick with single PC and use systems of online machine learning. Currently the most popular of them is Vowpal Wabbit (VW). The main idea behind such systems is to process one record or small chunk of historical data per time. They read a few records from historical data, update prediction model coefficients, remove examples from RAM and then read a new portion of data again. Only coefficients of decision rules or equations are stored in memory permanently. This allows VW to work with datasets of unlimited size.

Such systems has several drawbacks. One of them is sensitiveness to order of records in historical data. For example, you could sort a dataset containing your customers' previous loans by result (paid in time/ not paid), train prediction model with 10000 unreliable loaners and then with 10000 examples of successfully paid debts. In this case your model should predict anyone as trustworthy loaner because it already "forgot" about unreliable ones. Simple workaround is to make sure that good and bad loaners have uniform distribution across the dataset. And one of obvious ways to do this is to shuffle your dataset.

Some technical details

  • Fisher–Yates algorithm is used to shuffle the data.
  • The tool is written in C++ and doesn't depend on any standalone library.
  • QtCreator IDE and qmake toolchain are currently used to build the project
  • License is Simplified BSD.
Clone this wiki locally