-
Notifications
You must be signed in to change notification settings - Fork 0
CSV Files
CSV (Comma Separated Values) is a very widespread file format. The content of such files can be easily represented with DataFrames. If you don't know what a DataFrame is, read this first. To make working with CSV-files easy, the io-package provides the CSVReader and CSVWriter classes.
Let's say that you have a CSV-file "myFile.csv" with the following content:
id,name,age
100,Seth McFarlane,31
101,Peter Griffin,39
102,Adam West,23
103,Joe Swanson,34
104,Glenn Quagmire,43
You can read the content of that file by constructing a CSVReader object and then calling read() on it. The constructor takes the file as an argument. Either as a File object or directly as a String. Since the above file has a header (the first line describing the columns), you can also explicitly specify that after constructing the reader:
CSVReader csv = new CSVReader("myFile.csv");
csv.withHeader(true);
DataFrame df = csv.read();The withHeader() method in the above example indicates whether the first line should be treated as a header. This is turned on by default. You can then reference the columns in the returned DataFrame directly by their name. The CSVReader uses a comma (',') as the default separator. If the values in the above file were separated by a semicolon (';') rather than a comma, then you need to specify that by calling:
csv.useSeparator(';');The DataFame returned by read() can then be used like any other DataFrame. Of course, if you want to use the default configuration anyway you can also read a CSV-file in one line:
DataFrame df = new CSVReader("myFile.csv").read();However, an important thing to notice is that all columns in the returned DataFrame are of type String. That's because we didn't specify the column types to use in the above example. Now, maybe in some situations that's not relevant, but perhaps you want the returned DataFrame to consist of properly typed columns. So you can specify with the useColumnTypes() method to use a specific type for each column. For example:
DataFrame df = new CSVReader("myFile.csv")
.useColumnTypes(Integer.class, String.class, Byte.class)
.read();or equivalently:
CSVReader csv = new CSVReader("myFile.csv");
csv.useColumnTypes(Integer.class, String.class, Byte.class);
DataFrame df = csv.read();So the first column will be an IntColumn, the second column a StringColumn and the third column a ByteColumn.
If your CSV-file is rather large (we are talking about millions of lines), then reading the entire file might take several seconds. In many cases it's not desirable to perform such heavy operations on the calling thread. Therefore both CSVReader and CSVWriter provide *Async() methods which perform the same operation as their non-async counterpart, but on a background thread. So if you want to read your CSV-file in the background, simply call the readAsync() method like this:
CompletableFuture<DataFrame> future = new CSVReader("myFile.csv").readAsync();Once the background thread finishes, it will complete the Future returned by the readAsync() method and pass the DataFrame to it.
Persisting a DataFrame to a CSV-file is just as easy as reading one. Simply construct a CSVWriter object and pass the DataFrame you want to persist to its write() method:
new CSVFileWriter("myFile.csv").write(df);Again, if you want a different separator, for example a semicolon, you must do something like this:
CSVWriter csv = new CSVWriter("myFile.csv");
csv.useSeparator(';');
csv.write(df);NOTE: If the DataFrame you want to persist has any column names set, they will be used as a header for the CSV-file. This behaviour can be controlled with the withHeader() method.
As already mentioned, you can also perform a write operation in the background. The Future completes when the background thread finishes.
CompletableFuture<Void> future = new CSVWriter("myFile.csv").writeAsync(df);NOTE: The Future's get() method returns null when completed.