KSV - robust mapping of comma separated values to user-defined data classes for Kotlin on the JVM

The robustness stems from columns being identified by (normalized) column name instead of the column index, which makes this solution work seamlessly in situations in which columns have been swapped or new columns have been inserted. The names of column might even have changed slightly. (The default name normalization removes lower/uppercase differences as well as spaces.) This property is very useful if the source of the csv-file(s) is not within your organization and you can't enforce a certain format (, e.g. if you regularly import csv-files from government sites).

You only have to annotate a data class with @CsvRow and it’s properties with either @CsvValue (for Strings, Ints, Doubles and Booleans), CsvTimestamp (for LocalDate and LocalDateTime) or @CsvGeneric (for user-defined mappings). Because this library is written in Kotlin, you can define the nullability of properties. A blank value in the csv results in a null value of the property (assuming the property doesn't have a default value).

@CsvRow data class DataRow(
    @CsvValue(name = "RQIA") val id: String,
    @CsvValue(name = "Number of beds") val bedCount: Int?, // types can be nullable
    val addressLine1: String,                              // without annotation it's assumed the the column name is the the property name
    val city: String = "London",                           // without value in the csv file the Kotlin default value is used
    @CsvTimestamp(name = "latest check", format = "yyyy/MM/dd|dd/MM/yyyy")  
    val latestCheckDate: LocalDate?,                       // multiple formats can be provided separated by '|'
    @CsvGeneric(name = "offers Cola, Sprite or Fanta", converterName = "beverageBoolean")
    val refreshments: Boolean?                             // a user-defined converter can be used
)

// register a user-defined converter
registerGenericConverter("beverageBoolean") {
    it.toLowerCase()=="indeed"
}

val csvStream: InputStream = """
  city, addressLine1, Number of beds, latest check, RQIA, "offers Cola, Sprite or Fanta"
  if a line doesn't fit the pattern, it will be discarded <- like this line, the next line is fine because city and Number of beds are nullable
      , "2 Marylebone Rd",          ,2020/03/11,   WERS234, nope
  Berlin, "Berkaer Str 41", 1       ,28/08/2012, "NONE123", indeed
  Paris,"Rue Gauge, Maison 1", 4    ,          , "FR92834",
  Atlantis,,25000,,,
  """.trimIndent().byteInputStream()

val dataRows: List<DataRow> = csv2List(
  CsvSourceConfig(
    stream = csvStream 
  )    
)

This code is actually executed in the testclass TestExample.

How to Import this Lib into your (Kotlin) Gradle -project

via a source dependency!

First add this git-repository to your projects settings.gradle.kts file:

sourceControl {
    gitRepository(java.net.URI.create("https://github.com/whichdigital/ksv.git")) {
        producesModule("uk.co.whichdigital.ksv:ksv")
    }
}

then add this dependency (in its latest git-tagged version) to your build.gradle.kts file:

implementation("uk.co.whichdigital.ksv:ksv:1.0.0")

Done.

Annotations

class annotation(s)

@CsvRow

Is a marker annotation on a data class marking it as Boolean conversion

property annotation(s)

All values are trimmed and stripped of surrounding quotes (default quote is double quote).

@CsvValue

for mapping values to String, Int, Double or Boolean.

Booleans are mapped from a String value by comparing the lowercase version to "true", "yes", "y" and "1", which are mapped to true, otherwise false.

annotation parameter:

name (optional): the name of the column this property is instantiated from. If no name is provided, the name of the annotated property is used.

@CsvTimestamp

for LocalDate and LocalDateTime.

annotation parameter:

name (optional): the name of the column this property is instantiated from. If no name is provided, the name of the annotated property is used.
format: format is either a single timestamp pattern (e.g. "yyyy/MM/dd" ) or multiple patterns separated by '|' (e.g. "yyyy/MM/dd|dd-MM-yyyy" )

@CsvGeneric

for user-defined mappings to any type. It just has to be assured that the user-defined converter is registered before the annotation is used. This is done by invoking the global registerConverter function.

fun <T: Any> registerGenericConverter(
  converterName: String,
  converter: (String) -> T
)

where T is the type of the property.

annotation parameter:

name (optional): the name of the column this property is instantiated from. If no name is provided, the name of the annotated property is used.
converterName: has to match the name of a registered converter. The return type of the converter has to match the type of the annotated property.

Code

csv2List

Is the global function that converts a csv source (an InputStream plus optional more configuration parameters) to a list of the user-defined row type. This is where the (reflective) magic happens.

val dataRows: List<DataRow> = csv2List(
  CsvSourceConfig(
    stream = csvStream 
  )    
)

Invoking csv2List will close the InputStream.

csv2List has actually a bunch of optional parameters - apart from the main one that takes in a CsvSourceConfig - that provide statistics about how many rows where discarded/parsed. (As the naming suggest the main idea here is to allow for logging.)

logInvalidLine: (line: String, msg: String)->Unit: the line and reason of why a certain row/line was dropped from the csv (, mostly because the number of commas didn't fit).
logRejectedRecord: (record: String)->Unit: the String representation of a CsvRecord (slightly process row/line) that was rejected by the keepCsvRecord-Predicate.
logConversionError: (record: String, msg: String)->Unit: a record and why its conversion to the expected (row)type failed (, e.g. because of unfulfilled nullability constraints).
logSummary: (invalidLineCount: Int, rejectedRecordCount: Int, conversionErrorCount: Int, itemsCreated: Int)->Unit: after all lines have been considered, here a summary of the complete process can be logged.

e.g.

val csvFilePath: String = "data/someFile.csv"
val dataRows: List<DataRow> = csv2List(
  CsvSourceConfig(
    stream = classLoader.getResourceAsStream(csvFilePath),
    logSummary = {invalidLineCount: Int, rejectedRecordCount: Int, conversionErrorCount: Int, itemsCreated: Int ->
      logger.info("""
        Finished importing file $csvFilePath
          items imported: $itemsCreated
          lines with invalid format: $invalidLineCount (probably wrong amount of commas)
          rejected csvRecord: $rejectedRecordCount (optional filter provided)
          csvRecord which couldn't be converted to item: $conversionErrorCount
        """.trimIndent())
    }
  )    
)

CsvSourceConfig

Assuming the InputStream uses UTF8 the instantiation of a CsvSourceConfig only needs said InputStream. But there are more configuration options:

stream: InputStream: the source of the csv
charset: Charset: the default is UTF8
commaChar: Char: the default is a normal comma (',') but csv files are known to sometimes use other characters (e.g. a semicolon) as a delimiter
quoteChar: Char: the default is a double quote, but char (e.g. single quote) can be used
fixLine: (String)->String: this function is used on every line of the csv file. The idea is to remove e.g. illegal characters. The default removes invisible BOM characters (\uFEFF and \u200B) from the start of the line.
keepCsvRecord: (CsvRecord) -> Boolean: The csv input stream can be extremely large. Sometimes we want to filter out rows from the csv before an object is instantiated. (in our example this would be of type DataRow)
normalizeColumnName: (String)->String: if we don't control the source of the csv data (e.g. because the files come from an external source), it often happens the column names change slightly between different versions. the normalizeColumnName-parameter is supposed to make a configuration more robust against such changes. The default version removes all spaces from the column names an maps them to their lower case version. If you have different requirements (or the default version leads to collisions of normalized column names) provide your own function.

Here an example of how to define a predicate for the optional keepCsvRecord-parameter: (it allows only lines where the number of beds is bigger than 2)

val dataRows: List<DataRow> = csv2List(
  CsvSourceConfig(
    stream = csvStream,
    keepCsvRecord = ::onlyRowsWithAtLeastTwoBeds 
  )    
)

fun onlyRowsWithAtLeastTwoBeds(header: CsvHeader, record: CsvRecord): Boolean {
  val (nrOfBedsIndex: Int) = header.getIndexesOf("Number of beds") // more than one index can be queried at the same time which is why the result has to be destructed
  val nrOfBeds: Int = record.getAsNonBlankStringOrNull(nrOfBedsIndex)?.toIntOrNull() ?: 0
  return nrOfBeds>=2
}

Of course this filter operation could also be implemented on dataRows after the complete csv source has been parsed:

val dataRows: List<DataRow> = csv2List(CsvSourceConfig(csvStream)).filter {row ->
  row.bedCount?.let{nrOfBeds->nrOfBeds>=2} ?: false
}

But assuming a case where the csv source is truely big (like gigabytes big), and a lot of instances get discarded because of this filter, it can be a reasonable idea to filter those rows out before computation time and memory is wasted on them.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle.kts		build.gradle.kts
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KSV - robust mapping of comma separated values to user-defined data classes for Kotlin on the JVM

How to Import this Lib into your (Kotlin) Gradle -project

Annotations

class annotation(s)

@CsvRow

property annotation(s)

@CsvValue

@CsvTimestamp

@CsvGeneric

Code

csv2List

CsvSourceConfig

About

Releases

Packages

Contributors 2

Languages

License

whichdigital/ksv

Folders and files

Latest commit

History

Repository files navigation

KSV - robust mapping of comma separated values to user-defined data classes for Kotlin on the JVM

How to Import this Lib into your (Kotlin) Gradle -project

Annotations

class annotation(s)

@CsvRow

property annotation(s)

@CsvValue

@CsvTimestamp

@CsvGeneric

Code

csv2List

CsvSourceConfig

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages