Permalink
Browse files

Don't parse entire CSV file before inserting the first row

We were separating the CSV import into two steps: parsing the CSV file
and inserting the parsed data. This had the advantages that it keeps the
parsing code and the database code nicely separated and that we have
full knowledge of the CSV file when we start inserting the data into the
database. However, this made it necessary to keep the entire parser
results in RAM. For large CSV files this uses enormous amounts of
memory.

This commit changes the import to parse the first 20 lines and analyse
them. This should give us a good impression of what to expect from the
rest of the file. Based on that information we then parse the file row
by row and insert each row into the database as soon as it is parsed.
This means we only have to keep one row at a time in memory while more
or less keeping the possibility to analyse the file before inserting
data.

On my system this does seem to change the runtime for small files which
take a little longer now (<5%), though these measurements aren't
conclusive. For large files it, however, it changes memory consumption
from using all memory and starting to swap within seconds to almost no
memory consumption at all. And not having to swap speeds things up a
lot.
  • Loading branch information...
MKleusberg committed Sep 12, 2017
1 parent e0ced4a commit 6ed8080fdb19e4b6cfadbb67d336bc6ce828d74c
Showing with 173 additions and 166 deletions.
  1. +113 −130 src/ImportCsvDialog.cpp
  2. +4 −2 src/ImportCsvDialog.h
  3. +18 −12 src/csvparser.cpp
  4. +21 −13 src/csvparser.h
  5. +17 −9 src/tests/TestImport.cpp
@@ -14,7 +14,6 @@
#include <QFile>
#include <QTextStream>
#include <QSettings>
#include <QDebug>
#include <QFileInfo>
#include <memory>

@@ -85,12 +84,10 @@ namespace {
void rollback(
ImportCsvDialog* dialog,
DBBrowserDB* pdb,
QProgressDialog& progress,
const QString& savepointName,
size_t nRecord,
const QString& message)
{
progress.hide();
QApplication::restoreOverrideCursor(); // restore original cursor
if(!message.isEmpty())
{
@@ -110,7 +107,7 @@ class CSVImportProgress : public CSVProgress
explicit CSVImportProgress(size_t filesize)
{
m_pProgressDlg = new QProgressDialog(
QObject::tr("Decoding CSV file..."),
QObject::tr("Importing CSV file..."),
QObject::tr("Cancel"),
0,
filesize);
@@ -183,16 +180,10 @@ void ImportCsvDialog::updatePreview()
ui->editCustomSeparator->setVisible(ui->comboSeparator->currentIndex() == ui->comboSeparator->count()-1);
ui->editCustomEncoding->setVisible(ui->comboEncoding->currentIndex() == ui->comboEncoding->count()-1);

// Get preview data
QFile file(selectedFile);
file.open(QIODevice::ReadOnly);

CSVParser csv(ui->checkBoxTrimFields->isChecked(), currentSeparatorChar(), currentQuoteChar());

QTextStream tstream(&file);
tstream.setCodec(currentEncoding().toUtf8());
csv.parse(tstream, 20);
file.close();
// Reset preview widget
ui->tablePreview->clear();
ui->tablePreview->setColumnCount(0);
ui->tablePreview->setRowCount(0);

// Analyse CSV file
sqlb::FieldVector fieldList = generateFieldList(selectedFile);
@@ -205,36 +196,40 @@ void ImportCsvDialog::updatePreview()
if(fieldList.size() == 0)
return;

// Use first row as header if necessary
CSVParser::TCSVResult::const_iterator itBegin = csv.csv().begin();
if(ui->checkboxHeader->isChecked())
{
ui->tablePreview->setHorizontalHeaderLabels(*itBegin);
++itBegin;
}

// Fill data section
ui->tablePreview->setRowCount(std::distance(itBegin, csv.csv().end()));

for(CSVParser::TCSVResult::const_iterator ct = itBegin;
ct != csv.csv().end();
++ct)
{
for(QStringList::const_iterator it = ct->begin(); it != ct->end(); ++it)
// Set horizontal header data
QStringList horizontalHeader;
foreach(const sqlb::FieldPtr& field, fieldList)
horizontalHeader.push_back(field->name());
ui->tablePreview->setHorizontalHeaderLabels(horizontalHeader);

// Parse file
parseCSV(selectedFile, [this](size_t rowNum, const QStringList& data) -> bool {
// Skip first row if it is to be used as header
if(rowNum == 0 && ui->checkboxHeader->isChecked())
return true;

// Decrease the row number by one if the header checkbox is checked to take into account that the first row was used for the table header labels
// and therefore all data rows move one row up.
if(ui->checkboxHeader->isChecked())
rowNum--;

// Fill data section
ui->tablePreview->setRowCount(ui->tablePreview->rowCount() + 1);
for(QStringList::const_iterator it=data.begin();it!=data.end();++it)
{
int rowNum = std::distance(itBegin, ct);
if(it == ct->begin())
{
ui->tablePreview->setVerticalHeaderItem(
rowNum,
new QTableWidgetItem(QString::number(rowNum + 1)));
}
// Generate vertical header items
if(it == data.begin())
ui->tablePreview->setVerticalHeaderItem(rowNum, new QTableWidgetItem(QString::number(rowNum + 1)));

// Add table item
ui->tablePreview->setItem(
rowNum,
std::distance(ct->begin(), it),
std::distance(data.begin(), it),
new QTableWidgetItem(*it));
}
}

return true;
}, 20);
}

void ImportCsvDialog::checkInput()
@@ -325,69 +320,61 @@ void ImportCsvDialog::matchSimilar()
checkInput();
}

CSVParser ImportCsvDialog::parseCSV(const QString &fileName, qint64 count)
CSVParser::ParserResult ImportCsvDialog::parseCSV(const QString &fileName, std::function<bool(size_t, QStringList)> rowFunction, qint64 count)
{
// Parse all csv data
QFile file(fileName);
file.open(QIODevice::ReadOnly);

CSVParser csv(ui->checkBoxTrimFields->isChecked(), currentSeparatorChar(), currentQuoteChar());
// If count is one, we only want the header, no need to see progress
if (count != 1) csv.setCSVProgress(new CSVImportProgress(file.size()));

// Only show progress dialog if we parse all rows. The assumption here is that if a row count limit has been set, it won't be a very high one.
if(count == -1)
csv.setCSVProgress(new CSVImportProgress(file.size()));

QTextStream tstream(&file);
tstream.setCodec(currentEncoding().toUtf8());
csv.parse(tstream, count);
file.close();

return csv;
return csv.parse(rowFunction, tstream, count);
}

sqlb::FieldVector ImportCsvDialog::generateFieldList(const QString& filename)
{
// Parse the first couple of records of the CSV file and only analyse them
CSVParser parser = parseCSV(filename, 20);
sqlb::FieldVector fieldList; // List of fields in the file

// If there is no data, we don't return any fields
if(parser.csv().size() == 0)
return sqlb::FieldVector();
// Parse the first couple of records of the CSV file and only analyse them
parseCSV(filename, [this, &fieldList](size_t rowNum, const QStringList& data) -> bool {
// Has this row more columns than the previous one? Then add more fields to the field list as necessary.
for(int i=fieldList.size();i<data.size();i++)
{
QString fieldname;

// How many columns are there in the CSV file?
int columns = 0;
for(int i=0;i<parser.csv().size();i++)
{
if(parser.csv().at(i).size() > columns)
columns = parser.csv().at(i).size();
}
// If the user wants to use the first row as table header and if this is the first row, extract a field name
if(rowNum == 0 && ui->checkboxHeader->isChecked())
{
// Take field name from CSV and remove invalid characters
fieldname = data.at(i);
fieldname.replace("`", "");
fieldname.replace(" ", "");
fieldname.replace('"', "");
fieldname.replace("'","");
fieldname.replace(",","");
fieldname.replace(";","");
}

// Generate field names. These are either taken from the first CSV row or are generated in the format of "fieldXY" depending on the user input
sqlb::FieldVector fieldList;
for(int i=0;i<columns;i++)
{
QString fieldname;
// If we don't have a field name by now, generate one
if(fieldname.isEmpty())
fieldname = QString("field%1").arg(i+1);

// Only take the names from the CSV file if the user wants that and if the first row in the CSV file has enough columns
if(ui->checkboxHeader->isChecked() && i < parser.csv().at(0).size())
{
// Take field name from CSV and remove invalid characters
fieldname = parser.csv().at(0).at(i);
fieldname.replace("`", "");
fieldname.replace(" ", "");
fieldname.replace('"', "");
fieldname.replace("'","");
fieldname.replace(",","");
fieldname.replace(";","");
// Add field to the column list
fieldList.push_back(sqlb::FieldPtr(new sqlb::Field(fieldname, "")));
}

// If we don't have a field name by now, generate one
if(fieldname.isEmpty())
fieldname = QString("field%1").arg(i+1);

// TODO Here's also the place to do some sort of data type analysation of the CSV data

// Add field to the column list
fieldList.push_back(sqlb::FieldPtr(new sqlb::Field(fieldname, "")));
}
// All good
return true;
}, 20);

return fieldList;
}
@@ -396,6 +383,7 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)
{
#ifdef CSV_BENCHMARK
// If benchmark mode is enabled start measuring the performance now
qint64 timesRowFunction = 0;
QElapsedTimer timer;
timer.start();
#endif
@@ -415,19 +403,8 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)

// Analyse CSV file
sqlb::FieldVector fieldList = generateFieldList(fileName);

// Parse entire file
CSVParser csv = parseCSV(fileName);
if (csv.csv().size() == 0) return;

#ifdef CSV_BENCHMARK
qint64 timer_after_parsing = timer.elapsed();
#endif

// Show progress dialog
QProgressDialog progress(tr("Inserting data..."), tr("Cancel"), 0, csv.csv().size());
progress.setWindowModality(Qt::ApplicationModal);
progress.show();
if(fieldList.size() == 0)
return;

// Are we importing into an existing table?
bool importToExistingTable = false;
@@ -452,22 +429,18 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)
}
}

#ifdef CSV_BENCHMARK
qint64 timer_before_insert = timer.elapsed();
#endif

// Create a savepoint, so we can rollback in case of any errors during importing
// db needs to be saved or an error will occur
QString restorepointName = pdb->generateSavepointName("csvimport");
if(!pdb->setSavepoint(restorepointName))
return rollback(this, pdb, progress, restorepointName, 0, tr("Creating restore point failed: %1").arg(pdb->lastError()));
return rollback(this, pdb, restorepointName, 0, tr("Creating restore point failed: %1").arg(pdb->lastError()));

// Create table
QStringList nullValues;
if(!importToExistingTable)
{
if(!pdb->createTable(sqlb::ObjectIdentifier("main", tableName), fieldList))
return rollback(this, pdb, progress, restorepointName, 0, tr("Creating the table failed: %1").arg(pdb->lastError()));
return rollback(this, pdb, restorepointName, 0, tr("Creating the table failed: %1").arg(pdb->lastError()));
} else {
// Importing into an existing table. So find out something about it's structure.

@@ -497,68 +470,78 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)
sqlite3_stmt* stmt;
sqlite3_prepare_v2(pdb->_db, sQuery.toUtf8(), sQuery.toUtf8().length(), &stmt, nullptr);

// now lets import all data, one row at a time
CSVParser::TCSVResult::const_iterator itBegin = csv.csv().begin();
if(ui->checkboxHeader->isChecked()) // If the first row contains the field names we should skip it here because this is the data import
++itBegin;
for(CSVParser::TCSVResult::const_iterator it = itBegin;
it != csv.csv().end();
++it)
{
// Parse entire file
size_t lastRowNum = 0;
CSVParser::ParserResult result = parseCSV(fileName, [&](size_t rowNum, const QStringList& data) -> bool {
// Process the parser results row by row

#ifdef CSV_BENCHMARK
qint64 timeAtStartOfRowFunction = timer.elapsed();
#endif

// Save row num for later use. This is used in the case of an error to tell the user in which row the error ocurred
lastRowNum = rowNum;

// If this is the first row and we want to use the first row as table header, skip it now because this is the data import, not the header parsing
if(rowNum == 0 && ui->checkboxHeader->isChecked())
return true;

// Bind all values
unsigned int bound_fields = 0;
for(int i=0;i<it->size();i++,bound_fields++)
for(int i=0;i<data.size();i++,bound_fields++)
{
// Empty values need special treatment, but only when importing into an existing table where we could find out something about
// its table definition
if(importToExistingTable && it->at(i).isEmpty() && nullValues.size() > i)
if(importToExistingTable && data.at(i).isEmpty() && nullValues.size() > i)
{
// This is an empty value. We'll need to look up how to handle it depending on the field to be inserted into.
QString val = nullValues.at(i);
if(!val.isNull()) // No need to bind NULL values here as that is the default bound value in SQLite
sqlite3_bind_text(stmt, i+1, val.toUtf8(), val.toUtf8().size(), SQLITE_TRANSIENT);
} else {
// This is a non-empty value. Just add it to the statement
sqlite3_bind_text(stmt, i+1, static_cast<const char*>(it->at(i).toUtf8()), it->at(i).toUtf8().size(), SQLITE_TRANSIENT);
sqlite3_bind_text(stmt, i+1, static_cast<const char*>(data.at(i).toUtf8()), data.at(i).toUtf8().size(), SQLITE_TRANSIENT);
}
}

// Insert row
if(sqlite3_step(stmt) != SQLITE_DONE)
{
sqlite3_finalize(stmt);
return rollback(this, pdb, progress, restorepointName, std::distance(itBegin, it) + 1, tr("Inserting row failed: %1").arg(pdb->lastError()));
}
return false;

// Reset statement for next use. Also reset all bindings to NULL. This is important, so we don't need to bind missing columns or empty values in NULL
// columns manually.
sqlite3_reset(stmt);
sqlite3_clear_bindings(stmt);

// Update progress bar and check if cancel button was clicked
unsigned int prog = std::distance(csv.csv().begin(), it);
if(prog % 100 == 0)
progress.setValue(prog);
if(progress.wasCanceled())
{
sqlite3_finalize(stmt);
return rollback(this, pdb, progress, restorepointName, std::distance(itBegin, it) + 1, "");
}
#ifdef CSV_BENCHMARK
timesRowFunction += timer.elapsed() - timeAtStartOfRowFunction;
#endif

return true;
});

// Success?
if(result != CSVParser::ParserResult::ParserResultSuccess)
{
// Some error occurred or the user cancelled the action

// Rollback the entire import. If the action was cancelled, don't show an error message. If it errored, show an error message.
sqlite3_finalize(stmt);
if(result == CSVParser::ParserResult::ParserResultCancelled)
return rollback(this, pdb, restorepointName, 0, QString());
else
return rollback(this, pdb, restorepointName, lastRowNum, tr("Inserting row failed: %1").arg(pdb->lastError()));
}

// Clean up prepared statement
sqlite3_finalize(stmt);

#ifdef CSV_BENCHMARK
// If benchmark mode is enabled calculate the results now
qint64 timer_after_insert = timer.elapsed();

QMessageBox::information(this, qApp->applicationName(),
tr("Importing the file '%1' took %2ms. The parser took %3ms and the insertion took %4ms.")
tr("Importing the file '%1' took %2ms. Of this %3ms were spent in the row function.")
.arg(fileName)
.arg(timer_after_insert)
.arg(timer_after_parsing)
.arg(timer_after_insert-timer_before_insert));
.arg(timer.elapsed())
.arg(timesRowFunction));
#endif
}

Oops, something went wrong.

10 comments on commit 6ed8080

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 12, 2017

Yay, 2000th commit 😄

@justinclift If you have the time, can you try that 700MB file again? I wonder if this makes it better or worse on your system. It's also interesting to check the maximum memory consumption in the task manager while the import is running. If you turn on benchmark mode it now gives you the "row function time" instead of parse time and insert time. That's because the import now isn't split up into two phases anymore. But the row function time is roughly equivalent to what the insert time was.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

Yep, 2000 commits. Who would've thought we'd ever get to that just a few years ago? 😄

With the 700MB file, I'll try again shortly. Just about to jump onto that desktop anyway, so good timing.

An idea occurred last night with the import. With the import code, (at least prior to this commit) it seems to be single threaded. Kind of wondering if it would be a good candidate for breaking up into chunks (# of cpus cores?) and doing the processing step for the rows in parallel.

Then again, since we're not doing a large upfront processing step any more, that's probably a moot issue. 😉

Anyway, I'll try the 700MB file again shortly. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

Initial timing run ("Trim fields" enabled):

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121486ms. Of this 37054ms were spent in the row function.

Prior to this commit the resident memory size while doing the import was ~600MB. With this commit it was around ~100MB.

I'll do a few more runs now just to make sure, and also test the other file which seemed to really extended out processing time.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

Also with Trim fields enabled:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121646ms. Of this 36842ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121007ms. Of this 37025ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
118890ms. Of this 35416ms were spent in the row function.

With "Trim fields" disabled:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
118626ms. Of this 36670ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
119002ms. Of this 36839ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
119789ms. Of this 37241ms were spent in the row function.

Resident memory size of DB4S remained ~100MB during all runs, with little variation. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

Timing for the smaller-but-seems-to-be-harder CSV file ("Trim fields" enabled):

Importing the file '/home/jc/Databases/csv stress/selected_crimes_2012_2015.csv' took
663035ms. Of this 3681ms were spent in the row function.

Resident memory size with this one didn't keep constant. Instead, it keep growing over time to just over 200MB. Haven't looked at the memory usage of that with the previous commit, but I'll try it now. Guessing it'll be pretty huge. 😉

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

With the previous commit, memory usage grew to 700MB for that CSV.

Something interesting about that CSV, is that the processing rate appears to start out fast but slow down dramatically over time. Kind of wondering if it's triggering some kind of worst-case processing scenario.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 12, 2017

Changing "Trim fields" to disabled has no effect on processing time or resident memory usage. Still just over 200MB for this CSV.

Importing the file '/home/jc/Databases/csv stress/selected_crimes_2012_2015.csv' took
666227ms. Of this 3676ms were spent in the row function.
@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 13, 2017

Ah, that's pretty interesting 😄

Here you write that with "Trim fields" enabled and neither this commit nor the three-rows-at-a-time commit you get these times:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
124843ms. The parser took 84334ms and the insertion took 40500ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
125222ms. The parser took 83589ms and the insertion took 41630ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
126896ms. The parser took 85466ms and the insertion took 41412ms.

And now it's

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121646ms. Of this 36842ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121007ms. Of this 37025ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
118890ms. Of this 35416ms were spent in the row function.

So it does seem marginally faster. I was afraid it might be a little slower than the old version, so that's good news 😄

And yeah, the memory usage is as expected. It mainly depends on the length of a row now. The more columns and the longer their contents, the more memory it uses. So the 200MB are ok, it just means that there's more date per row than in the other file. If it's rising to 200MB that means, that the later rows contain more data than the first ones.

So that's all good. The reason that the Hebrew file is taking so much longer is probably because of some data problem in there. I've sent you an email with the problematic rows.

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 13, 2017

The multi-threading idea is pretty complicated (I would say impossible 😉) by the way because you have to parse the CSV file in order to know where to split it up for multi-threaded processing. So that doesn't work. What is possible is to use two threads: one for parsing and one for inserting into the SQLite database. But that would only save us the row function time minus some overhead. That might become interesting at some point but right now we have plenty of options to optimise the parser itself 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 13, 2017

No worries at all. 😄

Please sign in to comment.