Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Adding CSVT file reading - using CSVT file alongside data file to det…
…ermine field types
  • Loading branch information
ccrook committed May 24, 2013
1 parent b5a5264 commit bdcc01e
Show file tree
Hide file tree
Showing 13 changed files with 384 additions and 18 deletions.
20 changes: 18 additions & 2 deletions resources/context_help/QgsDelimitedTextSourceSelect-en_US
Expand Up @@ -26,9 +26,25 @@ describe points, lines, and polygons of arbitrary complexity. The file can also
only table, which can then be joined to other tables in QGIS.
</p>
<p>
In addition to the geometry definition the file can contain text, integer, and real number fields. QGIS
will choose the type of field based on its contents.
In addition to the geometry definition the file can contain text, integer, and real number fields. By default
QGIS will choose the type of field based on its the non blank values of the field. If all can be interpreted
as integer then the type will be integer, if all can be interpreted as real numbers then the type will
be double, otherwise the type will be text.
</p>
<p>
QGIS can also read the types from an OGR CSV driver compatible &quot;csvt&quot; file.
This is a file alongside the data file, but with a &quot;t&quot; appended to the file name.
The file should just contain one linewhich lists the type of each field.
Valid types are &quot;integer&quot;, &quot;real&quot;, &quot;string&quot;, &quot;date&quot;, &quot;time&quot;, and &quot;datetime&quot;. The date, time, and datetime types are treated as strings in QGIS.
Each type may be followed by a width and precision, for example &quot;real(10.4)&quot;.
The list of types are separated by commas, regardless of the delimiter used in the data file. An
example of a valid format file would be:
</p>

<pre>
&quot;integer&quot;,&quot;string&quot;,&quot;string(20)&quot;,&quot;real(20.4)&quot;
</pre>

<h4><a name="creating">Creating a delimited text layer</a></h4>
<p>Creating a delimited text layer involves choosing the data file, defining the format (how each record is to
be split into fields), and defining the geometry is represented.
Expand Down
132 changes: 122 additions & 10 deletions src/providers/delimitedtext/qgsdelimitedtextprovider.cpp
Expand Up @@ -19,6 +19,7 @@

#include <QtGlobal>
#include <QFile>
#include <QFileInfo>
#include <QDataStream>
#include <QTextStream>
#include <QStringList>
Expand Down Expand Up @@ -164,6 +165,83 @@ QgsDelimitedTextProvider::QgsDelimitedTextProvider( QString uri )
}
}

QStringList QgsDelimitedTextProvider::readCsvtFieldTypes( QString filename, QString *message )
{
// Look for a file with the same name as the data file, but an extra 't' or 'T' at the end
QStringList types;
QFileInfo csvtInfo( filename + 't' );
if ( ! csvtInfo.exists() ) csvtInfo.setFile( filename + 'T' );
if ( ! csvtInfo.exists() ) return types;
QFile csvtFile( csvtInfo.filePath() );
if ( ! csvtFile.open( QIODevice::ReadOnly ) ) return types;


// If anything goes wrong here, just ignore it, as the file
// is not valid, so just ignore any exceptions.

// For it to be valid, there must be just one non blank line at the beginning of the
// file.

QString strTypeList;
try
{
QTextStream csvtStream( &csvtFile );
strTypeList = csvtStream.readLine();
if ( strTypeList.isEmpty() ) return types;
QString extra = csvtStream.readLine();
while ( ! extra.isNull() )
{
if ( ! extra.isEmpty() ) return types;
extra = csvtStream.readLine();
}
}
catch ( ... )
{
return types;
}
csvtFile.close();

// Is the type string valid?
// This is a slightly generous regular expression in that it allows spaces and unquoted field types
// not allowed in OGR CSVT files. Also doesn't care if int and string fields have

strTypeList = strTypeList.toLower();
QRegExp reTypeList( "^(?:\\s*(\\\"?)(?:integer|real|string|date|datetime|time)(?:\\(\\d+(?:\\.\\d+)?\\))?\\1\\s*(?:,|$))+" );
if ( ! reTypeList.exactMatch( strTypeList ) )
{
// Looks like this was supposed to be a CSVT file, so report bad formatted string
if ( message ) { *message = tr( "File type string in %1 is not correctly formatted" ).arg( csvtInfo.fileName() ); }
return types;
}

// All good, so pull out the types from the string. Currently only returning integer, real, and string types

QgsDebugMsg( QString( "Reading field types from %1" ).arg( csvtInfo.fileName() ) );
QgsDebugMsg( QString( "Field type string: %1" ).arg( strTypeList ) );

int pos = 0;
QRegExp reType( "(integer|real|string|date|datetime|time)" );

while (( pos = reType.indexIn( strTypeList, pos ) ) != -1 )
{
QgsDebugMsg( QString( "Found type: %1" ).arg( reType.cap( 1 ) ) );
types << reType.cap( 1 );
pos += reType.matchedLength();
}

if ( message )
{
// Would be a useful info message, but don't want dialog to pop up every time...
// *message=tr("Reading field types from %1").arg(csvtInfo.fileName());
}


return types;


}


void QgsDelimitedTextProvider::resetCachedSubset()
{
mCachedSubsetString = QString();
Expand Down Expand Up @@ -482,6 +560,10 @@ void QgsDelimitedTextProvider::scanFile( bool buildIndexes )
mFieldCount = fieldNames.size();
attributeColumns.clear();
attributeFields.clear();

QString csvtMessage;
QStringList csvtTypes = readCsvtFieldTypes( mFile->fileName(), &csvtMessage );

for ( int i = 0; i < fieldNames.size(); i++ )
{
// Skip over WKT field ... don't want to display in attribute table
Expand All @@ -490,8 +572,21 @@ void QgsDelimitedTextProvider::scanFile( bool buildIndexes )
// Add the field index lookup for the column
attributeColumns.append( i );
QVariant::Type fieldType = QVariant::String;
QString typeName = "Text";
if ( i < couldBeInt.size() )
QString typeName = "text";
if ( i < csvtTypes.size() )
{
if ( csvtTypes[i] == "integer" )
{
fieldType = QVariant::Int;
typeName = "integer";
}
else if ( csvtTypes[i] == "real" )
{
fieldType = QVariant::Double;
typeName = "double";
}
}
else if ( i < couldBeInt.size() )
{
if ( couldBeInt[i] )
{
Expand All @@ -513,6 +608,7 @@ void QgsDelimitedTextProvider::scanFile( bool buildIndexes )
QgsDebugMsg( "feature count is: " + QString::number( mNumberFeatures ) );

QStringList warnings;
if ( ! csvtMessage.isEmpty() ) warnings.append( csvtMessage );
if ( nBadFormatRecords > 0 )
warnings.append( tr( "%1 records discarded due to invalid format" ).arg( nBadFormatRecords ) );
if ( nEmptyGeometry > 0 )
Expand Down Expand Up @@ -1104,25 +1200,41 @@ void QgsDelimitedTextProvider::fetchAttribute( QgsFeature& feature, int fieldIdx
switch ( attributeFields[fieldIdx].type() )
{
case QVariant::Int:
if ( value.isEmpty() )
val = QVariant( attributeFields[fieldIdx].type() );
{
int ivalue;
bool ok = false;
if ( ! value.isEmpty() ) ivalue = value.toInt( &ok );
if ( ok )
val = QVariant( ivalue );
else
val = QVariant( value );
val = QVariant( attributeFields[fieldIdx].type() );
break;
}
case QVariant::Double:
if ( value.isEmpty() )
{
int dvalue;
bool ok = false;
if ( ! value.isEmpty() )
{
val = QVariant( attributeFields[fieldIdx].type() );
if ( mDecimalPoint.isEmpty() )
{
dvalue = value.toDouble( &ok );
}
else
{
dvalue = QVariant( QString( value ).replace( mDecimalPoint, "." ) ).toDouble( &ok );
}
}
else if ( mDecimalPoint.isEmpty() )
if ( ok )
{
val = QVariant( value.toDouble() );
val = QVariant( dvalue );
}
else
{
val = QVariant( QString( value ).replace( mDecimalPoint, "." ).toDouble() );
val = QVariant( attributeFields[fieldIdx].type() );
}
break;
}
default:
val = QVariant( value );
break;
Expand Down
8 changes: 8 additions & 0 deletions src/providers/delimitedtext/qgsdelimitedtextprovider.h
Expand Up @@ -201,6 +201,14 @@ class QgsDelimitedTextProvider : public QgsVectorDataProvider
*/
bool boundsCheck( QgsGeometry *geom );

/**
* Try to read field types from CSVT (or equialent xxxT) file.
* @param filename The name of the file from which to read the field types
* @param message Pointer to a string to receive a status message
* @return A list of field type strings, empty if not found or not valid
*/
QStringList readCsvtFieldTypes( QString filename, QString *message = 0 );

private slots:

void onFileUpdated();
Expand Down
46 changes: 42 additions & 4 deletions tests/src/python/test_qgsdelimitedtextprovider.py
Expand Up @@ -94,6 +94,7 @@ def layerData( layer, request={}, offset=0 ):
first = True
data = {}
fields = []
fieldTypes = []
fr = QgsFeatureRequest()
if request:
if 'exact' in request and request['exact']:
Expand All @@ -112,6 +113,7 @@ def layerData( layer, request={}, offset=0 ):
first = False
for field in f.fields():
fields.append(str(field.name()))
fieldTypes.append(str(field.typeName()))
fielddata = dict ( (name, unicode(f[name].toString()) ) for name in fields )
g = f.geometry()
if g:
Expand All @@ -129,7 +131,7 @@ def layerData( layer, request={}, offset=0 ):
if 'description' not in fields: fields.insert(1,'description')
fields.append(fidkey)
fields.append(geomkey)
return fields, data
return fields, fieldTypes, data

# Retrieve the data for a delimited text url

Expand Down Expand Up @@ -157,6 +159,7 @@ def delimitedTextData( testname, filename, requests, verbose, **params ):
basename='file'
uri = re.sub(r'^file\:\/\/[^\?]*','file://'+basename,uri)
fields = []
fieldTypes = []
data = {}
if layer.isValid():
for nr,r in enumerate(requests):
Expand All @@ -165,14 +168,16 @@ def delimitedTextData( testname, filename, requests, verbose, **params ):
if callable(r):
r( layer )
continue
rfields,rdata = layerData(layer,r,nr*1000)
if len(rfields) > len(fields): fields = rfields
rfields,rtypes, rdata = layerData(layer,r,nr*1000)
if len(rfields) > len(fields):
fields = rfields
fieldTypes=rtypes
data.update(rdata)
if not rdata:
log.append("Request "+str(nr)+" did not return any data")
for msg in logger.messages():
log.append(msg.replace(filepath,'file'))
return dict( fields=fields, data=data, log=log, uri=uri)
return dict( fields=fields, fieldTypes=fieldTypes, data=data, log=log, uri=uri)

def printWanted( testname, result ):
# Routine to export the result as a function definition
Expand All @@ -186,6 +191,7 @@ def printWanted( testname, result ):
# Dump the data for a layer - used to construct unit tests
print prefix+"wanted={}"
print prefix+"wanted['uri']="+repr(result['uri'])
print prefix+"wanted['fieldTypes']="+repr(result['fieldTypes'])
print prefix+"wanted['data']={"
for k in sorted(data.keys()):
row = data[k]
Expand Down Expand Up @@ -265,6 +271,10 @@ def runTest( file, requests, **params ):
result['uri'],wanted['uri'])
print ' '+msg
failures.append(msg)
if result['fieldTypes'] != wanted['fieldTypes']:
msg = "Layer field types ({0}) doesn't match expected ({1})".format(
result['fieldTypes'],wanted['fieldTypes'])
failures.apend
wanted_data = wanted['data']
for id in sorted(wanted_data.keys()):
wrec = wanted_data[id]
Expand Down Expand Up @@ -614,6 +624,34 @@ def test_033_reset_subset_string(self):
]
runTest(filename,requests,**params)

def test_034_csvt_file(self):
# CSVT field types
filename='testcsvt.csv'
params={'geomType': 'none', 'type': 'csv'}
requests=None
runTest(filename,requests,**params)

def test_035_csvt_file2(self):
# CSV field types 2
filename='testcsvt2.txt'
params={'geomType': 'none', 'type': 'csv', 'delimiter': '|'}
requests=None
runTest(filename,requests,**params)

def test_036_csvt_file_invalid_types(self):
# CSV field types invalid string format
filename='testcsvt3.csv'
params={'geomType': 'none', 'type': 'csv'}
requests=None
runTest(filename,requests,**params)

def test_037_csvt_file_invalid_file(self):
# CSV field types invalid file
filename='testcsvt4.csv'
params={'geomType': 'none', 'type': 'csv'}
requests=None
runTest(filename,requests,**params)


if __name__ == '__main__':
unittest.main()

4 comments on commit bdcc01e

@ccrook
Copy link
Contributor Author

@ccrook ccrook commented on bdcc01e May 24, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added feature to avoid problems Régis Hauberg found with current behaviour of provider which automatically assigns field types, which created problems joining data when the field type was not the desired type. See discussion at:

http://lists.osgeo.org/pipermail/qgis-developer/2013-May/026255.html

WIth this change the provider will look for a OGR CSV driver compliant .csvt file alongside the data file, and read field types from that if it exists. In fact it looks for a file with extension the same as the datafile, with a "t" added, so for a .txt file it will see if there is matching .txtt file, and so on.

While it is late to add a feature I see this as adding a lot of value with minimal impact on anything else. It adds one extra translation string (but only seen in the message log), and some text in the help delimited text provider help file. Other than that the changes are all within the provider code. There is no UI change in this.

This does not cover the much greater feature set that Régis is discussing, but it does address a significant weakness in the provider.

@NathanW2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is no csvt file?

@ccrook
Copy link
Contributor Author

@ccrook ccrook commented on bdcc01e May 25, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nathan

Just the same as before, it infers field types based on the content. Also if there is a csvt file, but the content is incorrect. Basically only changes the behaviour if there is a valid CSVT.

Cheers
Chris

@NathanW2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. Just checking :)

Please sign in to comment.