Skip to content

CSV Importer

Kiah Stroud edited this page Jun 14, 2024 · 32 revisions

Bulkrax can import from a CSV file that follows the following guidelines.

Required fields

  • The CSV MUST have a header row to uniquely identify the record.
  • This header row MUST have a field representing the source_identifier, containing a unique identifier for the item. (refer to the below for more detail)
  • The CSV MUST have a title column
  • There MUST be something in the field representing the source_identifier and title for all works (unless you are auto generating source_identifiers in the bulkrax config file)

Source Identifier

​Refer to https://github.com/samvera-labs/bulkrax/wiki/Configuring-Bulkrax#source-identifier.

Supported fields

All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.

In addition, the following columns will be imported:

  • collection or collection_# (deprecated in v3.0.0)
  • file or file_#
  • file_url or file_url_#
  • remote_files
  • model

Properties with multiple values

A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.

There are two ways that a property with multiple values can be imported.

Single Header

contributor language license
Aaliyah; Ruth En cc3.0

Multiple Headers

contributor_1 contributor_2 language license
Aaliyah Ruth En cc3.0

Collections

As of v3.0.0 collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).

  • In the example below, Second Work is a child of First Work, while both works are children of Collection
    • If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
  • Collections can also be children of other collections
  • A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
  • The character separating multiple source identifiers can be a ;, | or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping
source_identifier model title description parents
collection_1 Collection First Collection This will be the collection's description
work_1 Work First Work This is a work collection_1
Work Second Work This is another work work_1 ; collection_1

Caveats

  • Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
  • If you are importing works into an existing collection, you don't need the collection row. You must still reference either the source_identifier or id already attached to that collection in the "parents" column of the work(s) and/or collection(s).

Deprecated in v.3.0.0

A column titled collection will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection field on the csv. To create a new collection, put the title in the collection field.

Multiple collections can be supplied.

If the value provided matches a value found in the system_identifier_field of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field will be set to the value supplied in the collection column.

For example

source_identifier title collection
imported_work_1 Work One Collection One
imported_work_2 Work Two Collection One; Collection Two

In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.

If either of those already exist, then the existing collection is used. If not, a new one is created.

Model

The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Files will be imported from a column called file_#, file_url_# or remote_files if they are present.

The file_# columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.

The file_url_# columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.

The remote_files column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".

NOTE: Currently (as of v2.1), this method does not support the file_url_# and remote_files columns mentioned in Method 1. Only the file_# column is supported. See the Important Configuration Details section below for more details on how to use the file_# column.

The following are required to import File Sets:

Example CSV:

source_identifier model file parent title description
work_1 Work My Work This is a work
file_set_1 FileSet image_1.png work_1 My FileSet This is a file set

Important Configuration Details

Regardless of which method you choose to ingest files, the following rules apply

Files Location

If imported from a pre-existing server location, files MUST be placed in a directory called files relative to the location of the CSV file. By default, Bulkrax will process the file column in the provided CSV or treat all file_<number> (e.g. file_0, file_1, file_2`) columns as the columns for filenames.

Below is an example of the current directory with the file metadata.csv and the sub-directory files containing two .tif files. When the metadata.csv has a column file, you can provide one or more filenames (separated by ;).

.
├── files
│   ├── P000001.tif
│   └── P000002.tif
└── metadata.csv

With the above current directory, when our metadata.csv looks as follows, we'd ingest one work and attach those two files to the work.

file,                      title
P000001.tif; P000002.tif,  My Work

Alternatively if we have the same current directory, when our metadata.csv looks as follows, we'd ingest two works and attach one file to each work.

file,         title
P000001.tif,  My Work
P000002.tif,  My Other Work

With another example:

source_identifier title creator publisher file
first_work First work title Smith, John Faber and Faber document.pdf
second_work Second work title Jones, David Macmillan firstdocument.docx; seconddocument.pdf
third_work Third work title Other, A.N. Penguin

If the CSV to be imported is written to the server at:

/tmp/imports/1/csv-to-be-imported.csv

The files would be at:

/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf

The third_work does not have any associated files.

If uploading using Browse Everything, the location of the files will be handled by the system.

Importing from a Zip file

A Zip file containing a single CSV and a folder named files/ can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:

metadata.csv
files/
  |
  file_1.png
  file_2.jpg

See the Files Location guide for how to reference the files within the CSV

In Finder, select the CSV and the files/ folder (cmd + click to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.

NOTE: The names of the files themselves don't matter, as long as they match what's in the files column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/ folder, it will not import properly.

Configuration and Customization

Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.

Bulkrax.setup do | config |
  # Use the doi field (note: doi must be available on all works and collections).
  config.field_mappings['Bulkrax::CsvParser'] = {
    'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true }
  }
end
  • Allow Bulkrax to create the source_identifier
    • If there isn't a field that's available and unique across all Works and Collections, Bulkrax can make a custom field. An example of how this can be changed in the local application as follows:
      config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" }
      config.field_mappings['Bulkrax::CsvParser'] = {
        'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true }
      }
    
    • You will also need to add the following to "app/indexers/shared_indexer" in your local app
    solr_doc[Solrizer.solr_name('bulkrax_identifier', :facetable)] = object.bulkrax_identifier
    

Supported fields

All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.

In addition, the following columns will be imported:

  • collection or collection_# (deprecated in v3.0.0)
  • file or file_#
  • file_url or file_url_#
  • remote_files
  • model

Properties with multiple values

A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.

There are two ways that a property with multiple values can be imported.

Single Header

contributor language license
Aaliyah; Ruth En cc3.0

Multiple Headers

contributor_1 contributor_2 language license
Aaliyah Ruth En cc3.0

Collections

As of v.3.0.0 collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).

  • In the example below, Second Work is a child of First Work, while both works are children of Collection
    • If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
  • Collections can also be children of other collections
  • A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
  • The character separating multiple source identifiers can be a ;, | or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping
source_identifier model title description parents
collection_1 Collection First Collection This will be the collection's description
work_1 Work First Work This is a work collection_1
Work Second Work This is another work work_1 ; collection_1

Caveats

  • Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
  • If you are importing works into an existing collection, you don't need the collection row. You must still reference either the source_identifier or id already attached to that collection in the "parents" column of the work(s) and/or collection(s).

Deprecated in v.3.0.0

A column titled collection will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection field on the csv. To create a new collection, put the title in the collection field.

Multiple collections can be supplied.

If the value provided matches a value found in the system_identifier_field of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field will be set to the value supplied in the collection column.

For example

source_identifier title collection
imported_work_1 Work One Collection One
imported_work_2 Work Two Collection One; Collection Two

In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.

If either of those already exist, then the existing collection is used. If not, a new one is created.

Model

The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Files will be imported from a column called file_#, file_url_# or remote_files if they are present.

The file_# columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.

The file_url_# columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.

The remote_files column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".

NOTE: Currently (as of v2.1), this method does not support the file_url_# and remote_files columns mentioned in Method 1. Only the file_# column is supported. See the Important Configuration Details section below for more details on how to use the file_# column.

The following are required to import File Sets:

Example CSV:

source_identifier model file parent title description
work_1 Work My Work This is a work
file_set_1 FileSet image_1.png work_1 My FileSet This is a file set

Important Configuration Details

Regardless of which method you choose to ingest files, the following rules apply

Files Location

If imported from a pre-existing server location, files MUST be placed in a directory called files relative to the location of the CSV file.

If uploading using Browse Everything, the location of the files will be handled by the system.

For example:

source_identifier title creator publisher file
first_work First work title Smith, John Faber and Faber document.pdf
second_work Second work title Jones, David Macmillan firstdocument.docx; seconddocument.pdf
third_work Third work title Other, A.N. Penguin

If the CSV to be imported is located at

/tmp/imports/1/csv-to-be-imported.csv

The files would be at:

/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf

The third_work does not have any associated files.

Importing from a Zip file

A Zip file containing a single CSV and a folder named files/ can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:

metadata.csv
files/
  |
  file_1.png
  file_2.jpg

See the Files Location guide for how to reference the files within the CSV

In Finder, select the CSV and the files/ folder (cmd + click to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.

NOTE: The names of the files themselves don't matter, as long as they match what's in the files column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/ folder, it will not import properly.

Configuration and Customization

Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.

Clone this wiki locally