Skip to content

Releases: socrata/datasync

DataSync 1.6

04 Sep 23:05
Compare
Choose a tag to compare

Overview

This release greatly simplifies setting up your integration jobs using DataSync. In addition to simplifying the initial setup screen, we have added an interactive UI to walk you through the often difficult task of mapping your CSV or TSV to the columns in your dataset. This UI will give you an interactive preview of exactly how your file will import, as well as real-time feedback around whether or not the job will complete successfully. Once setup, you can then use the output to quickly setup a scheduled “headless job” in the same way that you would before. With DataSync 1.6 it’s now even easier to reliably and repeatedly upload your data into Socrata.

What’s new

  • UI updates
    • Realtime feedback
    • Interactive control file generation
    • Inline validation
  • Up to 25% improvement in initial upload speed for large files.

Bug fixes

  • Datasync logs do not update with the number of inserts, updates and appends
  • “Unable to parse CRUD” command line error

DataSync 1.5.4

31 Dec 20:08
Compare
Choose a tag to compare

This release fixes four bugs.

Bugs

  • Validity checks were prematurely finishing in some cases, allowing jobs to start that would otherwise be denied.
  • Constructing a synthetic location could fail if the component columns were found in the file to publish but not in the dataset (unless the previous bug allowed the job to proceed).
  • Unknown control file fields were allowed and ignored; we're now restricting control files to have only those fields DataSync supports.
  • The publish method could not be given on the command line if a control file was being used.

DataSync 1.5.3

12 Dec 18:26
Compare
Choose a tag to compare

This release fixes two bugs and hopes to simplify control file selection and editing in the UI.

Changes around Control File Selection

  • You may now view or edit control files regardless of whether they were generated or sourced from a file.
  • Control files will be ready for viewing and/or editing if an .sij file is loaded into the UI.
  • Control content may be specified in two ways in an .sij file - via the controlFileContent and the pathToControlFile fields. Should the content be inconsistent across these
  • and you are using the UI, warnings are issued
  • and you are using headless mode, errors are returned

Bugs

  • If a control file for a csv lacked a "separator" field, the default separator in use was a tab, rather than a comma.
  • If the csv/tsv you want to upload has a byte order mark, the DataSync HTTP option will now remove it prior to upload.

DataSync 1.5.2

06 Nov 23:40
Compare
Choose a tag to compare

This minor release

  • adds proxy support for generating control files and finding the column ids (options available in the GUI under the standard job tabs).
  • makes the suite of validity checks for HTTP replace, append, upsert and delete jobs available to users working behind a proxy server.

With thanks to the DataSync users who uncovered this oversight.

DataSync 1.5.1

23 Oct 19:02
Compare
Choose a tag to compare

This minor release addresses

  • #84 - Synthetically made columns were being ignored when checking column agreement between the dataset and the csv/tsv.
  • #86 - Edits to the control file in the generate/edit panel were being forgotten if the user chose to browse for a control file then go back to editing one.
  • #87 - When opening any job files but the first, DataSync will now remember the directory from which the last job was opened and open the file browser to that directory.

With thanks to mikegiddens for reporting these issues.

DataSync 1.5

24 Sep 18:17
Compare
Choose a tag to compare

Overview

We are excited to announce the release of DataSync 1.5 ( download link below ).

DataSync 1.5 provides a number of updates to increase the reliability of data ingress, as well as to allow DataSync to easily work in more environments. One of the most important improvements in this release is that data publishers can now use the ‘SSync replace’ operation over HTTP. In DataSync 1.0 we introduced ‘(SSync) replace’ via FTP which automatically detects which rows have been added, updated, or deleted and only publishes the minimal set of changes to the dataset. This functionality is now available over HTTP. Switching from the FTP variant is as easy as toggling a button in the user interface or including the “-ph” flag on the command line.

A number of changes were also made to increase the reliability of ingress. First, transmitting only the changes minimizes the amount of data that must be transferred, which in turn minimizes the opportunity for an unreliable network to fail the operation). Second, should DataSync encounter an unreliable network, it will execute a series of retries to allow the job to automatically recover and continue. These changes should greatly enhance the reliability of all jobs, including scheduled ETLs.

If you are already using DataSync you just need to download the new JAR file below, replace your existing JAR file. If you are not using a previous version of DataSync you can simply download version 1.5 below. DataSync 1.5 requires Java version 1.7 or higher.

DataSync documentation has also been improved and expanded. We have added a quick start guide which aims to simplify first time use of DataSync. We have detailed all of the options that are available to each job, in terms of both your personal configuration file and the job control file. We have added a resource that describes the data restrictions by data type. For instance, Percent type data must not include the % symbol and should range between 0 and 100, not 0 and 1.

We also invite you to contribute to the documentation using a GitHub pull request to the gh-pages branch of the DataSync repository.

DataSync 1.5 comes with additional enhancements and new features, many of which are based off of customer requests. The full list of changes can be found in the list below:

What’s new

New Ssync Replace via HTTP - Enables simple and efficient replace operations on datasets of essentially any size. Diff based file transfering is used so that only the changes between the CSV and the dataset will be passed across the wire. All replace, upsert and delete operations flow over HTTPS by default.
New SSync Upsert and delete via HTTP - All of the SSync benefits brought to replace jobs are available for upsert and delete jobs also.
New HTTP proxy support - For the new Ssync replace, upsert and delete jobs, both authenticated and unauthenticated proxies are supported and configurable from the configuration file or preference menu.
New Compressed Diffs - DataSync compresses diffs before sending.
New Robust DataSync retry logic - Datasync will automatically pause and retry jobs in the case of network failure. You can now start a ‘Replace via HTTP’ job, turn off your internet, watch DataSync try and retry again, turn back on your internet and watch DataSync succeed!
New Early failure notification - For the new SSync suite of jobs, Datasync will attempt to find and report any control file misspecifications or data alignment problems before starting the job.
New Version information - You can retrieve the DataSync version of your jar via the commandline using -v or --version.
New Porting column formatting - Port jobs can now copy column formatting along with data.
Changed More job options - Options are now available in the UI to choose between legacy SODA2 and FTP v. the new HTTP path.
Changed Version warnings - The customer is only warned about new versions in the case of a major version update and DataSync no longer breaks in the case of a major version change.
Changed Preferences location - Moved preferences into the file menu.
Changed Control file source - Ability to change the source of the control file in the GUI.
Changed Simpler configuration files - Previously configuration files had to be fully-specified regardless of the job. Now, only the domain and user credentials are required for most cases.
Bug Fix Fixed a bug preventing the use of non-SSL SMTP servers.
Bug Fix Fixed a bug preventing port jobs of datasets with resource names.

Known issues

Known Customers are limited to 2 simultaneous running jobs per domain - In order to keep customers from starving their own resources (and other customer resources), DataSync will only run two jobs at a time; additional jobs are queued and must wait for earlier jobs to complete. Socrata monitors the queue and will allocate additional resources if we find that jobs are not able to clear in an acceptable time.
Known HTTP proxies do not work with existing SODA2 jobs - Customers will need to setup new jobs using the updated upsert, replace and delete jobs. This will require the customer to create a control file as described in our Github documentation.
Known Customers may be locked out for 15 minutes if incorrectly keying their password - Because DataSync retries failed network calls on the user’s behalf, if a password is incorrectly keyed, the 3-strikes-and-you’re-locked-out-for-15-minutes rule comes into effect.

Want to leave a question, comment, suggestion, or bug report on DataSync? Submit these to the DataSync Github repository issue tracker - all you need is a free GitHub account:
https://github.com/socrata/datasync/issues

Watch the GitHub repository to remain up to speed with new features on the roadmap and deploy schedules for future versions.

Related Links:

DataSync 1.0

24 Apr 23:43
Compare
Choose a tag to compare

We are excited to announce the release of DataSync 1.0 ( download link below ).

One of the most important improvements in this release is that data publishers can now use the ‘replace’ operation as the default way to update essentially any dataset, even very large datasets (millions of rows). This is possible because the new ‘replace via FTP’ method in DataSync automatically detects which rows have been added, updated, or deleted and only publishes those changes to the dataset. For the vast majority of datasets, this will remove the need for data publishers to take on the rather complicated task of scripting a process to determine which rows have been added, updated, or deleted since the last dataset update. Publishers will no longer have to use the "upsert" method to update their datasets, a method which often requires significant developer resources. With DataSync 1.0, automating data publishing is as easy as extracting all the data into a CSV or TSV file and creating a simple DataSync job to publish the CSV or TSV to the Socrata dataset. The data publisher can then use Windows Task Scheduler or Cron to schedule the DataSync job to run automatically (i.e. every day).

If you are already using DataSync you just need to download the new JAR file below and replace your existing JAR file. If you are not using a previous version of DataSync you can simply download version 1.0 below. DataSync 1.0 requires Java version 1.6 or higher. You can also download a version compiled with Java 1.7 if you prefer to use that (datasync_1.0_java1.7.jar).

DataSync documentation has also been dramatically improved and expanded. There is now comprehensive documentation for using DataSync exclusively as a command-line tool (headless mode).

We also invite you to contribute to the documentation using a GitHub pull request to the gh-pages branch of the DataSync repository.

DataSync 1.0 comes with additional enhancements and new features, many of which are based off of customer requests:

  • ‘Replace via FTP’ update method: Enables simple and efficient replace operations on datasets of essentially any size
  • Reduces complexity of updating datasets with Location datatype columns: You can now use the Control file configuration (available when using ‘replace via FTP’ method) to “pull” address, city, state, zip code, or latitude/longitude data within other (non-Location) columns into the Location column (to enable Map visualizations or geocoding)
  • Update dataset metadata: You can now use DataSync to automate updating dataset metadata using a Metadata Job (go to File -> New.. -> Metadata Job). Many thanks to the generous open source code contribution to DataSync by Brian Williamson for that new job type!
  • Improved command-line interface: More user-friendly and fully-featured interface to configure and run Standard integration and Port jobs without the user interface
  • Delete update operation: Now you can use the ‘delete’ method
  • Improved logging for long-running jobs: When you run a job in a terminal or command prompt there is detailed logging information outputting the job’s progress toward completion
  • Developer documentation for compiling with Eclipse (on Windows) which was generously contributed by Jeff Chamblee.
  • Other small features:
    • Support for importing data with any date format
    • Optional fine-grained control of other data importing parameters such as automatically trimming whitespace, setting the timezone of imported dates, text file encoding, null value handling, overriding the CSV header, etc.
    • Ability to set the name of the destination dataset when running a Port job headlessly
    • Get a list of column identifiers (API field names) for any dataset

View the full list of features added in version 1.0 here:
https://github.com/socrata/datasync/issues?milestone=3&page=1&state=closed

Want to leave a question, comment, suggestion, or bug report on DataSync? Submit these to the DataSync Github repository issue tracker - all you need is a free GitHub account:
https://github.com/socrata/datasync/issues

Watch the GitHub repository to remain up to speed with new features on the roadmap and deploy schedules for future versions.

Related Links:

DataSync 0.3

14 Nov 01:40
Compare
Choose a tag to compare

You are now able to upgrade to the newest version of DataSync - version 0.3 ( download link below ). If you already have DataSync installed, the next time you open DataSync you will see a prompt to update your version to the current one. If you do not have the previous version of DataSync installed you can download version 0.3 below.

This release of DataSync comes with enhancements and new features, many of which are based off of customer requests:

  • Ability to transfer data between two datasets: Port Jobs now support transferring data between two existing datasets on one domain or across two different domains.
  • Support for publishing files with and without a header row: Previously it was required that the CSV file to be published had a header row, now you can publish data with or without a header.
  • Run and configure jobs purely through the command line: Now you can configure the parameters for a job and optionally authentication parameters through the command line. This gives more flexibility integrating DataSync into ETL processes. Previously, one had to configure jobs and input authentication details using the UI and then save the job before running through the command line. Now you can do everything needed to run a standard job through the command line. To learn how to use DataSync in command line mode run:
    java -jar datasync_0.3.jar --help
  • Support for TSV files: Now DataSync supports publishing .TSV (tab-separated value) in addition to .CSV (comma- separated value) files
  • Help bubbles to explain features within the UI: makes DataSync easier for new users to learn how to use.
  • Ability to configure file chunking settings: Now you can configure the parameters for automatic file chunking for optimizing uploading of very large files (go to Edit -> Preferences).
  • JUnit Testing Framework: this is relevant to developers working with the DataSync source code.

To view the full list of features added in version 0.3 see this page:
https://github.com/socrata/datasync/issues?milestone=2&state=closed

Want to leave a question, comment, suggestion or bug report on DataSync 0.3?
You can now submit these within the DataSync Github repository issue tracker - all you need is a free GitHub account:
https://github.com/socrata/datasync/issues

Check the repository to keep up to speed with new features on the roadmap and deploy schedules for future versions.

Related Links:

DataSync 0.2

21 Sep 03:52
Compare
Choose a tag to compare

You are now able to upgrade to the newest version of DataSync - version 0.2 ( download link below ). If you already have DataSync installed, the next time you open DataSync you will see a prompt to update your version to the current one. If you do not have the previous version of DataSync installed you can download version 0.2 below.

This release of DataSync comes with enhancements and new features, many of which are based off of customer requests:

  • Automatic chunking of large files: Uploading very large files (200 MB or larger) was not possible with the previous version of DataSync due to hitting publisher API filesize limits. DataSync can now handle large files with ease by splitting them up and publishing them in separate “chunks” (files larger than 75 MB are automatically chunked). DataSync 0.2 has been tested with files larger than 200 MB. NOTE: file chunking is only supported by the ‘upsert’ and ‘append’ (not ‘replace’). Replace will support chunking in the future.
  • Use DataSync to copy a dataset or dataset blueprint (schema) quickly: DataSync™ now has a function allowing you to make copies of existing datasets on one domain or across two different domains. You can copy the dataset blueprint (schema) without any data or the dataset along with all rows of data. DataSync now has a new job type called a “Port” job which supports this capability.
  • Better Error Handling: DataSync now has further improved error messages to help troubleshoot issues with publishing data.
  • Major Bug Fix: Due to a fix made in the soda-java library (version 0.9.4) publishing via 'upsert' and ‘append’ on certain datasets will no longer cause an error (UnrecognizedPropertyException).
  • Migration to Maven: Developers can now use Maven for package management (rather than importing JARs as external libraries) - to move mail and to import soda-java and org.json packages

Want to leave a question, comment, suggestion or bug report on DataSync 0.2?
You can now submit these within the DataSync Github repository issue tracker - all you need is a free GitHub account.

Check the repository to keep up to speed with new features on the roadmap and deploy schedules for future versions.

Related Links: