Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import of sotorrent CSV data files fails on MacOS #7

Closed
Alfusainey opened this issue Nov 12, 2018 · 17 comments
Closed

Import of sotorrent CSV data files fails on MacOS #7

Alfusainey opened this issue Nov 12, 2018 · 17 comments
Assignees

Comments

@Alfusainey
Copy link

Importing the sotorrent csv data files, using 6_load_sotorrent.sql, fails with the following error msg:

ERROR 1300 (HY000): Invalid utf8mb4 character string: ''.

@sbaltes: Any idea why this is happening? Am running MySQL version 5.7.24, running on MacOs High Sierra

@Alfusainey Alfusainey changed the title Import CSV Import of sotorrent csv data files fails Nov 12, 2018
@sbaltes sbaltes self-assigned this Nov 12, 2018
@sbaltes
Copy link
Member

sbaltes commented Nov 12, 2018

Could you check for which of the tables the import is failing?

@Alfusainey
Copy link
Author

i first tried with the PostBlockVersion table and it fails, then I tried with the CommentUrl table and it fails too. this makes me conclude that it will probably not work for the other tables too.

another thing: on the terminal if I do: less PostBlockVersion.csv to view the contents of the file, less warns that am trying to open a binary file. If i opened CommentUrl.csv using an excel software (i.e Numbers), all i see are very funny characters. Probably the contents of the file could be the reason why it is failing?

@Alfusainey
Copy link
Author

Alfusainey commented Nov 12, 2018

I should also clarify that I executed individual import statements to load data into specific tables. I did not run the entire 6_load_sotorrent.sql script at once. This way, I do not have to download all the files and then import (because I have limited space).

Essentially, I did this:

  1. Download a single .gz file
  2. Extract the file
  3. Import the file data into the database
  4. Delete the file and repeat the process from (1)

I did this for all the xml.gz files and it worked nicely. Now I want to do the same for the .csv.gz files. So far, it fails with the PostBlockVersion.csv and CommentUrl.csv files

@sbaltes
Copy link
Member

sbaltes commented Nov 12, 2018

Could you execute file -i <FILENAME>.csv to check the file encoding?

@Alfusainey
Copy link
Author

Alfusainey commented Nov 12, 2018

Could you execute file -i .csv to check the file encoding?

Output: CommentUrl.csv: regular file

Just realized for osx i need the -I switch.
If i run file -I CommentUrl.csv, i get:
CommentUrl.csv: application/x-gzip; charset=binary

@Alfusainey
Copy link
Author

I executed file -I <filename>.xml on one of the XML files and I see that the charset=utf-8.

@Alfusainey
Copy link
Author

@sbaltes what file encoding of the csv files do you have once they're compressed? looks like the once uploaded to Zenodo are binary files

@sbaltes
Copy link
Member

sbaltes commented Nov 13, 2018

I was able to reproduce this on macOS 10.13:

gunzip CommentUrl.csv.gz
file -I CommentUrl.csv
CommentUrl.csv: application/x-gzip; charset=binary

However, when executing the same commands on an Ubuntu 16.04 LTS system, the extraction works as expected:

gunzip CommentUrl.csv.gz
file -i CommentUrl.csv
CommentUrl.csv: text/plain; charset=us-ascii

This issue seems to be specific to macOS. I don't have time to look into this now, but as a workaround you could either use an Ubuntu system to unzip the files or setup a VirtualBox VM on your macOS system.

@Alfusainey
Copy link
Author

This issue seems to be specific to macOS.

whatsapp image 2018-11-13 at 18 13 50
humm.. I don't think so. I just tested with Ubuntu 16.04 LTS and still have CommentUrl.csv: application/x-gzip; charset=binary (see attachment).

Did you download the file from Zenodo?

@sbaltes
Copy link
Member

sbaltes commented Nov 13, 2018

I downloaded the file from Zenodo on both systems (macOS and Ubuntu).

@sbaltes
Copy link
Member

sbaltes commented Nov 13, 2018

I'm currently traveling, but I will take a closer look at this next week.

@Alfusainey
Copy link
Author

thanks!

I have three different people confirm that it works on mint and ubuntu 16.04. I can also confirm that it works on Debian 9.6(stretch). seems like the problem is with MacOS

@Alfusainey Alfusainey changed the title Import of sotorrent csv data files fails Import of sotorrent CSV data files fails on MacOS Nov 14, 2018
@sbaltes
Copy link
Member

sbaltes commented Nov 19, 2018

I just installed gzip 1.9 using Homebrew, but the same error occurs when using /usr/local/bin/gunzip (opposed to /usr/bin/gunzip).

@sbaltes
Copy link
Member

sbaltes commented Nov 19, 2018

Interesting observation:

gunzipman page Ubuntu 16.04:

gunzip can currently decompress files created by gzip, zip, compress, compress -H or pack. The detection of the input format is automatic.

gunzip man page macOS 10.13:

This version of gzip is also capable of decompressing files compressed using compress(1), bzip2(1), or xz(1).

@sbaltes
Copy link
Member

sbaltes commented Nov 19, 2018

This works for me on macOS 10.13:

/System/Library/CoreServices/Applications/Archive\ Utility.app/Contents/MacOS/Archive\ Utility CommentUrl.csv.gz
mv CommentUrl CommentUrl.csv
file -I CommentUrl.csv
CommentUrl.csv: text/plain; charset=us-ascii

Could you try this on your system?

@Alfusainey
Copy link
Author

@sbaltes: great, yes this works for me on MacOs 10.13.6. thanks for investigating this

@sbaltes
Copy link
Member

sbaltes commented Nov 21, 2018

I will update the README file in the next SOTorrent release and I also added a remark to the SOTorrent project page.

@sbaltes sbaltes closed this as completed Nov 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants