Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8-bit bytestrings vs. Unicode strings in Python: fix this once and for all #62

Open
haleagar opened this issue Jun 10, 2011 · 33 comments
Open

Comments

@haleagar
Copy link

I've seen issue #25 and gotten the most recent version with the related update, but I'm still getting similar errors still however now in arbitrator.py
I belive this is one of the problem files WoodyHallæ°åç_0.jpg
Python 2.6.5
Ubuntu 10.04.1 LTS

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "arbitrator.py", line 289, in run
self.__process_db_queue()
File "arbitrator.py", line 634, in __process_db_queue
self.dbcur.execute("SELECT COUNT(*) FROM synced_files WHERE input_file=? AND server=?", (input_file, server))
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

@haleagar
Copy link
Author

So I tried a simple fix of adding

    input_file = input_file.decode('utf-8')
    output_file = output_file.decode('utf-8')
    transported_file = transported_file.decode('utf-8')

to arbitrator.py at line 623, (probably should be done earlier though)
That "fixed" the above error, but not quite my problem because I'm using Rackspace CloudFiles

Now I get this errors, which probably means, yes earlier, or maybe we have to convert to non-unicode altogether to use those processors and transporters.

2011-06-10 17:08:29,332 - Arbitrator.Transporter - ERROR - The transporter 'mosso' has failed while transporting the file '/tmp/daemon/var/www/webroot/sites/default/files/imagefield_thumbs/featured_img/WoodyHallæ°åç_0_1306282603.jpg' (action: 1). Error: 'u'\u6c0f''.
2011-06-10 17:08:29,534 - Arbitrator.ProcessorChain - ERROR - The processsor 'link_updater.CSSURLUpdater' has failed while processing the file '/var/www/webroot/sites/all/modules/jquery_ui/jquery.ui/tests/unit/testsuite.css'. Exception class: <class 'xml.dom.SyntaxErr'>. Message: CSSValue: No match: ('CHAR', u':', 62, 16).

and google to find this update
https://github.com/rackspace/python-cloudfiles/pull/29
so I grabbed that version of python-cloudfiles, but the errors above still persist, and I've fallen out of my depth, but I hope that's helpful.

@haleagar
Copy link
Author

one more error cropping up now as well, then I'll be quite.

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "arbitrator.py", line 289, in run
self.__process_db_queue()
File "arbitrator.py", line 709, in __process_db_queue
self.remaining_transporters[key].remove(server)
KeyError: u"/var/www/webroot/sites/default/files/imagecache/detail_featured/featured_img/IT\u6295\u8cc72011.jpg2{'filter': <filter.Filter object at 0x1b1e210>, 'source': 'drupal', 'destinations': {'cloudfiles': {'path': 'static'}}, 'processorChain': ['yui_compressor.YUICompressor', 'link_updater.CSSURLUpdater', 'unique_filename.Mtime'], 'label': 'CSS, JS, images and Flash'}"

@EricB1021
Copy link

Thank you for the fix above. However after those 3 lines of code fixed the original problem, File Conveyor runs for a little while and I get this output. I am not very familiar with Python, but any help would be appreciated, Thanks.

2011-06-20 15:32:50,795 - Arbitrator - WARNING - Arbitrator is initializing.
2011-06-20 15:32:50,797 - Arbitrator - WARNING - Loaded config file.
2011-06-20 15:32:51,103 - Arbitrator - WARNING - Created 'ftp' transporter for the 'ftp push cdn' server.
2011-06-20 15:32:51,103 - Arbitrator - WARNING - Server connection tests succesful!
2011-06-20 15:32:51,104 - Arbitrator - WARNING - Setup: created transporter pool for the 'ftp push cdn' server.
2011-06-20 15:32:51,112 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 15055 items.
2011-06-20 15:32:51,113 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 0 items.
2011-06-20 15:32:51,113 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items.
2011-06-20 15:32:51,114 - Arbitrator - WARNING - Setup: moved 0 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue.
2011-06-20 15:32:51,114 - Arbitrator - WARNING - Setup: connected to the synced files DB. Contains metadata for 0 previously synced files.
2011-06-20 15:32:51,151 - Arbitrator - WARNING - Setup: initialized FSMonitor.
2011-06-20 15:32:51,151 - Arbitrator - WARNING - Fully up and running now.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 107, in run
self.__process_queues()
File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 132, in __process_queues
self.__add_dir(path, event_mask)
File "/etc/file_conveyor/code/fsmonitor_inotify.py", line 58, in __add_dir
FSMonitor.generate_missed_events(self, path, event_mask)
File "/etc/file_conveyor/code/fsmonitor.py", line 121, in generate_missed_events
for event_path, result in self.pathscanner.scan_tree(path):
File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/etc/file_conveyor/code/pathscanner.py", line 240, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/etc/file_conveyor/code/pathscanner.py", line 227, in scan_tree
result = self.scan(path)
File "/etc/file_conveyor/code/pathscanner.py", line 191, in scan
for path, filename, mtime, is_dir in self.__listdir(path):
File "/etc/file_conveyor/code/pathscanner.py", line 76, in __listdir
path_to_file = os.path.join(path, filename)
File "/usr/lib/python2.6/posixpath.py", line 70, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 13: ordinal not in range(128)

@andriy-gerasika
Copy link

IMHO true fix should involve dbcon.text_factory = str code, not str.decode('utf-8')

http://www.gerixsoft.com/sites/gerixsoft.com/files/fileconveyor-utf8.patch fixed the problem for me.

@wimleers
Copy link
Owner

wimleers commented Aug 1, 2011

Uploaded the patch you linked to to gist.github.com, in case your site goes offline: https://gist.github.com/1118004.

@wimleers
Copy link
Owner

wimleers commented Aug 1, 2011

The patch posted by andriy-gerasika is definitely interesting. Look at the documentation of pysqlite.Connection.text_factory: http://pysqlite.googlecode.com/svn/doc/sqlite3.html#sqlite3.Connection.text_factory.

However, the log output provided at #25 suggests this is not the right way to solve the problem:
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Clearly, this implies that we should be using proper Unicode strings in Python. Whatever that may be, because that's absolutely not clear. (If anything is messy in Python, it's unicode strings.) I guess a good starting point is http://docs.python.org/howto/unicode.html.

@jacobSingh
Copy link
Contributor

This one just won't die huh? I did a bunch of research on this before, but I really don't remember at which point we should be intercepting this. If I remember correctly, we should be handling it at the point where we harvest the path names, before they go in the DB.

I think I tried dbcon.text_factory = str and it didn't work IIRC. I can't remember why, but I think it had something to do with the point at which it was trying to massage the text. Sorry I can't be more help, I'm kinda foggy on this one.

@unn
Copy link

unn commented Aug 1, 2011

That's what we'd tried Jacob. I don't specifically remember if we'd implemented the dbcon.text_factory = str every where it was needed, but I do remember we tried it and were still seeing similar stack traces as #62 (comment)

@wimleers
Copy link
Owner

wimleers commented Aug 1, 2011

@jacobSingh: Thanks for weighing in! :)
@unn: Thanks for also reporting back on this.

It's clear that additional attention will be needed to solve this for once and for all. And no clear solution is available at the moment.

@unn
Copy link

unn commented Aug 1, 2011

I'll email you some of the files that were causing the issues we were seeing.

@wimleers
Copy link
Owner

wimleers commented Aug 3, 2011

I've read through Python's entire Unicode HOWTO (which seems to be authoritative). Especially the Unicode filenames section is interesting. The os.listdir() trick mentioned there just might be all that we need…
(This will make the changes introduced in 12f2ddf for #25 obsolete; these changes will be undone.)

Combined with sqlite3.Connection.text_factory = unicode (which is the default, but setting it explicitly should prevent any confusion/assumptions/different defaults in a specific SQLite build).

There's also this daunting post on Stack Overflow, which was also fairly informational.

From effbot.org's "Python Unicode Objects", I discovered the need to use re.UNICODE when using regular expressions.

Further, I changed the Config module: I added Config.__ensure_unicode() and used this for all values parsed from the config.xml file through xml.etree.ElementTree, to ensure that all strings used from the config file are Python Unicode strings (i.e. u'string' instead of 'string'). This was necessary because xml.etree.ElementTree will try to optimize memory consumption by only storing Unicode strings as Python Unicode strings (u'string') if it's impossible to represent it as a regular string (in the system's default encoding). By calling Config.__ensure_unicode() on every string, we can be sure that all strings are in Unicode.

After all, the initial call to os.listdir() will happen with a parameter read directly from the config file. If that string is a Unicode string, then os.listdir() will return Unicode strings, which implies that future calls to os.listdir() (to traverse the directory tree) will also be called with a Unicode string as a parameter. Hence these changes to the Config module were necessary.

Next, I had to ensure all strings received through FSMonitor (i.e. coming from inotify and FSEvents) were in Unicode. On OS X, this is easy, since the file system always uses UTF-8. On Linux, many different encodings are possible, hence we use sys.getfilesystemencoding() to make sure we decode from the right one.

Because we're now using Unicode strings everywhere in File Conveyor (as it should be), we'll need to encode it to byte strings to be able to use certain functions (like this: u'unicode string'.encode('utf-8')), such as hashlib.md5() in PersistentQueue.__hash_key().

Finally, I was having problems with PersistentQueue and PersistentList: both of these cPickle.dumps() arbitrary Python data and then store it in a SQLite DB for persistency. I was already loading data correctly using sqlite3.register_converter("pickle", cPickle.loads), but apparently it should be stored in a SQLite BLOB column, not a TEXT column (source). Then, it needs to be inserted with special care, using sqlite3.Binary(). This function is documented nowhere, not even on the official sqlite3 documentation page on docs.python.org!

This covers Unicode issues 99% of the way, but there's still the potential problem of not knowing the encoding of the file system of the destination — for that, I just created #75.

Phew. That was not easy! I hope I didn't forget to mention anything.

P.S.: some more interesting functions:

  • sys.getdefaultencoding()
  • sys.setdefaultencoding()

@patrickfournier
Copy link

I still get errors (using release d1c55b8):

/usr/lib/python2.6/urllib.py:1222: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
res = map(safe_map.getitem, s)
2011-08-27 02:27:05,165 - Arbitrator.Transporter - ERROR - The transporter 'S3' has failed while transporting the file '/var/www/sites/curiosae/files/RB 47 - Départ pour la pêche des huîtres un jour de grande marée.jpg' (action: 1). Error: 'u'\xe9''.

and later (maybe for a different file, I am not sure):

2011-08-27 02:27:17,199 - Arbitrator - ERROR - Unhandled exception of type 'ProgrammingError' detected, arguments: '('You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.',)'.
Traceback (most recent call last):
File "arbitrator.py", line 301, in run
self.process_retry_queue()
File "arbitrator.py", line 778, in __process_retry_queue
if (input_file, event) not in self.failed_files and (input_file, event) not in self.pipeline_queue:
File "/home/ubuntu/wimleers-fileconveyor-d1c55b8/code/persistent_queue.py", line 77, in __contains

return self.dbcur.execute("SELECT COUNT(item) FROM %s WHERE item=?" % (self.table), (cPickle.dumps(item), )).fetchone()[0]
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
2

@wimleers
Copy link
Owner

wimleers commented Sep 7, 2011

Did you start from scratch with File Conveyor or did you upgrade from a previous File Conveyor installation? In the latter case, did you run the upgrade script?

@ykyuen
Copy link

ykyuen commented Sep 16, 2011

I am using release d1c55b8. i want to use file converyor to sync my drupal site to rackspace cloudfiles.

the server it Ubuntu 10.04 with python2.6.5 i got the following error which is same as EricB1021


Exception in thread FSMonitorThread:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 122, in run
self.__process_queues()
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 155, in __process_queues
self.__add_dir(path, event_mask)
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 91, in __add_dir
FSMonitor.generate_missed_events(self, path)
File "/home/halo/fileconveyor/code/fsmonitor.py", line 128, in generate_missed_events
for event_path, result in self.pathscanner.scan_tree(path):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 226, in scan_tree
result = self.scan(path)
File "/home/halo/fileconveyor/code/pathscanner.py", line 190, in scan
for path, filename, mtime, is_dir in self.__listdir(path):
File "/home/halo/fileconveyor/code/pathscanner.py", line 77, in __listdir
path_to_file = os.path.join(path, filename)
File "/usr/lib/python2.6/posixpath.py", line 70, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)


Then i try with python 2.5 but no luck


2011-09-16 04:06:58,628 - Arbitrator - WARNING - Fully up and running now.
Exception in thread FSMonitorThread:
Traceback (most recent call last):
File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
self.run()
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 122, in run
self.__process_queues()
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 155, in __process_queues
self.__add_dir(path, event_mask)
File "/home/halo/fileconveyor/code/fsmonitor_inotify.py", line 91, in __add_dir
FSMonitor.generate_missed_events(self, path)
File "/home/halo/fileconveyor/code/fsmonitor.py", line 128, in generate_missed_events
for event_path, result in self.pathscanner.scan_tree(path):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 239, in scan_tree
for subpath, subresult in self.scan_tree(os.path.join(path, filename)):
File "/home/halo/fileconveyor/code/pathscanner.py", line 226, in scan_tree
result = self.scan(path)
File "/home/halo/fileconveyor/code/pathscanner.py", line 190, in scan
for path, filename, mtime, is_dir in self.__listdir(path):
File "/home/halo/fileconveyor/code/pathscanner.py", line 77, in __listdir
path_to_file = os.path.join(path, filename)
File "/usr/lib/python2.5/posixpath.py", line 65, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)


then i replace the dependencies/cloudfiles with the latest rackspace clouldfiles https://github.com/rackspace/python-cloudfiles as suggested @ #75. still no luck.

i check the following in the python console
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sys
sys.getdefaultencoding()
'ascii'
sys.getfilesystemencoding()
'ANSI_X3.4-1968'

Any way to solve the problem?
Thanks.

@j0rd
Copy link

j0rd commented Sep 25, 2011

Same problem using Amazon CloudFront.

@ykyuen
Copy link

ykyuen commented Sep 26, 2011

I found that the problem is related to the uploaded file name. for example, if i have some cck images which have spaces in the file names, those files cannot be synced and i notice that the spaces are converted into %20 as shown in the Drupal CDN module statistics.

Anyway, i try to solve the problem by converting the string from to utf-8. everything seems work fine but there are still some errors in the daemon.log. probably the issue is not completely resolved in the right way.

i forked the code and commited my changes. see if it works for u.
https://github.com/ykyuen/fileconveyor

@checkerap
Copy link

Thanks ykyuen,

Your fork worked perfectly for me.

@wimleers
Copy link
Owner

The (hopefully) last Unicode problem has been fixed at #90.

@ykyuen
Copy link

ykyuen commented Jan 20, 2012

thanks @wimleers =)

@wimleers
Copy link
Owner

@ykyuen Please let me know if you have any more suggestions or bugs :)

@woutrbe
Copy link

woutrbe commented Nov 12, 2012

I'm actually still getting the same error

Exception in thread FSMonitorThread: Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run self.__process_queues() File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in __process_queues self.__add_dir(path, event_mask) File "/src/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in __add_dir wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False) File "/usr/local/lib/python2.6/dist-packages/pyinotify.py", line 1853, in add_watch for rpath in self.__walk_rec(apath, rec): File "/usr/local/lib/python2.6/dist-packages/pyinotify.py", line 2041, in __walk_rec for root, dirs, files in os.walk(top): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 294, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/lib/python2.6/os.py", line 284, in walk if isdir(join(top, name)): File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

@wimleers
Copy link
Owner

@woutrbe Which version of File Conveyor? Which particular file name triggers this error?

@wimleers wimleers reopened this Nov 16, 2012
@woutrbe
Copy link

woutrbe commented Nov 20, 2012

Sorry about the late reply, I'm using the latest version here on github, installed with

pip install -e git+https://github.com/wimleers/fileconveyor@master#egg=fileconveyor

I tried outputting the file name, but it seems it doesn't even get to that point.
It seems that the error is quite similar to what others have experienced.

@wimleers
Copy link
Owner

Can you enable DEBUG logging?

@woutrbe
Copy link

woutrbe commented Nov 23, 2012

It's not giving that much more information when I enable debug loggin for both CONSOLE_LOGGER_LEVEL and FILE_LOGGER_LEVEL. (http://pastebin.com/zmCiQ0RN)

I've printed the path in pathscanner.py, but that doesn't seem to be outputting anything.

@wimleers
Copy link
Owner

Wow. The error you're getting doesn't occur in File Conveyor; it occurs in Python's internals!

  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

I'm afraid there's not much I can do there then. Some googling let me to these things:

import locale
locale.setlocale( locale.LC_ALL, 'C.UTF-8' )

fails


As per the latter link, I'm convinced this is the solution:

$ git d
 fileconveyor/fsmonitor_inotify.py |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fileconveyor/fsmonitor_inotify.py b/fileconveyor/fsmonitor_inotify.py
index 04a46ed..938c3f5 100644
--- a/fileconveyor/fsmonitor_inotify.py
+++ b/fileconveyor/fsmonitor_inotify.py
@@ -28,6 +28,11 @@ class FSMonitorInotify(FSMonitor):
     """inotify support for FSMonitor"""


+    # On Linux, you can choose which encoding is used for your file system's
+    # file names. Hence, whenever we interact with pyinotify, we must ensure
+    # that the paths we pass it are encoded in the file system's encoding.
+    encoding = sys.getfilesystemencoding()
+
     EVENTMAPPING = {
         FSMonitor.CREATED             : pyinotify.IN_CREATE,
         FSMonitor.MODIFIED            : pyinotify.IN_MODIFY | pyinotify.IN_ATTRIB,
@@ -71,7 +76,7 @@ class FSMonitorInotify(FSMonitor):
         # Immediately start monitoring this directory.
         event_mask_inotify = self.__fsmonitor_event_to_inotify_event(event_mask)
         try:
-            wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
+            wdd = self.wm.add_watch(path.encode(cls.encoding), event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
         except WatchManagerError, e:
             raise FSMonitorError, "Could not monitor '%s', reason: %s" % (path, e)
         # Verify that inotify is able to monitor this directory and all of its
@@ -79,7 +84,7 @@ class FSMonitorInotify(FSMonitor):
         for monitored_path in wdd:
             if wdd[monitored_path] < 0:
                 code = wdd[monitored_path]
-                raise FSMonitorError, "Could not monitor %s (%d)" % (monitored_path, code)
+                raise FSMonitorError, "Could not monitor %s (%d)" % (monitored_path.decode(cls.encoding), code)
         self.monitored_paths[path] = MonitoredPath(path, event_mask, wdd)
         self.monitored_paths[path].monitoring = True

@@ -100,7 +105,7 @@ class FSMonitorInotify(FSMonitor):
     def __remove_dir(self, path):
         """override of FSMonitor.__remove_dir()"""
         if path in self.monitored_paths.keys():
-            self.wm.rm_watch(path, rec=True, quiet=True)
+            self.wm.rm_watch(path.encode(cls.encoding), rec=True, quiet=True)
             del self.monitored_paths[path]

Could you please try that?


If that doesn't work, can you do this on your system and report back your output (mine is inline):

python2.5
>>> import locale
>>> locale.getlocale()
(None, None)
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'
>>> sys.getdefaultencoding()
'ascii'

In this case I think that the solution might be to do this in FSMonitor.__init__(), and possibly also in Arbitrator.__init__():

sys.setdefaultencoding('utf-8')

@MaffooBristol
Copy link

Hi, I know this is an old post, but I'm still having this issue... I followed your steps in the post above but it seems to just throw up this error:

Exception in thread ArbitratorThread:
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "arbitrator.py", line 286, in run
self.__setup()
File "arbitrator.py", line 271, in __setup
self.fsmonitor = fsmonitor_class(self.fsmonitor_callback, True, True, self.config.ignored_dirs.split(":"), "fsmonitor.db", "Arbitrator")
File "/opt/fileconveyor/fileconveyor/code/fsmonitor_inotify.py", line 43, in __init__
sys.setdefaultencoding('utf-8')
AttributeError: 'module' object has no attribute 'setdefaultencoding'

@wimleers
Copy link
Owner

A quick googling reveals that it's essentially evil to call sys.setdefaultencoding(), so on some systems/builds, that function has been removed. What a mess, Python!

@lfourcade
Copy link

Hi @wimleers ,

Thank you very much for this fantastic tool. Unfortunatly, I still have a problem after applying your changes to fsmonitor_inotify.py

Here's my output, hope you can help. Thank you.

/var/fileconveyor/fileconveyor/filter.py:10: DeprecationWarning: the sets module is deprecated
from sets import Set, ImmutableSet
2013-07-31 16:24:48,836 - Arbitrator - WARNING - File Conveyor is initializing.
2013-07-31 16:24:48,836 - Arbitrator - INFO - Loading config file.
2013-07-31 16:24:48,838 - Arbitrator.Config - INFO - Parsing sources.
2013-07-31 16:24:48,839 - Arbitrator.Config - INFO - Parsing servers.
2013-07-31 16:24:48,839 - Arbitrator.Config - INFO - Parsing rules.
2013-07-31 16:24:48,840 - Arbitrator - WARNING - Loaded config file.
2013-07-31 10:24:49,727 - Arbitrator - WARNING - Created 'cumulus' transporter for the 'rackspace' server.
2013-07-31 10:24:49,727 - Arbitrator - WARNING - Server connection tests succesful!
2013-07-31 10:24:49,728 - Arbitrator - WARNING - Setup: created transporter pool for the 'rackspace' server.
2013-07-31 10:24:49,729 - Arbitrator - INFO - Setup: collected all metadata for rule 'cdn' (source: 'rackspace').
2013-07-31 10:24:49,730 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 0 items.
2013-07-31 10:24:49,731 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 0 items.
2013-07-31 10:24:49,731 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items.
2013-07-31 10:24:49,732 - Arbitrator - WARNING - Setup: initialized 'files_to_delete' persistent list, contains 0 items.
2013-07-31 10:24:49,733 - Arbitrator - WARNING - Setup: moved 0 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue.
2013-07-31 10:24:49,733 - Arbitrator - WARNING - Setup: connected to the synced files DB. Contains metadata for 0 previously synced files.
2013-07-31 10:24:49,913 - Arbitrator.FSMonitor - INFO - FSMonitor class used: FSMonitorInotify.
2013-07-31 10:24:49,914 - Arbitrator - WARNING - Setup: initialized FSMonitor.
2013-07-31 10:24:49,914 - Arbitrator - INFO - Setup: monitoring '/var/www/vhosts/packshot-creator.com/httpdocs/' (rackspace).
2013-07-31 10:24:49,915 - Arbitrator - INFO - Cleaned up the working directory '/tmp/fileconveyor'.
2013-07-31 10:24:49,915 - Arbitrator - WARNING - Fully up and running now.
Exception in thread FSMonitorThread:
Traceback (most recent call last):
File "/usr/local/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run
self.__process_queues()
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in __process_queues
self.__add_dir(path, event_mask)
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in __add_dir
wdd = self.wm.add_watch(path.encode(cls.encoding), event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
NameError: global name 'cls' is not defined

@wimleers
Copy link
Owner

wimleers commented Sep 3, 2013

d'oh, the mention of cls on line 74 of fsmonitor_inotify.py should be replaced by FSMonitorInotify. Then all should be well. Small mistake :(

Can you try that?

@Trozz
Copy link

Trozz commented Sep 6, 2013

FYI

Exception in thread FSMonitorThread:
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 122, in run
self.__process_queues()
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 155, in __process_queues
self.__add_dir(path, event_mask)
File "/var/fileconveyor/fileconveyor/fsmonitor_inotify.py", line 74, in __add_dir
wdd = self.wm.add_watch(path, event_mask_inotify, proc_fun=self.process_event, rec=True, auto_add=True, quiet=False)
File "/usr/lib/python2.6/site-packages/pyinotify.py", line 1742, in add_watch
for rpath in self.__walk_rec(apath, rec):
File "/usr/lib/python2.6/site-packages/pyinotify.py", line 1929, in __walk_rec
for root, dirs, files in os.walk(top):
File "/usr/lib64/python2.6/os.py", line 294, in walk
for x in walk(path, topdown, onerror, followlinks):
File "/usr/lib64/python2.6/os.py", line 294, in walk
for x in walk(path, topdown, onerror, followlinks):
File "/usr/lib64/python2.6/os.py", line 294, in walk
for x in walk(path, topdown, onerror, followlinks):
File "/usr/lib64/python2.6/os.py", line 294, in walk
for x in walk(path, topdown, onerror, followlinks):
File "/usr/lib64/python2.6/os.py", line 294, in walk
for x in walk(path, topdown, onerror, followlinks):
File "/usr/lib64/python2.6/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib64/python2.6/posixpath.py", line 70, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 2: ordinal not in range(128)

@wimleers
Copy link
Owner

Sigh :(

I won't have time any time soon to dive deeper into this. Sorry.

@insparrow
Copy link

I haven't been able to resolve the UnicodeDecodeError issue. However, a few grey hairs later I was able to come up with a good enough workaround (for my use case) which has enabled me to use File Conveyor with Rackspace Cloud Files.

For my server setup (Ubuntu 10.04, Python 2.6.5 and latest File Conveyor), I encountered two UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 2: ordinal not in range(128) issues.

First issue: The daemon would throw an exception before it attempted to transfer any files. If you encounter the same problem, check your server's locale settings with locale. I wish I had checked this first! When I ran locale, the following errors were reported:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Adding export LC_ALL="en_US.UTF-8" to the server's .bashrc file resolved the issues reported by locale and File Conveyor would then (in part) work.

Second issue: The daemon would throw an exception when it attempted to transfer a filename which contained special characters. I tried in vain to fix this but I was unsuccessful (until today I hadn't written a single line of python code).

Creating a solution that worked with Rackspace Cloud Files was non-negotiable, so I set out to create an acceptable workaround for a Drupal 7 site that has over 50GB of images. I patched arbitrator.py to skip files that I knew would cause the daemon to thrown an exception:

diff --git a/fileconveyor/arbitrator.py b/fileconveyor/arbitrator.py
index 394b4b4..6fa3e5a 100644
--- a/fileconveyor/arbitrator.py
+++ b/fileconveyor/arbitrator.py
@@ -347,6 +347,7 @@ class Arbitrator(threading.Thread):
         while self.discover_queue.qsize() > 0:


             # Discover queue -> pipeline queue.
             (input_file, event) = self.discover_queue.get()
             item = self.pipeline_queue.get_item_for_key(key=input_file)
             # If the file does not yet exist in the pipeline queue, put() it.
@@ -400,6 +401,16 @@ class Arbitrator(threading.Thread):
             (input_file, event) = self.filter_queue.get()
             self.lock.release()


+            # Skip filenames which we know will not work with File Convyeor or the CDN module.
+            import re
+            path, filename = os.path.split(input_file)
+            regexp = re.compile(r'^[a-zA-Z0-9_ .-]+$')
+            if regexp.search(filename) is None:
+                import codecs
+                output_file = codecs.open('skipped_files.txt', 'a', 'utf8')
+                output_file.write(input_file + '\n')
+                continue
+
             # The file may have already been deleted, e.g. when the file was
             # moved from the pipeline list into the pipeline queue after the
             # application was interrupted. When that's the case, drop the

With this patch in, the daemon doesn't throw an exception and will transfer the files that it is able to transfer. The patch also logs the problematic files to skipped_files.txt which gives me / the client a list to fix. For a Drupal 7 site, the transliteration module gives you the option to bulk rename files. I have installed this onto the site I'm working with to take care of new files. The bulk rename function doesn't work for me but I believe that's an isolated issue related to the images stored in complex parent and child field collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests