Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem converting cyrillic .doc OR .odt file to .txt #73

Closed
Yuseinov opened this issue Aug 7, 2012 · 7 comments
Closed

Problem converting cyrillic .doc OR .odt file to .txt #73

Yuseinov opened this issue Aug 7, 2012 · 7 comments
Assignees
Labels
Milestone

Comments

@Yuseinov
Copy link

Yuseinov commented Aug 7, 2012

OS: Linux Ubuntu 10.04
Unconv version: 0.3-6

When converting file with cyrillic text to txt format output is with broken encoding.
Input text:

Здравейте. Как сте?
This is how we do...

Ouput text:

?????????. ??? ????
This is how we do...

When converting from .odt OR .doc to .pdf everything is ok, but when converting to .txt then instead of text have question marks.

@dagwieers dagwieers reopened this Aug 7, 2012
@ghost ghost assigned dagwieers Sep 1, 2012
@dagwieers
Copy link
Member

Is it possible to try again with the latest version from master branch. I have updated the manual page to reflect import and export filter options. You can now use -i FilterOptions=,,76 and -e FilterOptions=,,76 to enforce a certain encoding during import and export phase.

Look at the following link for more information: http://wiki.services.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options

Assuming that your ODT or DOC is in Unicode, you can use:

unoconv -f txt -e FilterOptions=,,76 /path/to/file.odt

To convert to a UTF-8 (76) text-file. If you however would prefer another encoding for text files (because your editor or your system expects it), then use one of the below:

  • ISO-8859-5 (Cyrillic) use 16
  • DOS/OS2-855 (Cyrillic) use 26
  • Windows-1251 (Cyrillic) use 34
  • Apple Macintosh (Cyrillic) use 44
  • Apple Macintosh/Ukrainian (Cyrillic) use 55
  • KOI8-U (Cyrillic) use 88
  • PT 154 (Windows Cyrillic Asian codepage developed in ParaType) use 93

More options are in the above referenced URL. Too many options, I know ;-)

@dagwieers
Copy link
Member

@Yuseinov Any feedback on this ?

@Yuseinov
Copy link
Author

Sorry for the late response. Problem is fixed. Thanks.

@dagwieers
Copy link
Member

Great to hear ! Using UTF-8 for export should be the default for me whenever possible. Import encoding is of course very specific to the original document. I would expect LibreOffice import filters to auto-detect when that's possible (depends on the source format).

Let us know if there's anything we can improve, especially for (to me) foreign encodings.

@Yuseinov
Copy link
Author

I have a new problem. I don' t know this is issue for unoconv or not, that's why i wrote it here.
Problem is:
In my ubuntu 10.04 64bit Server i can't start unoconv from apache/php - (www-data user).
When i start whit admin(root user) converting work correct. But when start unoconv like www-data user program is crashing with error:

creation of executable memory area failed: Permission denied
unoconv: UnoException during conversion:
The provided document cannot be converted to the desired format.
Leaking python objects bridged to UNO for reason pyuno runtime is not initialized, (the pyuno.bootstrap needs to be called before using any uno classes)

If i start unoconv in sudo mode whit www-data user again still does not work on Server.
In my local machine i don' t have a problem. I can start unoconv from apache, but my server I can't.

I removed SELinux, i have installed xvfb and i have permissions to write and read in the folder and files.

@dagwieers
Copy link
Member

Please open a new issue for this. It helps getting attention from people with a similar issue (and myself too ;-))

@Krknv
Copy link

Krknv commented Apr 20, 2017

I'm trying to convert cyrillic.rtf to unicode.rtf with command:

unoconv -f rtf -i FilterOptions=34 -e FilterOptions=76 -o '~/unicode.rtf' '~/cyrillic.rtf'

  • on mac (local) files converted normally without filters
  • on ubuntu (remote) text looks the same as in source (ÞÌÒÍÅÈ Þ ÌÒÞÌÒÍÅÈ ÍÅÈ)

remote:

unoconv 0.7
platform posix/linux
python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4]
LibreOffice 5.3.2.2

local:

unoconv 0.7
platform posix/darwin
python 2.7.10 (default, Feb  6 2017, 23:53:20)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]
OpenOffice 4.1.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants