Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for new character sets #1

Open
GeoffMontee opened this issue Apr 24, 2014 · 24 comments
Open

Add support for new character sets #1

GeoffMontee opened this issue Apr 24, 2014 · 24 comments

Comments

@GeoffMontee
Copy link
Collaborator

Currently, tds_fdw can't set the character set for the connection.

The FreeTDS implementation of DB-Library does have the DBSETLCHARSET macro to set the character set:

http://www.freetds.org/reference/a00284.html#ga114

However, it says it doesn't work on TDS 7.0+ connections. This makes it less useful for most versions of MS SQL Server, since those try to use 7.0+ by default:

http://www.freetds.org/userguide/choosingtdsprotocol.htm

The reason it is not supported for 7.0+ is described here:

http://www.freetds.org/userguide/localization.htm

"It is also worth clarifying that TDS 7.0 and above do not accept any specified character set during login, as 4.2 does. A TDS 7.0 login packet uses UCS-2."

There's also the dbsetdefcharset function:

http://www.freetds.org/reference/a00284.html#ga86

But it looks like that might be unimplemented in the current versions of FreeTDS.

Character sets can be set in freetds.conf:

http://www.freetds.org/userguide/freetdsconf.htm

It is probably possible to change character sets with DB-Library in a way that is compatible with versions 7.0+ of TDS also. I should figure out how.

On the PostgreSQL side, I also wonder how BuildTupleFromCStrings will handle character sets with multi-byte characters. I might need to find another way to build tuples.

@GeoffMontee
Copy link
Collaborator Author

From the FreeTDS localization page:

"To learn what character set the client wants, FreeTDS prefers the applicable freetds.conf client charset property. If that is not set, it parses the LANG environment variable. In either case, the found string is passed to iconv(3) (or its built-in replacement). [1]. If neither is found, UCS-2 data are converted to ISO 8859-1."

@GeoffMontee
Copy link
Collaborator Author

One user said that setting "client charset" in freetds.conf to UTF-8 worked.

@sumariva
Copy link

sumariva commented Jun 1, 2015

Today I got a foreign table returning error related to charset below:

NOTE:  DB-Library notice: Msg #: 5701, Msg state: 2, Msg: Changed database context to 'SIARM'., Server: WINDOWS-0CMIFUI, Process: , Line: 1, Level: 0
NOTE:  DB-Library notice: Msg #: 5703, Msg state: 1, Msg: Changed language setting to us_english., Server: WINDOWS-0CMIFUI, Process: , Line: 1, Level: 0
ERRO:  DB-Library error: DB #: 2403, DB Msg: Some character(s) could not be converted into client's character set.  Unconverted bytes were changed to question marks ('?'), OS #: 0, OS Msg: Sucess, Level: 4
********** Error **********

ERRO: DB-Library error: DB #: 2403, DB Msg: Some character(s) could not be converted into client's character set.  Unconverted bytes were changed to question marks ('?'), OS #: 0, OS Msg: Sucess, Level: 4
SQL state: HV00L

Tried put configuration
client charset = UTF-8
on freetds.conf, first on global session then on server session but still same issue.
Using tds version = 7.0

Does not message pointing that itself is just an warning?
It says that will replace unconvertable chars to question marks.

@GeoffMontee
Copy link
Collaborator Author

Hi @sumariva,

I believe that client charset = UTF-8 in freetds.conf should work, as long as tds version = 7.0 (note: not tds vrsion, which is what you typed up above).

Can you please show the full contents of your freetds.conf?

@sumariva
Copy link

sumariva commented Jun 1, 2015

Yes, this is my freetds.conf

#   $Id: freetds.conf,v 1.12 2007/12/25 06:02:36 jklowden Exp $
#
# This file is installed by FreeTDS if no file by the same
# name is found in the installation directory.
#
# For information about the layout of this file and its settings,
# see the freetds.conf manpage "man freetds.conf".

# Global settings are overridden by those in a database
# server specific section
[global]
        # TDS protocol version
;       tds version = 4.2

        # Whether to write a TDSDUMP file for diagnostic purposes
        # (setting this to /tmp is insecure on a multi-user system)
;       dump file = /tmp/freetds.log
;       debug flags = 0xffff

        # Command and connection timeouts
;       timeout = 10
;       connect timeout = 10

        # If you get out-of-memory errors, it may mean that your client
        # is trying to allocate a huge buffer for a TEXT field.
        # Try setting 'text size' to a more reasonable limit
        text size = 64512
        client charset = UTF-8

# A typical Sybase server
[egServer50]
        host = symachine.domain.com
        port = 5000
        tds version = 5.0

# A typical Microsoft server
[duquedecaxias]
        host = 192.168.2.30
        port = 1433
        tds version = 7.0
        client charset = UTF-8

@GeoffMontee
Copy link
Collaborator Author

Thanks @sumariva!

The definition of duquedecaxias looks OK to me.

Can you show me the definition of your foreign server and foreign table please?

@sumariva
Copy link

sumariva commented Jun 1, 2015

Relative to postgresql conversions, there are iconv like functions on strings:
http://www.postgresql.org/docs/9.3/static/functions-string.html#FUNCTIONS-STRING-OTHER
convert_from will convert to database encoding.

@GeoffMontee
Copy link
Collaborator Author

The character set conversions are done by FreeTDS using the iconv library before PostgreSQL sees the strings, so those functions shouldn't be necessary for this.

@sumariva
Copy link

sumariva commented Jun 1, 2015

The configured server

CREATE SERVER siarm
FOREIGN DATA WRAPPER tds_fdw
OPTIONS (servername '192.168.2.30',port '1433');

The table mapping system_information tables, here I also noted a weird caracter ASCII(2) at end of a column

CREATE FOREIGN TABLE sgm.information_schema_columns_tds (
    table_catalog varchar
  , table_schema varchar
  , table_name varchar
  , column_name varchar
  , ordinal_position integer
  , column_default varchar
  , is_nullable varchar
  , data_type varchar
  , character_maximum_length integer
  , character_octect_length integer
  , numeric_precision integer
  , numeric_precision_radix integer
  , numeric_scale integer
  , datetime_precision integer
  , interval_type varchar
  , interval_precision integer
  , character_set_catalog varchar
  , character_set_schema varchar
  , character_set_name varchar
  , collation_catalog varchar
  , collation_schema varchar
  , collation_name varchar
  , domain_catalog varchar
  , domain_schema varchar
  , domain_name varchar
  , udt_catalog varchar
  , udt_schema varchar
  , udt_name varchar
  , scope_catalog varchar
  , scope_schema varchar
  , scope_name varchar
  , maximun_cardinality integer
  , dtd_identifier varchar
  /*
  , is_self_referencing varchar
  , is_identity varchar
  , identity_generation varchar
  , identity_start varchar
  , identity_increment varchar
  , identity_maximum varchar
  , identity_minimum varchar
  , identity_cycle varchar
  , is_generated varchar
  , generation_expression varchar
  , is_updatable varchar */
  )
    SERVER siarm
    OPTIONS ( query 'SELECT * FROM information_schema.columns ORDER BY table_schema, table_name, ordinal_position' )

The table as harvested using the system information table

    CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos"
    (
    "Inscricao" integer NOT NULL,
    "ID" smallint NOT NULL,
    "Processo" character varying(12) ,
    "Historico" text NOT NULL,
    "Data" timestamp without time zone ,
    "Departamento" smallint ,
    "Data_Atualizacao" timestamp without time zone
    )
    SERVER siarm
    OPTIONS (table 'MV_IPTU_Contribuintes_CadastroHistoricos' );

@GeoffMontee
Copy link
Collaborator Author

@sumariva,

Your foreign server isn't using the duquedecaxias server definition in freetds.conf because you have servername set to 192.168.2.30. If you want it to use this definition, set servername to duquedecaxias.

Otherwise, you should set tds version = 7.0 and client charset = UTF-8 in the [global] section of freetds.conf. That way, those settings will apply to all connections.

@sumariva
Copy link

sumariva commented Jun 1, 2015

Humm, updated the freetds.conf as suggested

[global]
        # TDS protocol version
        tds version = 7.0
        client charset = UTF-8
[duquedecaxias]
        host = 192.168.2.30
        port = 1433
        tds version = 7.0
        client charset = UTF-8

but still with same message.
I also checked postgresql client_encoding

SHOW client_encoding
UNICODE

@GeoffMontee
Copy link
Collaborator Author

Did you restart the PostgreSQL server after making the change to freetds.conf?

@sumariva
Copy link

sumariva commented Jun 1, 2015

Yes, the postgresql server has been restarted after each change since I do not known when those parameters are reread by the fdw_tds driver.
I trying to discover the 2403 error code meaning at mssql documentation.

@GeoffMontee
Copy link
Collaborator Author

Maybe your configuration is being read correctly, but one of your rows has non-UTF-8 data.

The table mapping system_information tables, here I also noted a weird caracter ASCII(2) at end of a column

What do you mean by this? Are you saying that ASCII(2) is the data type of a column on the MS SQL Server side? Or something else?

@sumariva
Copy link

sumariva commented Jun 1, 2015

When I executed a select statement on that system information table(this is a database internal table) I got the following result in psql

 table_catalog | table_schema |           table_name           | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octect_length | numeric_precision | numeric_precision_radix | numeric_scale | datetime_precision | interval_type | interval_precision | character_set_catalog | character_set_schema | character_set_name | collation_catalog | collation_schema | collation_name | domain_catalog | domain_schema | domain_name | udt_catalog | udt_schema | udt_name | scope_catalog | scope_schema | scope_name | maximun_cardinality | dtd_identifier 
---------------+--------------+--------------------------------+-------------+------------------+----------------+-------------+-----------+--------------------------+-------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-----------------------+----------------------+--------------------+-------------------+------------------+----------------+----------------+---------------+-------------+-------------+------------+----------+---------------+--------------+------------+---------------------+----------------
 SIARM_CAXIAS  | dbo          | MV_IPTU_Contribuintes_Cadastro | Inscricao   |                1 |                | NO          | int       |                          |                         |                10 |                      10 |             0 |                    |               |                    |                       |                      |                    |                   |                  |                |                |               |             |             |            |          |               |              |            |                     | Y\x02

Note the the \x02 caracter reported at last column named dtd_identifier, what I called ASCII(2).
Unfortunely I do not have a mssql client to issue the same query to see if that unicode \x02 is reported also.

@sumariva
Copy link

sumariva commented Jun 1, 2015

Found 2403 error code defined at freetds as an constant called SYBEICONVI
define in file https://github.com/FreeTDS/freetds/blob/master/include/sybdb.h.
Do not need to catch this "error code" as an warning and allow normal execution?

@GeoffMontee
Copy link
Collaborator Author

As far as I can tell, the column dtd_identifier doesn't exist in information_schema.columns in MS SQL Server:

https://msdn.microsoft.com/en-us/library/ms188348.aspx

This probably means that this column is being populated with whatever random bytes are already in memory when tds_fdw is fetching results. This looks like a new bug, so I've submitted that here:

#25

Either way, this seems completely unrelated to any character set issues you are experiencing.

It still sounds to me like you may have data in one of your rows that is incompatible with UTF-8. Maybe try to find out which column is failing? (e.g. maybe try fetching only one varchar column at a time?)

Also, what version of MS SQL Server are you using? If you are using 2008 or later, maybe try tds version = 7.3:

http://www.freetds.org/userguide/choosingtdsprotocol.htm

@GeoffMontee
Copy link
Collaborator Author

@sumariva,

I don't know if this warning should be treated in a non-fatal manner without knowing exactly what is going wrong here for you. Can you please try to find out what data is failing? You should be able to find out which varchar column is failing by creating two new foreign tables:

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Processo"
(
"Processo" character varying(12)
)
SERVER siarm
OPTIONS (query 'SELECT Processo FROM MV_IPTU_Contribuintes_CadastroHistoricos' );

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico"
(
"Historico" text NOT NULL
)
SERVER siarm
OPTIONS (query 'SELECT Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos' );

Then you could figure out which column is failing conversion with:

SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Processo";
SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico";

@sumariva
Copy link

I followed yours sugestions and found that the failing column is on second query

SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico";

beyond 10000 rows fetched.

@sumariva
Copy link

Digging into MSSQL information_schema.columns table, I found that the type used on column
is text(max??) using iso_1 and COLLATION Latin1_General_CI_AI.

The following attempt to translate the character to a binary reprentation failed with an error reported on footer

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico"
(
  "Historico" character varying
)
SERVER siarm
OPTIONS (query 'Select CAST( CAST( Historico AS varchar(max) ) AS varbinary(max) ) AS Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos' );
======= Backtrace: =========
/lib64/libc.so.6(+0x7d4e6)[0x7fa5b5ba54e6]
/usr/lib64/libsybdb.so.5(dbconvert+0x7ae)[0x7fa5b1107fce]
/usr/lib64/postgresql-9.2/lib64/tds_fdw.so(+0x1eaa)[0x7fa5b1361eaa]

If I understood the documentation, when text type is selected, the database stores data on the configured database codepage, that could be an single byte not UCS-2(unicode).
Did you known how to tell mssql to convert character on a different codepage on query before freetds receive it?

@GeoffMontee
Copy link
Collaborator Author

You might want to try downloading the most recent commit of tds_fdw and then compiling tds_fdw with DEBUG defined. Then tds_fdw will give you a lot of information about what's going on.

You might also want to consider enabling a FreeTDS log file.

@sumariva
Copy link

Just for reference I record the server version used:

Microsoft SQL Server 2014 - 12.0.4100.1 (X64) 
Apr 20 2015 17:29:27 
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: )

@sumariva
Copy link

sumariva commented Jul 4, 2015

Currently trying read of column "Historico"(history) as binary

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_HistoricoHex"
("Historico" text NOT NULL)
SERVER siarm
OPTIONS (query 'SELECT convert(varbinary(8000), convert(varchar(8000),Historico), 0) AS Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos');

A simple select on postgresql showed that data has arrived.

select
  char_length( "Historico" ), octet_length( "Historico" ), "Historico", encode( "Historico"::bytea, 'hex' )
--, convert_to( "Historico"::bytea, 'iso_8859_1') mssql_varchar_from_varbinary
from
cliente."MV_IPTU_Contribuintes_CadastroHistoricos_HistoricoHex"
limit 100

@wprigollopes
Copy link

wprigollopes commented Dec 27, 2018

I had the same problem.

What happened: I'm configured my freetds.conf with default options and imported from a SQL Server the foreign schema into postgresql.

If I fetch some integer columns,freetds works well but, if I fetch some text columns, the error 2403 happens.

To find a solution, I've tried different charsets and, finally, the client_charset to CP1252 works fine. Seems a very similar problem that reported by @sumariva

Follow my freetds.conf (the read-only intent option was used because my postgresql database reads and refresh some materialized views, I'm not writing on SQL server database.

[totvs]
        host = <ip>
        port = 1433
        tds version = 7.4
        database = <dabatabe>
        client charset = CP1252 
        read-only intent yes

It seems that the sql server has data registered with "CP1252" format, so iconv returns an error because the command can't understand the some charset definitions in database. Using "CP1252" the conversion was done correctly and without errors.

Setup info
SQL Server version: 2017 - Encoding LATIN1 - Win
Postgresql: 10 - Encoding UTF-8 (default) - Ubuntu 18.04 LTS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants