Skip to content

Commit

Permalink
Merge pull request #89 from treasure-data/doc-issue-87-88
Browse files Browse the repository at this point in the history
Doc: URL fix & comparison between pytd, td-client-python, and pandas-td
  • Loading branch information
takuti authored May 11, 2020
2 parents 78409e0 + 7bb739e commit 1862dc5
Show file tree
Hide file tree
Showing 9 changed files with 51 additions and 23 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ v0.8.0 (2019-09-17)
(`#43 <https://github.com/treasure-data/pytd/pull/43>`__, `#44 <https://github.com/treasure-data/pytd/pull/44>`__)
- Disable ``type``, one of the Treasure Data-specific query parameters, because it is conflicted with the ``engine`` option.
(`#45 <https://github.com/treasure-data/pytd/pull/45>`__)
- Add `td-pyspark <https://pypi.org/project/td-pyspark/>`__ dependency for easily accessing to the `td-spark <https://support.treasuredata.com/hc/en-us/articles/360001487167-Apache-Spark-Driver-td-spark-FAQs>`__ functionalities.
- Add `td-pyspark <https://pypi.org/project/td-pyspark/>`__ dependency for easily accessing to the `td-spark <https://treasure-data.github.io/td-spark/>`__ functionalities.
(`#46 <https://github.com/treasure-data/pytd/pull/46>`__, `#47 <https://github.com/treasure-data/pytd/pull/47>`__)

v0.7.0 (2019-08-23)
Expand Down
45 changes: 33 additions & 12 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ pytd
**pytd** provides user-friendly interfaces to Treasure Data’s `REST
APIs <https://github.com/treasure-data/td-client-python>`__, `Presto
query
engine <https://support.treasuredata.com/hc/en-us/articles/360001457427-Presto-Query-Engine-Introduction>`__,
engine <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083607/Presto+Query+Engine+Introduction>`__,
and `Plazma primary
storage <https://www.slideshare.net/treasure-data/td-techplazma>`__.

Expand All @@ -29,9 +29,9 @@ Usage
Colaboratory <https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11>`__

Set your `API
key <https://support.treasuredata.com/hc/en-us/articles/360000763288-Get-API-Keys>`__
key <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1081428/Getting+Your+API+Keys>`__
and
`endpoint <https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints>`__
`endpoint <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085143/Sites+and+Endpoints>`__
to the environment variables, ``TD_API_KEY`` and ``TD_API_SERVER``,
respectively, and create a client instance:

Expand Down Expand Up @@ -93,7 +93,7 @@ data to Treasure Data:
query through the Presto query engine.
- Recommended only for a small volume of data.

3. `td-spark <https://support.treasuredata.com/hc/en-us/articles/360001487167-Apache-Spark-Driver-td-spark-FAQs>`__:
3. `td-spark <https://treasure-data.github.io/td-spark/>`__:
``spark``

- Local customized Spark instance directly writes ``DataFrame`` to
Expand Down Expand Up @@ -137,8 +137,36 @@ with ``td_spark_path`` option would be helpful.
writer = SparkWriter(apikey='1/XXX', endpoint='https://api.treasuredata.com/', td_spark_path='/path/to/td-spark-assembly.jar')
client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')
Comparison between pytd, td-client-python, and pandas-td
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Treasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.

1. `td-client-python <https://github.com/treasure-data/td-client-python>`__

- Basic REST API wrapper.
- Similar functionalities to td-client-{`ruby <https://github.com/treasure-data/td-client-ruby>`__, `java <https://github.com/treasure-data/td-client-java>`__, `node <https://github.com/treasure-data/td-client-node>`__, `go <https://github.com/treasure-data/td-client-go>`__}.
- The capability is limited by `what Treasure Data REST API can do <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085354/REST+APIs+in+Treasure+Data>`__.

2. **pytd**

- Access to Plazma via td-spark as introduced above.
- Efficient connection to Presto based on `presto-python-client <https://github.com/prestodb/presto-python-client>`__.
- Multiple data ingestion methods and a variety of utility functions.

3. `pandas-td <https://github.com/treasure-data/pandas-td>`__ *(deprecated)*

- Old tool optimized for `pandas <https://pandas.pydata.org>`__ and `Jupyter Notebook <https://jupyter.org>`__.
- **pytd** offers its compatible function set (see below for the detail).

An optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:

- Use td-client-python if you want to execute *basic CRUD operations* from Python applications.
- Use **pytd** for (1) *analytical purpose* relying on pandas and Jupyter Notebook, and (2) achieving *more efficient data access* at ease.
- Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.

How to replace pandas-td
------------------------
^^^^^^^^^^^^^^^^^^^^^^^^

**pytd** offers
`pandas-td <https://github.com/treasure-data/pandas-td>`__-compatible
Expand Down Expand Up @@ -180,13 +208,6 @@ Consequently, all ``pandas_td`` code should keep running correctly with
`here <https://github.com/treasure-data/pytd/issues/new>`__ if you
noticed any incompatible behaviors.

.. note:: There is a known difference to ``pandas_td.to_td`` function for type conversion.
Since :class:`pytd.writer.BulkImportWriter`, default writer pytd, uses CSV as an intermediate file before
uploading a table, column type may change via ``pandas.read_csv``. To respect column type as much as possible,
you need to pass `fmt="msgpack"` argument to ``to_td`` function.

For more detail, see ``fmt`` option of :func:`pytd.pandas_td.to_td`.

.. |Build status| image:: https://github.com/treasure-data/pytd/workflows/Build/badge.svg
:target: https://github.com/treasure-data/pytd/actions/
.. |PyPI version| image:: https://badge.fury.io/py/pytd.svg
Expand Down
7 changes: 7 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@
.. include:: ../README.rst

.. note:: There is a known difference to ``pandas_td.to_td`` function for type conversion.
Since :class:`pytd.writer.BulkImportWriter`, default writer pytd, uses CSV as an intermediate file before
uploading a table, column type may change via ``pandas.read_csv``. To respect column type as much as possible,
you need to pass `fmt="msgpack"` argument to ``to_td`` function.

For more detail, see ``fmt`` option of :func:`pytd.pandas_td.to_td`.

More Examples
-------------

Expand Down
4 changes: 2 additions & 2 deletions pytd/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ class Client(object):
endpoint : str, optional
Treasure Data API server. If not given, ``https://api.treasuredata.com`` is
used by default. List of available endpoints is:
https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085143/Sites+and+Endpoints
database : str, default: 'sample_datasets'
Name of connected database.
Expand Down Expand Up @@ -203,7 +203,7 @@ def query(self, query, engine=None, **kwargs):
- ``wait_callback`` (function): called every interval against job itself
- ``engine_version`` (str): run query with Hive 2 if this parameter
is set to ``"experimental"`` and ``engine`` denotes Hive.
https://support.treasuredata.com/hc/en-us/articles/360027259074-How-to-use-Hive-2
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083123/Using+Hive+2+to+Create+Queries
Meanwhile, when a following argument is set to ``True``, query is
deterministically issued via ``tdclient``.
Expand Down
4 changes: 2 additions & 2 deletions pytd/pandas_td/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def connect(apikey=None, endpoint=None, **kwargs):
endpoint : str, optional
Treasure Data API server. If not given, ``https://api.treasuredata.com`` is
used by default. List of available endpoints is:
https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085143/Sites+and+Endpoints
kwargs : dict, optional
Optional arguments
Expand Down Expand Up @@ -174,7 +174,7 @@ def read_td_query(
- ``wait_callback`` (function): called every interval against job itself
- ``engine_version`` (str): run query with Hive 2 if this parameter is
set to ``"experimental"`` in ``HiveQueryEngine``.
https://support.treasuredata.com/hc/en-us/articles/360027259074-How-to-use-Hive-2
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083123/Using+Hive+2+to+Create+Queries
Returns
-------
Expand Down
6 changes: 3 additions & 3 deletions pytd/query_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def execute(self, query, **kwargs):
- ``wait_callback`` (function): called every interval against job itself
- ``engine_version`` (str): run query with Hive 2 if this parameter
is set to ``"experimental"`` in ``HiveQueryEngine``.
https://support.treasuredata.com/hc/en-us/articles/360027259074-How-to-use-Hive-2
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083123/Using+Hive+2+to+Create+Queries
Meanwhile, when a following argument is set to ``True``, query is
deterministically issued via ``tdclient``.
Expand Down Expand Up @@ -179,7 +179,7 @@ def _get_tdclient_cursor(self, con, **kwargs):
- ``wait_callback`` (function): called every interval against job itself
- ``engine_version`` (str): run query with Hive 2 if this parameter
is set to ``"experimental"`` in ``HiveQueryEngine``.
https://support.treasuredata.com/hc/en-us/articles/360027259074-How-to-use-Hive-2
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083123/Using+Hive+2+to+Create+Queries
Returns
-------
Expand Down Expand Up @@ -399,7 +399,7 @@ def cursor(self, force_tdclient=True, **kwargs):
- ``wait_callback`` (function): called every interval against job itself
- ``engine_version`` (str): run query with Hive 2 if this parameter
is set to ``"experimental"``.
https://support.treasuredata.com/hc/en-us/articles/360027259074-How-to-use-Hive-2
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083123/Using+Hive+2+to+Create+Queries
Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion pytd/spark.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def fetch_td_spark_context(
endpoint : str, optional
Treasure Data API server. If not given, ``https://api.treasuredata.com`` is
used by default. List of available endpoints is:
https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085143/Sites+and+Endpoints
td_spark_path : str, optional
Path to td-spark-assembly_x.xx-x.x.x.jar. If not given, seek a path
Expand Down
2 changes: 1 addition & 1 deletion pytd/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def create(self, column_names=[], column_types=[]):
column_types : list of str, optional
Column types corresponding to the names. Note that Treasure Data
supports limited amount of types as documented in:
https://support.treasuredata.com/hc/en-us/articles/360001266468-Schema-Management
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083743/Schema+Management
"""
if len(column_names) > 0:
schema = ", ".join(
Expand Down
2 changes: 1 addition & 1 deletion pytd/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ def _insert_into(self, table, list_of_tuple, column_names, column_types, if_exis
column_types : list of str
Column types corresponding to the names. Note that Treasure Data
supports limited amount of types as documented in:
https://support.treasuredata.com/hc/en-us/articles/360001266468-Schema-Management
https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083743/Schema+Management
if_exists : {'error', 'overwrite', 'append', 'ignore'}
What happens when a target table already exists.
Expand Down

0 comments on commit 1862dc5

Please sign in to comment.