Skip to content
This repository has been archived by the owner on Aug 4, 2020. It is now read-only.

Latest commit

 

History

History
356 lines (286 loc) · 13.3 KB

README.mkd

File metadata and controls

356 lines (286 loc) · 13.3 KB

pycassa

pycassa is a Cassandra library with the following features:

  1. Auto-failover single or thread-local connections
  2. A simplified version of the thrift interface
  3. A method to map an existing class to a Cassandra ColumnFamily.
  4. Support for SuperColumns

Requirements

thrift: http://incubator.apache.org/thrift/
Cassandra: http://incubator.apache.org/cassandra/

Install thrift with the python bindings.

Cassandra was only tested with version 0.5.0-beta2, but should work with some previous versions.

It comes with the Cassandra python files for convenience, but you can replace them with your own.

Installation

The simplest way to get started is to copy the pycassa and cassandra directories to your program. If you want to install, run setup.py as a superuser.

python setup.py install

If you also want to install the cassandra python package (if it's not already included on your system):

python setup.py --cassandra install

Basic Usage

All functions are documented with docstrings. To read usage documentation:

>>> import pycassa
>>> help(pycassa.ColumnFamily.get)

For a single connection (which is not thread-safe), pass a list of servers.

>>> client = pycassa.connect() # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect(['localhost:9160'])

If you need Framed Transport, pass the framed_transport argument.

>>> client = pycassa.connect(framed_transport=True)

Thread-local connections opens a connection for every thread that calls a Cassandra function. It also automatically balances the number of connections between servers, unless round_robin=False.

>>> client = pycassa.connect_thread_local() # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect_thread_local(['localhost:9160', 'other_server:9160']) # Round robin connections
>>> client = pycassa.connect_thread_local(['localhost:9160', 'other_server:9160'], round_robin=False) # Connect in list order

Connections are robust to server failures. Upon a disconnection, it will attempt to connect to each server in the list in turn. If no server is available, it will raise a NoServerAvailable exception.

To use the standard interface, create a ColumnFamily instance.

>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test ColumnFamily')

The value returned by an insert is the timestamp used for insertion, or int(time.mktime(time.gmtime())). You may replace this function with your own (see Extra Documentation).

>>> cf.insert('foo', {'column1': 'val1'})
1261349837
>>> cf.get('foo')
{'column1': 'val1'}

Insert also acts to update values.

>>> cf.insert('foo', {'column1': 'val2'})
1261349910
>>> cf.get('foo')
{'column1': 'val2'}

You may insert multiple columns at once.

>>> cf.insert('bar', {'column1': 'val3', 'column2': 'val4'})
1261350013
>>> cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
>>> cf.get_count('bar')
2

get_range() returns an iterable. Call it with list() to convert it to a list.

>>> list(cf.get_range())
[('bar', {'column1': 'val3', 'column2': 'val4'}), ('foo', {'column1': 'val2'})]
>>> list(cf.get_range(row_count=1))
[('bar', {'column1': 'val3', 'column2': 'val4'})]

You can remove entire keys or just a certain column.

>>> cf.remove('bar', column='column1')
1261350220
>>> cf.get('bar')
{'column2': 'val4'}
>>> cf.remove('bar')
1261350226
>>> cf.get('bar')
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()

pycassa retains the behavior of Cassandra in that get_range() may return removed keys for a while. Cassandra will eventually delete them, so that they disappear.

>>> cf.remove('foo')
>>> cf.remove('bar')
>>> list(cf.get_range())
[('bar', {}), ('foo', {})]

... After some amount of time

>>> list(cf.get_range())
[]

Class Mapping

You can also map existing classes using ColumnFamilyMap.

>>> class Test(object):
...     string_column       = pycassa.String(default='Your Default')
...     int_str_column      = pycassa.IntString(default=5)
...     int_column          = pycassa.Int64(default=0)
...     float_str_column    = pycassa.FloatString(default=8.0)
...     float_column        = pycassa.Float64(default=0.0)
...     datetime_str_column = pycassa.DateTimeString() # default=None
...     datetime_column     = pycassa.DateTime()

The defaults will be filled in whenever you retrieve instances from the Cassandra server and the column doesn't exist. If, for example, you add columns in the future, you simply add the relevant column and the default will be there when you get old instances.

The difference between IntString and Int64 is how it's stored in Cassandra. If you want maximum compatibility with other languages, use IntString, FloatString, and DateTimeString. Int64 is stored as an int64_t, Float64 is stored as a double, and DateTime is stored in the same format as the time() system call (seconds since 1970-01-01 00:00:00). These may end up being more compact and faster than the string representations.

>>> Test.objects = pycassa.ColumnFamilyMap(Test, cf)

All the functions are exactly the same, except that they return instances of the supplied class when possible.

>>> t = Test()
>>> t.key = 'maptest'
>>> t.string_column = 'string test'
>>> t.int_column = t.int_str_column = 18
>>> t.float_column = t.float_str_column = 35.8
>>> from datetime import datetime
>>> t.datetime_column = t.datetime_str_column = datetime.now()
>>> Test.objects.insert(t)
1261395560

>>> Test.objects.get(t.key).string_column
'string test'
>>> Test.objects.get(t.key).int_str_column
18
>>> Test.objects.get(t.key).float_column
35.799999999999997
>>> Test.objects.get(t.key).datetime_str_column
datetime.datetime(2009, 12, 23, 17, 6, 3)

>>> Test.objects.multiget([t.key])
{'maptest': <__main__.Test object at 0x7f8ddde0b9d0>}
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7f8ddde0b710>]
>>> Test.objects.get_count(t.key)
7

>>> Test.objects.remove(t)
1261395603
>>> Test.objects.get(t.key)
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()

Note that, as mentioned previously, get_range() may continue to return removed rows for some time:

>>> Test.objects.remove(t)
1261395603
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7fac9c85ea90>]
>>> list(Test.objects.get_range())[0].string_column
'Your Default'

SuperColumns

To use SuperColumns, pass super=True to the ColumnFamily constructor.

>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test SuperColumnFamily', super=True)
>>> cf.insert('key1', {'1': {'sub1': 'val1', 'sub2': 'val2'}, '2': {'sub3': 'val3', 'sub4': 'val4'}})
1261490144
>>> cf.get('key1')
{'1': {'sub2': 'val2', 'sub1': 'val1'}, '2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.remove('key1', '1')
1261490176
>>> cf.get('key1')
{'2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.get('key1', super_column='2')
{'sub3': 'val3', 'sub4': 'val4'}
>>> cf.multiget(['key1'], super_column='2')
{'key1': {'sub3': 'val3', 'sub4': 'val4'}}
>>> list(cf.get_range(super_column='2'))
[('key1', {'sub3': 'val3', 'sub4': 'val4'})]

These output values retain the same format as given by the Cassandra thrift interface.

Advanced

pycassa currently returns Cassandra Columns and SuperColumns as python dictionaries. Sometimes, though, you care about the order of elements. If you have access to an ordered dictionary class (such as collections.OrderedDict in python 2.7), then you may pass it to the constructor. All returned values will be of that class.

>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test ColumnFamily',
                              dict_class=collections.OrderedDict)

You may also define your own Column types for the mapper. For example, the IntString may be defined as:

>>> class IntString(pycassa.Column):
...     def pack(self, val):
...         return str(val)
...     def unpack(self, val):
...         return int(val)
... 

Extra Documentation

All the functions have the exact same functionality as their thrift counterparts, but it may be hidden as keyword arguments.

ColumnFamily.__init__()
    Parameters
    ----------
    client   : cassandra.Cassandra.Client
        Cassandra client with thrift API
    keyspace : str
        The Keyspace this ColumnFamily belongs to
    column_family : str
        The name of this ColumnFamily
    buffer_size : int
        When calling get_range(), the intermediate results need to be
        buffered if we are fetching many rows, otherwise the Cassandra
        server will overallocate memory and fail.  This is the size of
        that buffer.
    read_consistency_level : ConsistencyLevel
        Affects the guaranteed replication factor before returning from
        any read operation
    write_consistency_level : ConsistencyLevel
        Affects the guaranteed replication factor before returning from
        any write operation
    timestamp : function
        The default timestamp function returns:
        int(time.mktime(time.gmtime()))
        Or the number of seconds since Unix epoch in GMT.
        Set timestamp to replace the default timestamp function with your
        own.
    super : bool
        Whether this ColumnFamily has SuperColumns
    dict_class : class (must act like the dict type)
        The default dict_class is dict.
        If the order of columns matter to you, pass your own dictionary
        class, or python 2.7's new collections.OrderedDict. All returned
        rows and subcolumns are instances of this.

ColumnFamily.get()
    Parameters
    ----------
    key : str
        The key to fetch
    columns : [str]
        Limit the columns fetched to the specified list
    column_start : str
        Only fetch when a column is >= column_start
    column_finish : str
        Only fetch when a column is <= column_finish
    column_reversed : bool
        Fetch the columns in reverse order. This will do nothing unless
        you passed a dict_class to the constructor.
    column_count : int
        Limit the number of columns fetched per key
    include_timestamp : bool
        If true, return a (value, timestamp) tuple for each column
    super_column : str
        Return columns only in this super_column

ColumnFamily.multiget()
    Parameters
    ----------
    keys : [str]
        A list of keys to fetch
    columns : [str]
        Limit the columns fetched to the specified list
    column_start : str
        Only fetch when a column is >= column_start
    column_finish : str
        Only fetch when a column is <= column_finish
    column_reversed : bool
        Fetch the columns in reverse order. This will do nothing unless
        you passed a dict_class to the constructor.
    column_count : int
        Limit the number of columns fetched per key
    include_timestamp : bool
        If true, return a (value, timestamp) tuple for each column
    super_column : str
        Return columns only in this super_column

ColumnFamily.get_count()
    Parameters
    ----------
    key : str
        The key with which to count columns
    super_column : str
        Count the columns only in this super_column

ColumnFamily.get_range()
    Parameters
    ----------
    start : str
        Start from this key (inclusive)
    finish : str
        End at this key (inclusive)
    columns : [str]
        Limit the columns fetched to the specified list
    column_start : str
        Only fetch when a column is >= column_start
    column_finish : str
        Only fetch when a column is <= column_finish
    column_reversed : bool
        Fetch the columns in reverse order. This will do nothing unless
        you passed a dict_class to the constructor.
    column_count : int
        Limit the number of columns fetched per key
    row_count : int
        Limit the number of rows fetched
    include_timestamp : bool
        If true, return a (value, timestamp) tuple for each column
    super_column : string
        Return columns only in this super_column

ColumnFamily.insert()
    Insert or update columns for a key

    Parameters
    ----------
    key : str
        The key to insert or update the columns at
    columns : dict
        Column: {'column': 'value'}
        SuperColumn: {'column': {'subcolumn': 'value'}}
        The columns or supercolumns to insert or update

ColumnFamily.remove()
    Parameters
    ----------
    key : str
        The key to remove. If column is not set, remove all columns
    column : str
        If set, remove only this column or supercolumn