In [None]:
C HAPTER 12
Python - Cassandra
In this last chapter, we are going to deal with another important NOSQL
database – Cassandra. Today some of the biggest IT giants (including
FaceBook, Twitter, Cisco, and so on) use Cassandra because of its high
scalability, consistency, and fault-tolerance. Cassandra is a distributed
database from Apache Software Foundation. It is a wide column store
database. Large amount of data is stored across many commodity servers
which makes data highly available.

In [None]:
12.1 Cassandra Architecture
The fundamental unit of data storage is a node. A node is a single server in
which data is stored in the form of keyspace. For understanding, you can
think of keyspace as a single database. Just as any server running a SQL
engine can host multiple databases, a node can have many keyspaces. Again,
like in a SQL database, keyspace may have multiple column families which
are similar to tables.
However, the architecture of Cassandra is logically as well as physically
different from any SQL oriented server (Oracle, MySQL, PostgreSQL, and
so on). Cassandra is designed to be a foolproof database without a single
point of failure. Hence, data in one node is replicated across a peer-to-peer
network of nodes. The networks is called a data center, and if required,
multiple data centers are interconnected to form a cluster. Replication
strategy and replication factor can be defined at the time of creation of a
keyspace. (Figure 12.1)

In [None]:
Each ‘write’ operation over a keyspace is stored in Commit Log, which acts
as a crash-recovery system. After recording here, data is stored in a Mem-
table. Mem-table is just a cache or buffer in the memory. Data from the
mem-table is periodically flushed in SSTables, which are physical disk files
on the node.
Cassandra’s data model too, is entirely different from a typical relational
database. It is often, described as a column store or column-oriented
NOSQL database. A keyspace holds one or more column families, similar to
the table in RDBMS. Each table (column family) is a collection of rows,
each of which stores columns in an ordered manner. Column, therefore, is
the basic unit of data in Cassandra. Each column is characterized by its
name, value, and timestamp.
The difference between a SQL table and Cassandra’s table is that the latter is
schema-free. You don’t need to define your column structure ahead of time.
As a result, each row in a Cassandra table may have columns with different
names and variable numbers. (Figure 12.2)

In [None]:
12.2 Installation
The latest version of Cassandra is available for download at
http://cassandra.apache.org/download/.
Community
distributions
of
Cassandra (DDC) can be found at https://academy.datastax.com/planet-
cassandra/cassandra. Code examples in this chapter are tested on DataStax
distribution installed on Windows OS.
Just as any relational database uses SQL for performing operations on data
in tables, Cassandra has its own query language CQL which stands for
Cassandra Query Language. The DataStax distribution comes with a
useful front-end IDE for CQL. All operations such as creating keyspace and
table, running different queries, and so on can be done both visually as well
as using text queries. The following diagram shows a view of DataStax
DevCenter IDE. (Figure 12.3)

In [None]:
12.3 CQL Shell
Cassandra installation also provides a shell inside which you can execute
CQL queries. It is similar to MySQL console, SQLite console, or Oracle’s
SQL Plus terminal. (Figure 12.4)
Figure 12.4 CQL Shell
We shall first learn to perform basic CRUD operations with Cassandra from
inside CQLSH and then use Python API for the purpose.

In [None]:
12.4 Create KeyspaceAs mentioned above, Cassandra Query Language (CQL) is the primary tool
for communication with Cassandra database. Its syntax is strikingly similar
to SQL. The first step is to create a keyspace.
A keyspace is a container of column families or tables. CQL provides
CREATE KEYSPACE statement to do exactly the same – create a new
keyspace. The statement defines its name and replication strategy.

In [None]:
Example 12.1
CREATE KEYSPACE name with replication {options}
The replication clause is mandatory and is characterized by class and
replication_factor attributes. The ‘class’ decides the replication strategy to
be used for the keyspace. Its value is, by default, SimpleStrategy indicating
that data will be spread across the entire cluster. Another value of class is
NetworkTopologyStrategy. It is a production-ready strategy with the help
of which replication factor can be set independently on each data center.
The replication_factor attribute defines number of replicas per data center.
Set its value to 3, which is considered optimum, so that the data availability
is reasonably high.
The following statement creates ‘MyKeySpace’ with ‘ SimpleStrategy ’ and
replication_factor of 3.
Note that, the name of keyspace is case-insensitive unless given in double-
quotes. CQL provides the use keyword to set a certain keyspace as current.
(Similar to MySQL ‘use’ statement isn’t it?). To display list of keyspaces in
the current cluster, there is DESCRIBE keyword.

In [None]:
Create New Table
As mentioned earlier, one or more column families or tables may be present
in a keyspace. The CREATE TABLE command in CQL creates a new table
in the current keyspace. Remember the same command was used in SQL?
The general syntax is, as follows:
Example 12.2
create table if not exists table_name
(
col1_definition,
col2_definition,
..
..
)
Column definition contains its name and data type, optionally setting it as
the primary key. The primary key can also be set after the list of columns
have been defined. The ‘if not exists’ clause is not mandatory but is
recommended to avoid error if the table of the given name already exists.
The following statement creates the ‘Products’ table in mykeyspace.Following definition is also identical:

In [None]:
Partition Key
Partition key determines on which node will a certain row will be stored. If a
table has single primary key (as in above definition), it is treated as partition
key as well. The hash value of this partition key is used to determine the
node or replica on which a certain row is located. Cassandra stores rows
having primary key in a certain range on one node. For example, rows with a
productID value between 1 to 100 are stored on Node A, between 2 to 200
on node B, and so on.
The primary key may comprise of more than one column. In that case, the
first column name acts as the partition key and subsequent columns are
cluster keys. Let us change the definition of Products table slightly as
follows:

In [None]:
In this case, the ‘manufacturer’ column acts as the partition key and
‘productID’ as a cluster key. As a result, all products from the samemanufacturer will reside on the same node. Hence a query to search for
products from a certain manufacturer will return results faster.

In [None]:
12.5 Inserting Rows
INSERT statement in CQL is exactly similar to one in SQL. However, the
column list before the ‘VALUES’ clause is not optional as is the case in
SQL. That is because, in Cassandra, the table may have variable number of
columns.
Issue INSERT statement multiple number of times to populate ‘products’
table with sample data given in chapter 9. You can also import data from a
CSV file using copy command, as follows:

In [None]:
12.6 Querying Cassandra Table
Predictably, CQL also has SELECT statement to fetch data from a Cassandra
table. Easiest usage is employing ‘*’ to fetch data from all columns in a
table.All conventional logical operators are allowed in the filter criteria specified
with the WHERE clause. The following statement returns product names
with price greater than 10000.

In [None]:
Use of ALLOW FILTERING is necessary here. By default, CQL only
allows select queries where all records read will be returned in the result set.
Such queries have predictable performance. The ALLOW FILTERING
option allows to explicitly allow (some) queries that require filtering. If the
filter criteria consists of partition key columns only = and IN operators are
allowed.
UPDATE and DELETE statements of CQL are used as in SQL. However,
both must have filter criteria based on the primary key. (Note the use of ‘--’
as a commenting symbol)

In [None]:
12.7 Table with Compound Partition Key
In the above example, the products table had been defined to have a partition
key with a single primary key. Rows in such a table are stored in different
nodes depending upon hash value of the primary key. However, data is
stored across the cluster using a slightly different method when the table has
a compound primary key. The following table’s primary key comprises of
two columns.

In [None]:
For this table, ‘manufacturer’ is the partition key and ‘productID’ behaves as
a cluster key. As a result, products with similar ‘manufacturer’ are stored in
the same node. Let us understand with the help of the following example.
The table contains following data:
Example 12.3
cqlsh:mykeyspace> select * from products;
productid | manufacturer | name
| price
-----------+--------------+------------+-------
5 |
'Epson' | 'Printer' | 9000
10 |
'IBall' | 'Keyboard' | 10001
8
2
4
7
6
9
3
|
|
|
|
|
|
|
|
'Acer'
'Acer'
'Samsung'
'Epson'
'IBall'
'Samsung'
'Samsung'
'IBall'
|
|
|
|
|
|
|
|
'Laptop'
'Tab'
'TV'
'Scanner'
'Mouse'
'Mobile'
'AC'
'Router'
|
|
|
|
|
|
|
|
25000
10000
40000
5000
500
15000
35000
2000
(10 rows)
Rows in the above table will be stored among nodes such that products from
the same manufacturer are together. (Figure 12.5)

In [None]:
12.8 Python Cassandra Driver
Cassandra’s Python module has been provided by apache itself. It works
with the latest version CQL version 3 and uses Cassandra’s native protocol.
This Python driver also has ORM API in addition to core API which is
similar in many ways to DB-API.To install this module, use the pip installer as always.
Verify successful installation by following commands:
Example 12.4
>>> import cassandra
>>> print (cassandra.__version__)
3.17.0


In [None]:
To execute CQL queries, we have to set up a Cluster object, first.
Example 12.5
>>> from cassandra.cluster import Cluster
>>> clstr=Cluster()
Next up, we need to start a session by establishing a connection with our
keyspace in the cluster.
Example 12.6
>>> session=clstr.connect('mykeyspace')
The ubiquitous execute () method of session object is used to perform all
CQL operations. For instance, the primary SELECT query over the
‘products’ table in ‘mykeypace’ returns a result set object. Using a typical
for loop, all rows can be traversed.
Example 12.7
#cassandra-select.py
from cassandra.cluster import Cluster
clstr=Cluster()
session=clstr.connect('mykeyspace')
rows=session.execute("select * from products;")
for row in rows:
print ('Manufacturer: {} ProductID:{} Name:{} price:
{}'.format(row[1],row[0], row[2], row[3]))

In [None]:
12.9 Parameterized Queries
The cassandra.query submodule defines following Statement classes:
SimpleStatement: A simple, unprepared CQL query contained in a query
string. For example:
Example 12.8
from cassandra.query import SimpleStatement
stmt=SimpleStatement("select * from products;")
rows=session.execute(stmt)
BatchStatement: A batch combines multiple DML operations (such as
INSERT, UPDATE, and DELETE) and executes at once to achieveatomicity. For the following example, firs,t create a ‘customers’ table in the
current keyspace.
Customer data is provided in the form of a list of tuples. Individual INSERT
query is populated with each tuple and added in a BatchStatement. Batch is
then executed at once.
Example 12.9
#cassandra-batch.py
from cassandra.cluster import Cluster
clstr=Cluster()
session=clstr.connect('mykeyspace')
custlist=[(1,'Ravikumar','27AAJPL7103N1ZF'),
(2,'Patel','24ASDFG1234N1ZN'),
(3,'Nitin','27AABBC7895N1ZT'),
(4,'Nair','32MMAF8963N1ZK'),
(5,'Shah','24BADEF2002N1ZB'),
(6,'Khurana','07KABCS1002N1ZV'),
(7,'Irfan','05IIAAV5103N1ZA'),
(8,'Kiran','12PPSDF22431ZC'),
(9,'Divya','15ABCDE1101N1ZA'),
(10,'John','29AAEEC4258E1ZK')]
from cassandra.query import SimpleStatement, BatchStatement
batch=BatchStatement()
for cst in custlist:
batch.add(SimpleStatement("INSERT INTO customers
(custID,name,GSTIN) VALUES (%s, %s, %s)"), \
(cst[0], cst[1],cst[2]))
session.execute(batch)
Run above code and then check rows in ‘customers’ table in CQL shell.

In [None]:
PreparedStatement: Prepared statement contains a query string that is
parsed by Cassandra and then saved for later use. Subsequently, it only needs
to send the values of parameters to bind. This reduces network traffic and
CPU utilization because Cassandra does not have to re-parse the query each
time. The Session.prepare() method returns a PreparedStatement instance.
Example 12.10
#cassandra-prepare.py
from cassandra.cluster import Cluster
from cassandra.query import PreparedStatement
clstr=Cluster()
session=clstr.connect('mykeyspace')
stmt=session.prepare("INSERT
INTO
customers
(custID,
name,GSTIN) VALUES (?,?,?)")
boundstmt=stmt.bind([11,'HarishKumar', '12PQRDF22431ZN'])
session.execute(boundstmt)
Each time, the prepared statement can be executed by binding it with a new
set of parameters. Note that, the PreparedStatement uses ‘?’ as place holder
and not ‘%s’ as in BatchStatement.

In [None]:
12.10 User-defined Types
While executing the queries, Python data types are implicitly parsed to
corresponding CQL types as per the following table: (Figure 12.1)
Table 12.1 Data types
Python Type CQL Type
None NULL
bool boolean
float float, double
int, long int, bigint, varint, smallint, tinyint, counter
decimal.Decimal Decimal
str, unicode ascii, varchar, text
buffer, bytearray Blob
Date Date
Datetime Timestamp
Time Time
list, tuple, generator List
set, frozenset Set
dict, OrderedDict Map
uuid.UUID timeuuid, uuid
In addition to the above built-in CQL data types, Cassandra table may have a
column of a user-defined type to which an object of Python class can be
mapped.
Cassandra provides a CREATE TYPE statement to define a new user-
defined type which be used as a type for a column in a table defined with the
CREATE TABLE statement.

In [None]:
In the script given below (Cassandra-udt.py), we define a Cassandra user-
defined type named as ‘contacts’ and use it as the data type of ‘contact’
column in ‘users’ table. The register_user_type() method of cluster object
helps us to map Python class ‘ContactInfo’ to the user-defined type.
Example 12.11
#cassandra-udt.py
from cassandra.cluster import Cluster
cluster = Cluster(protocol_version=3)
session = cluster.connect()
session.set_keyspace('mykeyspace')
session.execute("CREATE TYPE contact (email text, phone text)")
session.execute("CREATE TABLE users (userid int PRIMARY KEY,
name text, contact frozen<contact>)")
class ContactInfo:
def __init__(self, email, phone):
self.email = email
self.phone = phone
cluster.register_user_type('mykeyspace',
'contact',
ContactInfo)
# insert a row using an instance of ContctInfo
session.execute("INSERT INTO users (userid, name, contact)
VALUES (%s, %s, %s)",
(1,
‘Admin’,
ContactInfo("admin@testserver.com", '9988776655')))
The following display of CQL shell confirms the insertion operation of the
above script.In this chapter, we learnt about the basic features of the Cassandra database,
and importantly how to perform read/write operations on it with Python