#Not Only SQL Databases
NoSQL databases offer an alternative to relational databases by storing and querying data outside traditional structures, such as tabular formats, often using JSON documents. This schema-less design enables rapid scalability for managing large, unstructured datasets. Additionally, NoSQL databases are distributed, ensuring data availability and reliability across multiple servers, supporting modern web applications' demands for speed and scalability in cloud, big data, and mobile environments. The choice between relational and non-relational databases depends on specific use cases.

Here we discuss 3 types of NoSQL databases.

1. Document databases
2. Wide-column databases
3. Columnar databases
4. Graph databases


In [None]:
%%sh
pip install "requests" "pyarrow" "neo4j"

## 1. Document Databases

Document databases store data as documents, typically in JSON, XML, or BSON formats, making them ideal for managing semi-structured data. This structure keeps data together, reducing translation needs for application usage, and offers flexibility as schemas don't need to match across documents. However, complex transactions may lead to data corruption.

Popular for content management systems and user profiles, MongoDB is a notable example. Here we look at how we can connect and query documents from a MongoDB database using a data API.



In [None]:
import requests
import json
url = 'https://ap-southeast-1.aws.data.mongodb-api.com/app/data-gldcs/endpoint/data/v1/action/find'
api_key = 'sdP21rq0OCSiHILoR0rELTxFi7aCLB6ObIJHACEUE452kqoOXYcESPacFThmsOOV'

### 1.1 Create the API body

The body of the API request includes details avout the database, collection, filter clauses and field projections.

In [None]:
data = {
    "dataSource": "Cluster0",
    "database": "sample_airbnb",
    "collection": "listingsAndReviews",
    "limit": 5555,
    "filter": {
        "price": {"$gt": 500},
        "number_of_reviews": {"$lt": 10}
    },
    "projection": {
        "name": True,
        "reviews": True
    }
}
payload = json.dumps(data)

In [None]:
payload

In [None]:
headers = {
    'api-key': api_key,
    'Content-Type': 'application/json',
    'Access-Control-Request-Headers': '*'
}

### 1.2 API Response

Status 200 means OK or a successful API reposnse.

In [None]:
response = requests.post(url, headers=headers, data=payload)
response.raise_for_status()
print(response)

### 1.3 JSON Output

One of the most common data format of the APIs is JSON format. In python we can dump the JSON response to a text file and save it in the disk.

In [None]:
result = response.json()
with open('results.json','w') as f:
  json.dump(result,f)

In [None]:
print(json.dumps(result, indent=4, sort_keys=True))
documents = result["documents"]
print("There are",len(documents), "properties")

## 2. Columnar Databases
Columnar databases store data in columns, allowing users to access specific columns without allocating memory for irrelevant data. They address limitations of key-value and document stores, but their complexity makes them less suitable for new teams and projects. Examples include Apache HBase, built on Hadoop Distributed File System for storing sparse data sets in big data applications, and Apache Cassandra, designed for managing large data across multiple servers and data centers, used in social networking and real-time analytics.

### 2.1 Apache Cassandra


Cassandra is a NoSQL distributed database. By design, NoSQL databases are lightweight, open-source, non-relational, and largely distributed. Counted among their strengths are horizontal scalability, distributed architectures, and a flexible approach to schema definition.

NoSQL databases enable rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types. That’s become more important in recent years, with the advent of Big Data and the need to rapidly scale databases in the cloud. Cassandra is among the NoSQL databases that have addressed the constraints of previous data management technologies, such as SQL databases.https://cassandra.apache.org/_/cassandra-basics.html

In [None]:
!pip install --upgrade astrapy

In [None]:
from astrapy import DataAPIClient

# Initialize the client
client = DataAPIClient("AstraCS:BsxEcxZwtiLMEqiOAQChJldz:6df6b63168171b9ce1223596948ee1e59d04f609e1edf4b52ea0d4a140e42c93")
db = client.get_database_by_api_endpoint(
  "https://7554c06f-dcf1-412d-81a2-c5e7075f5497-us-east-2.apps.astra.datastax.com"
)

print(f"Connected to Astra DB: {db.list_collection_names()}")

In [None]:
collection = db.get_collection("movie_reviews")
print(collection.find_one())

In [None]:
reviews = collection.find({"criticname": "Todd Jorgenson"})
for review in reviews:
    print(review['title'])


In [None]:
rotten_reviews = collection.find({
    "reviewstate": "rotten"
})

for review in rotten_reviews:
    print(review)


### 2.2 Apache Arrow
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead. https://arrow.apache.org/overview/.

Apache arrow is in-memory storage and lazily loads data when iterated to it, making latency very small, and its table format storage also allows me to use simple filters. Plus it can interact with GPU memory as well for data processing. And it doesn't have any serialisation/de-serialisation overheads, as compared to redis which sores data in-memory as bytes and for my usecase it required serialisation/de-serialisation overheads.

In [None]:
import pyarrow as pa
import pyarrow.json

In [None]:
block_size_10MB = 10<<20
read_options = pyarrow.json.ReadOptions(
    block_size = block_size_10MB
)
reviews =  pyarrow.json.read_json('/content/results.json', read_options=read_options)
print(reviews)

### 2.3 Parquet Files
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

In [None]:
import pyarrow.parquet as pq
pq.write_table(reviews,'results.parquet')

In [None]:
reviews = pq.read_table('results.parquet')
print(reviews)

## 3. Graph Databases
Graph databases store data as nodes, edges, and properties, forming a knowledge graph where any object, place, or person can be a node, and edges define relationships between nodes. For example, a node could represent a client like IBM, and an edge could indicate the customer relationship between IBM and Ogilvy agency. Graph databases are used for managing connections within the graph network. Neo4j is a prominent graph-based database service, offering both open-source and licensed versions with additional features like online backup and high availability extensions.

In [None]:
pip install neo4j

In [None]:
from neo4j import GraphDatabase
URI = "neo4j+s://032d4418.databases.neo4j.io"
AUTH = ("neo4j", "en0Kwb2l6tfypPGj4U6ASlvNSEtsuSxT_13ZiyCwNUk")

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

In [None]:
records, summary, keys = driver.execute_query(
    "MATCH (m:movies) WHERE m.imdbRating > 8 RETURN m.title as title",
    database_="neo4j"
)

for title in records:
    print(title)

print("The query `{query}` returned {records_count} records in {time} ms.".format(
    query=summary.query, records_count=len(records),
    time=summary.result_available_after,
))

In [None]:
import pandas as pd
from pandas import DataFrame

DataFrame(records)


In [None]:
records, summary, keys = driver.execute_query(
    "MATCH (p:person)-->(m:movies) WHERE m.imdbRating > 8 RETURN p.name",
    database_="neo4j"
)

for name in records:
    print(name)

print("The query `{query}` returned {records_count} records in {time} ms.".format(
    query=summary.query, records_count=len(records),
    time=summary.result_available_after,
))
DataFrame(records)