# Unified Data Analysis: SQL & NoSQL on a Single Database with Kai

<img src=https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/notebooks/unified-data-analysis-sql-nosql-kai/banking_analytics.png width="100%">

### What you will learn in this notebook:

In this notebook we ingest data from from different sources like MySQL, MongoDB and S3 and perform efficient analysis using both NoSQL and SQL on multimodal data (tabular and JSON). 

### Highlights
1. Setup CDC from MongoDB and MySQL in easy steps. Replicate data in real-time and ensure upto date information for analytics, eliminating the need for complex tooling for data movement

2. Analyze data using both NoSQL and relational approaches, depending on your specific needs. Developers and data analytics who are familiar with different programming approaches like MongoDB query language and SQL can work together on the same database. Perform familiar SQL queries on your NoSQL data!

Ready to unlock real-time analytics and unified data access? Let's start!

In [696]:
pip install pymongo prettytable matplotlib --quiet

Note: you may need to restart the kernel to use updated packages.


### Create database for importing data from different sources 

This example gets banking data from three different sources: ATM locations from S3, transaction data from MySQL and user profile details from MongoDB databases. Joins data from different sources to generate rich insights about the transactional activity across user profile and locations across the globe

In [None]:
%%sql
DROP DATABASE IF EXISTS BankingAnalytics;
CREATE DATABASE BankingAnalytics;

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Action Required</b></p>
        <p> Make sure to select 'BankingAnalytics' database from the drop-down menu at the top of this notebook. It updates the <tt>connection_url</tt>  to connect to that database.</p>
    </div>
</div>

<img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/notebooks/unified-data-analysis-sql-nosql-kai/selectdb.png" style="width: 500px; border: 1px solid darkorchid"> 

## Setup CDC from MySQL

### Singlestore allows you to ingest the data from mysql using pipelines . 

In this step , we will create sqllink and create the pipelines to infer the schema and start the cdc 

In [None]:
%%sql
create link mysqllink as mysql config
'{"database.hostname": "3.141.19.255",
"database.exclude.list": "mysql,performance_schema", "table.include.list": "DomainAnalytics.transactions","database.port": 3306, "database.ssl.mode":"required"}'
credentials '{"database.password": "Password@123", "database.user": "repl_user"}';

In [None]:
%%sql
create tables as infer pipeline as load data link
mysqllink "*" format avro;

In [None]:
%%sql
START ALL PIPELINES;

### Migrate the data from S3 storage to SingleStore using Pipelines

 We are migrating the data from S3 using pipelines, you will require to create tables before hand 

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS atm_locations (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    address VARCHAR(255),
    city VARCHAR(255),
    country VARCHAR(255),
    latitude DECIMAL(9, 6),
    longitude DECIMAL(9, 6)
);

In [None]:
%%sql
CREATE PIPELINE atmlocations AS
LOAD DATA S3 's3://ocbfinalpoc1/data'
CONFIG '{"region":"ap-southeast-1"}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE atm_locations;

In [None]:
%%sql
start pipeline atmlocations

### Setup CDC from MongoDB to SingleStore

We will create pipelines to migrate the data from mongo to Singlestore 

Please note that , Singlstore has capability to infer the schema from mongo db into Singlestore

In [None]:
%%sql
CREATE link mongo AS MONGODB
CONFIG '{"mongodb.hosts":"ac-t7n47to-shard-00-00.tfutgo0.mongodb.net:27017,ac-t7n47to-shard-00-01.tfutgo0.mongodb.net:27017,ac-t7n47to-shard-00-02.tfutgo0.mongodb.net:27017",
         "collection.include.list": "bank.*",
         "mongodb.ssl.enabled":"true",
         "mongodb.authsource":"admin",
         "mongodb.members.auto.discover": "true"}'
CREDENTIALS '{"mongodb.user":"forimport",
              "mongodb.password":"4Zfb0SKGCcDz5bBt"}';

In [None]:
%%sql
CREATE TABLES AS INFER PIPELINE AS LOAD DATA link mongo '*' FORMAT AVRO;

In [None]:
%%sql
show pipelines

In [None]:
%%sql
START ALL PIPELINES

### Check for records in tables

1. Data from MySQL

In [None]:
%%sql
select count(*) from transactions

In [None]:
%%sql
SELECT * FROM transactions WHERE transaction_type LIKE '%Deposit%' LIMIT 1;

2. Data from S3 

In [None]:
%%sql
select count(*) from atm_locations

In [None]:
%%sql
SELECT * FROM atm_locations LIMIT 1;

3. Data from MongoDB

In [None]:
%%sql
SELECT _id:>JSON, _more:>JSON FROM profile LIMIT 1;

In [None]:
%%sql

SELECT _id:>JSON, _more:>JSON FROM history LIMIT 1;

### Join tables from different sources using SQL queries 

SQL Query 1: View Users details, their associated ATMs 

In [None]:
%%sql
SELECT
    p._more::$full_name AS NameOfPerson,
    p._more::$email AS Email,
    a.id,
    a.name AS ATMName,
    a.city,
    a.country
FROM
    profile p,
    atm_locations a
WHERE
    p._more::$account_id = a.id limit 10;

SQL Query 2: View Users details, their associated ATMs and transaction details

In [None]:
%%sql
SELECT
    p._more::$full_name AS NameOfPerson,
    p._more::$email AS Email,
    a.id,
    a.name AS ATMName,
    a.city,
    t.transaction_id,
    t.transaction_date,
    t.amount,
    t.transaction_type,
    t.description
FROM
    profile p
JOIN
    atm_locations a ON p._more::$account_id = a.id
LEFT JOIN
    transactions t ON p._more::$account_id = t.account_id limit 10;


### Run queries in Mongo Query Language using Kai

In [None]:
from pymongo import MongoClient
import pprint
from prettytable import PrettyTable

client = MongoClient(connection_url_mongo)

# Get the profile collection

db = client['BankingAnalytics']

profile_coll = db['profile']

for profile in profile_coll.find().limit(1):
  pprint.pprint(profile)

In [None]:
pipeline = [
    {
        "$lookup": {
            "from": "profile",
            "localField": "account_id",
            "foreignField": "account_id",
            "as": "profile_data"
        }
    },
    {
        "$limit": 5
    },
    {
        "$group": {
            "_id": "$_id",
            "history_data": {"$first": "$$ROOT"},
            "profile_data": {"$first": {"$arrayElemAt": ["$profile_data", 0]}}
        }
    },
    {
        "$project": {
            "_id": "$history_data._id",
            "account_id": "$history_data.account_id",
            "history_data": "$history_data",
            "profile_data": "$profile_data"
        }
    }
]

# Execute the aggregation pipeline
result = list(db.history.aggregate(pipeline))

# Print the result in a tabular format
table = PrettyTable(["Account ID", "Full Name", "Date of Birth", "City", "State", "Country", "Postal Code", "Phone Number", "Email"])
for doc in result:
    profile_data = doc["profile_data"]
    table.add_row([
        doc["account_id"],
        profile_data.get("full_name", ""),
        profile_data.get("date_of_birth", ""),
        profile_data.get("city", ""),
        profile_data.get("state", ""),
        profile_data.get("country", ""),
        profile_data.get("postal_code", ""),
        profile_data.get("phone_number", ""),
        profile_data.get("email", "")
    ])

print(table)

In [None]:
# Get the state with highest number of customers
from bson.son import SON

pipeline = [
    {"$group": {"_id": "$state", "count": {"$sum": 1}}},
    {"$sort": SON([("count", -1), ("_id", -1)])},
    {"$limit": 5}
]

p1 = [
    {"$count": "count"}
]
pprint.pprint(list(profile_coll.aggregate(pipeline)))

In [None]:
import matplotlib.pyplot as plt

data = list(profile_coll.aggregate(pipeline))

print(data)

country,count = [dcts['_id'] for dcts in data],[dcts['count'] for dcts in data]

plt.bar(country,count)
plt.plot()