# NoSQL Data Modeling
## Project: Data Modeling with Apache Cassandra

This project was completed as part of Udacity's Data Engineering Nanodegree. Data set and prompt provided by [2011–2020 Udacity, Inc.](https://www.udacity.com), used under [CC BY](https://creativecommons.org/licenses/by-nc-nd/3.0/).

#### Import Python packages 

In [1]:
import cassandra
import csv

#### Creating a Cluster

For the following connection and session creation, a locally running Cassandra cluster is required. 

In [2]:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()

#### Create Keyspace

In [3]:
session.execute("""
CREATE KEYSPACE IF NOT EXISTS udacity 
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
""")

<cassandra.cluster.ResultSet at 0x7f908d4e4e80>

#### Set Keyspace

In [4]:
session.set_keyspace('udacity')

## Queries to ask the following three questions of the data:

#### 1. Find artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4


#### 2. Find name of artist, song (sorted by itemInSession) and user (first and last name) for userId = 10, sessionid = 182
    

#### 3. Find every user name (first and last) in the music app history who listened to the song 'All Hands Against His Own'

## Query 1

In this query, `sessionId` was the partition key, and `itemInSession` was the clustering column. Using the composite primary key from the two columns, unique play sessions (WHERE conditions) were categorized to find the corresponding `artist`, `song`, and `length` columns.   

In [5]:
# main query
query1 = "select artist, song, length from song_session WHERE sessionId = 338 AND itemInSession = 4"

# creating the required table 
file = 'event_datafile_new.csv'
with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:
        query = "CREATE TABLE IF NOT EXISTS song_session"
        query = query + "(sessionId int, itemInSession int, artist text, song text, length float, PRIMARY KEY (sessionId, itemInSession))"
        session.execute(query)
        query = "INSERT INTO song_session (sessionId, itemInSession, artist, song, length)"
        query = query + " VALUES (%s, %s, %s, %s, %s)"
        session.execute(query, (int(line[8]), int(line[3]), line[0], line[9], float(line[5])))

# executing the query
rows = session.execute(query1)
for row in rows:
    print (row.artist, row.song, row.length)

Faithless Music Matters (Mark Knight Dub) 495.30731201171875


## Query 2

In this query, `userId` and `sessionId` comprised the composite partition key, and `itemInSession` was the clustering column. Unique play sessions (WHERE conditions) were categorized and sorted by `itemInSession` to find the corresponding `artist`, `song`, and `user` columns.   

In [6]:
# main query
query2 = "select artist, song, user from song_user_session WHERE userid = 10 AND sessionid = 182"

# creating the required table 
file = 'event_datafile_new.csv'
with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:
        query = "CREATE TABLE IF NOT EXISTS song_user_session"
        query = query + "(userid int, sessionid int, itemInSession int, artist text, song text, user text, PRIMARY KEY ((userid, sessionId), itemInSession))"
        session.execute(query)
        query = "INSERT INTO song_user_session (userid, sessionid, itemInSession, artist, song, user)"
        query = query + " VALUES (%s, %s, %s, %s, %s, %s)"
        session.execute(query, (int(line[10]), int(line[8]), int(line[3]), line[0], line[9], (line[1] + line[4]))) 
        
# executing the query
rows = session.execute(query2)
for row in rows:
    print (row.artist, row.song, row.user)

Down To The Bone Keep On Keepin' On SylvieCruz
Three Drives Greece 2000 SylvieCruz
Sebastien Tellier Kilometer SylvieCruz
Lonnie Gordon Catch You Baby (Steve Pitron & Max Sanna Radio Edit) SylvieCruz


## Query 3

In this query, `song` was the partition key, and `userId` and `user` were the clustering columns. Using the composite primary key from the three columns, unique play sessions (WHERE conditions) were categorized to find the corresponding `user` columns.    

In [7]:
# main query
query3 = "select user from user_song WHERE song = 'All Hands Against His Own'"
# creating the required table 
file = 'event_datafile_new.csv'
with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader) 
    for line in csvreader:
        query = "CREATE TABLE IF NOT EXISTS user_song"
        query = query + "(song text, userid int, user text, PRIMARY KEY (song, userid, user))"
        session.execute(query)
        query = "INSERT INTO user_song (song, userid, user)"
        query = query + " VALUES (%s, %s, %s)"
        session.execute(query, (line[9], int(line[10]), (line[1] + line[4]))) 
        
# executing the query
rows = session.execute(query3)
for row in rows:
    print (row.user)             

JacquelineLynch
TeganLevine
SaraJohnson


#### Drop the tables

In [8]:
query = "drop table song_session"
session.execute(query)

query = "drop table song_user_session"
session.execute(query)

query = "drop table user_song"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7f908d4d2b38>

#### Close the session and cluster

In [9]:
session.shutdown()
cluster.shutdown()