# Data Modeling with Apache Cassandra

In this project, we model songs and user activity data and develop an ETL pipeline using Python to transfer
and insert the data into Apache Cassandra tables. We divide this project into 2 parts:

* [Part I. Creating an ETL Pipeline for Pre-Processing the Files](#Part-I.-Creating-an-ETL-Pipeline-for-Pre-Processing-the-Files)
* [Part II. Modeling Data with the Apache Cassandra](#Part-II.-Modeling-Data-with-the-Apache-Cassandra)

## Part I. Creating an ETL Pipeline for Pre-Processing the Files

#### Import Python packages 

In [1]:
import csv
import glob
import json
import os
import re
from typing import List

import cassandra
import numpy as np
import pandas as pd

#### Creating list of filepaths to process original event csv data files

In [2]:
filepath = os.getcwd() + "/event_data"
for root, dirs, files in os.walk(filepath):
    file_path_list = glob.glob(os.path.join(root, "*"))

#### Processing the files to create the data file csv that will be used for Apache Cassandra tables

In [3]:
def read_csv_files(file_path_list: List[str]) -> List[List[str]]:
    """
    Description: This function is responsible for reading a list of CSV files
    and return a list of event data (a list of values)
    
    Arguments:
        file_path_list: the list of file paths
        
    Returns:
        List of event data (a list of values)
    """

    full_data_rows_list = [] 
    
    for f in file_path_list:
        with open(f, "r", encoding="utf8", newline="") as csvfile: 
            csvreader = csv.reader(csvfile) 
            next(csvreader)

            for line in csvreader:
                full_data_rows_list.append(line) 

    return full_data_rows_list

In [4]:
full_data_rows_list = read_csv_files(file_path_list)

print(f"Total number of rows: {len(full_data_rows_list)}")
print(f"List of event data (only first 5 rows): {full_data_rows_list[:5]}")

Total number of rows: 8056
List of event data (only first 5 rows): [['Harmonia', 'Logged In', 'Ryan', 'M', '0', 'Smith', '655.77751', 'free', 'San Jose-Sunnyvale-Santa Clara, CA', 'PUT', 'NextSong', '1.54102E+12', '583', 'Sehr kosmisch', '200', '1.54224E+12', '26'], ['The Prodigy', 'Logged In', 'Ryan', 'M', '1', 'Smith', '260.07465', 'free', 'San Jose-Sunnyvale-Santa Clara, CA', 'PUT', 'NextSong', '1.54102E+12', '583', 'The Big Gundown', '200', '1.54224E+12', '26'], ['Train', 'Logged In', 'Ryan', 'M', '2', 'Smith', '205.45261', 'free', 'San Jose-Sunnyvale-Santa Clara, CA', 'PUT', 'NextSong', '1.54102E+12', '583', 'Marry Me', '200', '1.54224E+12', '26'], ['', 'Logged In', 'Wyatt', 'M', '0', 'Scott', '', 'free', 'Eureka-Arcata-Fortuna, CA', 'GET', 'Home', '1.54087E+12', '563', '', '200', '1.54225E+12', '9'], ['', 'Logged In', 'Austin', 'M', '0', 'Rosales', '', 'free', 'New York-Newark-Jersey City, NY-NJ-PA', 'GET', 'Home', '1.54106E+12', '521', '', '200', '1.54225E+12', '12']]


In [5]:
csv.register_dialect("myDialect", quoting=csv.QUOTE_ALL, skipinitialspace=True)

column_names = [
    "artist",
    "firstName",
    "gender",
    "itemInSession",
    "lastName",
    "length",
    "level",
    "location",
    "sessionId",
    "song",
    "userId",
]
with open("event_datafile_new.csv", "w", encoding="utf8", newline="") as f:
    writer = csv.writer(f, dialect="myDialect")
    writer.writerow(column_names)
    for row in full_data_rows_list:
        if (row[0] == ""):
            continue
        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))

In [6]:
with open("event_datafile_new.csv", "r", encoding="utf8") as f:
    print(f"Total number of rows: {sum(1 for line in f)}")

Total number of rows: 6821


## Part II. Modeling Data with the Apache Cassandra


Now we're ready to work with the CSV file we created in the last part titled `event_datafile_new.csv`. The file contains the following columns:

- artist 
- firstName of user
- gender of user
- item number in session
- last name of user
- length of the song
- level (paid or free song)
- location of the user
- sessionId
- song title
- userId

### Working with the Apache Cassandra

#### Creating a Cluster and a Session

In [7]:
from cassandra.cluster import Cluster


cluster = Cluster(['127.0.0.1'])
session = cluster.connect()

#### Create Keyspace

In [8]:
try:
    session.execute(
        """
        CREATE KEYSPACE IF NOT EXISTS sparkify 
        WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
        """
    )
except Exception as e:
    print(e)

#### Set Keyspace

In [9]:
try:
    session.set_keyspace("sparkify")
except Exception as e:
    print(e)

### Now we need to create tables to queries to ask the following three questions of the data

1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4
1. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
1. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'

#### Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4

We'll partition the data by the session ID and the item in session to make each data unique.

In [10]:
query = """
CREATE TABLE IF NOT EXISTS songs_heard_during_session
(
    session_id int,
    item_in_session int,
    artist text,
    song text,
    length double,
    PRIMARY KEY (
        session_id,
        item_in_session
    )
)
"""
try:
    session.execute(query)
except Exception as e:
    print(e)

In [11]:
file = "event_datafile_new.csv"

with open(file, encoding = "utf8") as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:
        query = "INSERT INTO songs_heard_during_session (session_id, item_in_session, artist, song, length)"
        query = query + " VALUES (%s, %s, %s, %s, %s)"
        session.execute(query, (int(line[8]), int(line[3]), line[0], line[9], float(line[5])))

In [12]:
query = "SELECT artist, song, length from songs_heard_during_session WHERE session_id = 338 AND item_in_session = 4"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print(row)

Row(artist='Faithless', song='Music Matters (Mark Knight Dub)', length=495.3073)


#### Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182

For this query, we can make a composite key of the user ID and session ID. However, we also want to sort the data by the item in session, so we'll set the item in session a a clustering column when we create the table.

In [13]:
query = """
CREATE TABLE IF NOT EXISTS songs_users_listen_to
(
    user_id int,
    session_id int,
    item_in_session int,
    artist text,
    song text,
    first_name text,
    last_name text,
    PRIMARY KEY (
        (
            user_id, 
            session_id
        ),
        item_in_session
    )
)
"""
try:
    session.execute(query)
except Exception as e:
    print(e)

In [14]:
file = "event_datafile_new.csv"

with open(file, encoding = "utf8") as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:
        query = "INSERT INTO songs_users_listen_to (user_id, session_id, item_in_session, artist, song, first_name, last_name)"
        query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s)"
        session.execute(query, (int(line[10]), int(line[8]), int(line[3]), line[0], line[9], line[1], line[4]))

In [15]:
query = "SELECT artist, song, first_name, last_name, item_in_session from songs_users_listen_to WHERE user_id = 10 AND session_id = 182"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print(row)

Row(artist='Down To The Bone', song="Keep On Keepin' On", first_name='Sylvie', last_name='Cruz', item_in_session=0)
Row(artist='Three Drives', song='Greece 2000', first_name='Sylvie', last_name='Cruz', item_in_session=1)
Row(artist='Sebastien Tellier', song='Kilometer', first_name='Sylvie', last_name='Cruz', item_in_session=2)
Row(artist='Lonnie Gordon', song='Catch You Baby (Steve Pitron & Max Sanna Radio Edit)', first_name='Sylvie', last_name='Cruz', item_in_session=3)


#### Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'

Here we cannot partition the data by the song's title alone since the song's title can be listened by any user. That means it is not unique. To make it unique, we should make a composite key of the song and user ID.

In [16]:
query = """
CREATE TABLE IF NOT EXISTS users_by_song
(
    song text,
    user_id int,
    first_name text,
    last_name text,
    PRIMARY KEY (
        song,
        user_id
    )
)
"""
try:
    session.execute(query)
except Exception as e:
    print(e)

In [17]:
file = "event_datafile_new.csv"

with open(file, encoding = "utf8") as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:
        query = "INSERT INTO users_by_song (song, first_name, last_name, user_id)"
        query = query + " VALUES (%s, %s, %s, %s)"
        session.execute(query, (line[9], line[1], line[4], int(line[10])))

In [18]:
query = "SELECT first_name, last_name from users_by_song WHERE song = 'All Hands Against His Own'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print(row)

Row(first_name='Jacqueline', last_name='Lynch')
Row(first_name='Tegan', last_name='Levine')
Row(first_name='Sara', last_name='Johnson')


### Drop the tables before closing out the sessions

In [19]:
query = "drop table songs_heard_during_session"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

query = "drop table songs_users_listen_to"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

query = "drop table users_by_song"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

### Close the session and cluster connection

In [20]:
session.shutdown()
cluster.shutdown()