# Part I. ETL Pipeline for Pre-Processing the Files

## 1. Import necessary Python packages 

In [68]:
# Import Python packages 
import pandas as pd
import cassandra
import re
import os
import glob
import numpy as np
import json
import csv

## 2. Create a list of filepaths to process original event CSV data files

First, check current working directory. Since this might expose local workplace directories, only run this for testing and verification, and be aware of privacy compromising when sharing this notebook to others.

In [None]:
print(os.getcwd())

Next, get current folder and subfolder for event data, which is stored in `./event_data` directory. Then, loop through all files and get their filepaths for later data wrangling.

Notice that, in the below code, I use `glob`, a Python package, for joining filepaths and roots with the subdirectories together.

In [70]:
filepath = os.getcwd() + '/event_data'

for root, dirs, files in os.walk(filepath):
    file_path_list = glob.glob(os.path.join(root,'*'))

## 3. Process the files to create a merged data file CSV for Apache Casssandra tables

Here, a merged data file CSV must be created, named <font color=red>event_datafile_new.csv</font>, containing data from all the event files whose filepaths were fetched in previous step. The process includes 3 steps:

1. Initiate an empty list of rows that will be generated from each file, named `full_data_rows_list`.
2. For every filepath in the file path list:
    - Read the event data CSV file.
    - Extract each data row one by one and append it to the merged data file CSV.

In [71]:
full_data_rows_list = [] 
    
for f in file_path_list:

    with open(f, 'r', encoding = 'utf8', newline='') as csvfile: 
        csvreader = csv.reader(csvfile) 
        next(csvreader)
        
        for line in csvreader:
            full_data_rows_list.append(line) 
            
print(f"Total number of rows: {len(full_data_rows_list)}")

Total number of rows: 8056


3. Create <font color=red>event_datafile_new.csv</font> that will be used to insert data into the Apache Cassandra tables. For each row fetched from all event data CSV files, extract 11 columns that will be used in later tasks. The list of these 11 columns will be explained next section.

In [72]:
csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, skipinitialspace=True)

with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:
    writer = csv.writer(f, dialect='myDialect')
    writer.writerow([
        'artist', 'firstName', 'gender', 'itemInSession',
        'lastName', 'length', 'level', 'location',
        'sessionId', 'song', 'userId'
    ])
    for row in full_data_rows_list:
        if (row[0] == ''):
            continue
        writer.writerow((
            row[0], row[2], row[3], row[4], 
            row[5], row[6], row[7], row[8], 
            row[12], row[13], row[16]
        ))

In [73]:
with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:
    print(f"Total number of rows in merged CSV file: {sum(1 for line in f)}")

Total number of rows in merged CSV file: 6821


# Part II. Apache Cassandra coding challenge 

Now the CSV file titled <font color=red>event_datafile_new.csv</font> is completed, located within the Workspace directory. The <font color=red>event_datafile_new.csv</font> contains the following columns: 

- artist 
- firstName of user
- gender of user
- item number in session
- last name of user
- length of the song
- level (paid or free song)
- location of the user
- sessionId
- song title
- userId

The image below is a screenshot of what the denormalized data should appear like in the <font color=red>event_datafile_new.csv</font> after the pre-processing ETL pipeline code in part I is run:<br>

<img src="./assets/image_event_datafile_new.jpg">

## 1. Apache Cassandra initial configurations

### a. Create a Cluster

The cell below makes a connection to a Cassandra instance in local machine (127.0.0.1), and establishes connection for executing queries, using a session.

In [74]:
from cassandra.cluster import Cluster
cluster = Cluster()

session = cluster.connect()

### b. Create a Keyspace

A Keyspace is a top-level namespace for Cassandra to define data replication on nodes. The cell below creates a keyspace named `huy_udacity_project1b`, configures replica placement strategy and replication factor.

In [75]:
try:
    session.execute("""
        CREATE KEYSPACE IF NOT EXISTS huy_udacity_project1b
        WITH REPLICATION = {
            'class' : 'SimpleStrategy',
            'replication_factor' : 1
        }
    """)
    print("Keyspace created succesfully!")
except Exception as e:
    print(e)

Keyspace created succesfully!


### c. Set Keyspace

Set `huy_udacity_project1b` as default keyspace for all queries made in this session.

In [76]:
try:
    session.set_keyspace('huy_udacity_project1b')
    print("Keyspace set successfully!")
except Exception as e:
    print(e)

Keyspace set successfully!


## 2. Project tasks - create queries for the following 3 data questions

### Task 1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession = 4

#### a. Question answer approach

The Question 1 expects Name of the artist, title of the song and length of the track based on `sessionId` and `itemInSession`. As we are working with a NoSQL database, first we need to think about the query and the table required.

##### i. The queries

As anyone who knows SQL, our query for this is gonna be like:

```sql
SELECT
    artist_name,
    song_title,
    track_length
FROM table_name
WHERE
    sessionId = 338
    AND itemInSession = 4
```

But first, we need a table. A table that is specifically made for this kind of question and queries it involves.

##### ii. The table

Let's give the table a good, self-explanatory name, like `song_session`.

For the table structure:
- Columns : all those columns that are either required or involved in query condition, which are:
    + Artist name
    + Song title
    + Song length
    + `sessionId`
    + `itemInSession`
- Primary key : unique identifier of each row. Since we query the data based on `sessionId` and `itemInSession`, let them both be our primary key.

#### b. Table initiation

For this, we first create a table that serves as a response to query 1 above.
Before that, as I will rerun this script a lot of time during development, dropping the old one and creating a new one for every rerun is advised.

In [77]:
table_query1_name = "song_session"

query1_dropExist = f"DROP TABLE IF EXISTS {table_query1_name}"
try:
    session.execute(query1_dropExist)
    print("DROP TABLE IF EXISTS completed successfully.")
except Exception as e:
    print(e)

DROP TABLE IF EXISTS completed successfully.


The data should be inserted and retrieved in the same order as to how the COMPOSITE PRIMARY KEY is set up. This is important because Apache Cassandra is a partition row store, which means the partition key determines where any particular row is stored and on which node.

For this, it is strongly advised to `CREATE` and `INSERT` columns with order as Partition Keys followed by clustering keys followed by other features.

Thus, the order of the columns is `sessionId` - `itemInSession` - `artist_name` - `song_title` - `song_length`.

In [78]:
query1_createNew = f"""
    CREATE TABLE IF NOT EXISTS {table_query1_name} (
        sessionId int, 
        itemInSession int, 
        artist_name text, 
        song_title text,
        song_length float, 
        PRIMARY KEY (sessionId, itemInSession)
    )
"""

try:
    session.execute(query1_createNew)
    print("Successfully created new table for query 1.")
except Exception as e:
    print(e)

Successfully created new table for query 1.


#### c. Insert data into table for the query

The cell below fetches data from <font color=red>event_datafile_new.csv</font> file (the merged CSV file we mentioned earlier), based on the table structure we discussed above. 

Our `INSERT` statement will iterate through each row of the CSV file(line) and Insert the data from the appropriate columns to our table columns. For example, for the `sessionId` the CSV file has the column at index 8, so for the `song_session`'s `sessionId` we will take the value from:
- Current row, which is `line`.
- `line`'s 9th column which is `line[8] : int(line[8])`. The int here is so that data type matches our table column data-type.

In [79]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:

        query1_insert = f"""
            INSERT INTO {table_query1_name} (
                sessionId, 
                itemInSession, 
                artist_name, 
                song_title, 
                song_length
            )
        """
        query1_insert = query1_insert + "VALUES (%s, %s, %s, %s, %s)"

        try:
            session.execute(query1_insert, (
                int(line[8]), 
                int(line[3]), 
                line[0], 
                line[9], 
                float(line[5])
            ))
        except Exception as e:
            print(e)
    
    print(f"Successfully filled {table_query1_name} with rows.")

Successfully filled song_session with rows.


#### d. Do a SELECT to verify that the data have been inserted into the table

Once the data has been inserted, we need to verify the insertion with a `SELECT` statement. We are using our question's selection statement based on which we created this table, which I have shown you above right at the section's beginning.

In [80]:
query1_verify = f"""
    SELECT
        artist_name, 
        song_title, 
        song_length
    FROM {table_query1_name}
    WHERE sessionId = 338
    AND itemInSession = 4
"""

try:
    rows = session.execute(query1_verify)
    print(f"Sucessfully fetched rows for conditions of query 1 from {table_query1_name} : \n")
except Exception as e:
    print(e)

for row in rows:
    print(
        row.artist_name, "\t",
        row.song_title, "\t",
        row.song_length
    )

Sucessfully fetched rows for conditions of query 1 from song_session : 

Faithless 	 Music Matters (Mark Knight Dub) 	 495.30731201171875


The output is a single record : 

```
Faithless Music Matters (Mark Knight Dub) 495.30731201171875
```

This means our operation was successful, as it retrieves exactlywhat the question asks!

### Task 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182

#### a. Question answer approach

This question expects name of artist, song title and user name, for a specific `userId` and `sessionId` pair of values. This question has 2 additional requirements:
- Song names are sorted by `itemInSession`.
- Show both first and last names of the user.

##### i. The queries

You will see them in the code cells below in this question.

##### ii. The table

Table name : `user_playlist_session`.

Table columns, by orders :
- `userId` (first primary key)
- `sessionId` (second primary key)
- `itemInSession` (clustering column, also for sorting the table)
- Artist name
- Song name
- First name
- Last name

#### b. Table initiation

Since we sort the table by `itemInSession`, we slap it to `CLUSTERING ORDER BY` (I assume we sort by ascend? Since they don't mention which way).

In [81]:
table_query2_name = "user_playlist_session"

query2_dropExist = f"DROP TABLE IF EXISTS {table_query2_name}"
try:
    session.execute(query2_dropExist)
    print("DROP TABLE IF EXISTS completed successfully.")
except Exception as e:
    print(e)

query2_createNew = f"""
    CREATE TABLE IF NOT EXISTS {table_query2_name} (
        userId int,
        sessionId int,
        itemInSession int,
        artist_name text, 
        song_name text,
        firstName text, 
        lastName text,
        PRIMARY KEY ((userId, sessionId), itemInSession)
    ) WITH CLUSTERING ORDER BY (itemInSession ASC)
"""

try:
    session.execute(query2_createNew)
    print("Successfully created new table for query 2.")
except Exception as e:
    print(e)

DROP TABLE IF EXISTS completed successfully.
Successfully created new table for query 2.


#### c. Insert data into table for the query

The cell below fetches data from <font color=red>event_datafile_new.csv</font> file (the merged CSV file we mentioned earlier), based on the table structure we discussed above. Basically the same to previous question, just different columns and their corresponding lines.

In [82]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:

        query2_insert = f"""
            INSERT INTO {table_query2_name} (
                userId,
                sessionId, 
                itemInSession,
                artist_name, 
                song_name,
                firstName,
                lastName
            )
        """
        query2_insert = query2_insert + "VALUES (%s, %s, %s, %s, %s, %s, %s)"

        # Based on the order of cols above, it's gonna be 0 - 1 - 3 - 4 - 8 - 9 - 10.
        try:
            session.execute(query2_insert, (
                int(line[10]),
                int(line[8]), 
                int(line[3]),
                line[0], 
                line[9],
                line[1],
                line[4]
            ))
        except Exception as e:
            print(e)

    print(f"Successfully filled {table_query2_name} with rows.")

Successfully filled user_playlist_session with rows.


#### d. Do a SELECT to verify that the data have been inserted into the table

Once the data has been inserted, we need to verify the insertion with a `SELECT` statement. We are using our question's selection statement based on which we created this table, which I have shown you above right at the section's beginning.

In [83]:
query2_verify = f"""
    SELECT
        itemInSession,
        artist_name, 
        song_name, 
        firstName,
        lastName
    FROM {table_query2_name}
    WHERE userId = 10
    AND sessionId = 182
"""

try:
    rows = session.execute(query2_verify)
    print(f"Sucessfully fetched rows for conditions of query 2 from {table_query2_name} : \n")
except Exception as e:
    print(e)

for row in rows:
    print(
        row.iteminsession, "\t",
        row.artist_name, "\t",
        row.song_name, "\t",
        row.firstname, "\t",
        row.lastname
    )

Sucessfully fetched rows for conditions of query 2 from user_playlist_session : 

0 	 Down To The Bone 	 Keep On Keepin' On 	 Sylvie 	 Cruz
1 	 Three Drives 	 Greece 2000 	 Sylvie 	 Cruz
2 	 Sebastien Tellier 	 Kilometer 	 Sylvie 	 Cruz
3 	 Lonnie Gordon 	 Catch You Baby (Steve Pitron & Max Sanna Radio Edit) 	 Sylvie 	 Cruz


We can see that the return of `SELECT` query verifies that our code correctly answers the question.

### 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'

#### a. Question answer approach

This question expects first and last names of all the users who has listened to a specific song. Since this question has no hints about which primary key or clustering columns, we must define them ourselves.

##### i. The queries

You will see them in the code cells below in this question.

##### ii. The table

Table name : `song_listeners_session`.

Table columns, by orders :
- Song's name (first primary key)
- `userId` (second primary key)
- First name
- Last name

Although `sessionId` and `userId` can both be used as primary key, as we are querying user's info, `userId` sounds better in terms of data purpose.

#### b. Table initiation

Similar to other questions, `DROP TABLE IF EXISTS` then `CREATE TABLE IF NOT EXISTS`, with columns in order described above.

In [95]:
table_query3_name = "song_listeners_session"

query3_dropExist = f"DROP TABLE IF EXISTS {table_query3_name}"
try:
    session.execute(query3_dropExist)
    print("DROP TABLE IF EXISTS completed successfully.")
except Exception as e:
    print(e)

query3_createNew = f"""
    CREATE TABLE IF NOT EXISTS {table_query3_name} (
        song_name text,
        userId int,
        firstName text, 
        lastName text,
        PRIMARY KEY (song_name, userId)
    )
"""

try:
    session.execute(query3_createNew)
    print("Successfully created new table for query 3.")
except Exception as e:
    print(e)

DROP TABLE IF EXISTS completed successfully.
Successfully created new table for query 3.


#### c. Insert data into table for the query

The cell below fetches data from <font color=red>event_datafile_new.csv</font> file (the merged CSV file we mentioned earlier), based on the table structure we discussed above. Basically the same to previous question, just different columns and their corresponding lines.

In [96]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:

        query3_insert = f"""
            INSERT INTO {table_query3_name} (
                song_name,
                userId,
                firstName,
                lastName
            )
        """
        query3_insert = query3_insert + "VALUES (%s, %s, %s, %s)"

        try:
            session.execute(query3_insert, (
                line[9],
                int(line[10]),
                line[1],
                line[4]
            ))
        except Exception as e:
            print(e)

    print(f"Successfully filled {table_query3_name} with rows.")

Successfully filled song_listeners_session with rows.


#### d. Do a SELECT to verify that the data have been inserted into the table

Once the data has been inserted, we need to verify the insertion with a `SELECT` statement. We are using our question's selection statement based on which we created this table, which I have shown you above right at the section's beginning.

In [97]:
query3_verify = f"""
    SELECT
        firstName,
        lastName
    FROM {table_query3_name}
    WHERE song_name = 'All Hands Against His Own'
"""

try:
    rows = session.execute(query3_verify)
    print(f"Sucessfully fetched rows for conditions of query 3 from {table_query3_name} : \n")
except Exception as e:
    print(e)

for row in rows:
    print(
        row.firstname, "\t",
        row.lastname
    )

Sucessfully fetched rows for conditions of query 3 from song_listeners_session : 

Jacqueline 	 Lynch
Tegan 	 Levine
Sara 	 Johnson


We can see that the return of `SELECT` query verifies that our code correctly answers the question.

## 3. Drop the tables before closing out the sessions

In [98]:
try:
    session.execute(query1_dropExist)
except Exception as e:
    print(f"Failed to drop table {table_query1_name}...")
    print(e)

try:
    session.execute(query2_dropExist)
except Exception as e:
    print(f"Failed to drop table {table_query2_name}...")
    print(e)

try:
    session.execute(query3_dropExist)
except Exception as e:
    print(f"Failed to drop table {table_query3_name}...")
    print(e)

print("DROP TABLE IF EXISTS all completed successfully.")

DROP TABLE IF EXISTS all completed successfully.


## 4. Close the session and cluster connection¶

In [99]:
session.shutdown()
cluster.shutdown()