## SET UP

In [None]:
#GET DATA FROM CASSANDRA AS DATAFRAME
from queries import make_queries_get_df
#FOR MANIPULATING DATAFRAME
import pandas as pd
#FOR MAKING QUICK CHARTS
import chart_helper
#PLOTTING (THIS IS OPTIONAL IF YOU WISH TO TWEAK A CHART FURTHER)
import plotly.express as px

## 1st Example: Total File Usage Across All Courses

The following query can be used to get an overview of file usage across different courses. If needed, the table can also add an extra layer of grouping such as department and university.

### Step 1: Get the Data from Cassandra

Notice that Python only transforms the result into a table and visualise it into a sunburst chart, as the data is already prepared by Cassandra.  

The actual query can be seen in the *string* argument passed onto the make_queries_get_df() function.

In [4]:
from queries import make_queries_get_df
import pandas as pd
import chart_helper

# get data for file usage
file_usage_df = make_queries_get_df('''
SELECT course_id, paper_id, document_id, type, SUM(size) as total_file_usage_in_KB
FROM component
GROUP BY course_id, paper_id, document_id, type;
''')

#the quick_sunburst() function works as long as the columns are in the right order
#something like: 1st layer > 2nd layer > ... > values_columns
#for the chart in question: course > paper > doc > type > size
#Thanks to Cassandra, it's already grouped this way in the SELECT statement.
chart_helper.quick_sunburst(file_usage_df, 3).show()

An adminstrator can quickly zoom in on a particular paper or document to get a quick overview of file usage across different types of materials. Further reports can also easily be generated from the available statistics to inform business decisions about usage plans.

## 2nd example: getting contributions by each individuals in a paper

The below queries can be run to produce an instant breakdown of contributions by each student to each document in a paper. 

If an instructor's teaching multiple papers, similar queries can be made for each paper and they can be implemented as separate option from a drop-down menu.  

### Step 1: Get Data as Dataframe from Cassandra

The helper function *make_queries_get_df()* accepts a string of CQL statement as its argument.

It hides the details of connecting to Cassandra and executing CQL query using Python.

This means you can test out your CQL statement on Datastax Studio, and if you're happy with the result, you can simply copy and paste it into your Jupyter notebook.

The returned data will be a dataframe, ready to be processed and visualised.

In [7]:
from queries import make_queries_get_df
import pandas as pd
import chart_helper

contribution_by_paperC1_df = make_queries_get_df('''
SELECT paper_id, document_id, author_full_name, type, COUNT(type) as count
FROM component
WHERE course_id = 'courseC'
AND paper_id = 'paperC1'
GROUP BY document_id, type;
''')

#generate the figure object
chart_helper.quick_sunburst(contribution_by_paperC1_df, 3).show()

Notice that the data's already sorted and grouped by Cassandra on the server side. This takes advantage of the design of the component table and does not force processing on the client side.

Similar to the 1st example, we use *quick_sunburst()* to visualise the received dataframe.

## Using Materialized View for Student-Centric Visualisations

### Example 1: Individual Contributions

Using materialized view in Cassandra, we can construct a virtual table based on the component table with an added primary key. 

We can also redefine which keys are the partition keys and which are the clustering keys.

### Redefining the Primary Key using Materialized View

Let's say we want to know the contributions of a student with id 2. We can't get an overview using the component table because its primary key uses 'course_id' and 'paper_id' as its partition key. The clustering key also begins with 'document_id', 'type' and then comes 'author_id'. 

This means we can only select a contributions of a student for a document, and we need to make several select statements to get an overview of all the contributions across all documents.

To solve this problem, we can use materialized view to cast the 'author_id' as the partition key, and the rest of the keys as clustering columns. These allow the data to be efficiently accessed as well as pre-sorted.

Here's the full CQL statement for creating the Materialized View:

CREATE MATERIALIZED VIEW component_by_author_id <br>
AS SELECT * FROM component <br>
WHERE course_id IS NOT NULL <br>
AND paper_id IS NOT NULL <br>
AND document_id IS NOT NULL <br>
AND type IS NOT NULL <br>
AND author_id IS NOT NULL <br>
AND time_added IS NOT NULL <br>
PRIMARY KEY (author_id, course_id, paper_id, document_id, type, time_added);


### Getting the Data

Next, we simply query the materialized view like any CQL table.

In [8]:
from queries import make_queries_get_df
import pandas as pd
import chart_helper

contributions_of_student_2_df = make_queries_get_df('''
SELECT author_full_name, course_id, paper_id, document_id, type, count(type) as count
FROM component_by_author_id
WHERE author_id = '2'
GROUP BY course_id, paper_id, document_id, type;
''')

contributions_of_student_2_df

Unnamed: 0,author_full_name,course_id,paper_id,document_id,type,count
0,Rory Davies,courseA,paperA1,docA11,discussion,1
1,Rory Davies,courseA,paperA1,docA11,image,1
2,Rory Davies,courseA,paperA1,docA12,attachment,1
3,Rory Davies,courseA,paperA1,docA12,image,1
4,Rory Davies,courseA,paperA1,docA12,video,1
...,...,...,...,...,...,...
59,Rory Davies,courseC,paperC2,docC22,comment,1
60,Rory Davies,courseC,paperC2,docC22,discussion,1
61,Rory Davies,courseC,paperC2,docC23,audio,1
62,Rory Davies,courseC,paperC2,docC23,discussion,2


### Visualize the Data

Finally, we can take advantage of Cassandra's pre-sorted result to visualise the data into a sunburst chart. 

Notice that only the name of the student in question appears in the center. This is because *author_full_name* is the first column and we know that it's the same thanks to the fact that *author_id* is the partition key, giving us access to all data of a student in one place. 

As a result, a student, in this case Rory, can quickly get an overview of what she/he has contributed in each document in a paper.

In [9]:
from queries import make_queries_get_df
import pandas as pd
import chart_helper

#maxdepth 

contributions_of_student_2_df = make_queries_get_df('''
SELECT author_full_name, paper_id, document_id, type, count(type) as count
FROM component_by_author_id
WHERE author_id = '2'
GROUP BY course_id, paper_id, document_id, type;
''')

chart_helper.quick_sunburst(contributions_of_student_2_df, 3).show()

### Example 2: Finding Uncited Sources

A student may wish to add missing references to their contributions as an academic requirement and good practice. It would be handy to get a report of which of their contributions need citation and where they can be found.

Our current *component* table does not support direct filtering on 'source', but a materialized view can take an extra primary key and this is a good use case.

Ideally, the table should return all contributions whose 'source' is missing from a user_id. The result should also contain the whereabouts of them so that a user/student can quickly navigate.

### Create a materialized view with *source* as primary key

Here's the CQL statement.

// create a materialized view for identifying missing source

CREATE MATERIALIZED VIEW component_source_by_author_id AS SELECT * FROM component <br>
WHERE course_id IS NOT NULL <br>
AND paper_id IS NOT NULL <br>
AND document_id IS NOT NULL <br>
AND type IS NOT NULL <br>
AND author_id IS NOT NULL <br>
AND time_added IS NOT NULL <br>
AND source IS NOT NULL <br> <small>(the extra primary key)</small><br>
PRIMARY KEY (author_id, source, course_id, paper_id, document_id, type, time_added);

Notice the order of the new primary key. Since we imagine that a user may want to know a list uncited contributions, we map the query to the order of the primary key components. 

This design also illustrates how Cassandra tables should be conceived, as what comes after, not before the queries.

### The Code

Similar to other examples, we first extract the data, and then visualise it with an appropriate visual.

In [10]:
from queries import make_queries_get_df
import pandas as pd
import chart_helper

uncited_contributions_of_Tom = make_queries_get_df('''
SELECT document_id, type, source as status, time_added
FROM component_source_by_author_id
WHERE author_id = '8'
AND source = 'missing'
GROUP BY course_id, paper_id, document_id;
''')

uncited_contributions_of_Tom

Unnamed: 0,document_id,type,status,time_added
0,docA11,audio,missing,2020-01-14
1,docA12,video,missing,2020-02-13
2,docA13,image,missing,2020-01-04
3,docA21,image,missing,2020-01-20
4,docA22,audio,missing,2020-03-03
5,docA23,audio,missing,2020-01-09
6,docB11,video,missing,2020-01-03
7,docB12,attachment,missing,2020-03-05
8,docB13,attachment,missing,2020-02-25
9,docB21,attachment,missing,2020-02-05


The result looks promising (or rather frustrating for Tom), but it doesn't allow him to jump inside a document to fix things. This is because our test table does not contain an actual *document_id* or *component_id* (substituted by *time_added*). 

These ids can in turn serve as breadcrumbs as there could be a table that records the location of a component by its id (like a URL). We can also include this attribute inside our *component* table.

### NAVIGATION QUERIES

As of the momement, the ob3 platform could greatly improve the experience of users alike by adding navigation breadcrumbs of various forms to their client side. The following tables are some suggestions toward that goal, with a focus on students as users.

## Bookmarked, Favorite, and Annotated Components

Instead of searching through each document in each paper for marked materials, students should be able to easily locate their desired materials through a sidebar tab showing the list of all of their bookmarks, favorites, and notes, plus links to these places. 

This means there should be a table containing such information for each student, and the list should be sorted by document and paper. In CQL terms, our create table statement could look something like this:

CREATE TABLE marked_component_by_user_id ( <br>
    user_id TEXT,  <br>
    paper_id TEXT,  <br>
    doc_id TEXT,  <br>
    bookmarked map<timeuuid, text>,  <br>
    favorite map<timeuuid, text>,  <br>
    annotated map<timeuuid, text>,  <br>
    PRIMARY KEY ((user_id, paper_id),  <br>
    doc_id));  <br>

Using Cassandra built-in collection type map, we can store a map where for each element, the key is the component id and the value is the link to it. There should also be three separate maps for each type of interaction.

Note that when updating the table, we should use CQL's *UPDATE ... SET ... field = field +/- element key + value* instead of *INSERT INTO ... VALUES*, since the latter would replace the old map with a new one. The first, however, simply append or remove an element from the map.  

Let's look at the result from such a table for user_id = '2' who's interested in checking out all the components that they have interacted with in a paper, grouped by document.
## 

In [11]:
# since we're getting Cassandra's map type as result, we need to import some special function that helps with processing the result into a dataframe.

from cloud import session
from pandas_factory import pandas_factory
session.row_factory = pandas_factory

query = '''
SELECT doc_id, bookmarked, favorite, annotated 
FROM marked_component_by_user_id
WHERE user_id = '2' 
AND paper_id = 'paperB' 
GROUP BY doc_id;
'''

result = session.execute(query, timeout=None)
marked_component_of_user_2_df = result._current_rows

marked_component_of_user_2_df

Unnamed: 0,doc_id,bookmarked,favorite,annotated
0,docB1,"{'bookmarked1': 'URL', 'bookmarked2': 'URL', '...","{'favorite1': 'URL', 'favorite2': 'URL', 'favo...","{'annotated1': 'URL', 'annotated2': 'URL', 'an..."
1,docB2,"{'bookmarked1': 'URL', 'bookmarked2': 'URL', '...","{'favorite1': 'URL', 'favorite2': 'URL', 'favo...","{'annotated1': 'URL', 'annotated2': 'URL'}"


Notice that in production the key of each entry in each dictionary will be the id of the component that was marked by the user, and the URL will be the actual URL leading to the component itself. 

Since *doc_id* is also one of the clustering key, we can filter the result by a document name in case the user wants to get these items within a document rather than a paper.

In [12]:
from cloud import session
from pandas_factory import pandas_factory
session.row_factory = pandas_factory

query = '''
SELECT doc_id, bookmarked, favorite, annotated 
FROM marked_component_by_user_id
WHERE user_id = '2' 
AND paper_id = 'paperB' 
AND doc_id = 'docB1';
'''

result = session.execute(query, timeout=None)
marked_component_of_user_2_in_docB1_df = result._current_rows

marked_component_of_user_2_in_docB1_df

Unnamed: 0,doc_id,bookmarked,favorite,annotated
0,docB1,"{'bookmarked1': 'URL', 'bookmarked2': 'URL', '...","{'favorite1': 'URL', 'favorite2': 'URL', 'favo...","{'annotated1': 'URL', 'annotated2': 'URL', 'an..."
