##### Author: Praveen Saxena
##### Email: saxep01@gmail.com
##### Create Date: 7/16/2021
##### Purpose: Validate the need for certain indexes on _nanohub.jos_users_ & _nanohub_metrics.userlogin_.

----

Some preliminaries.

In [1]:
import pandas as pd
from pprint import pprint
from IPython.display import display, Markdown

In [2]:
%%capture 

from nanoHUB.application import Application

application = Application.get_instance()
nanohub_db = application.new_db_engine('nanohub')
nanohub_metrics_db = application.new_db_engine('nanohub_metrics')

We start with a basic query that finds all the users except for the one-day users.
One-day users are users who registered with nanoHUB but never logged in after their first 24 hours of registration.

In [3]:
sql = '''
SELECT registerDate, lastvisitDate
    FROM nanohub.jos_users
    WHERE registerDate <= DATE_SUB(lastvisitDate, INTERVAL 1 DAY)
LIMIT 100;
'''
df = pd.read_sql(sql, nanohub_db)
df.head()

Unnamed: 0,registerDate,lastvisitDate
0,2008-11-18 17:29:56,2020-02-14 18:50:14
1,2007-01-29 09:34:45,2019-03-15 09:00:31
2,2005-07-28 01:28:13,2015-02-18 17:03:45
3,2002-04-17 09:43:44,2010-06-02 20:22:00
4,2005-07-29 18:24:36,2012-10-02 17:22:20


Let's run a _EXPLAIN_ query on this.

In [4]:
sql = '''
EXPLAIN SELECT registerDate, lastvisitDate
    FROM nanohub.jos_users
    WHERE registerDate <= DATE_SUB(lastvisitDate, INTERVAL 1 DAY)
LIMIT 100;
'''
df = pd.read_sql(sql, nanohub_db)
df.head()

Unnamed: 0,id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
0,1,SIMPLE,jos_users,ALL,,,,,263425,Using where


We can see that the query uses the much maligned _all_ type. This, and the fact that the query plan includes all the 263425 rows indicates the lack of appropriate indexes. In this case, the needed index would be on _registerDate_ and possibly _lastvisitDate_.

-----

Let's take it up a notch. This time we will execute a query to find the _FREQUENCY_ of user visits as the average number of times a user performed an action over his/her lifetime on nanoHUB (now - registerDate). We can use their last day on nanoHUB instead of now() but for our purposes, we will use now().

In [5]:

sql = '''
EXPLAIN SELECT DISTINCT users.id, users.name, users.registerDate,
        COUNT(action) AS `num_actions`, COUNT(DISTINCT DATE(datetime)) as `distinct_days`,
        DATEDIFF (NOW(), users.registerDate)/COUNT(DISTINCT DATE(datetime)) as Frequency
    FROM nanohub.jos_users AS users
        LEFT JOIN nanohub_metrics.userlogin as logins ON logins.uidnumber = users.id
        WHERE logins.uidnumber != 0 
            && logins.uidnumber IS NOT NULL
            &&  registerDate <= DATE_SUB(lastvisitDate, INTERVAL 1 DAY)
LIMIT 10
;
'''

df = pd.read_sql(sql, nanohub_db)
df

Unnamed: 0,id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
0,1,SIMPLE,logins,index,,userlogin,951,,484906592,Using where; Using index
1,1,SIMPLE,users,eq_ref,PRIMARY,PRIMARY,4,nanohub_metrics.logins.uidnumber,1,Using where


We can see again the need for indexes as the entire index tree is scanned to find matching rows. The needed index would be on _nanohub_metrics.userlogin.uidnumber_.

Let's see which columns on _nanohub_metrics.userlogin_ do have an index.

In [6]:
sql = '''
SHOW INDEX FROM nanohub_metrics.userlogin; 
'''

df = pd.read_sql(sql, nanohub_db)
df

Unnamed: 0,Table,Non_unique,Key_name,Seq_in_index,Column_name,Collation,Cardinality,Sub_part,Packed,Null,Index_type,Comment,Index_comment
0,userlogin,0,PRIMARY,1,id,A,484906592,,,,BTREE,,
1,userlogin,0,userlogin,1,datetime,A,242453296,,,,BTREE,,
2,userlogin,0,userlogin,2,user,A,242453296,,,,BTREE,,
3,userlogin,0,userlogin,3,uidnumber,A,242453296,,,YES,BTREE,,
4,userlogin,0,userlogin,4,ip,A,484906592,,,,BTREE,,
5,userlogin,0,userlogin,5,action,A,484906592,,,,BTREE,,
6,userlogin,1,username,1,user,A,630567,,,,BTREE,,


So _userlogin_ has a multi-column index on _datetime, user, uidnumber, ip,_ and _action_ because we want to make sure that specific combination of column values is unique. Unfortunately, indexes work like a yellowbook. A yellowbook is essentially indexed by lastname, firstname. This is really great when we need to find someone by their last name first. But what if we had to find people by their first names, first?

So having a multi-column index that includes _uidnumber_ as the 3rd column doesn't help us. We need a separate index on _uidnumber_.

-----

The conclusion is that we desperately need indexes on:

1. nanohub_metrics.userlogin.uidnumber
2. nanohub_metrics.userlogin.datetime
3. nanohub.jos_users.registerDate
4. nanohub.jos_users.lastvisitDate

----