In [2]:
### PREAMBLE
# Redis - Representing Data and Basic Usage
# redis.svg


In [2]:
# Some imports for the tutorial
from time import sleep

# Introduction

While we have covered relational databases in class and learned about entity relationships between table in a relational database, not all data can be organized in a structured manner. In this tutorial, we examine the uses and advantages of non-relational databases such as MongoDB, Cassandra, and Redis. In particular, we delve into the capabilities of Redis and how we can apply it to data science problems.

First, we note some main differences between relational databases and a non-relational database __(SQL vs. NoSQL)__:
- The schema for a SQL database must be predefined, while NoSQL databases have dynamic schema.
- SQL databases represent data in the form of table, while NoSQL databases have no specific schema. In the case of Redis, data is key-value pairs.
- SQL databases are typically scaled "vertically," by improving hardware on a single server, while NoSQL databases scale "horizontally," by distributting the database across more servers.
- All SQL Databases use SQL style syntax, while NoSQL databases' syntax vary.


## Motivation
As a motivation to use a NoSQL database for data science, consider the following problems:
- You are working for Amazon and your task is to find analyze order in which pages on Amazon get visited. In particular, you want to find paths that lead to visitors becoming customers and Is there a path that leads to visitors leaving? Both would be good to know. Given enough of this data, it would be possible to generate suggested pages that lead to more customers, but first, you would like to store the data efficiently.

Although it is possible to come up with a SQL schema for this problem, most implementations would be either very inefficient or would violate the relational nature of the database. As a result, we turn to a NoSQL database, Redis, and show that such a problem can be easily solved.

# Redis
You have probably heard of Redis as a cache. In the most case, Redis acts as a key-value cache for large web-based applications due to its processing speed. In addition to being able to process over 100,000 SETs and GETs per second on a normal machine, Redis comes with a rich set of built-in data structures. To be more precise, Redis is a key-value storage, where values can take on the forms of 
[Strings](https://redis.io/topics/data-types-intro#redis-strings), 
[Lists](https://redis.io/topics/data-types-intro#redis-lists), 
[Sets](https://redis.io/topics/data-types-intro#redis-sets), 
[Sorted Sets](https://redis.io/topics/data-types-intro#redis-sorted-sets), 
[Hashes](https://redis.io/topics/data-types-intro#redis-hashes), 
[Bitmaps](https://redis.io/topics/data-types-intro#bitmaps), and 
[HyperLogLogs](https://redis.io/topics/data-types-intro#hyperloglogs).
For this tutorial, we will go over Strings, Lists, and Sorted Sets and see how they can be used for the problem stated earlier. But first, let us go over the setup instructions to run Redis with Python.

### Installation & Setup
- Download and install Redis for your machine [here](https://redis.io/download). Instructions are included at the bottom of the page.

    For Windows users, [this](https://github.com/rgl/redis/downloads) might be an easier setup solution.
    
    
- Install redis-py, a Python Redis client.

    `> conda install redis-py`
    
    
- Run the Redis server on your machine. By default, it runs on port 6379.

   On Windows, you will be able to find the Redis server under Services.
   
   On MacOS/Linux, you can simply run redis-server
   
   
- In Python, import Redis and connect to your server.


In [3]:
import redis

r = redis.Redis(host='localhost', port=6379)

By default, `redis.Redis()` would connect to localhost on port 6379.

## Strings
In Redis, all keys must be Strings, and values can be Strings as well. This is probably the most used and simplest type of data.

Although it is called a String, this datatype can represent any binary content. This means that pictures and videos can be stored as Strings as well. This datatype is fairly intuitive, so let us move on to some operations we can do with them in Redis.

`SET` and `GET` are the most common operations, and as the names suggest, the former sets a key to be a certain value and the latter retrieves the value associated with a key.

`MSET` and `MGET` can set and retrieve multiple keys and values respectively.

`EXISTS` can be used to check for the existance of a key, and `DEL` can be used to delete a key and its value.

In [4]:
# Writes the value, 'google.com' to the key, 'url'
r.set('url', 'google.com')

# Retrieves the value associated with the key, 'url'
value = str(r.get('url'))
print('Value for url is ' + value)

print('Does count3 exist? ' + str(r.exists('count3')))

r.mset({'count1' : 3, 'count2' : 8, 'count3' : 7})
print('Does count3 exist now? ' + str(r.exists('count3')))
print(r.mget(['count1', 'count2', 'count3']))

r.delete('count3')
print('Is count3 still in there? ' + str(r.exists('count3')))

Value for url is b'google.com'
Does count3 exist? False
Does count3 exist now? True
[b'3', b'8', b'7']
Is count3 still in there? False


As you can see, the values returned by Redis are bytes. To convert them to python strings, we can define the following helper function:

In [5]:
# Converts byte returned by Redis to python string for printing
def to_string(value):
    return value.decode('utf-8')

print(to_string(r.get('url')))

google.com


We can also increment Strings that are numbers in Redis with a single command with `INCR`:

In [6]:
print('count1 is ' + to_string(r.get('count1')))
r.incr('count1')
print('count1 is now ' + to_string(r.get('count1')))

# You can also specify how much you want to increase the value by
r.incr('count1', 8)
print('count1 is now ' + to_string(r.get('count1')))

count1 is 3
count1 is now 4
count1 is now 12


Another cool thing we can do in Redis is setting the time to live for a key with `EXPIRE` and check how much time a key has left with `TTL`.

In [7]:
r.set('text', 'I will be deleted soon!')
# Allow the key to live for 2 seconds
r.expire('text', 2)
sleep(1)
# Or use pttl to see how many milliseconds are remaining
print(r.ttl('text'))
sleep(2)
print('Is text still in there? ' + str(r.exists('text')))

1
Is text still in there? False


## Lists
Lists in Redis can be thought of as a double ended queue. Elements can be added to the list on either end at constant time, and elements can be popped off at either end at constant time. It is possible to view an element at a specific index, but the operation can be costly, as lists are implemented using linked lists in Redis. For constant time indexing, sorted sets can be used instead.

`LPUSH` and `RPUSH` can be used to insert elements into the list from the head and the tail respectively. To remove elements and view them from the ends, we can use `LPOP` and `RPOP`.

In [8]:
# Simple helper function to display everything inside a list without modifying the list
def visualize_list(key):
    print('Content of ' + key)
    elements = [to_string(x) for x in r.lrange(key, 0, -1)]
    for i in range(len(elements)):
        print('  ' + str(i) + '. ' + elements[i])
    print()
        
# Add elements at the head of the list
r.lpush('list1', 1)
visualize_list('list1')
r.lpush('list1', 2)
visualize_list('list1')

# Add elements at the tail of the list
r.rpush('list1', 3)
r.rpush('list1', 4)
visualize_list('list1')

# Remove an element from the tail and print it
print('The former tail is ' + to_string(r.rpop('list1')))
visualize_list('list1')

# Remove an element from the head and print it
print('The former head is ' + to_string(r.lpop('list1')))
visualize_list('list1')

Content of list1
  0. 1
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9

Content of list1
  0. 2
  1. 1
  2. 1
  3. 3
  4. 5
  5. 7
  6. 9

Content of list1
  0. 2
  1. 1
  2. 1
  3. 3
  4. 5
  5. 7
  6. 9
  7. 3
  8. 4

The former tail is 4
Content of list1
  0. 2
  1. 1
  2. 1
  3. 3
  4. 5
  5. 7
  6. 9
  7. 3

The former head is 2
Content of list1
  0. 1
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9
  6. 3



It is also possible to view elements without removing them from the list using `LINDEX` and `LRANGE`

In [9]:
# Let us add a few more elements to the list
for i in range(5, 10, 2):
    r.rpush('list1', i)
visualize_list('list1')

# View the element at the 3rd index
print('The third element is ' + to_string(r.lindex('list1', 3)))
# Verify that the list is unchanged
visualize_list('list1')

# View the elements from index 2 to 5, inclusive
print(r.lrange('list1', 2, 4))

Content of list1
  0. 1
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9
  6. 3
  7. 5
  8. 7
  9. 9

The third element is 5
Content of list1
  0. 1
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9
  6. 3
  7. 5
  8. 7
  9. 9

[b'3', b'5', b'7']


## Sorted Sets
Sorted sets are arguably the most powerful data structure in Redis. Keys in a sorted set are unique, and each key is associated with a floating point number: a score. The elements (key, score pairs) in a sorted set are first sorted by their scores. Elements that have the same score are sorted lexicographically by the key. There will never be two elements that are tied, since all keys are unique. In this tutorial, we will only go over the basic Sorted Set operations

Elements can be added to a sorted set using `ZADD`.

In [15]:
# Add a variety of fruits to a sorted set based on the annual national sale of that fruit (in billion $)
r.zadd('fruit_scores', 'Apples', 2.442)
r.zadd('fruit_scores', 'Bananas', 2.183)
r.zadd('fruit_scores', 'Blueberries', 0.889)
r.zadd('fruit_scores', 'Oranges', 0.792)
r.zadd('fruit_scores', 'Grapes', 2.135)
r.zadd('fruit_scores', 'Mandarins', 0.672)

# A simple function to visualize the elements and their scores in a sorted set associated with key
def visualize_sorted_set(key):
    # Retrieve all items in sorted set (use -1 to return all)
    items_with_scores = r.zrange('fruit_scores', 0, -1, withscores=True)
    # Get length of longest key for better printing
    maxlen = max([len(p[0]) for p in items_with_scores])
    print('Content of ' + key)
    # The format for printing item
    format_string = '{:' + str(maxlen) + '}'
    for item,score in items_with_scores: 
        print(' ITEM: ' + format_string.format(to_string(item)) + '  SCORE: ' + str(score))
        
visualize_sorted_set('fruit_scores')

Content of fruit_scores
 ITEM: Mandarins    SCORE: 0.672
 ITEM: Oranges      SCORE: 0.792
 ITEM: Blueberries  SCORE: 0.889
 ITEM: Grapes       SCORE: 2.135
 ITEM: Bananas      SCORE: 2.183
 ITEM: Apples       SCORE: 2.442


Now, we look at two cool ways to retrieve information from a sorted set, `ZRANGEBYSCORE` and `ZRANGEBYLEX`.

`ZRANGEBYSCORE` retrieves items with scores between min and max, inclusive. In this case, min and max are numbers. 'inf' and '-inf' can be used to represent positive and negative infinity.

`ZRANGEBYLEX` retrieves items whose keys are lexicographically between min and max. In this case, min and max are strings. '[' and '(' can be appended in front of the a string to specify whether the limit is inclusive or exclusive, respectively. '+' and '-' can be used to represent the infinities in this case.

In [23]:
# Get all fruits that had sales greater than or equal to 2 billion dollars
fruits = r.zrangebyscore('fruit_scores', 2, 'inf')
print([to_string(fruit) for fruit in fruits])
# Get all fruits that had sales less than or equal to 1 billion dollars
fruits = r.zrangebyscore('fruit_scores', '-inf', 1)
print([to_string(fruit) for fruit in fruits])

# Get fruits lexicographically greater than or equal to 'Blueberries'
fruits = r.zrangebylex('fruit_scores', '[Blueberries', '+')
print([to_string(fruit) for fruit in fruits])
# Get fruits lexicographically greater than 'Blueberries'
fruits = r.zrangebylex('fruit_scores', '(Blueberries', '+')
print([to_string(fruit) for fruit in fruits])

['Grapes', 'Bananas', 'Apples']
['Mandarins', 'Oranges', 'Blueberries']


ResponseError: unknown command 'ZRANGEBYLEX'

## Example
Now, let us move on to an example and see how we can use Redis to store data about page visits. To make the problem a bit more challenging, you would like to take into account how often a user visits Amazon. It would be tremendously helpful to Amazon if you could find patterns for newcomers that lead to purchases.

Our idea will be to use regular strings to store how many times a user has visited the site, lists for each visit's page navigation path, and a sorted set to separate the paths of users who visit the site very often and users who do not.

##### Dummy Data
Since the Redis database cannot be encapsulated in a file like SQLite, the following function will add dummy data into Redis.

In [56]:
# Adds all value in values to the list associated with key in the order of how they appear in values
def add_list(key, values):
    for value in values:
        r.rpush(key, value)

# Creates 3 dummy users and 10 paths for the users
# username4 has visited the site 5 times,
# username37 has visited the site 2 times, and
# username158 has visited the site 3 times
def create_data():
    # First clear everything in Redis, so we have a clean start
    r.flushall()
    
    # Create user who has visited the site 5 times
    r.set('visits:username4', 5)
    # Create 5 random paths, path:username4:i is for the ith visit
    add_list('path:username4:1', ['item987', 'item4526', 'cart', 'item552', 'cart', 'checkout'])
    add_list('path:username4:2', ['item237', 'item6833', 'item6834', 'item2820'])
    add_list('path:username4:3', ['cart', 'item152', 'cart', 'checkout'])
    add_list('path:username4:4', ['item133', 'cart', 'checkout'])
    add_list('path:username4:5', ['item388', 'item688', 'cart', 'item15', 'cart', 'checkout'])
    # Add paths to sorted set with different scores
    for i in range(1, 6):
        r.zadd('pathscores', 'path:username4:'+ str(i), float(i))
        
    # Create user who has visited the site twice
    r.set('visits:username37', 2)
    # Create 2 random paths, path:username37:i is for the ith visit
    add_list('path:username37:1', ['item987', 'item552', 'cart', 'checkout'])
    add_list('path:username37:2', ['item6833', 'item6834', 'item2820', 'settings'])
    # Add paths to sorted set with different scores
    for i in range(1, 3):
        r.zadd('pathscores', 'path:username37:' + str(i), float(i))
        
    # Create user who has visited the site three times
    r.set('visits:username158', 3)
    # Create 3 random paths, path:username158:i is for the ith visit
    add_list('path:username158:1', ['item987', 'item552', 'cart', 'item6834', 'item2820'])
    add_list('path:username158:2', ['item388', 'item688', 'cart', 'item15', 'cart', 'checkout'])
    add_list('path:username158:3', ['item4284', 'item570', 'item15', 'item2820', 'item6833', 'item6834'])
    # Add paths to sorted set with different scores
    for i in range(1, 4):
        r.zadd('pathscores', 'path:username158:' + str(i), float(i))
        
create_data()
    

##### Data Insertion
Assuming that our database is in some state, we can use the following function, `new_visit`, to add a new sample.

In [57]:
# Gets the number of visits a username has made
def num_visits(username):
    return to_string(r.get('visits:' + username))

# Increase the number of visits username has made by 1
def increase_visits(username):
    r.incr('visits:' + username)

# Creates a new path for visiter with username=username
# Creates a key to store the path and puts the key in the sorted set
def add_path(username, path):
    path_key = 'path:' + username + ':' + num_visits(username)
    add_list(path_key, path)
    r.zadd('pathscores', path_key, float(num_visits(username)))

# Adds a new datapoint (username of visitor and the navigation path of the visit) to the database
def new_visit(username, path):
    increase_visits(username)
    add_path(username, path)
    
# Add another visit for username4
new_visit('username4', ['item2258', 'item5624', 'cart', 'item6834'])
print('username4 now has ' + str(num_visits('username4')) + ' visits.')

# A user who has not visited the site has visited
new_visit('username337', ['item426', 'cart', 'item552', 'cart', 'checkout'])
print('username337 now has ' + str(num_visits('username337')) + ' visit.')

username4 now has 6 visits.
username337 now has 1 visit.


##### Basic Retrieval
Now, we move on to some basic data retrieval. As stated earlier, you would like to find paths of new users. In this case, let us consider the first visit and the second visit of a user to be new visits. To get the keys of the paths, you can simply do

In [58]:
keys = r.zrangebyscore('pathscores', 0, 2)
print([to_string(key) for key in keys])

['path:username158:1', 'path:username337:1', 'path:username37:1', 'path:username4:1', 'path:username158:2', 'path:username37:2', 'path:username4:2']


To get the view the actual paths, we can do the following

In [63]:
def get_list(key):
    return [to_string(x) for x in r.lrange(key, 0, -1)]
for key in keys:
    print(get_list(key))

['item987', 'item552', 'cart', 'item6834', 'item2820']
['item426', 'cart', 'item552', 'cart', 'checkout']
['item987', 'item552', 'cart', 'checkout']
['item987', 'item4526', 'cart', 'item552', 'cart', 'checkout']
['item388', 'item688', 'cart', 'item15', 'cart', 'checkout']
['item6833', 'item6834', 'item2820']
['item237', 'item6833', 'item6834', 'item2820']


One note we would like to make is that indexing elements in a list is costly, as explained in the Lists section. We could have replaced the lists with sorted sets with scores equaling the index to get constant time indexing. However, since we wanted to demonstrate all of our mentioned data structures and lists are very intuitive in this case, we used lists for the paths.

Now that you have an efficient way of storing paths, you can move on the the next part of your task: coming up with a fancy algorithm to find similarities between paths that lead to purchases and similarities between paths that don't lead to purchases!

### References
In no particular order:
- https://redis.io/topics/data-types-intro
- https://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/?utm_source=tuicool
- http://101.datascience.community/tag/redis/
- https://daringfireball.net/projects/markdown/basics
- http://redis-py.readthedocs.io/en/latest/
- http://tylerstroud.com/2014/11/18/storing-and-querying-objects-in-redis/
- http://fortune.com/2014/11/04/best-selling-fruit-us/