# Avro Schema
Kevin Davis, Walter Erquinigo, Guillermo Monge, Carlos Rodriguez & Alex Smith

***
### Resources
- <a href="https://avro.apache.org/docs/1.8.1/gettingstartedpython.html">Getting Started with Avro</a>
- <a href="https://www-01.ibm.com/software/data/infosphere/hadoop/avro/">IBM's Avro Overview</a>
- <a href="https://dzone.com/articles/introduction-apache-avro">Introduction to Avro</a>
- <a href="http://www.treselle.com/blog/avro-with-python-part-2/">Avro with Python Example</a>
***

### Define the schema for the Slack messages
- user_id
    - string or null value
    - anonymized, equivalent to Slack RTM's 'user' field + anonymization function
- record_type
    - string or null value
    - type of record, includes options such as reconnect_url, presence_change, message; should be message for our data
- text
    - string or null value
    - text of the message sent by the user
- channel
    - string or null value
    - channel in which the message was sent
- timestamp
    - float (required field)
    - Unix timestamp of record assigned by slack api, equivalent to Slack RTM's 'ts' field

In [1]:
%%writefile slackSchema.avsc

{
"namespace": "slack.avro",
"type": "record",
"name": "slack_schema",
"fields" : [
    {
        "name": "user_id",
        "type": ["string", "null"],
        "doc": "User ID, anonymized, equivalent to Slack RTM's 'user' field + anonymization function"
    },
    {
        "name": "record_type",
        "type": ["string","null"],
        "doc": "type of record, includes options such as reconnect_url, presence_change, message; should be message for our data"
    }
    {
        "name": "text",
        "type": ["string", "null"],
        "doc": "Text of message sent by user, equivalent to Slack RTM's 'text' field"
    },
    {
        "name": "channel",
        "type": ["string", "null"],
        "doc": "The channel in which the message is being sent, equivalent to Slack RTM's 'channel' field"
    },
    {
        "name": "timestamp",
        "type": "float",
        "doc": "Unix timestamp of record assigned by slack api, equivalent to Slack RTM's 'ts' field, required for each record"}
 ],
"doc": "A Schema for storing Slack messages."
}

Writing slackSchema.avsc


In [5]:
%%writefile avroSerializer.py

# import our avro libraries
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# set the schema of interest
schema = avro.schema.parse(open("slackSchema.avsc", "rb").read())


# a method to parse the data from Slack's RTM api and serialize it given a single message
def avroSerialize(message):
    """takes as input a message loops through each part of the message, 
    and attempts to serialize it and return the new AVRO serialized message"""
    
    # set up a writer to serialize data
    writer = DatumWriter(schema)
    
    # we set up a try loop because the RTM may return messages
    # that are not actual messages that fit our Avro schema
    try: 
            
        # set up a new converted message
        new_message = {}
        new_message['user_id'] = message['user']
        new_message['record_type'] = message['type']
        new_message['text'] = message['text']
        new_message['channel'] = message['channel']
        new_message['time_stamp'] = message['ts']

        # serialize the message and return it
        return writer.write(new_message)
    
    # if we fail to write successfully, it's probably that we were
    # attempting to write a message that we don't care about, like
    # a status change. if this happens, we'll pass
    except:
        pass

Overwriting avroSerializer.py
