This is where we take notes for this class...
Applications of data engineering and big data: social networks,
data analytics (sampling to machine learning,
storage (NoSQL - No sequel)
Issues with queries and performance
Graph
Key values
Colorner
Data Collection and Cleaning
Data modeling is wicked hard!
Infoviz
ex. D3
Data Lifecycle:
Question (Curation:longevity of data, triach:prioritization) -> Collection, Generation -> Clean up -> Storage -> Processing/Analysis -> Query + Visualize + Act -> New Questions
Software engineers play large role in the stages of the data lifecycle
Request Response Cycle:
Start -> Get/Post/Put/Delete -> HTTP Server
Go to https://github.com/cu-data-engineering-s15/syllabus -> wiki
Markdown presentation:
Markdown is a plain text formatting that converts easily to HTML
Types of Markdown: Standard Markdown (SM) and Github Flavored Markdown (GFM)
What can you do with it? Styling, word formatting, add images, create lists, links, code blocks
Headers - created with the '#' symbol
Bold/Italics - use "-" symbol
Links - created with square brackets and parenthesis
Code blocks - create with " code
"
Tables - create with pipes ie. |, dashes ie. -
Horizontal lines - create with triple hyphens, astericks, underscores
Services:
REST - Representation State Transfer
Resources - URI
CRUD (Create, Read, Update, Destroy
Given /users :
Get - Read
Post - creates {data}
Put - update
Delete - destroy
Simple Example:
require 'sinatra'
require 'sinatra/reloader' if development?
require 'json'
configure do
set :port, 3000
end
get '/api/1.0/whattimeisit' do
{status: true, message: Time.now}.tojson + "\n"
end
More Complicated Example:
require 'json'
require 'time'
class Contact
attr_reader :id, :name, :birthdate, :email, :phone, :twitter
Restful Web Services
Rest
REST- architectural web service style (inventor: Roy Fielding)
Approach to developing web services tha mimic design of Web itself
Service provides access to linked set of resources
Operations: CRUD (Create, Read, Update, Delete)
Example:
GET /api/1.0/users Retrieves list of users
GET /api/1.0/users/0 Retrieves details of user0
POST /api/1.0/users Creates new user
PUT /api/1.0/users/0 Update user0
DELETE /api/1.0/users/0 Delete user0
GET /api/1.0/search?q=tattersall Performs a search with the query tattersail
Each operation may produce a result (JSON format is KING)
POST and PUT methods typically send data
Dealing with accessing shared resources
One approach:
GET /api/1.0/posts/0/comments/1 Gets first coment on post0
POST /api/1.0/posts/0/comments Creates a new comment n ost0
Alternative approach: While performing an operation on one resource, you reference other resources in the data that is sent with the request.
Issues
Security, Identity, Failure, Persistence
Example
Contacts Web Servce
Implemented in Ruby and Javascript
Technologies used: Sinatra, Rspec, Typhoeus, Node, Express
Goto: https://github.com/cu-data-engineering-s15/contacts
Git Presentations
Initialization: git init - starts a git repo in current directory whith no files currently tracked
Clone: git clone remote_repository_address - creates a new git repo in current directory that is copy of current
Branching: git branch, git branch new_branch_name (create), git branch -d new_branch_name (delete), git checkout branch_name (change to branch), git checkout commit
Add: git add file
Commit: git commit -m "Commit Message" - "-m" flag is optional
Merge: git merge branch_name - merges named branch
Pull: git pull [remote_repository] [branch_name]
Push: git push [remote_repository]
Other cool things: log (commit history), remote (setup remote knowledge of remote repos), stash (shelving), rebase (change how branches are related), diff (show differences between commits), fetch (get changes but does not integrate), reset (move to current head), tag (mark git objects), mv (moves files), rm (stops tracking changes to file)
Github Presentation
Fork - exact copy of repo, use to run experiments without risk of messing things up
Github workflow: Branch your commit, Submit a pull request, Re-merge with feedback
Demo: https://github.com/Zandrr/pull_requests_demo
git checkout -b "new-branch-bug-fix"
make changes >git
git add README.md
git commit -m "updated readme"
Node.js
Node.js - service site for executing javascript
Hello World:
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(1337, '127.0.0.1');
console.log('Server running at http://127.0.0.1:1337/');
Basic Structure:
while (there are events to handle) {
event
}
Callback hell - situation where callbacks are indented, this makes code hard to read, Delays in callbacks
Problems with synchronicity
Two ways to solve problem:
- Use synchronous functions
- Use named callback functions
Node Execution Model
Node is single-threaded!
Any written code is guaranteed to be synchronous
No need to worry about race conditions
IO is handled in parallel
If you issue an asynchronous call for IO:
Callback is registered
IO call is executed in separate thread
immediately blocks because it is an IO call
Makes it easy to implement services that run server-side
Express
Web app framework written in Javascript for use in Node.js
Design influenced by Sinatra
Makes it easy to define endpoints of web-based service
Includes feautures to create website
Minimum framework - designed to be augmented by node packs
Create Directory - cd into directory and run 'npm init'
Install express: npm install --save express
Install middleware: npm install --save body-parser nmp install --save morgan
See what nmp installed: npm list
Creating Test.json: start with Get request
Creating a module: install moment --save, mkdir lib, cd lib, vi time.js
- client side web application framework
- written in Javascript
- compatibility with JSON RESTful services.
- Data bindings
- Html tag is associated with model object and automatically updates
- Controllers
- Define all states/methods, Modulize and decompose data
- Services
- Services remember things that controllers may leave behind when they return multiple times * i.e. login controller
- Directives
- Allow angular to integrate into HTML in a natural way
- They can also be used to create reusable components that combine controllers, data, and HTML
- Embeddable
- Anuglar can control as much or as littel of a web page as you specify
- Provides control over web page, easy to add new functinality
Public class Employee
Public Employee(Database d)
// employee needs a database to exist,
// injectable -> finds a database for you without you needing to set up database connections, etc.
// Spring framework is an example of this that gives you dependency injections
- Module is the primary way to package upa set of controllers into an Angular application
- To Create a module: give name and list dependencies
angular.module('contactsApp', [])
- Creates a module called contactsApp; with no dependencies
angular.module('contactsApp')
- When you created a module, you can gain a handle to it by calling angular.module
- When you have defined a module yo ucan tell angular where it lives in the html like this:
<html ng-app="contactsApp">
</html>
- To do something in angular, you need a controller.
- Declared using controller function:
angular.module('contactsApp').controller('MainController', [<dependencies and code>])
- second param is an array that allows controller to declare its dependencies
angular.module('contactsApp').controller('MainController', [function() {
var self = this;
self.name = "Ken Anderson";
self.update = function() {
self.name = "Kenneth M. Anderson";
};
}]);
- Anything defined on this is available to the HTML that makes use of the controller.
* here is an example of using dependencies ```javascript .controller('MainController', ['$http', function($http) { var self = this; self.name = "Ken Anderson"; self.update = function() { return $http.get('/api/1.0/update_name').then(function(response) { self.name = response.data.new_name; return response; }); }; }]); ``` * Create a controller that requires use of ANgulars build-in http module. * Event lifecycle * http get -> returns a promise * pass a function, gives a response. * inside function do what you want with the response * can chain these things... i.e. .then().then().... ```javascript var age = 22 // private var this.age = 22 // public var ``` #### AngularJS Hello World
<!DOCTYPE html>
<html>
<head>
<title>Hello World</title>
</head>
<body ng-app>
<h1>Hello {{name}}</h1>
<input type="text" ng-model="name" placeholder="First Name">
<script src="http://ajax.googleapis.com/ajax/libs/angularjs/1.3.11/angular.min.js"></script>
</body>
</html>
Demonstration with html files.
Go to contacts_web_app for example code
Prerequisites: Need ruby, gem
consumer key: identifies app
accesss tokens: identifies user
Goto: https://github.com/cu-data-engineering-s15/get_tweets
Sample Call
ruby get_tweets.rb --props=/home/user/oauth.properties badastronomer
Statically-typed language - (like java) uses interfaces
Dynamically-typed language - (like ruby) improvise with method
def url
raise NotImplementedError, "No URL Specified for the Request"
end
Subclasses provide implementations of: url, request_name, twitter_endpoint, success
Subclasses may provide implementations for error, authorization, options, make_request, collect
Consists of custom class def and method to create default logger
Created in TwitterRequest's constructor:
@log = args[:log] || default_logger
Accessed via log attribute:
log.info("REQUESTING: #{request.base_url}?#{display_params}")
Automatically keeps track of rate imits for Twitter endpoint
Blocks call and sleeps until current Twitter window is done
make_request: ensures rates are checked on each request
def make_request
check_rates
request = Typhoeus::Request.new(url, options)
log.info("REQUESTING: #{request.base_url}?#{display_params}")
response = request.run
@rate_count = @rate_count - 1
response
end
MaxIdRequest, CursorRequest, StreamingRequest
MaxIdRequest: subclass for endpooints to traverse timelines with max_id parameter
Defines new contract with: init_condition, condition, update_condition
CursorRequest: does not need to define contract for subclasses, but can implement functinality directly
StreamingRequest: collect designed to run forever
handlers - on_headers, on_body, on_complete
NoSQL databases aer AWARE of their distributed nature
They manage sharding and replication for you and are horizontally scalable
NoSQL databases tend to avoid mutable data
NoSQL databases are fault tolerant
Key-Value, Graphs, Columnar, Documents
Key-Value Stores
Just like a hash table, values are untyped
Benefits: Simple
Graph Stores
Optimized to store and traverse graph structures
Provide structural query language to locate info based on data structure
Columnar Stores
Able to store enormous amounts of data and achieve very fast writes while also reading efficiently
Distributed hash table that is easy to partition across nodes of cluster
Document Stores
Similar to key-value but with more structure
Each document gets indexed in various ways and can be grouped into collections, which are grouped into databases
Document Database
Document NoSQL Database
Implemented in Erlang
Embraces the web
Document Model
self-contained data
stores documents, which contain everything that might be needed by an app that uses it
No schema is enforced
CAP Theorem
When designing a distributed data store, there are issues you must confront as soon as your system has more than one running server
Issues: Consistency, Availability, Partial Tolerances
CAP theorem: PICK ANY TWO
Consistency and Transactions
CAP theorem: Consistency and availability compete with one another
Terminology
document - basic unit of data
collection - table with dynamic schema
Documents
Case sensitive and type sensitive
Cannot contain duplicate keys
Collections
group of documents
names can be of any UTF- string, but cannot be empty string, have null character, statrt with System, or have '$' character
Databases
Reserved database names: Admin, Local, Config
When to use MongoDB?
Medicat records, other large document systems
read heavy environments like analytics and mining
partnered with relational databases
Install MongoDB on Windows
Go to: https://www.mongodb.org/downloads
Lucene
Powerful search engine written in Java
Used in Solr
Solr
search server built on top of Apache Lucene that provides array of features
Searches indexed documents
Each document has ID and term list
Has list of documents for each term
Identifies each document by ID and returns
Sunspot
Gem for Ruby on Rails
Bundled version of Solr
Pitfalls
Issues with Solr and Rake
Starting server with Solr - different ports
Conclusion
Solr is a good way to get search functionality for DB
Fash, reliable, fantastic features like pagination and indexing
key value store
Basics
not a database replacement
all keys and final values are strings
Users
Github, stackoverflow, twitter, etc.
Optional Features
Persistence, replication, clustering, server-side scripting
Special Commands
Incr, Decr, etc.
Cloud Providers
AWS, Morpheous, etc.
Lists
Blocking operations provided
Sorted Set
Don't have to sort each time set is presented
Bitmaps
constant time set/get methods
work with streaming data (ie. gathering tweets about Kanye on Twitter)
Bottleneck created
keeps all messages for u to N days
written in Scala
named after Franz Kafka
Alternatives
RabbitMQ, A Database, Redis, etc.
Example Architecture
detecting civil unrest
Interaction of Kafka with Spark, Express,js, MongoDB, AngularJS
Zookeeper - allows for greater coordination in machines
Graph-based database, NoSQL Graph Database
Graph Database
Stores data as nodes and relationships
Cheap to traverse along relationships
Nodes represent entities
Edges show relationships
No constraints on data
Cannot shard subgraphs
Hadoop database - open source distributed column-oriented database
Serves tables with billions of rows and miilions of columns
Built on top of Hadoop File System
Why HBase?
HBase leverages distributed data storage provided by HDFS
HBase provides low latency access to single rows from billions of records
HBase Data Model
Stored in tables
Each cell value has timestamp
Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware
Map, Reduce, MapReduce
Functions to transform data
Open source data analytics cluster computing framework
Real time vs interactive vs batch processing
Open Source
Helps subdivide UI into small components
Maintains virtual DOM
Mix and match custom components with HTML
One-way state binding
Can only return single node
Google app engine
Support for Python, Java, PHP, Go
Virtual Machines: Standard, high CPU, high memory
MySQL databases in the cloud
Users: Snapchat, Coca-cola, etc.
Automating Server Development
Makes it easier to get up and running on a new server
Allows users to work from local versions
Simple - uses SSH to connect to machine and runs commands from terminal
Advanced Geo-spacial
GeoJSON + Turf + Mapping = Infographics
format to encode geographical data structures
Library to create interactive data visualizations for the web
Microframework
Based on Werkzeug WSGI toolkit and Jinja2 template engine
Powered by SQLite
JavaScript library for manipulating documents based on data
Data organized at top of architecture
Similar to CSS in select
Layers, MultiPlatform, GIS, interactive
Map Example
Visuaization Grammar built on top of D3.js
Enable fast, customizable design
Make reusable-sharable visualizations
Tabular data model
Axes (lines, tickes, labels) and Legends (colors, shapes, sizes)
Marks - basic visual building block for visualization
Graph database
Uses Languages Gremlin and SPARQL
Simple way to work with services over the internet
Cloud computing
Reliable, scalable, low-latency, easy to use, inexpensive
Networking
Amazon Rout 53
VPC
Amazon DirectConnect
Storage
S3
EBS
Glacier
Database
DynamoDB
ElastiCache
RDS
RedShift