# Big data integration and processing

Data modelling
- tells how data is structured and what operations can be done
- relational, semi-structured, graph, text

BDMS
- designed for parallel and distributed processing 
- may not guarantee consistency (most likely supports eventual consistency)
- often built on Hadoop

Requirement of BDMS
- support big data operations (split volumns of data, access data fast, distribute computations to nodes)
- handle fault tolerance (replicate data partitions, recover files when needed)
- enable scaling out
- optimized and extensible for many data types
- enable both streaming and batch processing (low latency and accuracy)

Query Langugage
- declaritive: specify what you need rather than how to obtain it

When data is stored in multiple machines
- local indexing: use local index on each machine
- global indexing: use a combined index in a global index server

Semijoin (used in distributed settings)
- A semijoin from R to S reduce data transmission cost
- steps
    - "Project" R on attribute A (call it R[A])
    - "Step" this projection from the site of R to the site of S
    - "Reduce" S by eliminating tuples where attribute A are not matching any value in R[A]

Querying MongoDB
- EX1
    - SQL: SELECT * FROM Beers
    - MongoDB: db.Beers.find()
- EX2
    - SQL: SELECT beer, price FROM Sells
    - MongoDB: db.Sells.find({}, {beer:1, price:1})
- EX3
    - SQL: SELECT manf FROM Beers WHERE name = "Heineken"
    - MongoDB: db.Beers.find({name: "Heineken"}, {manf:1, _id:0})
- EX4
    - SQL: SELECT DISTINCT beer, price FROM Sells WHERE price > 15
    - MongoDB: db.Sells.distinct({price:{\$gt>15}}, {beer:1, price:1, _id:0})    
- EX5: 
    - count the number of manufacturers whose names have the partial string "am" in it (case insensitive) 
    - MongoDB: db.Beers.find(name: {\$regex:/^am/i}).count()
- EX6: 
    - count the number of manufacturers whose names starts with "am" and ends with corp"" 
    - MongoDB: db.Beers.find(name: {\$regex:/^am.*corp\$/}).count()
- EX7: 
    - find items which are tagged as "popular" or "organic"
    - MongoDB: db.inventory.find({tags: {\$in: ["popular", "organic"]}})
- EX8: 
    - find items which are not tagged as "popular" or "organic"
    - MongoDB: db.inventory.find({tags: {\$nin: ["popular", "organic"]}})
- EX9: 
    - find 2nd and 3rd element of tags
    - MongoDB: db.inventory.find({}, {tags: {\$slice: [1,2]}})
- EX10: 
    - find a document whose 2nd element in tags is "summer"
    - MongoDB: db.inventory.find(tags.1: "summer")
- EX11: 
    - count the number of unique addresses of drinkers
    - MongoDB: db.Drinkers.count(addr: {\$exist: true})
- EX12: 
    - get the distinct values of array
    - MongoDB: db.countryDB.distinct(places)
- EX13:
    - How many tweets have location not null?
    - db.users.find({"user.Location": null}).count()
- EX14:
    - How many people have more followers than friends?
    - db.users.find({ \$where : "this.user.FollowersCount > this.user.FriendsCount"}).count()
- EX15:
    - Perform a query that returns the text of tweets which have the string "http://"
    - db.users.find({"tweet_text" : {\$regex : ".*http://.*"}})
- EX16:
    - Get all the tweets from the location "Ireland" which also contain the string "UEFA"
    - db.users.find({"tweet_text" : {\$regex : "UEFA"}, "user.Location" : "Ireland"})
    
## Data information integration

- problem: too many soources
- solution: 
    - pay-as-you-go model: only integrate sources that are needed when needed
    - probablistic schema mapping
        - mediated schema
           - attribute grouping
               - similarity of attribute
               - probability of two attribute co-occuring
               
Data fusion
- data item: particular aspect of real-world entity
- find true values of data items from sources

Too many data sources
- many values for the same item => leads to conflicts
- hard to estimate trustworthiness of sources

## Big data processing pipeline

Map reduce
- split -> map -> shuffle and sort -> reduce

Data parallelism
- running the same function simultaneously for partition of data set on multiple cores

Three layers of Hadoop
- data management and storage
- data integration and processing
- coordination and workflow management

Categorization of big data processing systems
- execution model: batch (Hadoop, Spark, Flink, beam), streaming (Spark, Flink, beam, Storm) 
- latency: high-latency (Hadoop: no in-memory processing), low-latency (Spark, Flink, beam: in-memory processing), very low-latency (Storm)
- scailability: 
- programming model: Java (Hadoop, Flink, beam), Scala (Spark, Flink, beam), Python (Spark), R (Spark)
- fault tolerence: replication (Hadoop)

Lambda architecture
- hybrid data processing architecture

Hadoop limitation
- only map-reduce based computation
- no interactive shell 
- native support for Java only
- relies on reading data from HDFS, which become a bottleneck

Spark
- resilent distributed datasets (RDD)
- driver program creates RDD
- create RDD -> apply transformation -> lazily evaluated and action performed
    - transformations will wait until actions are performed
    - errors may show up in action stage, not in transformation stage 

Narrow transformations 
- take place in worker nodes locally
- map: apply function to each element of RDD
- flatMap: map then flatten output
- filter: keep only elements where function is true
- coalesce: reduce number of partitions

Wide transformations
- processing depends on data residing in multiple partitions distributed across worker nodes
- groupByKey: (K, V) pairs => (K, list of all V)

Actions
- collect: copy all elements to the driver
- take(n): copy first n elements
- reduce(func): aggregate elements with func
- saveAsTextFile(filename): save to local file or HDFS

Spark SQL
- enables querying structured and unstructured data through Spark
- APIs for Scala, Java, Python to convert result into RDDs
- deploy business intelligence tools over Spark

Data frames
- looks just like tables in relational databases
- RDDs can be converted to data frames 

Spark streaming
- sources: Kafka, Flume, HDFS, S3, etc

Spark MLLib
- machine learning library for Spark

Spark GraphX
- API for graph computation
- triplets: views that logically join vertex and edge properties