Programming MapReduce with Scalding
Switch branches/tags
Nothing to show
Clone or download
Antwnis Merge pull request #6 from FavioVazquez/fixing-typos
- Fixed typos in chapter 9 codes
Latest commit 9161769 May 13, 2015

README.md

Source code for PACKT Book 'Programming MapReduce With Scalding'

Find more information at http://scalding.io/

The book consists of 9 chapters

  • Introduction to Map-Reduce - Introduction to Hadoop, Map Reduce, Pipelining, Cascading, Pig and Hive. Chapter presents benefits of higher level abstractions of Map Reduce (concepts and capabilities).

  • Get ready for Scalding - Theory about Scalding - the Scala Domain Specific Language utilising Cascading. Development environment setup including local hadoop cluster for development. Execute the first Hello World Scalding example.

  • Scalding by example - The core capabilities of scalding: i) Map-like functions, ii) Grouping/reducing functions iii) Join operations

  • Intermediate examples - A Scalding log processing flow for a News company, aggregating multiple sources will be presented. Through an example with multiple pipe-lines some more advanced concepts are presented.

  • Scalding Design Patterns - Interesting design patterns applicable to Scalding data processing applications. Using the 'External Operations' patters will enable us performing unit testing and structuring our applications in a modular way.

  • Testing & TDD - Best practices of first defining behaviour (Behaviour Driven Development) then tests (Test Driven Development) and then completing the implementation. How to write unit, integration tests and also apply Black-box testing methodologies in the context of Big Data.

  • Running Scalding in Production - Tips and tricks on how to execute and schedule jobs. Also how to co-ordinate the execution of Scalding/Scala/Java and even external system processes. Finally how to configure Scalding jobs using property files or Hadoop parameters, how to monitor and optimize jobs and other usefull tips.

  • Using external data stores - Interaction with external external SQL, NOSQL and in-memory applications like HBase, SQL, ElasticSearch etc.

  • Matrix Calculations and Machine Learning - Matrix calculations using the Matrix API and algebird to calculate text similarity (TF-IDF) and set similarity (Jaccard). Then another example on Mahout K-Means clustering and outlier detection.