Skip to content

slangeberg/spark-lab

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

spark-lab

Apache Spark experiments using scala, apache spark built with gradle. Spark-shell provides spark and sc variables pre-initialised, here I did the same using a scala trait that you can extend.

Original project template source:
https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a
https://github.com/faizanahemad/spark-gradle-template

Prerequisites

Libraries Included

  • Gradle wrapper - 4.7
  • Spark - 2.3.0

Build and Demo process

Clone the Repo

git clone https://github.com/slangeberg/spark-lab.git

Build

./gradlew clean build

Run

./gradlew run

All Together

./gradlew clean run

What the demo does?

Take a look at src->main->scala->template->spark directory

We have two Items here.

The trait InitSpark which is extended by any class that wants to run spark code. This trait has all the code for initialization. I have also supressed the logging to only error levels for less noise.

The file Main.scala has the executable class Main. In this class, I do 4 things

  • Print spark version.
  • Find sum from 1 to 100 (inclusive).
  • Read a csv file into a structured DataSet.
  • Find average age of persons from the csv.

InitSpark.scala

trait InitSpark {
  val spark: SparkSession = SparkSession.builder().appName("Spark example").master("local[*]")
                            .config("spark.some.config.option", "some-value").getOrCreate()
  val sc = spark.sparkContext
  val sqlContext = spark.sqlContext
  def reader = spark.read.option("header",true).option("inferSchema", true).option("mode", "DROPMALFORMED")
  def readerWithoutHeader = spark.read.option("header",true).option("inferSchema", true).option("mode", "DROPMALFORMED")
  private def init = {
    sc.setLogLevel("ERROR")
    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("akka").setLevel(Level.ERROR)
    LogManager.getRootLogger.setLevel(Level.ERROR)
  }
  init
  def close = {
    spark.close()
  }
}

Main.scala

final case class Person(firstName: String, lastName: String, country: String, age: Int)

object Main extends InitSpark {
  def main(args: Array[String]) = {
    import spark.implicits._

    val version = spark.version
    println("VERSION_STRING = " + version)

    val sumHundred = spark.range(1, 101).reduce(_ + _)
    println(sumHundred)

    val persons = reader.csv("people-example.csv").as[Person]
    val averageAge = persons.agg(avg("age")).first.get(0).asInstanceOf[Double]
    println(f"Average Age: $averageAge%.2f")

    close
  }
}

Using this Repo

Just import it into your favorite IDE as a gradle project. Tested with IntelliJ to work. Or use your favorite editor and build from command line with gradle.

Useful Links

About

Apache Spark experiments. Using Gradle and Scala

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages