Apache Spark experiments using scala, apache spark built with gradle. Spark-shell provides spark
and sc
variables pre-initialised, here I did the same using a scala trait that you can extend.
Original project template source:
https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a
https://github.com/faizanahemad/spark-gradle-template
- Gradle wrapper - 4.7
- Spark - 2.3.0
git clone https://github.com/slangeberg/spark-lab.git
./gradlew clean build
./gradlew run
./gradlew clean run
Take a look at src->main->scala->template->spark directory
We have two Items here.
The trait InitSpark
which is extended by any class that wants to run spark code. This trait has all the code for initialization. I have also supressed the logging to only error levels for less noise.
The file Main.scala
has the executable class Main
.
In this class, I do 4 things
- Print spark version.
- Find sum from 1 to 100 (inclusive).
- Read a csv file into a structured
DataSet
. - Find average age of persons from the csv.
InitSpark.scala
trait InitSpark {
val spark: SparkSession = SparkSession.builder().appName("Spark example").master("local[*]")
.config("spark.some.config.option", "some-value").getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
def reader = spark.read.option("header",true).option("inferSchema", true).option("mode", "DROPMALFORMED")
def readerWithoutHeader = spark.read.option("header",true).option("inferSchema", true).option("mode", "DROPMALFORMED")
private def init = {
sc.setLogLevel("ERROR")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
LogManager.getRootLogger.setLevel(Level.ERROR)
}
init
def close = {
spark.close()
}
}
Main.scala
final case class Person(firstName: String, lastName: String, country: String, age: Int)
object Main extends InitSpark {
def main(args: Array[String]) = {
import spark.implicits._
val version = spark.version
println("VERSION_STRING = " + version)
val sumHundred = spark.range(1, 101).reduce(_ + _)
println(sumHundred)
val persons = reader.csv("people-example.csv").as[Person]
val averageAge = persons.agg(avg("age")).first.get(0).asInstanceOf[Double]
println(f"Average Age: $averageAge%.2f")
close
}
}
Just import it into your favorite IDE as a gradle project. Tested with IntelliJ to work. Or use your favorite editor and build from command line with gradle.