Skip to content

Notes for learning this Spark course (https://time.geekbang.org/column/intro/100090001?tab=catalog). Write and package the examples to run in Ezmeral Data Fabric (known as MapR in the past).

Notifications You must be signed in to change notification settings

aruruka/spark-example-ezmeral-data-fabric

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-example-ezmeral-data-fabric

Notes for learning this Spark course (https://time.geekbang.org/column/intro/100090001?tab=catalog). Write and package the examples to run in Ezmeral Data Fabric (known as MapR in the past).

The following is a cheat sheet of operators commonly used in RDD, quoted from this course from Geek Time:
๐Ÿ“–้›ถๅŸบ็ก€ๅ…ฅ้—จ Spark02 | RDDไธŽ็ผ–็จ‹ๆจกๅž‹๏ผšๅปถ่ฟŸ่ฎก็ฎ—ๆ˜ฏๆ€Žไนˆๅ›žไบ‹๏ผŸ

operators commonly used in RDD

What did I do in this sample projectโ“ (What did I record in this note)

  • Use the knowledge learned about Spark development (how to use Spark's API to complete business) to write example applications.

  • Some basic information about this project:

  • Develop applications on the local work PC and run the applications in Spark local mode.

  • Build the application (packaging) and run the application on the Ezmeral Data Fabric cluster (submit the application with spark-submit).

How to build and run the example app - "Spark RDD Neighboring Word Count๐Ÿ—„"

TL;DR

ls -1 `pwd`

Sample output:

README.md
build.sbt
project
src
tree ./src

Sample output:

./src
โ””โ”€โ”€ main
    โ””โ”€โ”€ scala
        โ””โ”€โ”€ shouneng
            โ””โ”€โ”€ RddNeighboringWordCount.scala

3 directories, 1 file

Take a look at the core configuration file of SBT - build.sbt๐Ÿ—„

sbt

Sample output:

[info] welcome to sbt 1.5.4 (Ubuntu Java 11.0.13)
[info] loading settings for project spark_rddexample-build-build-build from metals.sbt ...
[info] loading project definition from /home/raymondyan/spark_rddExample/project/project/project
[info] loading settings for project spark_rddexample-build-build from metals.sbt ...
[info] loading project definition from /home/raymondyan/spark_rddExample/project/project
[success] Generated .bloop/spark_rddexample-build-build.json
[success] Total time: 1 s, completed Jan 7, 2022, 5:19:42 PM
[info] loading settings for project spark_rddexample-build from metals.sbt,plugins.sbt ...
[info] loading project definition from /home/raymondyan/spark_rddExample/project
[success] Generated .bloop/spark_rddexample-build.json
[success] Total time: 1 s, completed Jan 7, 2022, 5:19:44 PM
[info] loading settings for project sparkRDD from build.sbt ...
[info] set current project to Spark RDD Example (in build file:/home/raymondyan/spark_rddExample/)
[info] sbt server started at local:///home/raymondyan/.sbt/1.0/server/df4dbb717ba189562adb/sock
[info] started sbt server
sbt:Spark RDD Example> 
sbt:Spark RDD Example> compile

Sample output:

[info] compiling 1 Scala source to /home/raymondyan/spark_rddExample/target/scala-2.12/classes ...
[success] Total time: 6 s, completed Jan 7, 2022, 5:24:17 PM
sbt:Spark RDD Example> package

Sample output:

[success] Total time: 0 s, completed Jan 7, 2022, 5:25:57 PM
sbt:Spark RDD Example> run

Sample output:

...
22/01/07 17:45:20 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 6) in 103 ms on 172.24.223.187 (executor driver) (1/1)
22/01/07 17:45:20 INFO DAGScheduler: ResultStage 4 (take at RddNeighboringWordCount.scala:52) finished in 0.133 s
22/01/07 17:45:20 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
22/01/07 17:45:20 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
22/01/07 17:45:20 INFO TaskSchedulerImpl: Killing all running tasks in stage 4: Stage finished
22/01/07 17:45:20 INFO DAGScheduler: Job 1 finished: take at RddNeighboringWordCount.scala:52, took 0.356140 s
(10,Apache-Software)
(8,Apache-Spark)
(7,can-be)
(7,of-the)
(6,Software-Foundation)
22/01/07 17:45:20 WARN FileSystem: Cleaner thread interrupted, will stop
java.lang.InterruptedException
        at java.base/java.lang.Object.wait(Native Method)
        at java.base/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
        at java.base/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:176)
        at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
        at java.base/java.lang.Thread.run(Thread.java:829)
...

๐Ÿ“”After ๐Ÿ’กpackage command in SBT, the built application Jar file will be populated into ๐Ÿ“./target/scala-2.12/.

Submit the application on a Data Fabric cluster node or on an edge node:

su - mapr
export SPARK_HOME=/opt/mapr/spark/spark-3.1.2 && \
cd /home/mapr/work && \
$SPARK_HOME/bin/spark-submit \
--class shouneng.NeighboringWordCount \
--master yarn \
--deploy-mode cluster \
spark-rdd-example_2.12-0.1.0.jar

Check the result from DF cluster's Resource Manager UI:

Check the result of the Spark RDD example: Neiboring Word Count

About

Notes for learning this Spark course (https://time.geekbang.org/column/intro/100090001?tab=catalog). Write and package the examples to run in Ezmeral Data Fabric (known as MapR in the past).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages