Skip to content

seabull/resources

Repository files navigation

Some useful URLs

Big Data

CI/CD

Security

Git/Github

  • pre-commit

    • installation

      pip install pre-commit
      brew install pre-commit
      
    • Add pre-commit plugins to project

      • add .pre-commit-config.yaml to root directory of the project
      • Example:
      - repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v1.4.0
      hooks:
        - id: trailing-whitespace
        - id: end-of-file-fixer
        - id: check-yaml
        - id: check-added-large-files
        - id: check-symlinks
        - id: mixed-line-ending
        - id: trailing-whitespace
      - repo: git@github.com:Yelp/detect-secrets
      rev: 0.9.1
      hooks:
        - id: detect-secrets
      args: ['--baseline', '.secrets.baseline']
      exclude: ./temp/.
      
      
    • Updating hooks automatically

      • Run pre-commit autoupdate
  • git tips

    • Remove sensitive data from history after a push

      git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch <path-to-your-file>' --prune-empty --tag-name-filter cat -- --all && git push origin --force --all

  • git beyond the basics

  • git flow cheatsheet

  • git upstream explained

Docker

GoLang

Kotlin

Python

Functional Programming

Tmux

Spark

  • spark internal docs

  • data frame partitioning

  • Scala Future and Spark Concurrency

  • pyspark production best practices

  • advanced spark training by sameer

  • spark unit tests

  • Spark docker workbench

  • spark tuning tips

    • Locality

    • custom listener

    • Spark Perf Tuning Checklist

    • Spark Linstener

    • configuration properties

    • spark-submit --master yarn --deploy-mode cluster 
      --executor-cores 3 
      --executor-memory 1g 
      --conf "spark.executor.extraJavaOptions=-Dhttps.proxyHost=${PROXY_HOST} 
      -Dhttps.proxyPort=${PROXY_PORT}" 
      --conf spark.dynamicAllocation.enabled=false 
      --num-executors 36 
      --conf spark.driver.cores=1
      --conf spark.streaming.receiver.maxRate=150 
      --conf spark.rpc.netty.dispatcher.numThreads=2 
      --conf spark.ui.retainedJobs=100 
      --conf spark.ui.retainedStages=100 
      --conf spark.ui.retainedTasks=100 
      --conf spark.worker.ui.retainedExecutors=100 
      --conf spark.worker.ui.retainedDrivers=100 
      --conf spark.sql.ui.retainedExecutions=100 
      --conf spark.streaming.ui.retainedBatches=10000 
      --conf spark.ui.retainedDeadExecutors=100 
      --conf spark.rpc.netty.dispatcher.numThreads=2 
      --conf spark.eventLog.enabled=false 
      --conf spark.history.retainedApplications=2 
      --conf spark.history.fs.cleaner.enabled=true 
      --conf spark.history.fs.cleaner.maxAge=2d 
      --class "$APPCLASS" "$APPFILE" >> "/var/log/${APPCLASS}.log" 2>&1
      --conf  spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
    • some config

    • spark=SparkSession.builder.master('yarn').appName('tga-dfa-adhoc')\
          .enableHiveSupport()\
          .config("hive.exec.dynamic.partition","true")\
          .config("hive.exec.dynamic.partition.mode", "nonstrict")\
          .config("spark.sql.cbo.enabled", "true")\
          .config("spark.rpc.netty.dispatcher.numThreads", "2")\
          .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")\
          .config("spark.default.parallelism","10000")\
          .config("spark.driver.maxResultSize","10G")\
          .config("spark.rdd.compress", "true")\
          .config("spark.sql.inMemoryColumnarStorage.compressed", "true")\
          .config("spark.io.compression.codec", "snappy")\
          .config("spark.executor.memory", '20g')\
          .config("spark.executor.memoryOverhead", "10g")\
          .config("spark.sql.hive.thriftServer.singleSession", "true")\
        .getOrCreate()
    • # useful to debug and find the source file name
      df = df.withColumn("fname", input_file_name())
    The following configuration settings are set by default in the bigRED environment and *should not* be modified:
    
    spark.authenticate = true (turns on authentication between Spark components)
    spark.master = yarn (Configures spark to use YARN as the execution engine)
    spark.executorEnv.LD_LIBARARY_PATH (varies depending on version / environment)
    spark.yarn.appMasterEnv.SPARK_HOME (varies depending on version / environment) [Spark 2.0 only]
    spark.yarn.archive (varies depending on version / environment)
    spark.driver.extraJavaOptions (varies depending on version)
    spark.yarn.am.extraJavaOptions (varies depending on version)
    spark.eventLog.enabled = true (turns on Spark event logging)
    spark.eventLog.dir (varies depending on version / environment)
    spark.yarn.historyServer.address (varies depending on version / environment)
    spark.shuffle.service.enabled = true (enables Spark external shuffle service)
    
    The following configuration settings are set by default in the bigRED environment. They should work for many (if not most) tasks, especially shell jobs, but may be customized as necessary:
    
    spark.yarn.am.waitTime = 180000000 (sets the time to wait for an ApplicationMaster to be allocated before giving up -- 3 minutes)
    spark.network.timeout = 300s (sets the time before network transfers should be considered failed -- 5 minutes)
    spark.speculation = true (enables speculative execution of Spark tasks)
    spark.dynamicAllocation.enabled = true (enables dynamic allocation of Spark executors)
    spark.dynamicAllocation.minExecutors = 1 (sets the minimum number of executors to allocate -- please do NOT change for shell jobs)
    spark.dynamicAllocation.maxExecutors = 100 (sets the maximum number of executors to allocate)
    spark.dynamicAllocation.executorIdleTimeout = 300s (sets the amount of time before idle executors are terminated -- please do NOT change for shell jobs)
    spark.executor.instances = 1 (sets the default # of Spark executors -- for dynamic allocation, use spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.initialExecutors and spark.dynamicAllocation.maxExecutors instead)
    spark.driver.memory = 1g (sets the amount of RAM to allocate for Driver instances)
    spark.executor.memory = 4g (sets the amount of RAM to allocate for Executor instances)
    spark.r.command = /usr/local/R-xyz-201701/bin/Rscript (sets the R command used to execute SparkR scripts)
    spark.r.shell.command = /usr/local/R-xyz-201701/bin/R (sets the R command used to execute SparkR shells) [Spark >= 2.1 only]
    spark.pyspark.python = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark) [Spark >= 2.1 only]
    spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark on YARN AM) [Spark <= 2.0 only]
    spark.port.maxRetries = 64
    • Spark Streaming log configuration

    • Mastering Spark Book

    • Some Tips

      • Spark 定义的 NarrowDependency 其实是 “each partition of the parent RDD is used by at most one partition of the child RDD“

      • Spark Native ORC: Create table MyTable ... USING ORC

      • Nested Schema Pruning: spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) Prunes the nested columns (e.g. struct) if not all fields are selected. Without this config, it read ALL fields

      • collapse projects: use .asNondeterministic in the udf

      • Spark 3.0: Join Hints:

        • BROADCAST (prior versions)
      • MERGE: Shuffle sort merge join

        • SHUFFLE_HASH: Shuffle hash join
      • SHUFFLE_REPLICATED_NL: Shuffle and Replicate Nested Loop join

      • spark.sql.codegen

        The default value of spark.sql.codegen is false. When the value of this is true, Spark SQL will compile each query to Java bytecode very quickly. Thus, improves the performance for large queries. But the issue with codegen is that it slows down with very short queries. This happens because it has to run a compiler for each query.

      • spark.sql.inMemorycolumnarStorage.compressed

        The default value of spark.sql.inMemorycolumnarStorage.compressed is true. When the value is true we can compress the in-memory columnar storage automatically based on statistics of the data.

      • spark.sql.inMemoryColumnarStorage.batchSize

        The default value of spark.sql.inMemoryColumnarStorage.batchSize is 10000. It is the batch size for columnar caching. The larger values can boost up memory utilization but causes an out-of-memory problem.

      • spark.sql.parquet.compression.codec

        The spark.sql.parquet.compression.codec uses default snappy compression. Snappy is a library which for compression/decompression. It mainly aims at very high speed and reasonable compression. In most compression, the resultant file is 20 to 100% bigger than other inputs although it is the order of magnitude faster. Other possible option includes uncompressed, gzip and lzo.

    • productionize spark ETL video

    • Spark Metrics

    • Spark Shuffling

    • Spark AI Summit 2020

    • parallelizing spark by Anna Holschuh TGT

    • spark unit testing

Security

  • Kerberos

    • keytab file

    • Typing passwords is annoying, especially when you are forced to change
      them arbitrarily. Here's what I did on bigred:
      
      Set up passwordless ssh:
      
      	cat ~/.ssh/id_rsa.pub | ssh <user>@domain.XYZ.com "cat >>
      ~/.ssh/authorized_keys"
      
      On host, create a local keytab file for your principal:
      
      	$ ktutil
      	ktutil: addent -password -p <USER>@DOMAIN.XYZ.COM -k 1 -e
      aes256-cts-hmac-sha1-96
      	ktutil: wkt .keytab
      	ktutil: q
      
      In ~/.profile:
      
      	if [ -f $HOME/.keytab ]; then
      		kinit -kt $HOME/.keytab <USER>@XYZ.COM
      	fi
      
      Now forget your password and go on with your life.
      
      $ generate-keytab .keytab <USER>@XYZ.COM
      Kerberos password for <USER>@XYZ.COM: ********
      
    • ssh keys

      ssh-keygen -t rsa -C "your_email@example.com"
    • ssh config

      cat ~/.ssh/config                                                                                                       HashKnownHosts yes
      ServerAliveInterval 120
      TCPKeepAlive yes
      
      Host gitlab.com
        Hostname altssh.gitlab.com
        User git
        Port 443
        PreferredAuthentications publickey
        IdentityFile ~/.ssh/gitlab
      
      Host *
        UseKeychain yes

Java

Scala

Architecture

Misc Tools

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published