Some useful URLs

Big Data

Reading List

CI/CD

Security

semgrep
owasp pytm
OWASP API Security Project

https://owasp.org/www-project-api-security/

LINDDUN privacy engineering

https://www.linddun.org/

Privacy by Design - The 7 Foundational Principles

https://iapp.org/resources/article/privacy-by-design-the-7-foundational-principles/

The State of Open Source Security

https://snyk.io/open-source-security/

npm and the Future of JavaScript

https://www.infoq.com/presentations/npm-javascript-users/

Semgrep

https://github.com/returntocorp/semgrep

OWASP pytm

https://owasp.org/www-project-pytm/

OWASP SEDATED

https://owasp.org/www-project-sedated/

Git/Github

pre-commit

installation

pip install pre-commit
brew install pre-commit

Add pre-commit plugins to project

add .pre-commit-config.yaml to root directory of the project
Example:

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v1.4.0
hooks:
  - id: trailing-whitespace
  - id: end-of-file-fixer
  - id: check-yaml
  - id: check-added-large-files
  - id: check-symlinks
  - id: mixed-line-ending
  - id: trailing-whitespace
- repo: git@github.com:Yelp/detect-secrets
rev: 0.9.1
hooks:
  - id: detect-secrets
args: ['--baseline', '.secrets.baseline']
exclude: ./temp/.

Updating hooks automatically
- Run pre-commit autoupdate

git tips
- Remove sensitive data from history after a push
  
  git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch <path-to-your-file>' --prune-empty --tag-name-filter cat -- --all && git push origin --force --all
git beyond the basics
git flow cheatsheet
git upstream explained

Docker

Find child docker images

GoLang

Concurrency Best Practices
- Ensure consumers can only consume. recvOnly <-chan Thing are your friends.
- Track completion of goroutines. sync.WaitGroup is your friend.
- Close only when producing routines can be verified as no longer able to send on the channel being closed.
GO in production
7 common mistakes in Go
5 useful ways to use Closure
Retry example
- Idiomatic Go Tricks
Code Review Comment
Awesome Go
Dependecny Management
Error Handling
go mod just tell me how to use
gRPC and REST tutorial
gRPC eco-system
Go High Performance
Sourced
Useful packages
- envconfig

Kotlin

Python

Tiny Python Notebook
pytest folder structure
Python Tutor visualize
pytest monkeypatching
Python run-time method patching
Python and postgres
Python String Formatting best practices
[Some Useful Packages]
virtualenv and git best practices
- python virtualenv do NOT need to be in the same directory as the python sources, i.e. we could put all venvs in ~/.envs (e.g. ~/.envs/myenv1 ~/.envs/my-test-env1), then run "source ~/.envs/myenv1/bin/activate"
python property decorator
python anti-patterns

pip SSL config

#!/usr/bin/env bash

MY_ID=$(whoami)
HOME_DIR=$(cd ~ && pwd)

TIMEOUT=60

Packaging
PEP440 versions
python load into posgresql

Functional Programming

Functional Thinking by Neal Ford

Tmux

Spark

spark internal docs
data frame partitioning
- memory partitioning and disk partition
Scala Future and Spark Concurrency
pyspark production best practices
advanced spark training by sameer
spark unit tests
Spark docker workbench
- github repo

spark tuning tips

Locality
custom listener
Spark Perf Tuning Checklist
Spark Linstener
- Some examples
configuration properties

spark-submit --master yarn --deploy-mode cluster 
--executor-cores 3 
--executor-memory 1g 
--conf "spark.executor.extraJavaOptions=-Dhttps.proxyHost=${PROXY_HOST} 
-Dhttps.proxyPort=${PROXY_PORT}" 
--conf spark.dynamicAllocation.enabled=false 
--num-executors 36 
--conf spark.driver.cores=1
--conf spark.streaming.receiver.maxRate=150 
--conf spark.rpc.netty.dispatcher.numThreads=2 
--conf spark.ui.retainedJobs=100 
--conf spark.ui.retainedStages=100 
--conf spark.ui.retainedTasks=100 
--conf spark.worker.ui.retainedExecutors=100 
--conf spark.worker.ui.retainedDrivers=100 
--conf spark.sql.ui.retainedExecutions=100 
--conf spark.streaming.ui.retainedBatches=10000 
--conf spark.ui.retainedDeadExecutors=100 
--conf spark.rpc.netty.dispatcher.numThreads=2 
--conf spark.eventLog.enabled=false 
--conf spark.history.retainedApplications=2 
--conf spark.history.fs.cleaner.enabled=true 
--conf spark.history.fs.cleaner.maxAge=2d 
--class "$APPCLASS" "$APPFILE" >> "/var/log/${APPCLASS}.log" 2>&1
--conf  spark.extraListeners=org.apache.spark.scheduler.StatsReportListener

some config

spark=SparkSession.builder.master('yarn').appName('tga-dfa-adhoc')\
    .enableHiveSupport()\
    .config("hive.exec.dynamic.partition","true")\
    .config("hive.exec.dynamic.partition.mode", "nonstrict")\
    .config("spark.sql.cbo.enabled", "true")\
    .config("spark.rpc.netty.dispatcher.numThreads", "2")\
    .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")\
    .config("spark.default.parallelism","10000")\
    .config("spark.driver.maxResultSize","10G")\
    .config("spark.rdd.compress", "true")\
    .config("spark.sql.inMemoryColumnarStorage.compressed", "true")\
    .config("spark.io.compression.codec", "snappy")\
    .config("spark.executor.memory", '20g')\
    .config("spark.executor.memoryOverhead", "10g")\
    .config("spark.sql.hive.thriftServer.singleSession", "true")\
  .getOrCreate()

# useful to debug and find the source file name
df = df.withColumn("fname", input_file_name())

The following configuration settings are set by default in the bigRED environment and *should not* be modified:

spark.authenticate = true (turns on authentication between Spark components)
spark.master = yarn (Configures spark to use YARN as the execution engine)
spark.executorEnv.LD_LIBARARY_PATH (varies depending on version / environment)
spark.yarn.appMasterEnv.SPARK_HOME (varies depending on version / environment) [Spark 2.0 only]
spark.yarn.archive (varies depending on version / environment)
spark.driver.extraJavaOptions (varies depending on version)
spark.yarn.am.extraJavaOptions (varies depending on version)
spark.eventLog.enabled = true (turns on Spark event logging)
spark.eventLog.dir (varies depending on version / environment)
spark.yarn.historyServer.address (varies depending on version / environment)
spark.shuffle.service.enabled = true (enables Spark external shuffle service)

The following configuration settings are set by default in the bigRED environment. They should work for many (if not most) tasks, especially shell jobs, but may be customized as necessary:

spark.yarn.am.waitTime = 180000000 (sets the time to wait for an ApplicationMaster to be allocated before giving up -- 3 minutes)
spark.network.timeout = 300s (sets the time before network transfers should be considered failed -- 5 minutes)
spark.speculation = true (enables speculative execution of Spark tasks)
spark.dynamicAllocation.enabled = true (enables dynamic allocation of Spark executors)
spark.dynamicAllocation.minExecutors = 1 (sets the minimum number of executors to allocate -- please do NOT change for shell jobs)
spark.dynamicAllocation.maxExecutors = 100 (sets the maximum number of executors to allocate)
spark.dynamicAllocation.executorIdleTimeout = 300s (sets the amount of time before idle executors are terminated -- please do NOT change for shell jobs)
spark.executor.instances = 1 (sets the default # of Spark executors -- for dynamic allocation, use spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.initialExecutors and spark.dynamicAllocation.maxExecutors instead)
spark.driver.memory = 1g (sets the amount of RAM to allocate for Driver instances)
spark.executor.memory = 4g (sets the amount of RAM to allocate for Executor instances)
spark.r.command = /usr/local/R-xyz-201701/bin/Rscript (sets the R command used to execute SparkR scripts)
spark.r.shell.command = /usr/local/R-xyz-201701/bin/R (sets the R command used to execute SparkR shells) [Spark >= 2.1 only]
spark.pyspark.python = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark) [Spark >= 2.1 only]
spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark on YARN AM) [Spark <= 2.0 only]
spark.port.maxRetries = 64

Spark Streaming log configuration
Mastering Spark Book
Some Tips
- Spark 定义的 NarrowDependency 其实是 “each partition of the parent RDD is used by at most one partition of the child RDD“
- Spark Native ORC: Create table MyTable ... USING ORC
- Nested Schema Pruning: spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) Prunes the nested columns (e.g. struct) if not all fields are selected. Without this config, it read ALL fields
- collapse projects: use .asNondeterministic in the udf
- Spark 3.0: Join Hints:
  - BROADCAST (prior versions)
- MERGE: Shuffle sort merge join
  - SHUFFLE_HASH: Shuffle hash join
- SHUFFLE_REPLICATED_NL: Shuffle and Replicate Nested Loop join
- spark.sql.codegen
  
  The default value of spark.sql.codegen is false. When the value of this is true, Spark SQL will compile each query to Java bytecode very quickly. Thus, improves the performance for large queries. But the issue with codegen is that it slows down with very short queries. This happens because it has to run a compiler for each query.
- spark.sql.inMemorycolumnarStorage.compressed
  
  The default value of spark.sql.inMemorycolumnarStorage.compressed is true. When the value is true we can compress the in-memory columnar storage automatically based on statistics of the data.
- spark.sql.inMemoryColumnarStorage.batchSize
  
  The default value of spark.sql.inMemoryColumnarStorage.batchSize is 10000. It is the batch size for columnar caching. The larger values can boost up memory utilization but causes an out-of-memory problem.
- spark.sql.parquet.compression.codec
  
  The spark.sql.parquet.compression.codec uses default snappy compression. Snappy is a library which for compression/decompression. It mainly aims at very high speed and reasonable compression. In most compression, the resultant file is 20 to 100% bigger than other inputs although it is the order of magnitude faster. Other possible option includes uncompressed, gzip and lzo.
productionize spark ETL video
Spark Metrics
Spark Shuffling
Spark AI Summit 2020
parallelizing spark by Anna Holschuh TGT
spark unit testing

Security

Kerberos

keytab file

Typing passwords is annoying, especially when you are forced to change
them arbitrarily. Here's what I did on bigred:

Set up passwordless ssh:

	cat ~/.ssh/id_rsa.pub | ssh <user>@domain.XYZ.com "cat >>
~/.ssh/authorized_keys"

On host, create a local keytab file for your principal:

	$ ktutil
	ktutil: addent -password -p <USER>@DOMAIN.XYZ.COM -k 1 -e
aes256-cts-hmac-sha1-96
	ktutil: wkt .keytab
	ktutil: q

In ~/.profile:

	if [ -f $HOME/.keytab ]; then
		kinit -kt $HOME/.keytab <USER>@XYZ.COM
	fi

Now forget your password and go on with your life.

$ generate-keytab .keytab <USER>@XYZ.COM
Kerberos password for <USER>@XYZ.COM: ********

ssh keys

ssh-keygen -t rsa -C "your_email@example.com"

ssh config

cat ~/.ssh/config                                                                                                       HashKnownHosts yes
ServerAliveInterval 120
TCPKeepAlive yes

Host gitlab.com
  Hostname altssh.gitlab.com
  User git
  Port 443
  PreferredAuthentications publickey
  IdentityFile ~/.ssh/gitlab

Host *
  UseKeychain yes

Java

Scala

Scala Implicit design pattern

Architecture

Misc Tools

Cookiecutter template
Terminal Setup
- Nerd Fonts
Powerlevel9k font
- SCM Breeze
- Starship
- Fish
Terminal tools
- bench
- Tools collection
- The Art of command Line
- df, ag, fzf, Z, ripgrep(rg), htop, glances, ctop, lazydocker, tree, exa, bat,
- httpie, tldr( similar to man), ncdu
- bash strict mode
VS Code Extensions
- GitLens
- Golang

Dash

Docker
VScode settings

Ligatures setup

brew tap homebrew/cask-fonts
brew cask install font-fira-code

Architecture Resources
Spark vs Flink
Code City
cgroup video
productivity tools
- asciinema
- exa (dir/ls tool)

Notes and tips

git-crypt to protect sensitive files
Hive CLI in Oozie actions

SET mapreduce.job.credentials.binary=${HADOOP_TOKEN_FILE_LOCATION}
--hiveconf hive.execution.engine=mr

Machine Learning
- https://www.jonkrohn.com/talks
Shellcheck
Apache Beam
- Data processing job using BEAM Part2

Adhoc

graph TB
A[Parse Command Line Arguemnts and Load Config file] -->B[Execute Pipeline Logic, Send Slack Msg, Perf Metrics Mgr and Cached metrics]
    B --> Z[Load Table concurrently]
    Z --> C[Load SF Hive table]
    Z --> D[Load CMP Hive table]
    Z --> E[Load PbR Hive table]
    C --> F[Business Logic and Filters]
    D --> G[Decorate with Business Logic/Filters]
    E --> H[Decorate with Business Logic/Filters]
    F --> I[Sync All DataFrames Loaded]
    G --> I
    H --> I
    I --> J[Write to Hive Table]
    J --> K[Post Process, Send Cached Perf Metrics]

graph TB
A[Farscape Source]-->B[Mobile Orders]
B-->C[Export to HDFS]
C-->D[Push to BigRED HDFS]
D-->E[Create Done Flag]
E-->F[Success Email]
B-->G[Failure Email]
C-->G
D-->G
E-->G

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets		assets
ml_andrewng		ml_andrewng
python_code		python_code
sparkSummit		sparkSummit
.gitignore		.gitignore
API.md		API.md
FacebookAPI-order_id1.png		FacebookAPI-order_id1.png
FacebookAPI-order_id2.png		FacebookAPI-order_id2.png
FacebookAPI-order_id3.png		FacebookAPI-order_id3.png
FacebookAPI.md		FacebookAPI.md
FacebookMarketingAPI_Playbook.pdf		FacebookMarketingAPI_Playbook.pdf
FlowGraph_DPP.md		FlowGraph_DPP.md
PerformanceManagers.md		PerformanceManagers.md
Python_ABCs_UML.png		Python_ABCs_UML.png
Python_Sequence.png		Python_Sequence.png
README.md		README.md
Scala-Pattern.png		Scala-Pattern.png
Scala-TyeClass.png		Scala-TyeClass.png
Scala_notes.md		Scala_notes.md
Spark_Shapeless.md		Spark_Shapeless.md
Spark_top5mistakes.pdf		Spark_top5mistakes.pdf
Streaming101.md		Streaming101.md
Streaming_ProcssingEvent_time.png		Streaming_ProcssingEvent_time.png
The_AWK_Programming_Language.pdf		The_AWK_Programming_Language.pdf
fb_Ad_placement.png		fb_Ad_placement.png
iterm_config.md		iterm_config.md
preparing_and_architecting_for_machine_learning.pdf		preparing_and_architecting_for_machine_learning.pdf
pyspark.md		pyspark.md
python_mapping_uml.png		python_mapping_uml.png
python_notes.md		python_notes.md
scala-lazy.png		scala-lazy.png
scala-with-cats.pdf		scala-with-cats.pdf
shapeless-guide.pdf		shapeless-guide.pdf

seabull/resources

Folders and files

Latest commit

History

Repository files navigation

Some useful URLs

Big Data

CI/CD

Security

Git/Github

Docker

GoLang

Kotlin

Python

Functional Programming

Tmux

Spark

Security

Java

Scala

Architecture

Misc Tools

About

Resources

Stars

Watchers

Forks

Languages