-
OWASP API Security Project
https://owasp.org/www-project-api-security/
LINDDUN privacy engineering
Privacy by Design - The 7 Foundational Principles
https://iapp.org/resources/article/privacy-by-design-the-7-foundational-principles/
The State of Open Source Security
https://snyk.io/open-source-security/
npm and the Future of JavaScript
https://www.infoq.com/presentations/npm-javascript-users/
Semgrep
https://github.com/returntocorp/semgrep
OWASP pytm
https://owasp.org/www-project-pytm/
OWASP SEDATED
-
-
installation
pip install pre-commit brew install pre-commit
-
Add pre-commit plugins to project
- add .pre-commit-config.yaml to root directory of the project
- Example:
- repo: https://github.com/pre-commit/pre-commit-hooks rev: v1.4.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-added-large-files - id: check-symlinks - id: mixed-line-ending - id: trailing-whitespace - repo: git@github.com:Yelp/detect-secrets rev: 0.9.1 hooks: - id: detect-secrets args: ['--baseline', '.secrets.baseline'] exclude: ./temp/.
-
Updating hooks automatically
- Run
pre-commit autoupdate
- Run
-
-
-
Remove sensitive data from history after a push
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch <path-to-your-file>' --prune-empty --tag-name-filter cat -- --all && git push origin --force --all
-
- Concurrency Best Practices
- Ensure consumers can only consume.
recvOnly <-chan Thing
are your friends. - Track completion of goroutines.
sync.WaitGroup
is your friend. - Close only when producing routines can be verified as no longer able to send on the channel being closed.
- Ensure consumers can only consume.
- GO in production
- 7 common mistakes in Go
- 5 useful ways to use Closure
- Retry example
- Code Review Comment
- Awesome Go
- Dependecny Management
- Error Handling
- go mod just tell me how to use
- gRPC and REST tutorial
- gRPC eco-system
- Go High Performance
- Sourced
- Useful packages
-
[Some Useful Packages]
-
virtualenv and git best practices
- python virtualenv do NOT need to be in the same directory as the python sources, i.e. we could put all venvs in ~/.envs (e.g. ~/.envs/myenv1 ~/.envs/my-test-env1), then run "source ~/.envs/myenv1/bin/activate"
-
pip SSL config
-
#!/usr/bin/env bash MY_ID=$(whoami) HOME_DIR=$(cd ~ && pwd) TIMEOUT=60
-
-
-
spark-submit --master yarn --deploy-mode cluster --executor-cores 3 --executor-memory 1g --conf "spark.executor.extraJavaOptions=-Dhttps.proxyHost=${PROXY_HOST} -Dhttps.proxyPort=${PROXY_PORT}" --conf spark.dynamicAllocation.enabled=false --num-executors 36 --conf spark.driver.cores=1 --conf spark.streaming.receiver.maxRate=150 --conf spark.rpc.netty.dispatcher.numThreads=2 --conf spark.ui.retainedJobs=100 --conf spark.ui.retainedStages=100 --conf spark.ui.retainedTasks=100 --conf spark.worker.ui.retainedExecutors=100 --conf spark.worker.ui.retainedDrivers=100 --conf spark.sql.ui.retainedExecutions=100 --conf spark.streaming.ui.retainedBatches=10000 --conf spark.ui.retainedDeadExecutors=100 --conf spark.rpc.netty.dispatcher.numThreads=2 --conf spark.eventLog.enabled=false --conf spark.history.retainedApplications=2 --conf spark.history.fs.cleaner.enabled=true --conf spark.history.fs.cleaner.maxAge=2d --class "$APPCLASS" "$APPFILE" >> "/var/log/${APPCLASS}.log" 2>&1 --conf spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
-
some config
-
spark=SparkSession.builder.master('yarn').appName('tga-dfa-adhoc')\ .enableHiveSupport()\ .config("hive.exec.dynamic.partition","true")\ .config("hive.exec.dynamic.partition.mode", "nonstrict")\ .config("spark.sql.cbo.enabled", "true")\ .config("spark.rpc.netty.dispatcher.numThreads", "2")\ .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")\ .config("spark.default.parallelism","10000")\ .config("spark.driver.maxResultSize","10G")\ .config("spark.rdd.compress", "true")\ .config("spark.sql.inMemoryColumnarStorage.compressed", "true")\ .config("spark.io.compression.codec", "snappy")\ .config("spark.executor.memory", '20g')\ .config("spark.executor.memoryOverhead", "10g")\ .config("spark.sql.hive.thriftServer.singleSession", "true")\ .getOrCreate()
-
# useful to debug and find the source file name df = df.withColumn("fname", input_file_name())
The following configuration settings are set by default in the bigRED environment and *should not* be modified: spark.authenticate = true (turns on authentication between Spark components) spark.master = yarn (Configures spark to use YARN as the execution engine) spark.executorEnv.LD_LIBARARY_PATH (varies depending on version / environment) spark.yarn.appMasterEnv.SPARK_HOME (varies depending on version / environment) [Spark 2.0 only] spark.yarn.archive (varies depending on version / environment) spark.driver.extraJavaOptions (varies depending on version) spark.yarn.am.extraJavaOptions (varies depending on version) spark.eventLog.enabled = true (turns on Spark event logging) spark.eventLog.dir (varies depending on version / environment) spark.yarn.historyServer.address (varies depending on version / environment) spark.shuffle.service.enabled = true (enables Spark external shuffle service) The following configuration settings are set by default in the bigRED environment. They should work for many (if not most) tasks, especially shell jobs, but may be customized as necessary: spark.yarn.am.waitTime = 180000000 (sets the time to wait for an ApplicationMaster to be allocated before giving up -- 3 minutes) spark.network.timeout = 300s (sets the time before network transfers should be considered failed -- 5 minutes) spark.speculation = true (enables speculative execution of Spark tasks) spark.dynamicAllocation.enabled = true (enables dynamic allocation of Spark executors) spark.dynamicAllocation.minExecutors = 1 (sets the minimum number of executors to allocate -- please do NOT change for shell jobs) spark.dynamicAllocation.maxExecutors = 100 (sets the maximum number of executors to allocate) spark.dynamicAllocation.executorIdleTimeout = 300s (sets the amount of time before idle executors are terminated -- please do NOT change for shell jobs) spark.executor.instances = 1 (sets the default # of Spark executors -- for dynamic allocation, use spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.initialExecutors and spark.dynamicAllocation.maxExecutors instead) spark.driver.memory = 1g (sets the amount of RAM to allocate for Driver instances) spark.executor.memory = 4g (sets the amount of RAM to allocate for Executor instances) spark.r.command = /usr/local/R-xyz-201701/bin/Rscript (sets the R command used to execute SparkR scripts) spark.r.shell.command = /usr/local/R-xyz-201701/bin/R (sets the R command used to execute SparkR shells) [Spark >= 2.1 only] spark.pyspark.python = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark) [Spark >= 2.1 only] spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/local/python-xyz-201701/bin/python2.7 (sets the python executable to use by default for pyspark on YARN AM) [Spark <= 2.0 only] spark.port.maxRetries = 64
-
Some Tips
-
Spark 定义的 NarrowDependency 其实是 “each partition of the parent RDD is used by at most one partition of the child RDD“
-
Spark Native ORC: Create table MyTable ... USING ORC
-
Nested Schema Pruning: spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) Prunes the nested columns (e.g. struct) if not all fields are selected. Without this config, it read ALL fields
-
collapse projects: use .asNondeterministic in the udf
-
Spark 3.0: Join Hints:
- BROADCAST (prior versions)
-
MERGE: Shuffle sort merge join
- SHUFFLE_HASH: Shuffle hash join
-
SHUFFLE_REPLICATED_NL: Shuffle and Replicate Nested Loop join
-
spark.sql.codegen
The default value of spark.sql.codegen is false. When the value of this is true, Spark SQL will compile each query to Java bytecode very quickly. Thus, improves the performance for large queries. But the issue with codegen is that it slows down with very short queries. This happens because it has to run a compiler for each query.
-
spark.sql.inMemorycolumnarStorage.compressed
The default value of spark.sql.inMemorycolumnarStorage.compressed is true. When the value is true we can compress the in-memory columnar storage automatically based on statistics of the data.
-
spark.sql.inMemoryColumnarStorage.batchSize
The default value of spark.sql.inMemoryColumnarStorage.batchSize is 10000. It is the batch size for columnar caching. The larger values can boost up memory utilization but causes an out-of-memory problem.
-
spark.sql.parquet.compression.codec
The spark.sql.parquet.compression.codec uses default snappy compression. Snappy is a library which for compression/decompression. It mainly aims at very high speed and reasonable compression. In most compression, the resultant file is 20 to 100% bigger than other inputs although it is the order of magnitude faster. Other possible option includes uncompressed, gzip and lzo.
-
-
Kerberos
-
keytab file
-
Typing passwords is annoying, especially when you are forced to change them arbitrarily. Here's what I did on bigred: Set up passwordless ssh: cat ~/.ssh/id_rsa.pub | ssh <user>@domain.XYZ.com "cat >> ~/.ssh/authorized_keys" On host, create a local keytab file for your principal: $ ktutil ktutil: addent -password -p <USER>@DOMAIN.XYZ.COM -k 1 -e aes256-cts-hmac-sha1-96 ktutil: wkt .keytab ktutil: q In ~/.profile: if [ -f $HOME/.keytab ]; then kinit -kt $HOME/.keytab <USER>@XYZ.COM fi Now forget your password and go on with your life. $ generate-keytab .keytab <USER>@XYZ.COM Kerberos password for <USER>@XYZ.COM: ********
-
ssh keys
ssh-keygen -t rsa -C "your_email@example.com"
-
ssh config
cat ~/.ssh/config HashKnownHosts yes ServerAliveInterval 120 TCPKeepAlive yes Host gitlab.com Hostname altssh.gitlab.com User git Port 443 PreferredAuthentications publickey IdentityFile ~/.ssh/gitlab Host * UseKeychain yes
-
- Architectyure ketas
- Document Architectual Decisions
- Lightweight Arch Decision Records
- Thinking Architecturally
- Should that be a microservice
-
Terminal tools
- bench
- Tools collection
- The Art of command Line
- df, ag, fzf, Z, ripgrep(rg), htop, glances, ctop, lazydocker, tree, exa, bat,
- httpie, tldr( similar to man), ncdu
- bash strict mode
-
VS Code Extensions
-
Golang
-
Dash
-
Docker
-
Ligatures setup
-
brew tap homebrew/cask-fonts brew cask install font-fira-code
-
-
-
productivity tools
- asciinema
- exa (dir/ls tool)
-
Notes and tips
-
Hive CLI in Oozie actions
-
SET mapreduce.job.credentials.binary=${HADOOP_TOKEN_FILE_LOCATION} --hiveconf hive.execution.engine=mr
-
Machine Learning
-
Apache Beam
-
Adhoc
graph TB A[Parse Command Line Arguemnts and Load Config file] -->B[Execute Pipeline Logic, Send Slack Msg, Perf Metrics Mgr and Cached metrics] B --> Z[Load Table concurrently] Z --> C[Load SF Hive table] Z --> D[Load CMP Hive table] Z --> E[Load PbR Hive table] C --> F[Business Logic and Filters] D --> G[Decorate with Business Logic/Filters] E --> H[Decorate with Business Logic/Filters] F --> I[Sync All DataFrames Loaded] G --> I H --> I I --> J[Write to Hive Table] J --> K[Post Process, Send Cached Perf Metrics]
graph TB A[Farscape Source]-->B[Mobile Orders] B-->C[Export to HDFS] C-->D[Push to BigRED HDFS] D-->E[Create Done Flag] E-->F[Success Email] B-->G[Failure Email] C-->G D-->G E-->G