## proc directory
/proc: directory in charge of linux performance optimization

proc fidrectory has 3 parts
1. PID directory
2. information files /meminfo: prevents information from kernel
3. /sys: tunables

## sys
important interface for sys
- fs: filesystem
- kernel: for kernel
- net: networking
- vm: memory

swappiness: the williness of the kernel to swap
out memory pages, if memory pressure arises

### Systemctl service
```systemctl status systemd-sysct```
its configuration is in /etc/sysctl.conf<br>
changes to system is applied on boot<br>
you can change swappiness

Useful sysctl command

In [None]:
%%bash
sysctl --help
sysctl -p /etc/sysctl.conf

In [None]:
%%bash
# list sysctl prameters
sysctl -a

# Managing Kernel Module Parameters
- Use **modinfo** to find which parameters are available
- Use **modprobe module key=value** to specify a parameter in runtime
- Use /etc/modprobe.d/modulename.conf to specify a permanent parameter

## Limiting Resource Usage

### ulimit
ulimit is the old way of configuring resource usage<br>
Applying POSIX Resource Limits
- Set runtime limits with **ulimit**
- - It dapplies to the shell in which the command is used
- Apply persistent ulimit settings to /etc/security/limits.conf
- Soft limits can be modified by the user, hard limits implements an absolute ceiling
- - Users can set soft limits, but only lower than the enforced hard limit

In [None]:
%%bash
cat /etc/security/limits.conf

#### Managing Persistent limits
- **pam_limits** is applied from PAM sessions to configure persistent limits
- It works with /etc/security/limits.conf and /etc/security/limits.d/*.conf
- See *man 5 limits.conf** for documentation
- - Notice that not all llimites currently work on RHEL. The **rss**limit for instance doesn not work
- Example: set the maximum amount of logins for studentst to 2, creating a file with the contents /etc/security/limits/students.d and the following contents:
- - **@students hard maxlogins 2**

#### Setting Limits to Services
- In systemd units, add the **LIMIT\** entries to the **[Service]** block of a unit file to limit what services can do
- If, for inatance you want to allow your blah.service a maximum of 60 seconds of CPU time before it is killed, add the following to **/etc/systemd/system/blah.service.d/10-cpulimits.conf**

[Service]
LimitCPU=60

- Use **systemctl daemon-reload** and **systemctl restart blah** to make the changes effective
- See **man5 systemd.exec** and **man 2 setrlimit** for a full list of Limit\* settings

PAM (Pluggable Authentication Modules): it shapes what is happening when user log into the system

<img src='screenshots/PAM.png'>

In [None]:
%%bash
man systemd.exec | grep limits

In [None]:
! mkdir /etc/systemd/system/sleep.service.d

In [None]:
%%bash
# to reload
systemctl daemon-reload
systemctl restart
status sleep

## Control Groups
Understand Control Groups
- Control Groups place resources in controllers that represent the type of resource
- Some common default controllers are **cpu**, **memory**, and **blkio**
- These controllers are subdivided in to a tree structure where different weights or limits are applied to each branch
- - Each of these branches is a cgroup
- - One or more processes are assigned to a cgroup
- Cgroups can be applied from the command line, or from systemd
- - Manual creation happened throught the **cgconfig** service and the **cgred** process
- - In all cases, cgroup settings are written to /sys/fs/cgroups

In [None]:
%%bash
ls /sys/fs/cgroup

## Resource Allocation 
- Machine
- System
- User

<img src='screenshots/CGroup.png'>

### Integrating Cgroups in Systemd
Systemd divides **cpu**, **cpuacct**, **blkio** into slices
- **system** for system processes and daemons
- **machine** for virtual machines
- **user** for user sessions
On a systemd-system, system-enabled cgroups can be omitted and you can still use **cgconfig** and **cgred**. See **systemd-system.conf** for instructions on how to do that and make sure to rebuild initramfs to make this effective

#### Using Custom Slices
- Administrators can create their own slices, naming them *.slice, or slices within a slice, using the **parent-child.slice** naming
- Child slices will inherit the settings of the parent slices
- Make sure to turn on CPU, memory, or I/O accounting to see how they are used within a slice

#### Enabling Accounting
Enabling accounting in the [Service] section of the unit file
- **CPUAccounting=true**
- **MemoryAccounting=true**
- BlockIOAccounting=true**
Or better: enable it in /etc/systemd/system.conf
Use drop-in files to take care of this
- e.g. the SSH service would use a drop-in /etc/systemd/system/sshd.service.d/*conf<br>
**man 5 systemd.resource-control** for all parameters
- CPUShares=512
- MemoryLimit=512M
- BlockIO*=

In [None]:
%%bash
# listing folder for cpu
ls /sys/devices/system/cpu

```top``` then 1 to check cpu usage

### Managing slice
Putting Commands into a Slice<br>
To put a command into a slice, you can use the **systemd-run** command with the **slice=** option
- **systemd-run --slice=example.slice sleep 10d**
- show with **systemd-cgls /example.slice/<servicename>**
- If the **--slice** option is not used, commands started with **systemd-run** will be put into the **system.slice**  
    
Using Custom Slices
- Put a service into a custom slice using **Slice=my.slice**; if the slice doesn't exist it iwll be created when the service is started
- You can also pre-create custom slices by creating a *.slice file in /etc/systemd/system. Put the tunables in the **[Slice]** section
- To make a lice a child of another slice, git it a name **<parent>-<clild>.slice**; it will inherit all settings of the parent slice
- Note that the slice will only be created once the first process is started within


In [None]:
! systemd-cgtop

In [None]:
! systemd-cgls

# Benchmarking
- Benchmarking is comparing performance characteristics to industry standards
- This means that data needs to be gathered and compared
- In IT there are often is no such thing as a standard benchmark, so you'll need to gather and compare a lot 
- Benchmarking is NOT profiling, which is about gathering information about performance hot spots

## Subsystem involved
While benchmarking, different subsystems should be involved
- Processor
- Memory
- Scheduler
- I/O
- Network

## Benchmarking Utilities
- **vmstat**
- - Notive that the first line of vmstat output is giving the average since boot!
- **iostat**
- **mpstat**
- **sar**: can be used for gathering performance data
- **awk**: is useful for data analysis
- **gnuplot**: can be useful for plotting data
- **pcp**: is useful as an extensive testing framework

### Getting Performance data with sar
**sar**
- For filtering purposes, use **LANG=C sar -b**
- Create an alias to do this automatically
- Consider /etc/sysconfig/sysstat for additional settings
- Data collection interval is set through cron

## Gnuplot
you can plot performance data with Gnuplot


# Kernel Information
with dmesg

In [None]:
%%bash
dmsg | less

# CPU, OS and cache Information

In [None]:
%%bash
lscpu
x86info -c

## Cache Architecture
- Cache is organized in lines, and each line can be used to cache a specific location in memory
- Each CPU has its separate cache and its own cache controller
- If a processor references main memory, it first checks cache for data. If it's there, then that is referred to as a cache hit
- A cache line fill occurs after a cache miss, and means that data is loaded from main memory
### Write-through and Write-back
- If write-through caching is enabled, when a line of cache memory is updated, the line is updated in memory as well
- If write-back is enabled, the write to cache is only written to main memory at the moment the cache line is deallocated
- Write-back is more efficient, write-through ensures a higher state of stability
- If on multi-CPU systems changes are not committed to memory immediately, the other CPUs need to be updated that something is changed if they are caching it also
- - This is referred to as cache snooping, which is a hardware feature
### Other Cache Features
- Direct mapped cache means that each line of cache can only cache a specific location in main memory
- - This is the cheaper solution
- Fully associative cache means that a cache line can cache any location in main RAM
- - More flexible, more expensive
- Set associative cache offers a compromise between direct mapped and fully associated cache, nd allows a memory location to be cache into n lines of cache, where n can be a number like 2 for instance
### Locality of Reference
- Cache memory is most efficient when majority of memory access come from cache
- Programs that access memory sequentially benefit most from cache
- Somethimes within a program, routines arfe used to make memory access less efficient regarding the way cache is accessed. If that is the case, different **gcc** compiler options can be usede for optimization of cache access
- - These options are -O, -O2, -O3, and -Os, and can be passed while compiling using the -f option to gcc
- - The -O option are optimizing gcc output for size in different ways. Consult the man page for more details.

# Tracing
System and Library Calls
- The kernel exposes system calls to provide kernel access to applications
- - While executing system calls, the application claims kernel time, which is known as system time (and visible in top such as)
- A library call is the application way of providing functions
- - Tiume spent handling library calls is seen as user time in 

## Using **strace**
- **strace**: is used to trace system calls
- **strace \< command \>**: will show what the command is doing
- **strace -p \< PID \>**: gives information about a PID
- **strace -c \< command \>**: shows counters and thus insight on who is doing what
- **strace -f**: follows childl processes as sell, which by default is not the case
- **strace -e** allows you to follow specific system calls only, as in **strace -e open ls**

**ltrace** is similar to strace

you can use strace for program that never finishes for debugging

In [3]:
%%bash
strace -c ls

linux.ipynb
screenshots


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 22.71    0.000094           9        11           close
 21.50    0.000089          45         2           getdents
 16.67    0.000069           8         9           openat
 12.08    0.000050           5        10           fstat
  8.45    0.000035          35         1           write
  6.76    0.000028          14         2         2 ioctl
  6.28    0.000026           2        17           mmap
  5.56    0.000023           3         8         8 access
  0.00    0.000000           0         7           read
  0.00    0.000000           0        12           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           exe

In [4]:
%%bash
strace -e open ls

linux.ipynb
screenshots


+++ exited with 0 +++


In [5]:
%%bash
strace -e access ls

linux.ipynb
screenshots


access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/selinux/config", F_OK)     = -1 ENOENT (No such file or directory)
+++ exited with 0 +++


## Kernel tracing with **ftrace**
ftrace was originally developed for function training in the linux kernel<br>
Also gives access to events using static kernel traces
- System calls
- Scheduler events
- Memory management
- Interrupts
It uses static traces that are present in the kernel by default<br>
Using ftrace
- Plugin provide new trace types
- Exposed via **debugfs** in /sys/kernel/debug/tracing
- User-space **trace-cmd** tools are used to access traces
- - Results are written yo a file trace.dat
- Notice that tracing does cause (a log) of overhead!
- - Use filters: **trace-cmd record -e sched_switch -f ' prev)prio < 100'**
- - Exclude specific: **trace-cmd record -e sched -v -e "\*static\*'**

Using **trace-cmd**
- **trace-cmd** uses events and plugins
- man pages are available
- - **trace-cmd list**: shows available plugins
- First, start capturing traces
- - **trace-cmd record** dumps all trace data
- - **trace0cmd record -p function_traph touch /tmp/file** traces
- Use **trace-cmd report** to see the result
- - User filtering for more specific results in **trace-cmd report | grep selinux**


In [None]:
%%bash
yum install trace-cmd
trace-cmd record -p function_graph touch /tmp/file
trace-cmd report
trace-cmd report | grep selinux

## SystemTap
SystemTap is for monitory your linux system while running applications for more information read<br>
https://sourceware.org/systemtap/SystemTap_Beginners_Guide

# I/O Workflow
<img src='screenshots/IO_Workflow.png' style='height: 50%; width: 50%'>
## I/O to storage
Working with Block Devices
- Data is written in blocks
- Block devices are accessible through device nodes in /dev
- These support seeking for specifc locations
- Page cache is sued to optimize performance
- - Bypass page cache by using unbuffered I/O
- Smart block devices don't need OS level optimization
- - They have large caches
- - User DMA for memory access
- - And do their own I/O scheduling
- - With these, you're better off passing I/O directly to the device
Understanding I/O Challenges
- HDDs have a delay because the read/write head needs to move to the right position
- - Seek time is where the hard drive positions the head over the right track
- - Rotational delay is where the HDD waits for the right sector to pass under the heads
- If data is spread out over the disk, a lot of time is lost
- - Disk controller movements can be minimized by re-arranging disk requests
- - RHEL does this automatically, putting the requests in a queue and running an elevator algorithm
- - In elevator algorithms, starvation can occur in bad algorithms: only floors in the middle are getting serviced

I/O Schedulers
- **noop**: FIFO: First requests that come in the handled first. Used for SAN, hypervisors and SSD
- **deadline**: queued requests for executed in batches defined in the **fifo_batch** parameter. Deadline is good for file and database servers
- - Higher values = enhanced throughput
- - Lower values = low latencies
- - **Read/write expire** define maximum waiting time
- - Read requests get priority; give read request more priority by increasing the value in **writes_starved**
- **cfq**(complete fair queueing) is useful if many processes are operating on the disk at the same time. Use **ionice** to the cgroup **blkio** controllers to set priorities in this scheduler
- - Do NOT user on servers
- - Use **ionice -c n -p PID** where n is between 0-7 and 0 is highest priority

Selecting I/O Schedulers
- As a boot time kernel argument
- - elevator=
- Throught/sys/block/sda/queue/scheduler
- - Each schedudler has its own tuables in /sys/block/sda/queue/iosched
- - Change by echoing a new value in the scheduler file
- Using tuned profiles
- - In [disk] section, use elevator=

# Memory Management
- Memory is prgainized in pages, 4 KiB by default
- Processes have a major and minor page fault counter
- - *ps* will show you
- Systemd can be used to enforce memory limits
- Linux uses virtual memory that is mapped to physical memory
- - Use **pmap** to display

tlb maps virtual memory to resident memory

<img src='screenshots/virtual-resident-memory.png' style='height: 50%; width: 50%'>

Understanding Memory and Paging
- Physical RAM is divided into page frames and the OS uses memory pages tp address memory
- - A page size normally is 4KiB
- Processes have a virtual address space. From there, physical memory pages are mapped
- - Processes can only see their physical memory pages
- - On 32 bit systems, the virtual address space is maxed to 4 GiB, on 64 bits address space is 16 EiB
- - Notice that the 16 EiB is a theoretical limit
- Virtual vs physical memory mappings are monitored using **top** or **ps**
- Physical memory pages can be shared between processes. If this is the case, they count to the resident memory of each process

Understanding the TLB (Translation Lookaside Buffer)
- Each process needs its own page table that contains mapping of virtual addresses to physical addresses
- Each virtual page has an entry in this table. If this entry is not mapped to an physical page, this is a page fault
- - A Major page fault occurs if a page is swapped out or needs to be loaded from a file on disk
- - A minor page fault occurs if the process has just loaded and the virtual page still needs to be mapped to a physical page
- - Monitor through /proc/\< PID \> /stat
- If each process has a complete table, it would require lots of memory
- To mitigate that problem, the page table is organized hierarchically, and only page tables that contain physical addresses are administered
- Looking up a page is expensive
- For that reason, the Translation Look aside Buffer (TLB) is used
- - TLB is a hardware cache on CPU
- - Depending on hardware, the TLB cache can be organized in L1 and L2, differentiating between instructions and data
- - It caches page mappings the process has recently used
- Use **x86info -c** to find the size of the TLB
- - The TLB is relative small
- - To optimize its use, huge pages can be used

Memory CoW
- New processes are created by forking the parent processes
- - A duplicate process is created
- - Pages are marked as copy-on-write, which means that data is modified while duplication occurs
- Where data can be shared between parent and child, the child has pointers to read-only parts of parent memory address space
- The data is only really copied to the child when written to, which leads to performance improvement

Managing Proces Memory
- Use **ps o pid, comm, minflt, majflt** to get information about page faults that have occurred
- - Minor page faults cannot be avoided, major page faults should be avoided
- To limit the amount of available memory to process, use Systemd

- monitor current virtual address space usage using **pmap \< pid \>** or **cat/proc/ \< PID \> /maps** and **/proc/ \< PID \> /smaps**

Understanding **pmap** output
- **pmap** shows virtual address use - no information about RSS
- Notice that not all virutal memory is mapped
- - The mapping always startts at 0x00400000 (4 MiB) for the executable part of the process that comes from the executable file
- - Next, there is the heap which is memory that has been dynamically allocated using **malloc** and shared libraries
- - From address 0x07fffffffffff (128 TiB) on, there is the stack which contains anonymous memory that is requested by the process
- - The kernel is available from address -xffff8000000
- If a process tries to access memory that doesn't occure in the virtual memory map, the kernel gives a SIGSEGV - a segmentation faul
- - If that happens, the program stops and starts core dump

In [None]:
! pmap `ps | head -2 | cut -d" " -f1`

## Uderstanding Memory Leaks
- After using it, processes should return memory they have been using, If this doesn't happen, then this is known as a memory leak
- Memory leaks are irrelevant for short-time prunning processes such as **ls**, as all memory will be freed by the kernel when it stops running
- For daemon processes, meory leaks are a severe problem
- The only fix for memory leak is to kill and restart the process

### Types of Memory Leak
2 types virtual and resident
- If a program requests memory but doesn't use it, the virtual size of program memory goes up but no physical memory is used - this is a leark in virtual memory
- - The total amount of virtual memory that is allocated is visiable as **Committed_AS** in **/proc/meminfo**
- If the program starts mapping resident memory, a problem occurs and you'll suffer from it

### Finding Memory leaks
- The best tool to find memory leak is **valgrind**. Use **valgrind --tool=memcheck program** to do so
- Use **--leak-check=full** as an option to get information about which funciton is leaking memory

In [None]:
%%bash
valgrind --tool=memcheck ls

## Memory Reclamation
A memory page can be in differetn states
- Free: immediately available
- Inactive clean: it is not used and its contents is synchronized with corresponding data on disk. It can be treated as a free page, but
- - this will result in a page fault in the process accesses it again
- - the page needs to be swapped if it is anonymous memory
- Inactive dirty: it isn't used but the page contents has not been synchronized to disk
- Active: it's doing something
Monitoring Memory States
- /proc/meminfo gives a generic overview
- /proc/PID/smaps shows sizes of Shared/Private clean and dirty memory
- Dirty pages need to be written to disk
- - Recent Red Hat uses a per-backing device flush thread to flush dirty data, it shows as flush-MAJOR:MINOR

MAJOR:MINOR is showned using lsblk

In [None]:
! lsblk

## Managing OOM
- OOM is out of memory
- As most applications never use their entire address space, Linux uses memory overcommitting
- As a result, you may get in an OOM situation
- Set 3 different modes for overcommiting through **vm.overcommit_memory**
- - 0: Heuristic overcommit (default). Overcommitting is allowed, unless it's a very large unrealistic request
- - 1: Always overcommit
- - 2: Ratio based overcommit: based on available RAM + swap

In [None]:
! cat /proc/sys/vm/overcommit_memory

## Understanding OOM
- It occurs when a minor page fault happens but no free pages are available
- If OOM happens, the OOM killer becomes active and will kill one or more processes to free memory
- - You can trigger this using **echo f > /proc/sysrq-trigger** and read output in **dmesg**
- - Find which trigger by using **echo h > /proc/sysrq-trigger**
- This is bad, very bad, which is why alternatively you can set **vm.panic_on_oom = 1** to have the kernel panic instead
- Better avoid getting into OOM at all times!

## OOM Killer
- Every process has an **oom_score** in **/proc/PID/oom_score**
- Higher scores are more likely to get killed
- The kernel and systemd are immune, root processes, processes with a higher runtime and processes involved in direct hardware ac*. It's value will be added to **oom_score**

## Memory Zones
- You may get in an OOM situation, evnen if **free** stills reports available memory (in particular on 32-bit systems)
- This is because the Linux kernel works in memory zones
- - Zone DMA goes from 0 MiB to 16 MiB
- - Zone DMA32 goes from 16 MB to 4 GB
- - Zone Normal goes from 4 GB to end of available memory
- The above is for 64-bit systems. 32-bit systems have low memory up to 896 MiB, all above is high memory
- Monitor /proc/boddyinfo to get information aobut these different zones


DMA stands for Direct Memory Access

In [None]:
https://learning.oreilly.com/videos/linux-performance-optimization/9780134985961/9780134985961-LPOC_03_09_07sysctl -a | grep overcom

# CPU
CPUs have different components that are important for performance optimization
- Socket: slot contains one or more mechanical components providing mechanical and electrical connections between a microprocessor and a printed circuit board. Is the connector on the motherboard that houses a CPU and forms the electrical interface and contact with the CPU
- Core
- Registers
- Cache
- Memory Controller
- External bus
Other attribute also play a role
- Architecture
- CPU family
- Model

In [None]:
%%bash
lscpu # display information about cpus

In [None]:
%%bash
cat /proc/cpuinfo

## Understanding CPU-bound Tasks
- A task is CPU-bound when CPU availability is the limiting factor
- Different factors can influence CPU-speed
- - The number of other tasks that also need attention from the CPU
- - Efficiency of hardware cache
- - Type of instruction that is executed
- Executing an instruction takes different steps
- - The fetch unit looks up the instruction in the L1i dand L2i hardware caches, or fetches it from RAM and moves to cache
- - The instuction is processed by the instruction decoder
- - Instruction-related data is fetched
- - Instruction is sent to the execution unit 
- - And it will run

## Process Scheduler
<img src='screenshots/Process-Scheduler.png' style='height: 50%; width: 50%'>
- The kernel Process Scheduler determines which process to run when
- It need to meet different criteria
- - Quickly determine which process to run next
- - Give each process a fair share of time
- - Distinguish between high priority and low priority processes
- - Be responsive to application requests
- - Be predictabloe and scalable
- Because all of these are hard to unite in 1 solution, different CPU schduler solutions do exists

### Understanding the O(1) Scheduler
- The legacy O(1) Scheduler works with 2 queues: a run queue and an expired queue
- The processor services the process with the highest priority in the run queue
- When the time slice of a process expires, the scheduler moves the process to the appropriate position in the expired queue (accordiing to priority)
- It next runs the highest priority process from the queue and repeats the process
- When the run queue has completely been serviced, it makes the expired queue, the active queue and the process starts again

### Understanding Complete Fair Scheduling
- CFS uses a red-black tree that is based on virtual time
- Virtual time is based on the time waiting to run, the number of processes that need CPU time, and process priority
- The process with the most virutal time (which should be the process that has been waiting longest) gets CPU time
- By using CPU time the virtual time decreases
- Once it no longer has the most virtual tie, it gets pre-empted
- Using this approach makes it easier to preven users that are claiming too much time to seriously hurt performance

### CFS **sysctl** Parameters
- **sched_latency_ns**: epoch duration in nanoseconds
- **sched_min_granularity_ns**: granularity of epoch in ns. If the number of running tasks in the queue is greater **sched_latench_ns** divided by **sched_min_granularity_ns** there is too many tasks on the system and epoch length must be increased
- **sched_migration_cost_ns**: if the real running time of a process is longer than this value, the scheduler will try not to move it to another CPU
- **sched_rt_period__us**: the time slice that is defined to run RT processes
- **sched_rt_runtime_us**: maximum CPU time that can be used by all real time tasks in a **sched_rt_period_us** time period

## Managing CPU Scheduling
- Systemd CGroups CPUShares can be used to set run-time priority
- Do this run-time using **systemctl set-property vsftpd.service -- runtime CPUShares=512**
- Skip **--runtime** to make it persistent
- Or create a conf file in /etc/systemd/system/\< srvice\> .service.d/ that has CPUShares=512 in the [Service] section
    
## Managing Real-Time Scheduling
- **SCHED_RR**: Round Robin real-time scheduler which has a process only run until its time slice is expired
- **SCHED_FIFO**: runs until blocked by I/O calls, **sched_field** or a higher priority process comes along

## Undrstand Process Priority Numbers for Kernel
- Kernel level system priority goes from 0 to 99, where a highter number is the lower priority (0 is the highest priority)
- Kernel level real time priority ges from 00 to 0, where a lower number is the lower priority (99 is hte highest priority)
- Real time priority 99 corresponds to system priority 0
- The **nice** command affects non-real time processes only, which run with a system priority of 99
- The **top** command just shows RT fro real time processes, 20 for "regular" processes, which can be adjusted by using the nice command
- No matter what you do with the **nice** command, it doesn't change the real kernel priority of the process

### Managing Real-time
For a sysadmin, it is important to ensure that the system stays responsive. To do so, use
- **kernel.sched_rt_period_us** to define the CPU allocation time frame
- **kernel.sched_rt_runtime_us** to define the share of time frame that can be allocated for real time
- - By default this is set to 0.95 seconds, which always leaves 0.05 seconds per second for processes in the **SCHED_OTHER** queue
- - If set to 0, no time periods can be allocated to real-time processes

### Non-real Time Scheduling
- **SCHED_IDLE** is for running very low priority applications
- - This ruins processes with a priority that is lower than **nice 19**

## Change Process Schedulers
- An application can use the syscall *sched_setscheduler** to set the scheduler
- Or administrators can use **chrt** to do so: **chrt [scheduler] priority command**
- - **-b** runs SCHED_BATCH
- - **-f** runs SCHED_FIFO
- - **-i** runs in SCHED_IDLE
- - **-o** runs on SCHED_OTHER
- - **-r** runs in SCHED_RR
- Use for instance **chrt -p -f 10 \$(cat /var/run/vsftpd.pid)** to run the vsftpd process in SCHED_FIFO
- Documentation in /usr/share/doc/kernel-doc-*/Documentation/scheduler after install ```yum install kernel-doc```

To see scheduler option

In [1]:
%%bash
chrt --help

Show or change the real-time scheduling attributes of a process.

Set policy:
 chrt [options] <priority> <command> [<arg>...]
 chrt [options] --pid <priority> <pid>

Get policy:
 chrt [options] -p <pid>

Policy options:
 -b, --batch          set policy to SCHED_BATCH
 -d, --deadline       set policy to SCHED_DEADLINE
 -f, --fifo           set policy to SCHED_FIFO
 -i, --idle           set policy to SCHED_IDLE
 -o, --other          set policy to SCHED_OTHER
 -r, --rr             set policy to SCHED_RR (default)

Scheduling options:
 -R, --reset-on-fork       set SCHED_RESET_ON_FORK for FIFO or RR
 -T, --sched-runtime <ns>  runtime parameter for DEADLINE
 -P, --sched-period <ns>   period parameter for DEADLINE
 -D, --sched-deadline <ns> deadline parameter for DEADLINE

Other options:
 -a, --all-tasks      operate on all the tasks (threads) for a given pid
 -m, --max            show min and max valid priorities
 -p, --pid            operate on existing given pid
 -v, --verbose        disp

In [2]:
%%bash
chrt -b --max ## shows priority available

SCHED_OTHER min/max priority	: 0/0
SCHED_FIFO min/max priority	: 1/99
SCHED_RR min/max priority	: 1/99
SCHED_BATCH min/max priority	: 0/0
SCHED_IDLE min/max priority	: 0/0
SCHED_DEADLINE min/max priority	: 0/0


In [4]:
%%bash
chrt -p -f 10 $(pidof ssh)

pid 10's current scheduling policy: SCHED_FIFO
pid 10's current scheduling priority: 99


## CPU Pinning
Managing CPU Pinning
- Everytime the scheduler reschedules a process, it will determine which CPU it will go on
- This is bad for cache usage
- Use the systemd CPUAffinity setting in the [Service] block to pin a service to a CPU
- - CPUAffinity=0, 1 allows the process to run on CPU cores 0 and 1 only
- - If you use this setting on a NUMA system, make sure to start the **numad.service** as well to manage NUMA usage
- For more scalability, use the cpuset Cgroup to manage multiple feature (not currently supported by systemd)

Understanding CPUSet
- The cpuset Cgroup offers some useful parameters
- - **cpuset.cpus**: the CPUs that can be inside this group
- - **cpuset.mems**: the NUMA memory zpnes in this group. On non-NUMA systems, always use this parameter and set to 0
- - **cpuset.{cpu, memory}_exclusive**: set to 1 if the CPU or mmemory for this group is exclusive, making it not accessible for other processes
- As systemd do not currently supports cpuset, define an ExecStartPost script within systemd unit

In [None]:
%%bash
# cpuset ExectStartPost script Example
#! /bin/bash
mkdir -p /sys/fs/cgroup/cpuset/cpuset0
echo 0 > /sys/fs/cgroup/cpuset/cpuset0/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/cpuset.mems
for PID in $(pgrep httpd); do
    echo %{PID} > /sys/fs/cgrooup/cpuset/cpuset0/tasks
done

[service]
ExectStartPost=/usr/local/bin/cpuset0

In [10]:
! systemctl show --all sshd.service | grep CPUAffinity

CPUAffinity=


In [None]:
%% bash 
cp /user/lib/service/system/sshd.service .
vi sshd.service
## then insert CPUAffinity='number'

systemclt daemon-reload
systemctl restart sshd

In [None]:
%%bash
! systemctl sho --all ssh.service | grep CPUAffinity

Making an exec start script

In [None]:
%%bash
cd /usr/local/bin
cd cpu.sh

In [None]:
# cpu.sh
#! /bin/bash
mkdir -p /sys/fs/cgroup/cpuset/cpuset0
echo 0 > /sys/fs/cgroup/cpuset/cpuset0/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/cpuset0/cpuset.mems
for PID in $(pgrep sshd); do
    echo ${PID} > /sys/fs/cgroup/cpuset/cpuset0/tasks
done

### Balancing Interrupts
- Interrupts are used to indicate that there is wrok to do at this moment
- When it occurs, a CPU needs to be chosen
- Monitor /proc/interrupts for an overview of which is handled when
- User **/proc/irq/ \< number \> /smp_affinity** to determine which interrupt is handled where

The **smp_affinity** value is set as a bitmask which is represented as a hexadecimal number

| bitmask | CPU | HEX|
| -- | -- | -- |
| 00000001 | 0 | 0x1 |
| 00000010 | 1 | 0x2 |
| 00000100 | 2 | 0x4 |
| 00001000 | 3 | 0x8 |
| 00010000 | 4 | 0x10 |
| 00100000 | 5 | 0x20 |
| 01000000 | 6 | 0x40 |

- CPU affinity can beset to multiple CPUs
- For instance, 01010010 would set affinity to CPUs 2,5, and 7
- This is equal to 0x92
- Calculate from the bash shell as n^2 using 2**n
- - printf '%0x' \$[2\*\*2+2\*\*5+2\*\*7] > /proc/irq/ \< n \> /smp_affinity



Using **irqbalance**
- The **irqbalance** service adjusts the smp_affinity of all interrupts every 10 seconds
- This increases the chance of getting cache hits
- No effect on single-core or dual-core systems with shared cache
- Settings in /etc/sysconfig/irqbalance
- - IRQBALANCE_ONESHOT: set to yes to run once
- - IRQBALANCE_BANNED_CPUS: set as a hexadecimal bitmask to exclude CPUs

In [None]:
%%bash
cat /proc/interrupts

In [None]:
%%bash
cd /proc/irq/3 # some folder number that exists
printf '%0x' $[2**2+2**5+2**7] > /proc/irq/3/smp_affinity 


In [2]:
%%bash
systemctl status irqbalance.service

● irqbalance.service - irqbalance daemon
   Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-09-04 01:39:53 PDT; 1 day 3h ago
 Main PID: 1119 (irqbalance)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/irqbalance.service
           └─1119 /usr/sbin/irqbalance --foreground

Sep 04 01:39:53 yi-XPS-13-9380 systemd[1]: Started irqbalance daemon.


## Real-Time Scheduling
Selecting the Scheduler<br>
Multiple ways exists to set the scheduler policy for a process
- By default, a process will inherit the scheduler from its parent
- **chrt** - discussed earlier - can be used to manually select a scheduler
- The process can sched_setscheduler to run in a specific scheduler
- **systemd** can set the policy and priority when starting the service
- - Use **CPUSchedulingPolicy** to select scheduler from a Systemd Unit file
- - - Notice that this does not work if **CPUShares** are also set
- - User **CPUSchedulingPriority** to set priority from Systemd
- See **man systemd.exec** for more information