<a href="https://colab.research.google.com/github/rogerwzeng/e17/blob/main/DGMD17_Final_Exam_Review_Document.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Context

This is a review document for the DGMD17: Robotics, Autonomous Vehicles, Drones, and Artificial Intelligence course. This review sheet is not meant to be a comprehensive, exhaustive review of all the course content. Instead it's meant to be a summary of the key topics and takeaways. For more depth, please refer to the course material and lectures found on canvas. Where deemed appropriate, links to relevant sources will be provided.

**Note**: not all topics from this document may appear on the final exam and not all topics in the final exam may appear in this document. Furthermore, this document is not exhaustive of all the topics and details covered in the class.

# Week 1: Introduction & Syllabus Review

* It is an exciting time for the field of robotics, autonomous vehicles, drones, and artificial intelligence. We're witnessing a modern day renaissance, where a number of advancements are being made and what were merely ideas in the past turning into reality

* Here are some examples of recent innovations:
    * Driverless robotaxis from [Waymo](https://waymo.com/) and [Cruise](https://www.getcruise.com/) available in [San Francisco](https://www.sfchronicle.com/sf/article/waymo-driverless-robotaxis-19400255.php), [Austin](https://www.reuters.com/business/autos-transportation/waymo-robotaxi-begin-driverless-services-employees-austin-2024-03-05/#:~:text=Waymo%20robotaxi%20to%20begin%20driverless%20services%20for%20employees%20in%20Austin,-By%20Reuters&text=March%205%20(Reuters)%20%2D%20Alphabet's,in%20Austin%2C%20Texas%20starting%20Wednesday.), and [Phoenix](https://techcrunch.com/2024/04/09/cruise-robotaxis-are-back-sort-of/?guccounter=1#:~:text=GM's%20Cruise%20robotaxis%20are%20back%20in%20Phoenix%20%E2%80%94%20but%20people%20are%20driving%20them,-Rebecca%20Bellan%40rebeccabellan&text=General%20Motors'%20Cruise%20is%20redeploying,won't%20be%20driving%20themselves.)
    * Advanced imaging and mapping for [endangered species via drones](https://ageagle.com/use-cases/using-mapping-drones-for-wildlife-monitoring-and-conservation/)
    * Amazon's fully autonomous warehouse robot [Proteus](https://www.youtube.com/watch?v=AmmEbYkYfHY)
    * Boston Dynamics [Spot](https://bostondynamics.com/products/spot/) and [Altas](https://bostondynamics.com/atlas/) robots

# Week 2: Systems, Linear Algebra, & NumPy

* A system is a collection or group of components all working together to achieve some desired functionalities
    * taking a "big" problem and breaking it down into its component parts. It's meant to make challenging problems easier to tackle.
* Embedded systems refer to systems designed to perform specific functions, leveraging hardware and software, within a larger system
    * components of a larger system
* Sample framework for robotics, autonomous vehicles, and drones:
    * Chassis / Encasing: the components housing the device
    * Electronics / Sensors: the components that collect data on the environment
    * Software / Algorithms: the processing of the data
* Many considerations when designing the chassis / encasing:  price, materials, durability, weight, reflectivity, color, water proof, toxicity, comfort, touch, flexibility, form factor, battery life, ease of use, human design, moddability, reparability, customizability, personality, skeuomorph, child safety, accessibility, usability, innovation, heating, interactability, reliability, speed, and maintenance
    * No perfect solution for every metric / attribute, often need to work with trade-offs (e.g., very precise, but high cost and low battery)
    * Autonomous robotaxi (needs to be a general, well-rounded solution) vs warehouse robot (can be aimed for high performance in a specific environment)
* Can further breakdown electronics / sensors into:
    * Processor / Chip: the brains of the device; two types:
        * Microprocessor: the CPU component that need to connect to other components, typically more powerful and capable but at higher cost and power need; used for higher end / complex devices
        * Microcontroller: an integrated microprocessor that is cheaper, compact, and less battery intensive at the cost of compute capability; used for more simple devices
    * Sensors: collects data from the environment
        * Many different sensors that are commonplace in the field (covered in the next week section)
* After collecting data from sensors, common next steps are: signal processing, algorithms, and/or machine learning

---

* It common to use linear algebra when working data.
* The two key objects in linear algebra are:
    * Vectors: a 1-dimensional list of numbers or collection of numbers (we can relate each entry to some attribute, e.g., in physics, each entry correspond to a direction)
    * Matrices: a 2D collection of numbers or a collection of vectors
* Key Vector Operations:
    * Addition / Subtraction: add / subtract each element elementwise
    * Multiply by a Constant: multiply each element by the constant
    * Dot Product: multiply each element elementwise and sum all entries
* Key Matrix Operations: $ $
    * Addition/ Subtraction: add / subtract each element elementwise
    * Multiply by a Constant: multiply each element by the constant
    * Transpose: flip the rows and columns along the diagonal
    * Matrix Multiplication: given a matrix, $A$, with $n$ rows and $m$ columns and another matrix, $B$, with $m$ rows and $p$ columns, the matrix multiplication of $A$ and $B$ is when we take each row of $A$ and each column of $B$, and take the dot product to fill the entries of the resulting $n$ by $p$ matrix (note: shape and order matters)
* Vectors and matrices are important as they let us conveniently and succiently represent different ideas and are often more efficient to implement (specialized libraries and hardware).




---

* A common library for working with data is NumPy
    * It lets us efficiently operate with vectors, matrices, and high dimensionally arrays by storing data in contiguous blocks of memory (compared to native Python's pointer system)
* We can do many operations, like vector / matrix addition, multiplication, dot product, etc.
* A key property of NumPy is broadcasting where arrays of different shapes can be operated on and NumPy infers the necessary modifications
    * Example: `np.array([1, 2, 3]) + 1 == np.array([2, 3, 4])`; NumPy inferred to add `1` to each element despite it not being the same object or shape
* NumPy has rich documentation, examples, and tutorials online and is the bedrock for many scientific computing and machine learning libraries (e.g., scikit-learn, openCV, etc.)

# Week 3: Sensors & Sensor Fusion

* Sensor: A device that captures and measures quantities / attributes of the environment
    * Many different applications from robots to drones to self-driving cars
* Sensors capture different measurands - types of common measurands are: mechanical, optical, electric, acoustic, thermal, chemical, radiation, biological, and magnetic
* A key part of sensors is convert analog (continuous signals) to digital (discrete signals that a computer can understand) conversion:
    * Sampling: get samples of the measurands at some frequency (higher frequency, more samples)
    * Quantization: round continuous values to nearest bucket or threshold (more buckets gives greater precision, but requires storing more information)
    * Encoding: convert the quantized thresholds to binary for digital systems (i.e., computers)
* Similarly, there are techniques to convert a digital signal to an analog signal. A reference carrier signal is used with shift, frequency, or amplitude shifting (i.e., applying the carrier signal when certain threshold on the property is reached)
* For sensors, there are some general key characteristics:
    * Time vs Space: sensors can collect data along the time domain (e.g., motion or audio) or space domain (e.g., images and videos)
    * Precision vs Accuracy: precision is how close measurements are to each other and accuracy is how close measurements are to the real measurements. Often, there's a trade-off between these two.

---

* Cameras captures light photons into photosites, converts them into electrical energy, and provides a 2D spatial signal
* Grayscale images are represented as 1-D matrices where each pixel value represents the light intensity (0 for black and 255 for white)
* Color images are represented as 3-D matrices with 3 color channels, where each pixel per each channel represents the color intensity (0 for black and 255 for maximum color)
* There are two main types of camera sensors:
    * CMOS (Complementary Metal-Oxide Semiconductor): Each pixel has its own circuitry and can be read independently
        * CMOS sensors are cheaper, faster, and more energy-efficient but are more noisy, struggles in low-light, and suffers from rolling shutter effects (i.e., distortion in moving images)
    * CCD (Charge-Coupled Device): Capacitors store and transfer the electrical charges from each pixel on the sensor and pixels are read sequentially and sent to a single amplifier
        * CCD have less noise, higher dynamic range, better color accuracy, and resilience to the rolling shutter effect, but are expensive, slower, and more power-hungry

---

* Ranging sensors work by emitting a wave, waiting for it to bounce back from objects, and using the time delta on the journey to determine the distance of the object (i.e., ranging). There are two main components: the transmitter that emits waves and the receiver that receives waves. Several sensors use this mechanism: LiDAR, RADAR, IR, SONAR, and Ultrasonic
* The main distinction between different range sensors are:
    * Wave types (electromagnetic vs sound): they exhibits different properties (under different mediums)
    * Wave characteristics (frequency and wavelength): they exhibits different ranges, energy, cost, and accuracy
        * Lower frequency / longer wavelength means greater range, but lower resolution versus higher frequency / shorter wavelength

---

* Inertial Measurement Unit (IMU) refers to the direction, angle, and speed of motion
* Typically done with:
    * Accelerometers: acceleration
    * Gyroscopes: angular motion
    * Magnetometers (optional): magnetic field
* IMU describes a 6 Degrees of Freedom (DoF) system:
    * The XYZ perpendicular axes
    * The rotation around the XYZ axes (pitch, roll, yaw)
    * Orientation of the sensor influences the data interpretation

---

* Button / Touch sensors detect a press, touch, or application of force (inner circuitry change voltage signal on tactile engagement).
* Different types of these sensors:
    * Button sensors complete a circuit when pressed
    * Resistive touch sensors complete a circuit when touched (human skin acts as a resistor to let current flow)
    * Capacitive touch sensors disrupts the electrostatic field in the presence of a conductive object (finger / stylus) (change in capacitance is detected to determine the position)

---

* Global Positioning System (GPS) sensors provide us information of our location on Earth
* Shoots out radio waves and back (radio waves since it can cover long distances)
    * (186,000 mi/sec) x (signal travel time in seconds) = distance of the satellite to the receiver in miles
* Using 4 satellites and 3D Trilateration, an exact $(x, y, z)$ coordinate can be derived
    * It will produce a system of equations from the overlap of a sphere - this system can be expressed as a matrix-vector problem, allowing for an efficient solution (e.g., with NumPy)

---

* Temperature sensors work by using a bimetallic strip
* Changes in temperature cause the two metals expand or contract at different rates due to their distinct coefficients of thermal expansion
* The delta allows us to determine the temperature
* Used in thermostats, thermometers, appliances like ovens, irons, and steamers, HVAC systems, biometric wearables, robots, vehicles, and more

---

* Audio sensors captures noise and sound
* Sound waves cause vibrations on the diaphragm
* The vibrations on the diaphragm is captured
    * Electromagnetic (Dynamic): coil and magnet
    * Piezoelectric Microphone: mechanical stress to piezoelectric material
    * Condenser Microphone: delta in capacitance between the diaphragm and backplate (needs external polarization voltage)
    * Electret Microphone: uses an electret, a permanently charged material, to polarize the diaphragm and backplate
* Conversion from sound energy to electric energy
* Amplification is performed to strengthen the signal

---

* Sensor fusion refers to using multiple and / or different sensors to better capture data about the environment
* Use multiple of the same sensors for reliability / maintainability
    * Sensor Averaging
    * Odd One Out
* Use different sensors to capture different measurands
* Many devices leverage multiple sensors to achieve their goals
* A key example we covered in class was with detecting the height of different objects from a drone using LIDAR, GPS, and IMU sensors
    * We use the fact that for a LIDAR sensor, we can get the distance between the sensor and the object with $d = \frac{ct}{2}$ where $d$ is distance, $c$ is the speed of light, since LIDAR shoots light waves, and $t$ is time. We divide by 2 because the light wave had to go and come back (so we need to correct by a half)
    * Then we combined the height of the drone from the GPS coordinate to determine the height of the object the light wave hit from the drone
    * And finally, we factored in the orientation of the drone itself from the IMU sensor which gave us the angle of the drone; we can model the scenario with trignometric properties of a triangle
    * It is recommend to review the problem set and midterm exam solution for these calculations

# Week 4: Wheeled Robots & Odometry

* Wheeled robots are robots with wheels; some examples are: autonomous vechiles, warehouse robots, and robots in spaces (e.g., Mars Rover)
* There are different types of wheeled robots:
    * Differential Drive Robots: robots with two independent wheels, each on one side; very simple to setup and typically can make sharp turns, but at the cost of stability for straightline motion
    * Car-type Drive Robots: has four wheels and configured like cars; also simple to setup and allows for more configuration (i.e., front wheel drive, back wheel drive, and all wheel drive), however is more complex to model compared to differential drive robots
    * Dual Differential Drive Robots: similar to differential robots but connects the wheel with a gear mechanism to enable better straightline motion
    * Synchro Drive Robots: has two motors connected to all the wheels; one motor is responsible for turning all the wheels and another motor is responsible for moving all the wheel forwards
    * Skid-steer Drive Robots: instead of wheels, it uses tracks and relies on skidding to move around; is more advantageous for more rough or uneven terrian
    * Articulated Drive Robots: has a pivot point in the chassis that allows for greater maneuverability; useful for long robots that need to make sharp turns
    * Pivot Drive Robots: has a four-wheeled chassis with non-pivoting wheels and a rotating platform which can be raised or lowered
* Different type of wheeled robots have their trade-off which also depend on application and environment

---

* Odometry is estimating motion, orientation, and position using data from a variety of sensors - one way of doing this is with wheel encoders
    * Odometry achieved by wheel encoders is often referred to as wheel odometry
* There are [different types of encoders](https://medium.com/@nahmed3536/rotary-encoders-for-odometry-713551a705e6)
    * We looked at optical encoders in class that uses slits on a disk and light passing through the slit to determine how many rotations have occurred
    * Using properties of the encoders, like the number of slits, circumference, etc., we can relate the encoder data to the amount of distance traveled by the wheels

---

* After we have information about how much the wheels traveled, we can make a wheel odometry model that lets us track the motion of the robot
* We focused on building a wheel odometry model for differential drive robots (due to their simplicity); here's the full derivation: [Wheel Odometry Model for Differential Drive Robotics](https://medium.com/@nahmed3536/wheel-odometry-model-for-differential-drive-robotics-91b85a012299)
    * The process in which we derived the model was using the fact that the motion between two time steps can be modeled as going along some curve
    * Using the arc formula and various geometric properties, we were able to derive specific quantities like how much distance was traveled by the robot, the change in orientation, and absolute position and angle (in reference to some coordinate system)
    * We can do a similar process for other types of wheeled robots, but it might be more complex; additional, our model is simplisitic since we treat the robot as massless point so we can further expand our model by factor in real world constraints / design of the robot itself

---

From [Wheel Odometry Model for Differential Drive Robotics](https://medium.com/@nahmed3536/wheel-odometry-model-for-differential-drive-robotics-91b85a012299), the key results are:

<br>

\begin{align}
d_t &= \frac{d_{L, t}+d_{R, t}}{2}\\
\Delta \theta_t &= \frac{d_{R, t}-d_{L, t}}{2d_w}\\\\
x_t &= x_{t-1} + d_t \cos \Bigg(\theta_{t-1} + \frac{\Delta \theta_t}{2} \Bigg) \\
y_t &= y_{t-1} + d_t \sin \Bigg(\theta_{t-1} + \frac{\Delta \theta_t}{2} \Bigg) \\
\theta_t &= \theta_{t-1} + \Delta \theta_t
\end{align}

where:

* $t$ represents the timestamp
* $d_t$ represents the distance traveled by the robot (reference point) at time $t$
* $d_{L, t}$ represents the distance traveled by the left wheel as measured by the wheel encoder at time $t$
* $d_{R, t}$ represents the distance traveled by the right wheel as measured by the wheel encoder at time $t$
* $\Delta \theta_t$ represents the angle of rotation the robot performed / the angle of the arc of motion of the robot at time $t$
* $d_{w}$ represents the distance between the robot reference point to the right and left wheel (assuming the reference point is equidistant between the two wheels)
* $x_t$ represents the $x$-coordinate of the robot (reference point) based on a set coordinate plane at time $t$
* $y_t$ represents the $y$-coordinate of the robot (reference point) based on a set coordinate plane at time $t$
* $\theta_t$ represents the orientation / angle from the positive $x$-axis of the robot (reference point) based on a set coordinate plane at time $t$



# Week 5: Computer Vision for Edge & Corner Detection

* Computer Vision is the field that looks at computational techniques to work with images, videos, and other visual data; there are broadly two main categories of techniques:
    * Computational-based or classical methods (we create algorithms for images)
    * Machine Learning based methods (we design models to learn from visual data)
* Core to computer vision is the image
    * An image is a matrix of numbers. Each number is a pixel value. A pixel value is the light (in the case of grayscale) or color intensity (3 colors, red, green, and blue, in the case of color images), ranging from 0 to 255.
* A key technique for image is filtering, and in particular linear filtering, also called convolution
* Convolution involves sliding a kernel (also called mask or filter) across the image, multiplying the numbers in the image by the numbers in the kernels that overlap at corresponding positions, and summing to get the new value of the center pixel
    * For example, a mean filter which is kernel of `1` will average each pixel with its neighbors and produce a mean blur
* A key kernel used in many computer vision tasks is the gaussian kernel based on the 2-d Gaussian function. In essence, the gaussian kernel puts more wieght in the middle and exponentially less weight the further parts (from the center) of the kernel, allowing us to average pixels in a weighted fashion
    * Directly, it can produce a blur or smoothing effect
    * Combined with other techniques, it can be used to extract key features (such as sharpening the image by removing a blurred version of the image from the original image)
* Convolutions with Gaussian Kernels also play a key role in Edge Detection (they can be used to do low-pass filtering and extract the areas of contrast via sharpening)
* Another key idea in classical computation is gradients which refer to the direction and amount of change in pixel values in an image
    * We analyze gradients in the $x$ and $y$ direction by looking at the delta between adjacent pixel values - this approximates magnitude of change in those respective directions (key word here is approximates since our image is discrete and we can't take a derivative or gradient of a discrete function)
    * Using trignometry, we can then calculate the magnitude and orientation of the gradients
* In extracting the gradients, we often apply a gaussian kernel first to smooth the image as noise can negatively affect our gradient calculations
    * This results in the Sobel, Roberts, or Prewitts kernels which can be used to find gradients along the $x$ and $y$ directions
* Combining convolution, the Gaussian kernel, and gradients, we can implement the Canny Edge Detection algorithm which allows us to extract edges from a grayscale image
    * Step 1: calculate smoothed gradient magnitude and orientation (i.e., apply Gaussian kernel convolution and calculate gradients)
    * Step 2: perform non-maximum suppression where gradients are filtered if there's an adjacent pixel whose gradient is higher and pointing in the same direction
    * Step 3: use two-stage thresholding (low and high values) to filtering the remaining edge candidates (use the high threshold to start edge curves and the low threshold to continue them)
    * Parameters of the Canny Edge Detection algorithm is the level of blur from the Gaussian kernel (less blur, more finer edges) and the thresholds (determines how many edges to keep)
$ $

# Week 6: Computer Vision for Feature Extraction & Matching

* Image feature refers generating keypoints with descriptors in an image that captures important aspect of the image (used for image stitching, SLAM, object detection, and more)
    * Key attributes of image features: locality, quantity, distinctiveness, and efficiency
    * Image features should be unusual or unique (lead to unambiguous matches in other images)
* Corners are good candidates for image features because they represent changes in two directions which can be very distinctive (compared to flat regions or simple edges)
* Harris Corner Detection detects edges by shifting a window around and measuring how difference there is when the window is shift (the largest the difference in pixel values, the greater the chance there is an corner)
    * Of course, sliding a window is very computationally expensive so we can represent this problem as linear algebra for greater efficiency
    * Key insight is that the gradients at a corner represent sizeable changes in two directions, so the patterns of the gradients would indicate a big spread in two directions
    * We can reformulate this problem as the magnitudes of the eigenvalues of an $H$ matrix which contains information about the gradients
    * If the eigenvalues are above some threshold, we deem that region as an corner
    * We combine with non-maximum suppression and thresholding to further filter for strong corners
* For image features to be useful, they should be:
    * Invariance: image is transformed and corner locations do not change
    * Equivariance: if we have two transformed versions of the same image, features should be detected in corresponding locations
* Corners are equivarent to image translation (sliding objects around the image), equivarent to image rotation (rotating objects), partially invariant to affine intensity change (i.e., changes in light intensities), but either invariant nor equivariant to scaling (i.e., making images bigger)
    * The reason for this is that as the image gets bigger, the window used to determine a region as an corner doesn't capture enough information
    * The solution is to look at the image at different scales to find all the corner candidates - that's the idea behind Scale Invariant Feature Transform!
* Scale Invariant Feature Transform (SIFT) aims to generate features using Gaussian pyramids, Difference of Gaussians (DoGs), and other tricks to find and describe important features in an image. The algorithm is as follows:
    * Step 1: Apply a series of Gaussian blurs at different scales (i.e., image sizes) and strengths (i.e., how strong the blur is)
        * Intuition: important image feature will remain even after you blur and make the image smaller
    * Step 2: Take the difference between the Gaussian blurred images
        * Intuition: by taking the difference, you're highlight the information that is unique between each image you generated in step 1
    * Step 3: Find the discrete derivative across scales and strengths by finding the maximal and minimal pixel value across a pixel's 26 neighbors
        * Intuition: by finding the maximal and minimal pixel across scales and strengths, you identify the most interesting points in the image
    * Step 4: filter based on threshold
        * Intuition: remove the weak image features
    * Step 5: create unique descriptors for all remaining points by constructing (normalized) histograms of the gradients for the region encompassing the interest point
        * Intuition: this creates unique and distinctive descriptors that are robust to many common image transformation (scaling, translation, rotation, lighting, etc.)
* SIFT is used to compare different image features, perform image search (i.e., match different patterns of different images), and perform image stitching
    * Note: there are other image feature algorithms that attempt to do similar things such as Histogram of Gradients (HOG), Fast Retina Keypoint (FREAK), and Learned Invariant Feature Transform (LIFT)
* For using SIFT for image stitching, we often use Random sample consensus, or RANSAC, which is an iterative method for estimating a mathematical model from a data set that contains outliers (since SIFT can often have many outlier or irrelevant points when performing image stitching). The general algorithm is as follows:
    * Step 1: Select a subsample of the data (typically the subset will have the minimum number points needed for making a model)
    * Step 2: Fit a model (i.e., stitch the image)
    * Step 3: Count how many of data points are within some threshold of the model (called inlier)
    * Step 4: Repeat steps 1-3 for prescribed number of times and keep the model that has the most number of inliers
* RANSAC is simple, effective, and general but can be slow, sensitive to hyperparameters (initial settings), and there are often better approach than a brute force combination checker


# Week 7: Deep Learning Based Computer Vision with Neural Networks

* Machine Learning for Computer Vision refers to using, typically, neural network models for vision problems
    * Popularized after the ImageNet project collecting many labelled image data and a CNN model, AlexNet, in 2012 beat classical computer vision algorithms on benchmarks by 10 percentage points. Since then, neural network based approaches dominate the Computer Vision space
* Focusing on the problem of image classification, we want to create a model, in this case a neural network, that takes in an image and outputs a label
* A neural network model is a model loosely inspired by the brain and how neurons communicated with one another
    * Information passes through a network of neurons where each neuron analyzes the signals and determines which signals are best to send to the next set of neurons
    * In a neural network model, each neuron takes in input, calculates a weighted average (where the weight represent how important each input is) and then applies a non-linear function to transform input into a new feature for the next set of neurons to use. This process repeats until the network reaches the final set of neurons and gets the prediction
    * The goal of training a neural network is to determine the best weights for each neuron by iterative selecting wieghts that minimizes the error or loss of the model
* The general training algorithm for a neural network model is:
    * Step 1: Determine the Neural Network Architecture
        * This done via experiementation, seeing what other's have done, or using previous experience. Typically, bigger models do better but at the risk of overfitting and being more memory intensive. If you can use a smaller model and get the same performance, you general should (easier to deploy)
    * Step 2: Initialize with random weights
        * The weights are unknown and need to be learned. We usually set them randomly or can use weight initialization strategies (depending on the current research or literature)
    * Step 3: Do a Forward Pass (pass inputs into the model and evaluate how well it did on the loss function)
        * We want to pass data and see how well did it measured by some meaningful function that tells us the ability of model (that function is usually decided based on the task at hand and literature / research). Initially, the model will get high loss because it doesn't know anything but over time it should improve on the loss metric (i.e., it should get smaller)
    * Step 4: Do a Backward Pass (update the weights via backpropagation with gradient descent)
        * For each weight value, we can wiggle the value a little and experience a change in our loss. If we wiggle all the weights such that the loss is lowered, that is ideal. We basically do this for all the weights in a computational efficient manner using an algorithm called backpropagation with gradient descent (the details of which involve multivariable calculus and linear algebra to full appreciate)
    * Step 5: Repeat steps 3-4 until end (some number of epochs)
        * We keep training over and over again until we get a model with good loss or until we no longer can train (run out of time or compute resources)
* For image data in particular, if we want to pass it into an neural network, we need to flatten the image down to 1 dimensions by taking the rows or columns and stacking them together (this is because a standard neural network expects a 1D vector of features)
* Fortunately for us, many of the detials of a neural network is abstracted away thanks to libraries like PyTorch and TensorFlow, where we can focus on data preparation, model architecture, and training (as opposed to building the nitty gritty from scratch)

---

* In a standard neural network, an image is flatten to 1-dimension and then passed into the network
    * In this approach, each pixel value is treated independently as we assign an independent weight to each pixel value
* However images are spatial in nature and there are patterns that in 2-dimensional space that can be useful for the model to leverage
* We can use a key idea from classical computer vision: convolutions
    * Convolutions apply a filter or kernel accross an image to extract important characteristics (like a Gaussian kernel to blur or Sobel kernel for edges)
    * In classical computer vision, we determine or derive the kernels manually but in a deep learning approach, the values in the kernels are learned from training (i.e., they become weights that the model learns)
* This creates the convolution neural network, which follows the same training algorithm as standard neural networks, but now includes a convolution layer in the beginning of the network
    * The image goes through the convolution layer where a bunch of filters are applied. This results in a bunch of filtered images. Often, we apply a max pooling layer (i.e., condense the images down to smaller regions by selecting the highest pixel values in a small local window) after this to reduce the size of resulting image as there's a lot of new information that results from convolution
    * This process of convolution and pooling repeats until we end (a decision we can make on the number of convolutional layers) and then flatten all the resulting images and pass through a standard neural network
    * It has been shown that these convolutional layers act as feature extractors and identify key properties of the images (like corners, edges, shapes, colors, etc.) that can be used by the standard or feed forward network to make better, more spatial aware decisions
* In general, convolution neural network (CNNs) are a powerful extension to standard neural networks that enable better results on image data (thus becoming industry standard)

---

* Since neural networks for computer visions is very popular, there are now many well-performing models publicly available. As opposed to training from scratch, which can be cost and time extensive, we can start from an existing model and make it better (in a process call transfer learning)
    * The idea is that a model trained on a general task is further trained with more data or examples on a new task, thus transferring its knowledge from the general domain to the specialized domain
    * Transfer learning can work by introducing new dataset to train on, adding new layers to the end of the network, and/or allowing the original model weights to change (i.e., keeping the base model weights frozen vs unfrozen)



# Week 8: Spring Break

No content from this week.

# Week 9: Probabilistic State Estimation

* The process of autonomy can be viewed as collecting data from the physical world / environment, analyzing the data, and then making some decisions
    * The wheel odometry model is a good example of where we collected data from the wheel encoders, analyzed it via our model, and then made some determination of the where the robot was and how it moved
* Since the world is inherently noisy (i.e., there is some level of randomness) and sensors can have measurement error (i.e., they will not be accurate a 100% of the time), we need to account for this uncertainity, hence we need algorithms that leverage probabilistic state estimaton
* To appreciate probabilisitic state estimation, we need to be comfortable with probability theory: $ $
    * A sample space is the set of possible outcome (e.g., the sample space for a 6-sided dice roll is: 1, 2, 3, 4, 5, 6)
    * An event is some subset of the sample space (e.g., rolling an even number or a number greater 5)
    * Probability describes the likelihood of an event occurring
        * When the sample space is discrete (i.e., a finite number of outcomes, like rolling a dice), we can count up the likelihood of different individual outcomes to find the probability of some set of outcomes
        * When the sample space is continuous (i.e., infinite number of outcomes, like the time spent waiting for the bus), we need to work with a probability distribution function, which describes the probability, and look at the area under the curve
    * Joint probability is the probability of a multiple events occurring
        * This is often expressed as $P(x, y)$ which means probability of $x$ and $y$ happening.
    * Conditional probability is the probability of an event(s) occurring given some other event(s)
        * This is often expressed as $P(x | y)$ which means probability of $x$ happening given $y$ happened.
    * Independent events are when different events don't affect one another
        * This means $P(x | y) = P(x)$
    * From the definitions of joint and conditional probability, we have Bayes rules which says $P(x | y) = \frac{P(y | x)P(x)}{P(y)}$
        * This formula is useful since it might be hard to solve for $P(x | y)$ directly, but $P(y|x)$, $P(x)$, and $P(y)$ are easier to find
    * Expectation, intuitively, is the weighted average for a distribution
    * Variance is the spread of the distribution and standard deviation is the square root of variance (this is useful because now the standard deviation is the same units as the original distribution)
    * Expectation and variance are useful properties to understand a distribution and its shape (think of it as a way of summarizing a distribution)
    * In probability, there are an infinite number of distribution, but some are more useful and common than others, one which is being the normal distribution
    * The normal distribution has a number of nice properties (i.e., can add two normal distributions and end up with a normal distributions) and appears in a number of scenario (see [Why is Everything Normal](https://medium.com/@nahmed3536/why-is-everything-normal-2f6b0f6efd73))
* Probabilistic state estimation is estimating the next state of a system based on the previous states observed and leveraging probabilities to weigh different pieces of information (i.e., more certainity means that a piece of information will be trusted more)
* The first algorithm we looked at for probabilistic state estimation was the alpha beta filter
* When using the alpha beta filter, we have some model of the system and some measurement of the system; both the model and measurement are noisy but by combining them together, we can get a more confident estimate of system
* The algorithm is as follows:
    * We initialize with our initial state (could be a guess, some prediction, or perhaps the first measurement)
    * Then we make a prediction of what the next state would be using our model; a simple model may only use information about what the previous state was whereas a more complex model might use information of the past several states
    * Then we perform an update of our prediction using a measurement; we compare the difference between the measurement and prediction, which is called the residual, and adjust our prediction by some factor of the residual
        * The amount we adjust the prediction by is called the alpha term and it express our confidence in the measurement
    * In case where our model is wrong, we might need to adjust the prediction even further, so we introduce a gain term which is added to the prediction and scaled residual; we can think of the gain term as a correction
        * The amount we scale the gain is called the beta term and it express how much we need to correct our filter
    * We repeat this process iteratively with the next prediction and measurement
        * In the version we covered in class, the alpha and beta terms are fixed and are found experimental
        * More complex versions of the algorithm might involve changing alpha and beta over time based on some set of conditions




# Week 10: Kalman Filter

* SLAM stands for simultanenous localization and mapping; it plays a key role in autonomy and used in many applications related to robotics, autonomous vehicles, and drones
* Broadly speaking, we can breakdown the SLAM space into three categories:
    * Filter-based SLAM with Kalman Filters & Particle Filters: state estimation with new estimates
    * Graph-based SLAM: formulates the problem as graph optimization
    * Deep Learning SLAM: uses neural networks for autonomy, albeit is very nascent

---

* The Kalman filter is an iterative, probabilistic state estimation algorithm that combines predictions from a model of the system and measurements to generate estimates weighing in the probability distribution of the predictions and measurements
    * Intuitively, think of it as an extension of the alpha beta filter, but instead of a fix alpha and beta term, we're now using the probability distribution surround the model prediction and the measurement
* More formally, the Kalman filter is described by the following equations (note we exclude the time subscript for simplicity):

Prediction:
\begin{align}
\mathbf{\bar{x}} &= \mathbf{Fx} + \mathbf{Bu} \\
\mathbf{\bar{P}} &= \mathbf{FPF}^T + \mathbf{Q}
\end{align}

Update:
\begin{align}
\mathbf{y} &= \mathbf{z} - \mathbf{H\bar{x}}\\
\mathbf{S} &= \mathbf{H}\mathbf{\bar{P}}\mathbf{H}^T+\mathbf{R}\\
\mathbf{K} &= \mathbf{\bar{P}H}^T\mathbf{S}^{-1}\\
\mathbf{x} &= \mathbf{\bar{x}} + \mathbf{Ky}\\
\mathbf{P} &= (\mathbf{I} - \mathbf{KH})\mathbf{\bar{P}}\\
\end{align}

where:
- $\textbf{x}$: state
- $\textbf{P}$: state noise
- $\bar{\textbf{x}}$: prediction
- $\bar{\textbf{P}}$: prediction noise
- $\textbf{F}$: state transition function
- $\textbf{Q}$: process noise
- $\textbf{B}$: control function
- $\textbf{u}$: control input
- $\textbf{z}$: measurement
- $\textbf{H}$: measurement function
- $\textbf{R}$: measurement noise
- $\textbf{y}$: residual
- $\textbf{S}$: system / residual uncertainity
- $\textbf{K}$: kalman gain
- $\textbf{I}$: identity matrix

---

* To use the Kalman Filter, we need to explicitly design and define the vectors and matrices. Properly selecting what values the components of the Kalman filter should take **is not trivial** - often it can be very difficult to select the appropriate values and involves leveraging ideas from various engineering domains as well as iterating / fine tuning. It's important to select the right values for the Kalman filter to ensure it estimates correctly because the Kalman filter can confidently converge on the wrong values (i.e., be a smug filter).
* In class, we explored SLAM with the Kalman Filter for 2D Robot motion (i.e., a robot moving in 2 dimensions) where the robot is traveling at a constant speed and we can model the system dynamics as Newtonian motion.
    * Here's the [tutorial for it](https://colab.research.google.com/drive/1k3JCXWOMxItyUf5vxPUAGHajLAu7znvi?usp=sharing): you should feel comfortable with the concept and setup in this tutorial (i.e., everything make sense when reading through it)


# Week 11: Particle Filter

* A limitation of the Kalman Filter is that it makes some assumptions about the underlying probability distributions: they have to be normal distributed
    * In many real life application, the underlying distribution is much more complex and doesn't follow a normal distribution
* The Particle Filter is an alternative to the Kalman Filter that can handle more complex probability distribution
    * The Particle Filter falls under the broader family of Monte Carlo Filters and Signal-Observation models
* The generic algorithm for the particle filter is:
    * Generate samples based on some initial distribution (typically an uniform or normal distribution)
    * Predict the next state of the particles based on knowledge of the system (each sample will update based on the model of the system)
    * Update the distribution based on measurements (i.e., some of the particles are more likely while the others are less likely to agree with the measurement observed)
    * Resample from the update the distribution and repeat
* Fundamentally, the Particle Filter is based on Bayes' rule (the predict-observe-update nature of the algorithm)
* There are many variations of the particle filter based on applications and the trade-offs between different versions
    * For example, a common problem with the particle filters is confidently diverging away from the true state of the system or having one particle represent all the confidence of the filter, thus not being very expressive - in these case, one modification that helps is to have better re-sampling schema that allows for greater exploration
* In class, we looked at several examples worth being familiar with:
    * A robot transversing a 1D hallway with three doors
    * A robot exploring a 2D room
    * In the example, notice how the distribution is updated after each measurement


# Week 12: Graph Based SLAM

* Kalman and Particle filter treat the SLAM problem as a state estimation problem
* Another popular approach to SLAM is a graph based approached where a pose-graph is constructed
    * The nodes of the pose-graph are the poses of the robot
    * The edges of the pose-graph are the sensor measurements that connect or constrain poses
    * The goal is to minimize the errors of in the constructed pose-graph (i.e., pose-graph optimization)
* In pose-graph optimization, we keep track of:
    * the states / poses (denoted by $x$)
        * this represents how the robot is oriented and positioned
    * the controls (denoted by $u$)
        * this represents any input provided to the robot by a user or human
    * the landmarks (denoted by $m$)
        * this represents key points or object in the environment that are distinctive and highly recognizable, thus allowing us to relate two different poses
    * the observations (denoted by $z$)
        * this represents the readings from the sensors
* To construct the pose-graph, we add a node everytime the robot enters a new state and connect them by an edge to the previous node
    * This process by itself will create a graph that looks like a line - to make the graph useful, we need to perform loop closure, which means we need to revist a state we've already been
    * When loop closures occurs, we can connect and merge different nodes togethers, thus creating a better map of the environment
    * More formally, when loop closures occur, we want to minimize the error between the two states recorded and adjust the graph accordingly
    * The exact mathematics of this optimization is not something we cover - however the concept of pose-graph optimization should be something to be familiar with
* When performing pose-graph optimization effectively, we will end up with map of where there is free space and blocked space, in something called an occupancy grid
    * An occupancy grid is a grid of the environment where a cell can be free or blocked (in the case of a binary occupancy grid) or a probability of being free (in the case of a probabilistic occupancy grid)

# Week 13: Motion Planning

* After we have a map of our environment, we want to plan our motion and path
    * Motion Planning: how does a robot execute along a path?
    * Path Planning: what path does the robot go from point A to B?
    * Both are crucial for robotic autonomy
* Intuition for motion planning:
    * Consider moving a piano from one room to another (i.e., Piano-Mover's Problem) - there's more than just the path to consider, we need to worry about how we orient and position the piano itself
* We can formalize motion planning with C-spaces
    * A C-space is a just vector space where some set of vectors represent a free position and another set of vectors represent a blocked space or one with an obstacle; we can view this vector analogous to the state of the robot
    * Our goal in motion planning is to get from one free space state to another free space state without running into an obstacle (i.e., always being in free space and never entering blocked space)
        * Depending on application, we can further extend the problem to finding a plan that mets some criteria (i.e., being efficient, shortest, least energy, etc.)
* Constructing the C-space varies based on the robot / scenario - in class we looked at a robot arm with 2-degrees of freedom with different sets of obstacles
* C-spaces are great for mathematically discussion, but they are limited when it comes to implementation since they are continuous space - in order to implement motion planning, we need to discretize our C-spaces
    * This involves converting our C-spaces into a Roadmap graph which basically transforms a continuous vector space into a graph and allows motion planning to be a graph transversal problem
    * Now the key question is how to construct this graph?
* One method is visibility graphs:
    * The key idea here is that in an environment with only immovable, permanent obstacles, our goal of the robot is to avoid colliding into theses obstacles
    * Thus, we can construct a graph by looking at the boundaries or corners of these obstacles and connect them accordingly
    * Since every point in our graph is delibarately added such that they avoid the obstacle, our robot will always be in free space when traveling the graph constructed
* Another method is C-space Grids
    * We take the environment and overlay a grid - each cell represents a potential node in the graph
    * If the cell is within an obstacle or too close to an obstacle, it is not added to the graph
    * The remaining cells are then connect (adjacent cells and / or corner neighboring cells are connected)
*  C-space grids are more dense and can allow the robot to enter more spaces, but require more memory to store and more options to explore when constructing a path
    * Thus, one modification to C-space grids is adapative decomposition where the grid is created in a recursive fashion, and cell regions only divide further if needed (i.e., there is some obstacle in the region that needs to be excluded in the graph construction)

# Week 14: Path Finding

* After a graph is constructed, we want to find a path on the graph that will be most efficiently (typically shortest path or minimum weighted path)
* When it comes to graphs, there are several types:
    * Unwieghted, undirected graphs: nodes are connected by edge with no weight and a connection between two nodes is bidirectional (can go from node A to node B and node B to node A)
    * Unwieghted, directed graphs: nodes are connected by edge with no weight and the connection between two nodes has direction (can go from node A to node B, but not necessarily from node B to node A unless that directed edge exists)
    * Weighted graphs: these can either undirected or directed, but essentially, the egdes now have some number associated with them call their weights - these weights might represent some type of cost associated going from one node to the next
* Typically, in path planning algorithms, we try to find the shortest path. For unweighted graphs, this is going along the paths with least amount of nodes visited. For weighted graphs, this is going along the paths with the least amount of weight.
    * As a side note, to use path planning algorithms that don't involve finding the shortest path, it's typically easier to convert the problem into finding the shortest path as opposed to construct a custom, new graph algorithm
* Two naive algorithms for finding the path from a start node to an end node is breadth first search (BFS) and depth first search (DFS)
    * BFS explores all the unvisited neighbors of the current node before moving onto the next set of nodes (usually goes level by level)
    * DFS explores a single path of unvisited nodes until it can't explore no more (so it goes deep along a single path until it has to turn back)
    * BFS and DFS have their trade-offs:
        * BFS will find the shortest path in an unweighted graph (i.e., path visiting the least amount of nodes) since it goes level by level / set of neighbors by set of neighbors, however it can slow since it has to explore all these levels (especially in very dense graphs like in C-grid)
        * DFS will find a path quicker (if it terminates on the first valid path it finds) but it is not guaranteed to be the shortest since it doesn't exhaustively check all paths (and if it did, it wouldn't be faster than BFS)
* Better graph transversal algorithm do exist (better in the sense they are faster and more efficient) such as A*
    * The A* algorithm finds the lowest-cost path on a graph where the path cost is the sum of the positive costs of the individual edges
    * The A* algorithm works by using an optimistic hueristic function that tells us how far we are from the goal node from any node in the graph
        * Optimistic means that the function tells us a smaller value than the true / real value
        * The heuristic needs to be fast and reliable for A* to better
    * The hueristic function helps shortcut some of the potential exploration and helps the algorithm prioritize paths that would lead to the goal node with less cost
* In class, we looked at a simple example of using the A* algorithm with the euclidean distance formula between nodes as the hueristic function
    * An assumption of why this example works is that the path cost factors in more than just distance such as energy, wear and tear, etc. so the distance function would always be smaller than true cost in this scenario
    * It is recommended to be well-versed in the example provide in class and how to perform A* star on a new graph given a heuristic function
    
    

# Week 15: Reinforcement Learning

* Reinforcement learning is a branch of artificial intelligence and machine learning that focuses on learning from states, actions, and rewards
    * The field of AI can broken down into two components: symbolic AI which focuses on developing algorithms to solve problems and machine learning which focuses on designing algorithms that learns from data to solves problems
    * Within machine learning, there are three main branches: supervised where the data has labels (i.e., we know the right or desired answer), unsupervised (i.e., we don't know the right or desired answer), and reinforcement learning (i.e., we learn by observing the environment and seeing which actions give the most reward)
* The inspiration of reinforcement learning can be traced back to Thorndike's Law of Effect and BF Skinner Operant Conditioning
    * Thorndike's Law of Effect states that behaviors followed by pleasant or rewarding consequences are more likely to be repeated, while behaviors followed by unpleasant or punishing consequences are less likely to be repeated.
    * BF Skinner Operant Conditioning states that reinforced behaviors will repeat (i.e., given a positive rewards for doing or negative reward for not doing) while non-reinforced behaviors die out
* The field of reinforcement learning formalizes these so that we can develop and implement algorithms; the core concepts within reinforcement learing are: environment, observation, state, action, and reward.
* The environment is the world we operate in
    * For example, for a warehouse robot, the environment is the warehouse with all of its obstacles and dynamics
* Within this environment, we can make observations. Observations represents what is going on in the environment. In more formal terms, an observation captures the state of the environment!
    * Depending on our environment or ability to measure (via sensors), our observation and therefore our states might fully capture the environment, partially capture, and/or probabilistically capture
* The program, often refer to as an agent, needs to make some action based on the observation(s)
    * An action defines what the agent does
    * The crux of reinforcement learning is to learn what the best action is (i.e., what strategy or policy the agent should follow)
* Determining the policy is difficult and needs to be learned
    * In terms of what this policy can be, there is a wide variety - some are static, some are random, some are greedy (i.e., doing what's best now or in the short term), only use the current observation (i.e., is markovian), or uses the history of what has happen
* To ascertain the best policy, we need to have rewards which inform the agent of how good or bad their action was
    * Positive reward indicates a positive signal
    * Negative reward indicates a negative signal
    * Zero reward doesn't provide any signal (not good or bad)
* Constructing the reward scheme is important - a proper reward setup will ensure the agent learns the best or desired behavior
* In class, we looked at a maze for the robot to navigate. When the reward was negative on obstacles, zero everywhere except the goal which was positive, the agent would learn to find the goal but not necessarily in the shortest possible path
    * When we add a small negative weight on each free cell, then the robot prioritizes finding the shortest path
    * If we want to robot to visit a particular cell along the way, we can add a small positive reward there, but we need to be careful that the reward is properly set so the agent doesn't get stuck looping back there - we did this by making the positive reward less than the negative reward gained from looping back to it
    * Alternatively, we could have changed the reward function (i.e., make it dynamic) - in this scenario, we would need to work with a different set of algorithms
    * And in settings where the reward function is too complex or unknown, we use function approximation of the reward function (popular with neural network, hence the field of deep reinforcement learning)
* In example in class, we discussed how the agent would go through the maze many times, observe which scenarios performed well (got the most reward) and learn from their
    * The approach would involve associating a value to each state action pair and then optimizing for the state-action pair (in a greedy fashion)
    * To avoid getting stuck in pre-mature optimization, we can sometime go in a random direction to explore new paths and adjust if our reward function is evolving over time (in a process called epsilon-greedy)
* It is recommended to be familiar with the example of the grid world covered in class
