# What Is Computer Vision?
Computer Vision is a field in Artificial Intelligence that uses Machine Learning and Neural Networks to teach computers and systems to derive meaningful information from digital images, videos and other visual inputs - and to make recommendations or take actions when they see defects or issues.

If AI enables computers to think, Computer Vision enables them to see, observe and understand.

Computer Vision works much the same as Human Vision, except Humans have a head start. Human sight has the advantage of lifetimes of context to train how to tell objects apart, how far away they are, whether they are moving or something is wrong with an image.

Computer Vision trains Machines to perform these functions, but it must do it in much less time with cameras, data and algorithms rather than retinas, optic nerves and visual cortex. Because a system trained to inspect products or watch a production asset can analyze thousands of products or processes a minute, noticing imperceptible defects or issues, it can quickly surpass human capabilities.

Computer Vision is used in industries that range from energy and utilities to manufacturing and automotive, etc.

# How Does Computer Vision Work?
Computer Vision needs a lot of data. It runs analyses of data over and over until it discerns distinctions and ultimately recognize imges. For example, to train a computer to recognize automobile tires, it needs to be fed vast quantities of tire images and tire-related items to learn the differences and recognize a tire, especially one with no defects. Two essential technologies are used to accomplish this, a type of Machine Learning called Deep Learning and a Convolutional Neural Network (CNN).

Machine Learning uses algorithmic models that enable a computer to teach itself about the context of visual data. If enough data is fed through the model, the computer will "look" at the data and teach itself to tell one image from another. Algorithms enable the machine to learn by itself, rather than someone programming it to recognize an image.

A CNN helps a Machine Learning or a Deep Learning model "look" by breaking images down into pixels that are given tags or labels. It uses the labels to perform convolutions (a mathematical operation on 2 functions to produce a third function) and makes predictions about what it is "seeing". The Neural Network runs convolutions and checks the accuracy of its predictions in a series of iterations until the predictions start to come true. It is then recognizing or seeing images in a way similar to humans.

Much like a human making out an image at a distance, a CNN first discerns hard edges and simple shapes, then fills in the information as it runs iterations of its predictions. A CNN is used to understand single images. A Recurrent Neural Network (RNN) is used in a similar way for video applications to help computers understand how pictures in a series of frames are related to one another.

# History Of Computer Vision
Scientists and Engineers have been trying to develop ways for machines to see and understand visual data for about 60 years. Experimentation began in 1959 when neurophysiologists showed a cat an array of images, attempting to correlate a response in its brain. They discovered that it responded first to hard edges or lines and scientifically, this meant that image processing starts with simple shapes like straight edges.

At about the same time, the first computer image scanning technology was developed, enabline computers to digitize and acquire images. Another milestone was reached in 1963 when computers were able to transform 2D images into 3D forms. In the 1960, AI emerged as an academic field of study and it also marked the beginning of the AI quest to solve the human vision problem.

1974 saw the introduction of Optical Character Recognition (OCR) technology, which could recognize test printed in any font or typeface. Similarly, Intelligent Character Recognition (ICR) could decipher hand-written text that is using Neural Networks. Since then, OCR and ICR have found their way into document and invoice processing, vehicle plate recognition, mobile payments, machine conversion, etc.

In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to detect edges, corners, curves and similar basic shapes. Concurrently, computer scientist Kunihiko Fukushima developed a network of cells that could recgnize patterns. The network, called the Neocognitron, included convolutional layers in a Neural Network.

By 2000, the focus of study was on object recognition; and by 2001, the first real-time face recognition applications appeared. Standardization of how visul data sets are tagged and annotated emerged through 2000s. In 2010, the ImageNet dataset became available. It contained millions of tagged images across a thousand object classes and provides a foundation for CNNs and Deep Learning models used today. In 2012, a tema from the University of Toronto entered a CNN into an image recognition contest. The model, called AlexNet, significantly reduced the error rate for image recognition. After this breakthrough, error rates have fallen to just a few percent.

# What Is Object Detection?
Object detection is a Computer Vision technique that identifies and locates objects of interest within an image or a video. It goes beyond simply recognizing objects (like image classification) by not only identifying the objects but also pinpointing their exact locations and boundaries within the scene.

![computer_vision_2.png](attachment:computer_vision_2.png)

### Key aspects of object detection
- Identification: Recognizing and classifying objects (e.g., cars, people, animals).
- Localization: Determining the precise location and size of each object using bounding boxes.

### Common applications of object detection
- Self-driving cars: Detecting pedestrians, vehicles and traffic signs for safe navigations.
- Surveillance systems: Monitoring for suspicious activities and identifying individuals.
- Medical imaging: Analysing medical images to deduct abnormalities.
- Retail: Tracking inventory and customer behavior.
- Robotics: Enabling robots to interact with their environments.

### How object detection works?
1. Image or video input: The system receives an image or video frame as input.
2. Feature extraction: The system analyzes the image to extract relevant features, such as edges, shapes and textures.
3. Object localization: The system identifies potential objects and draws bounding boxes around them.
4. Object classification: The system classifies each detected object into a specific category (e.g., car, person, dog).

### Popular object detection algorithms
- R-CNN (Regions with Convolutional Neural Networks).
- Fast R-CNN.
- Faster R-CNN.
- YOLO (You Only Look Once).
- SSD (Single Shot Detector).

Object detection is a powerful tool with numerous applications across various industries. It plays a crucial role in enabling machines to understand and interact with the visual world, making it a cornerstone of modern Computer Vision.

# What Is OCR?
OCR stands for Optical Character Recognition. It is a technology that enables computers to read text from scanned documents, images or even videos.

### How does it work?
1. Image input: The process begins with an image containing the text that is to be converted. This could be a scanned document, a photo of a sign or even a screenshot.
2. Image preprocessing: The image undergoes preprocessing to enhance the text's visibility. This might involve adjusting contrast, removing nois or isolating the text from the background.
3. Character segmentation: The system then divides the image into individual characters or words. This step is crucial for accurate recognition.
4. Feature extraction: The system analyzes the features of each character, such as its shape, lines and curves.
5. Character recognition: Finally, the system compares the extracted features to a database of known characters and assigns the most likely match.

### Application of OCR
OCR has a wide range of applications, including:
- Document digitization: Converting paper documents into searchable electronics formats.
- Data entry: Automating the process of entering data from forms and invoices.
- Image recognition: Recognizing text in images and videos, such as license plates or street signs.
- Accessibility: Making scanned documents accessible to visually impaired individuals.

### Limitations of OCR
- Accuracy: OCR's accuracy can be affected by factors such as image quality, font style, and language complexity.
- Handwriting: OCR may struggle to accurately recognize handwritten text, especially if the handwriting is messy or cursive.
- Background noise: Background noise or complex layouts can interfere with OCR accuracy.

# What Is Image Captioning?
Image Captioning involves generating a textual description of an image's content.

### How does it work?
1. Image analysis: The system first analyzes the image to understand its visual content. This involves identifying objects, their attributes, relationships between objects and the overall scene.
2. Language generation: Based on the visual understanding, the system generates a coherent and descriptive sentence or phrase. This often involves using Deep Learning models like RNNs or Transformers.

### Applications
- Accessibility: Providing textual description for visually impaired individuals to understand images.
- Image retrieval: Enhancing image search by allowing users to query with textual descriptions.
- Content creation: Automating the generation of captions for social media posts or news articles.
- Image understanding: Evaluating the performance of Computer Vision models by comparing generated captions with human-written descriptions.

### Challenges
- Complex scenes: Accurately describing scenes with multiple objects and intricate relationships can be challenging.
- Fine-grained details: Capturing subtle details or nuances in the image can be difficult.
- Creativity and fluency: Generating captions that are not only accurate but also creative and fluent can be a significant hurdle.

# What Is Image Generation?
Image Generation is a field in AI that leverages advanced Deep Learning algorithms to produce a wide range of visual content, from photorealistic landscapes and protraits to abstract art and imaginative concepts.

### How does it work?
At the core of image generation lie powerful Neural Networks, often organized into a framework called Generative Adversarial Networks (GANs). GANs consist of 2 key components,
1. Generator: This network's role is to create new images based on random noise or input prompts. It starts with random data and gradually refines it into a coherent image.
2. Discriminator: This network acts as a judge, evaluating the authenticity of the generated images. It distinguishes between real images from the training dataset and those created by the generator.

During the training process, the generator and discriminator engage in a continuous game of improvement. The geneator strives to produce increasingly realistic images that can fool the discriminator, while the discriminator learns to better identify fake images. This adversarial process leads to remarkable results, where the generator can create highly convincing and visually appealing images.

### Key techniques involved
- Variational Autoencoders (VAEs): Another popular approach that learns a compressed representation of the input data and can generate new samples from this compressed space.
- Diffusing models: These models gradually add noise to an image and then learn to reverse the process, generating new images from noise.

### Applications
- Creative arts: Generating unique artwork, designs and visual concepts.
- Entertainment: Creating realistic characters, environments and special effects for movies, games and virtual reality expreiences.
- Scientific research: Simulating complex phenomena, visualizing scientific data and accelarating drug discovery.
- Business: Personalizing marketing materials, creating realistic product visualizations and generative diverse content for social media.

# How Do Computers See Images?
Computers see images as a grid of tiny squares called pixels. Each pixel is assigned a numerical value representing its color intensity. The range of the numerical values is between 0 and 255.

The following is a breakdown,
1. Image captre: A camera or a scanner captures an image and converts it into a digital format.
2. Pixelation: The image is divided into a grid of pixels.
3. Color encoding: Each pixel is assigned a numerical value representing its color. This is often done using RGB color model, where each pixel's color is determined by the intensity of red, green and blue light.
4. Data representation: The numerical values for each pixel are stored in digital file such as a JPEG or PNG.

So, essentially, a computer sees an image as a massive array of numbers.

To understand the image, computers use algorithms and techniques like,
- Edge detection: Identifying boundaries and shapes within the image.
- Feature extraction: Identifying key characteristics like corners, curves and textures.
- Pattern recognition: Recognizing familiar objects or patterns within the image.

Deep Learning, particularly Convolutional Neural Networks (CNNs), have revolutionized how computers "see" images. CNNs can automatically learn complex features from raw pixel data, enabling them to perform tasks like image classification, object detection and image segmentation with remarkable accuracy.

### Grayscale image
A grayscale image is an image where the only colors are shades of gray. It ranges from pure black to pure white, with various shades of gray in between.

When dealing with a grayscale image, the 2 aspects involved are width and height. For each combination of height and width a pixel value is associated. Therefore in grayscale image, the parameter are, height, width and the numerical value representing the pixel. In essence, a grayscale image can be thought of as a 2D matrix (width x height) where each cell in the matrix holds a pixel value.

Therefore, a grayscale image can be stored in a 2D matrix.

### Colored images
A colored image, is an image that displays a wide range of colors. Unlike grayscale images, which only use shades of gray, colored images utilize a combination of colors to represent the visual information. All the images in real-world are colored.

How are colors represented in digital images? The most common color model used in digital imaging is the RGB (Red, Green, Blue) model. In this model, each pixel is represented by 3 values,
- Red: The intensity of red light.
- Green: The intensity of green light.
- Blue: The intensity of blue light.

By combining different intensities of these primary colors, a vast array of colors can be created. Each color component is typically represented by a value between 0 and 255, where 0 represents the absence of that color and 255 represents the maximum intensity.

While RGB is the most common model, there are other color models as well,
- CMYK (Cyan, Magenta, Yellow, Black): Used in printing.
- HSV (Hue, Saturation, Value): A more intuitive model for representing colors.
- HSL (Hue, Saturation, Lightness): Similar to HSV, but uses lightness instead of value.

Colored images are essential for capturing and displaying the full spectrum of visual information, making them ubiquitous in photography, video and digital art.

While dealing with a color image, in addition to width and height, there are 3 additional aspects involved, which are, R, G and B. For each of R, G and B, a width-height matrix is associated with each one of these, R, G and B.

Therefore, a colored image can be stored in a 3D tensor.

Note: 
- 0D: Scalar (`1`).
- 1D: Vector (`nx1` or `1xn`).
- 2D: Matrix (`nxn`).
- 3D and more: Tensor (`mxnxn`). 
- Tensor infact, encapsulates the others as well (vector and matrix).

# What Is A Tensor?
Tensors are a fundamental concept in Mathematics and Computer Science, particularly in the field of Machine Learning. There are essentially multi-dimensional arrays that can hold numerical data. Think of them as a generalized version of matrices, which are themselves 2D arrays.

### Key points
- Dimensions: Tensors can have any number of dimensions, for 0 (scalars) to many.
- Rank: The rank of a tensor is the number of dimensions it has.
- Shape: The shape of a tensor describes the size of each dimensions. For example, a tensor with shape (3, 2, 4) has 3 dimensions, with sizes, 3, 2 and 4 respectively.
- Element: The individual values stored within the tensor are called elements.

### Examples
- Scalars: 0-dimensional tensors (e.g., 5, -2.3, π)   
- Vectors: 1-dimensional tensors (e.g., [1, 2, 3], [-1.5, 0.7])   
- Matrices: 2-dimensional tensors (e.g., [[1, 2], [3, 4]], [[5, 6], [7, 8]])
- Higher-dimensional tensors: Used in image processing, natural language processing, and other areas of machine learning. For example, a color image can be represented as a 3-dimensional tensor, with dimensions for height, width, and color channels (RGB).

### Why are tensors important in ML?
- Data representation: Tensors provide a flexible way to represent various types of data used in Machine Learning, such as images, text and numerical features.
- Neural Networks: Deep Learning models, such as Neural Networks, heavily rely on tensors for their computations and data flow.
- TensorFlow and PyTorch: Popular ML packages TensorFlow and PyTorch are built around the concept of tensors.

# Why Is Achieving Computer Vision Difficult?
Achieving Computer Vision is difficult due to occlusion. Occlusion in Computer Vision refers to the sitution where one object in a scene partially or completely blocks another object from view. This phenomenon poses a significant challenge for various Computer Vision tasks, such as,
- Object detection: Accurately identifying and localizing objects can be difficult when they are partially or fully occluded.
- Object tracking: Tracking an object's movement becomes challenging when it disappears behind another object.
- 3D reconstruction: Reconstructing the 3D structure of a scene can be inaccurate if objects are occluded

### Why is occlusion a challenge?
- Incomplete information: Occlusion reduces the amount of visible information about an object, making it harder to recognize, track or reconstruct.
- Ambiguity: When objects overlap, it can be difficult to determine which object is in front and whic is behind.

### How to handle occlusion?
- Motion analysis: Tracking the movement of objects over time can help predict their positions even when they are occluded.
- Depth information: Using depth sensors or stereo vision can provide information about the relative distances of objects, helping to resolve occlusions.
- Contextual information: Utilizing prior knowledge about the scene or the objects involved can help to infer the presence of occluded objects.
- Advanced algorithms: Researchers are developing sophisticated algorithms that can handle occlusions more effectively, such as those based on Deep Learning.

Further reading:
- Illumination variability.
- Pose variability.

# Why Are Deep Neural Networks Preferred For Images?
Deep Neural Networks (DNNs), particularly Convolutional Neural Networks (CNNs), have become the preferred choice for image-related tasks due to their remarkable ability to extract complex features from visual data. The following are the reasons why,
1. Feature learning:
    - CNNs automatically learn hierarchical features from the raw image data.
    - Early layers detect basic features like edges and curves.
    - Deeper layers combine these basic features to reognize more complex patterns like objects and scenes.
    - This eliminates the need for manual feature engineering, which was a major bottleneck in traditional Computer Vision methods.
2. High accuracy:
    - CNNs have achieved start-of-the-art results on a wide range of image-related tasks, including image classification, object detection and image segmentation.
    - Their ability to learn intricate patterns and representations from massive datasets has led to significant improvements in accuracy.
3. Invariance to transformations:
    - CNNs are relatively invariant to transformations like translation, rotation and scaling.
    - This means they can recognize objects even if they are slightly shifted, rotated or resized in the image.
4. End-to-end learning:
    - CNNs can be trained end-to-end, meaning the entire network, from input to output is optimized for the specific task.
    - This eliminates the need for hand-crafted pipelines and allows the network to learn the most effective feature representations for the given problem.
5. Adaptability:
    - CNN architectures can be adapted and customized for various image-related tasks.
    - This flexibility makes them a versatile tool for a wide range of applications.

# Why Use CNN Over MLP?
CNNs have largely superseded MLPs in image processing tasks due to several key advantages.

1. Parameter sharing and reduced overfitting:
    - CNNs: Employ weight sharing, meaning the same filter is applied across the entire input image. This significantly reduced the number of parameters, preventing overfitting and making training more efficient.
    - MLPs: Connect every neuron in one layer to every neuron in the next, leading to a massive number of parameters, especially with high-resolution images. The increases the risk of overfitting and required significantly more training data.
2. Local connectivity and spatial informtion:
    - CNNs: Utilize local receptive fields, meaning each neuron only connects to a small region of the input image. This allows CNNs to capture local patterns and spatial relationships within the image, crucial for tasks like object detection and segmentation.
    - MLPs: Treat the input image as a flattened vector, losing the spatial information that is essential for understanding visual data.
3. Feature learning:
    - CNNs: Learn hierarchical features through convolutional and pooling layers. Early layers detect basic features like edges and curves, while deeper layers combine these features to recognize more complex patterns.
    - MLPs: Require careful feature engineering, which can be time-consuming and domain specific.
4. Invariance to transformations:
    - CNNs: Exhibit some degree of invariance to transformations like translation, rotation and scaling due to their local connectivity and pooling operations.
    - MLPs: Are less robust to these transformations, as they rely on the exact position of pixels in the input vector.

# How Are Images Processed In A Human Brain?

![computer_vision_3.png](attachment:computer_vision_3.png)

The visual cortex, located in the occipital lobe of the brain is responsible for processing visual information. It is a complex and facinating area, with different regions specializing in various aspects of vision.

### Key stages
1. Retina to V1 (Primary Visual Cortex):
    - Edge detection: Neuronss in V1 respond to edges an contours, breaking down the image into basic shapes.
    - Orientation selectivity: Neurons are tuned to specific orientations (e.g., vertical, horizontal).
    - Receptive fields: Neurons have specific regions of the visual field they respond to.
2. V1 to higher-level areas (V2, V3, V4, etc):
    - Feature extraction: More complex features like color, motion and depth are extracted.
    - Parallel processing: Information is processed simultaneously along different pathways:
        - Ventral stream ("what" pathway): Object recognition and identification.
        - Dorsal stream ("where" pathway): Spatial location, movement and action.

### Specialized areas
- V1: Basic feature extraction (edges, orientation).
- V2: More complex shape analysis, color processing.
- V3: Form perception, motion perception.
- V4: Color perception, complex shape analysis.
- V5/ MT: Motion perception.

### Key concepts
- Hierarchical processing: Information is processed in stages, withh increasing complexity.
- Parallel processing: Different aspects of the image are processed simultaneously.
- Plasticity: The visual cortex can adapt and change based on experience.

### Visual Illusions and Visual Cortex
Visual Illusions can provide insights into how the visual cortex works. For example, the Müller-Lyer illusion demonstrates how the perception of length can be influenced by the surrounding context.

### Recent research
- Deep Learning: Inspired by the Visual Cortex, Deep Learning algorithms have revolutionized image recognition in AI.
- Brain-Computer Interfaces: Researchers are exploring ways to use brain activity in the Visual Cortex to control devices.

# Problem Statement
Classify the Fashion images into one of the 10 categories.

![computer_vision_1.png](attachment:computer_vision_1.png)

