THESE NOTES ARE REFERED FROM BOOK [link text](https://github.com/tgarg535/Data-Visualization/blob/d64d7b8accdf421c04189bda3e020d0b8e8ec245/Data%20Visualization%20Book.pdf)

### **Chapter 1: Introduction**

This chapter provides a high-level introduction to data visualization, covering its definition, historical context, relationship to other fields, the process of creating visualizations, and the crucial role of the user and human perception.

---

#### **1.1 What Is Visualization?**

**Definition**: Visualization is defined as "the communication of information using graphical representations". It leverages the human perceptual system's ability to process images in parallel, making it much faster than reading text, which is a sequential process. Visualizations can also be language-independent, like a map or graph understood by people who don't share a common language.

**Visualization in Daily Life**: We encounter visualizations constantly, such as subway maps, weather charts, stock market graphs, 3D medical scans, and assembly instructions.

**Why Visualization is Important**: The way data is presented can profoundly impact decision-making. Two key examples from the book highlight this:

1.  **Data Distortion Through Scaling**: The same dataset can be perceived in dramatically different ways depending on how its axes are scaled. As seen in Figure 1.1 of the book, plotting car retail price vs. MPG with different scales can make the data appear as:
    * A tight, undifferentiated cluster (when both axes have a large, uniform scale).
    * A horizontal linear pattern (when the y-axis is scaled larger).
    * A vertical linear pattern (when the x-axis is scaled larger).
    * An inversely proportional curve (when the scale is determined by the data's range).
    This shows that scaling can be used, intentionally or not, to distort the "truthful" representation of data.

2.  **Human Interpretation of Different Formats**: A 1999 study by Elting et al. presented hypothetical clinical trial data to 34 clinicians using four different formats: a table, pie charts, stacked bar graphs, and an icon display.
    * **Results**: The format had a massive impact on the clinicians' ability to make the correct decision to stop the trial.
        * **Icon Displays**: 82% correct decisions.
        * **Tables**: 68% correct decisions.
        * **Pie Charts / Bar Graphs**: 56% correct decisions.
    * **Crucial Insight**: This means that up to 25% of patients could have received inappropriate treatment based on data shown in bar or pie charts. Ironically, most clinicians preferred the table and were "contemptuous of the icon display," demonstrating that user preference does not equate to effectiveness.

---

#### **1.2 History of Visualization**

The use of graphics to convey information has a long and rich history.

* **Early Visualizations**:
    * The earliest examples are **cave paintings**, such as those in the Chauvet Cave from approximately 30,000 years ago, used to record and pass on information.
    * **Maps** were essential for travel, commerce, and religion.
        * The **Peutinger Map** was an early Roman road map that distorted east-west distances to fit on a long scroll.
        * The **Hereford Map** is a large medieval map of the world with Jerusalem at its center, mixing real geography with religious and mythical information.
        * **John Snow's Cholera Map (1854)** is a famous example of thematic cartography. By plotting deaths on a map of London, he identified a strong cluster around the Broad Street water pump. He had the pump handle removed, ending the epidemic and demonstrating a clear geographical link to disease.
    * **Time-Series Visualizations** existed even before the 1600s. An early example from around 1030 by al-Biruni shows the phases of the moon in orbit.

* **Abstract & Thematic Graphics**: A critical development was the use of graphics to represent abstract, non-geographical data.
    * **William Playfair (late 1700s)** was a pioneer, inventing the line graph and bar chart to show economic data like national debt and trade balances over time.
    * **Charles Joseph Minard's Map (1869)** of Napoleon's march on Moscow is considered a masterpiece. It brilliantly visualizes six variables on a 2D surface: army size (via the width of the line), geographical location, direction of movement, temperature during the retreat, and time.
    * **Florence Nightingale (1850s)** used a "coxcomb" diagram to show that far more soldiers died from preventable diseases in hospitals than from wounds on the battlefield, compelling sanitary reforms.

* **Modern Visualization & the Role of Statistics**:
    * **Anscombe's Quartet (1973)** is a critical example demonstrating the need for visualization. It comprises four datasets that have nearly identical simple statistical properties (mean, variance, correlation, linear regression). However, when plotted, they reveal four very different structures. This proves that relying on statistics alone can be highly misleading.

---

#### **1.3 Relationship between Visualization and Other Fields**

* **Visualization vs. Computer Graphics**:
    * **Computer Graphics (CG)** is the set of tools and techniques used to create images. Its primary focus is often on synthesizing realistic images of 3D objects for art, entertainment, and games. It is the **underpinning** of visualization.
    * **Visualization** uses CG as a medium, but its goal is the **effective communication of data**, not visual realism. It encompasses aspects from other fields like HCI, psychology, statistics, and data mining. The underlying models (data vs. geometric objects) and goals (insight vs. realism) are different.

* **Scientific Visualization vs. Information Visualization**:
    * Historically, a distinction was made between **scientific visualization** (for physical data like MRI scans or fluid flow) and **information visualization** (for abstract data like financial records or networks).
    * This book does not make a strong distinction, treating them as allied fields that both provide representations of data.

---

#### **1.4 The Visualization Process**

Creating a visualization is a multi-stage process, often described as a pipeline. User interaction can occur at any stage.

* **The Core Visualization Pipeline**:
    * **Data Modeling / Preprocessing**: Raw data is filtered, sampled, and structured into a usable format.
    * **Data Selection**: A subset of the data is chosen for visualization (similar to clipping in graphics).
    * **Data to Visual Mappings**: The heart of the process. Data values are mapped to graphical attributes like position, size, shape, and color.
    * **View Transformations / Scene Parameters**: The user specifies non-data attributes like camera position, lighting, and color maps.
    * **Rendering**: The computer generates the final image, including axes, keys, and annotations.

* **Role of Perception**: The final stage is the human. A visualization's effectiveness depends on the abilities and limitations of the human visual system.
    * **Preattentive Processing**: The visual system can rapidly and accurately detect a limited set of visual properties in parallel, without focused attention. These include differences in color, shape, orientation, and size.
    * **Attentive Processing**: Is a slower, serial process that requires focused attention to understand more complex conjunctions of features.
    Effective visualizations harness preattentive features to draw the user's attention to important aspects of the data.

---

#### **1.5 The Role of Cognition**

Cognition goes beyond perception. It involves understanding, remembering, and reasoning about what is seen. A more complete model of the visualization process includes not just the computer's rendering pipeline but also the human's cognitive pipeline, where perception feeds into cognition to produce knowledge.

---

Based on the provided book, here is a detailed explanation of Pseudocode Conventions and The Scatterplot.

### Pseudocode Conventions

This section of the book establishes the conventions used for presenting algorithms, ensuring clarity and consistency. The pseudocode is designed to convey the core logic of an algorithm without being tied to a specific programming language, graphics library, or detailed data management.

The book assumes the following global variables and functions are available in the environment where the pseudocode runs:

  * **`data`**: Represents the working data table, which is assumed to contain only numeric values.
  * **`m`**: The number of dimensions (columns) in the working data table.
  * **`n`**: The number of records (rows) in the working data table.
  * **`NORMALIZE(record, dimension, min, max)`**: A function that maps a data value to a specified range. If `min` and `max` are not provided, it maps the value to the range between 0 and . The book notes this can be adapted for different normalization types, such as linear or logarithmic.
  * **`COLOR(color)`**: A function that sets the current drawing color for the graphics environment.
  * **`MAPCOLOR(record, dimension)`**: A function that applies a global color map to the normalized value of a given record and dimension to set the drawing color.
  * **Drawing Primitives**:
      * **`CIRCLE(x, y, radius)`**: Fills a circle at a given location with a specific radius using the current color setting.
      * **`POLYLINE(xs, ys)`**: Draws a series of connected line segments based on arrays of x and y coordinates.
      * **`POLYGON(xs, ys)`**: Fills a polygon defined by arrays of x and y coordinates with the current color setting.
  * **Geographic and Graph Functions**:
      * The book also assumes functions for handling geographic data (like `GETLATITUDES`) and graph data (`GETCONNECTIONS`) exist when needed.

-----

### The Scatterplot

The scatterplot is presented as one of the most fundamental, earliest, and widely used visualization techniques, built upon the Cartesian coordinate system. It serves as a foundational example to discuss the process of transforming data into a visual representation.

  * **How it Works**
    A scatterplot maps records from a dataset to points in a 2D or 3D space. The book provides the following pseudocode for a 2D scatterplot where additional data dimensions control the color and size (radius) of the points:

    ```
    SCATTERPLOT(xDim, yDim, cDim, rDim, rMin, rMax)
    1 for each record i
    2   do x ← NORMALIZE(i, xDim)      // derive the location
    3      y ← NORMALIZE(i, yDim)
    4      r ← NORMALIZE(i, rDim, rMin, rMax) // derive the radius
    5      MAPCOLOR(i, cDim)                   // derive the color
    6      CIRCLE(x, y, r)                     // draw the record as a circle
    ```

    This algorithm iterates through each data record, normalizes its values for the chosen dimensions to determine its x-y position and radius, maps another dimension to its color, and then draws the resulting circle on the screen.

  * **Example: The Power of Exploration**
    The book uses a dataset of 428 new cars and trucks from 2004 to demonstrate the scatterplot's power. It explains that trying to find patterns by looking at the raw data in a table is extremely difficult, even for a small subset of the data.

    However, when the data for just Toyota vehicles is visualized in a scatterplot (Figure 1.44) with **horsepower on the x-axis** and **city MPG on the y-axis**, a pattern becomes immediately clear: there is an almost **linear inverse relationship** between the two variables.

    This initial insight demonstrates the exploratory process that visualization enables:

    1.  **Observe**: A pattern is spotted in the Toyota data.
    2.  **Hypothesize**: A hypothesis is formed (e.g., "for vehicles, higher horsepower leads to lower MPG").
    3.  **Test**: The hypothesis is tested by looking at other subsets of the data, like Kia vehicles (which show a similar pattern) and then Lexus vehicles (where the relationship is less simple).
    4.  **Refine**: The initial hypothesis is refined based on the new evidence.

    This cycle of observation and inquiry is difficult with raw data tables but is a primary strength of the scatterplot, allowing an analyst to ask and begin answering complex questions about distributions, trends, and groups within the data.

---

#### **1.8 The Role of the User**

The user's purpose or goal is central to the design of a visualization. Visualizations can be categorized by their intended role:

* **Exploration**: The user has a dataset and wants to examine it to find interesting, unknown patterns, features, or outliers.
* **Confirmation**: The user already has a hypothesis about the data and uses the visualization to confirm or refute it.
* **Presentation**: The user already knows the story in the data and is creating a static visualization to communicate that story clearly to an audience.
* **Interactive Presentation**: A presentation that allows the end-user some degree of exploration within a guided framework, common on the web.

***

### **Chapter 2: Data Foundations**

Since every visualization begins with data, this chapter examines the fundamental characteristics of data, its structure, and the common preprocessing steps required to make it suitable for visualization.

A typical dataset consists of a list of *n* records ($r_1, r_2, ..., r_n$), where each record $r_i$ is made up of *m* variables or observations ($v_1, v_2, ..., v_m$). Variables can be classified as **independent** (not affected by other variables, like time) or **dependent** (affected by other variables, like temperature at a given time and location).

---

#### **2.1 Types of Data**

Data can be categorized based on its measurement type (ordinal vs. nominal) and its mathematical scale.

* **Ordinal (Numeric) Data**: This data consists of numeric values.
    * **Binary**: Can only take values of 0 or 1.
    * **Discrete**: Can only take integer values or values from a specific subset (e.g., (2, 4, 6)).
    * **Continuous**: Represents real values within a given interval (e.g., [0, 5]).

* **Nominal (Non-numeric) Data**: This data consists of non-numeric values.
    * **Categorical**: A value selected from a finite list of possibilities (e.g., 'red', 'blue', 'green').
    * **Ranked**: A categorical variable that has a clear, implied ordering (e.g., 'small', 'medium', 'large').
    * **Arbitrary**: A variable with a potentially infinite range of values with no implied ordering (e.g., street addresses).

* **Data Scale Attributes**: These attributes help further classify variables.
    * **Ordering relation**: A property that allows the data to be sorted. All ordinal and ranked nominal variables have this property.
    * **Distance metric**: A property that allows the distance between different records to be computed. This is present in all ordinal variables but generally not in nominal ones.
    * **Existence of absolute zero**: A property where a variable has a fixed lowest possible value. This helps differentiate types of ordinal data; for example, 'weight' has an absolute zero, while 'bank balance' (which can be negative) does not.

The type of data determines the valid mathematical operations. For example, comparison operators (`<`, `>`) can only be applied to ranked nominal and ordinal data.

---

#### **2.2 Structure within and between Records**

Data sets have structure, both in their representation (syntax) and in the interrelationships between data items (semantics).

* **2.2.1 Scalars, Vectors, and Tensors**
    * A **scalar** is a single number in a data record, like the cost of an item.
    * A **vector** is a composite item made of multiple values, like a 2D position `(x, y)` or a 3-component color `(R, G, B)`.
    * A **tensor** is a more general structure represented as an array or matrix. A scalar is a tensor of rank 0, and a vector is a tensor of rank 1. A rank M tensor in a D-dimensional space requires $D^M$ data values.

* **2.2.2 Geometry and Grids**
    * Geometric structure can be **explicit**, where coordinates are included for each data record (e.g., longitude and latitude for temperature sensors).
    * It can also be **implied**, where a grid structure is assumed, and successive records are located at successive grid locations (e.g., pixels in an image). Such grids can be uniform or **nonuniform/irregular**, where data is computed more densely in areas of high importance.

* **2.2.3 Other Forms of Structure**
    * **Timestamp**: An important attribute for time-oriented data, which can be relative or absolute.
    * **Topology**: Describes how data records are connected. For example, vertices on a surface are connected by edges. This connectivity information is essential for processes like resampling and interpolation.

---

#### **2.3 Data Preprocessing**

Before visualization, raw data often needs to be cleaned, transformed, or reduced.

* **2.3.1 Metadata and Statistics**: Information *about* the data, known as **metadata**, provides crucial context like data formats, units, and what value signifies a missing entry. Basic statistics like the **mean** ($\mu = \frac{1}{n} \sum_{i=0}^{n-1} x_i$) and **standard deviation** ($\sigma = \sqrt{\frac{1}{n}\sum(x_i - \mu)^2}$) can provide useful insights and help detect outliers.

* **2.3.2 Missing Values and Data Cleansing**: Real-world data is often incomplete or erroneous. Common strategies to handle this include:
    * **Discard the record**: The simplest but potentially most costly in terms of information loss.
    * **Assign a sentinel value**: Use a designated value (e.g., -999) to mark missing entries, making them clearly visible in a visualization.
    * **Assign the average value**: Replaces the missing entry with the mean for that variable. This minimally affects statistics but may mask interesting outliers.
    * **Assign value based on nearest neighbor**: Find the most similar record based on other variables and use its value.
    * **Compute a substitute value**: Use advanced statistical methods, a process known as **imputation**, to generate a high-confidence substitute value.

* **2.3.3 Normalization**: The process of transforming data to fit a particular statistical property, most commonly scaling values to the range of 0.0 to 1.0. This is crucial for mapping data to graphical attributes that have a fixed range (like color).
    * **Linear Normalization**: $d_{normalized} = (d_{original} - d_{min}) / (d_{max} - d_{min})$.
    * **Non-linear Normalization**: For skewed distributions, a non-linear mapping like square root ($d_{sqrt-normalized} = (\sqrt{d_{original}} - \sqrt{d_{min}}) / (\sqrt{d_{max}} - \sqrt{d_{min}})$) or logarithmic ($d_{log-normalized} = (\log d_{original} - \log d_{min}) / (\log d_{max} - \log d_{min})$) may be more appropriate.

* **2.3.4 Segmentation**: The process of separating data into contiguous regions corresponding to a particular classification (e.g., an MRI scan segmented into bone, muscle, fat, and skin). This is often followed by an iterative **split-and-merge** refinement stage to improve the quality of the segments.

* **2.3.5 Sampling and Subsetting**: Transforming a dataset to a different spatial resolution.
    * **Interpolation**: Estimating values at locations between known samples.
        * **Linear Interpolation**: Estimates a value at point C between points A and B using the formula: $d_C = d_A + (d_B - d_A) \times (x_C - x_A) / (x_B - x_A)$.
        * **Bilinear Interpolation**: Extends this to 2D by first interpolating horizontally between four grid points and then interpolating vertically using those results.
        * **Nonlinear Interpolation**: Uses higher-order polynomials like splines to create smoother transitions. The **Catmull-Rom curve** is a popular choice that ensures the curve passes through the data points.

* **2.3.6 Dimension Reduction**: Techniques to reduce the dimensionality of data for visualization while preserving as much information as possible.
    * **Principal Component Analysis (PCA)**: A popular method that computes new dimensions (principal components) which are linear combinations of the originals. These new dimensions are sorted by how much they contribute to explaining the variance of the data.
    * **Multidimensional Scaling (MDS)**: A class of algorithms that finds a lower-dimensional representation of data that best preserves the inter-point distances from the original high-dimensional space. It iteratively adjusts point locations to minimize a **stress** function, which measures the discrepancy between distances in the high-D and low-D spaces.

* **2.3.7 Mapping Nominal Dimensions to Numbers**: To visualize nominal (non-numeric) data, a mapping to a graphical attribute is needed that doesn't introduce artificial relationships. For variables with few distinct values, **color** and **shape** are good choices. For more complex cases, statistical techniques like **correspondence analysis** can be used to map nominal values to a numeric scale based on the similarity of the records associated with them.

* **2.3.8 Aggregation and Summarization**: In cases of very large datasets, it's often useful to group similar data points and represent the group with a summary (e.g., count, average, extent) to reduce visual clutter and provide an overview.

* **2.3.9 Smoothing and Filtering**: A process, often done via **convolution**, that uses a weighted average of a point's neighbors to reduce noise and blur sharp discontinuities. A simple 1D smoothing formula is $p'_{i} = \frac{p_{i-1}}{4} + \frac{p_{i}}{2} + \frac{p_{i+1}}{4}$.

* **2.3.10 Raster-to-Vector Conversion**: The process of extracting linear or polygonal structures from raster (pixel-based) data. This is useful for compression, comparison, and segmentation. Techniques include **thresholding**, **region-growing**, **boundary-detection** (using convolution), and **thinning**.