# Data Representation

## Terminology

The term "data" is defined in various ways in the literature depending on the fields and applications. Data can be...

- **a record** characterized by a set of fields (database terminology).
- **an individual/observation** defined by a set of characteristics or parameters or variables (statistical terminology).
- **an instance** characterized by a set of attributes (object-oriented terminology in computer science).
- **a point or a vector** characterized by its coordinates in a vector space (algebra terminology).

## Representation

Data is generally represented in the form of a rectangular table (or matrix) with N rows representing individuals and K columns corresponding to variables.
- **$M_obs$**: the matrix of dimension containing the data in terms of observations.
- **$M_var$**: the matrix of dimension containing the data in terms of variables.

where $x^j_i$ is the value of individual $i$ for variable $j$.

We denote $x_j = (x^1_i, x^2_i, ..., x^k_i)$ as the vector of variables for individual $i$ and $x^j = (x^j_1, x^j_2, ..., x^j_N)$ as the vector of individuals for variable $j$.

$$
M_{obs} = \begin{pmatrix}
x^1_1 & x^2_1 & \cdots & x^K_1 \\
x^1_2 & x^2_2 & \cdots & x^K_2 \\
\vdots & \vdots & \ddots & \vdots \\
x^1_N & x^2_N & \cdots & x^K_N
\end{pmatrix}
$$

$$
M_{var} = \begin{pmatrix}
x^1_1 & x^1_2 & \cdots & x^1_N \\
x^2_1 & x^2_2 & \cdots & x^2_N \\
\vdots & \vdots & \ddots & \vdots \\
x^K_1 & x^K_2 & \cdots & x^K_N
\end{pmatrix}
$$

## Data Structure Types Based on Variables

- **Structured Data**: Data that is organized in a specific manner and follows a predefined schema or structure.
- **Semi-structured Data**: Data that is neither completely structured nor completely unstructured.
- **Unstructured Data**: Data that is not organized in a predefined manner and does not follow a specific model.

### Structured Data

Structured data is a type of data organized in a specific manner and follows a predefined schema. Key characteristics include:

- **Predefined Format**: Usually stored in tables, relational databases, or CSV files.
- **Defined Schema**: Follow a predefined data model. For example, relational databases with tables and specific columns and data types.
- **Ease of Analysis**: Due to their consistent organization, structured data is easier to analyze, query, and extract meaningful information.

Examples include financial data, inventory tracking data, sales data, demographic data, billing data, and structured weather data.

### Unstructured Data

Unstructured data does not follow a predefined organization or schema. Characteristics include:

- **Lack of Structure**: Includes free text, media (images, videos, audio files), PDF documents, emails, web pages, social media posts, raw sensor data, etc.
- **Variety**: Comes in many different forms, making analysis complex.
- **Difficulty of Analysis**: Requires advanced analysis techniques such as natural language processing (NLP), computer vision, speech recognition, etc.

Examples include social media posts, customer comments, YouTube videos, Instagram images, emails, audio recordings, conversation transcripts, unformatted reports, blogs, social media monitoring data, etc.

### Semi-structured Data

Semi-structured data falls between structured tabular data and unstructured textual data. Characteristics include:

- **Variable Format**: Does not follow a fixed schema.
- **Partial Schema**: Can have partially defined or flexible schemas. Some parts may be well-defined, while others are more flexible.
- **Use of Tags or Markers**: Often associated with the use of tags, markers, or tags like XML, JSON, etc.

Examples include HTML documents, XML files, JSON data, configuration data, log data, social media data, IoT sensor data, etc.

## Data Types Based on Variables/Features

Determining the type of data based on variables is a necessary step in EDA that allows for the appropriate analysis methods.

### Qualitative Variable

A variable is qualitative (or categorical) if its values are not measurable. Examples include gender, profession, marital status, etc. The values of a qualitative variable are called modalities.

#### Nominal Qualitative Variable

A qualitative variable is nominal if its modalities are not naturally ordered. For example, in an active population, profession is a nominal variable.

#### Ordinal Qualitative Variable

A qualitative variable is ordinal if its modalities follow an order relation. For example, a pathology can be mild, moderate, or severe.

### Quantitative Variable

A quantitative variable is discrete if it can only take values that can be enumerated. It is continuous if its potential values cannot be enumerated. Binary variables are discrete quantitative variables with special properties.

#### Binary Quantitative Variable

- **Symmetrical**: A binary variable is symmetrical if both modalities have equal importance and can be coded as 0 or 1 interchangeably.
- **Asymmetrical**: A binary variable is asymmetrical if the two modalities do not have equal importance.

## Data Collection

Data collection is a crucial phase in research where the researcher/data scientist gathers information to be analyzed to address a problem.

## Python Libraries for Data Manipulation

- **numpy**: For supporting multi-dimensional arrays.
- **matplotlib**: For data visualization.
- **pandas**: For data analysis.
- **seaborn**: For making statistical graphics more aesthetically pleasing.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate some random data for demonstration
np.random.seed(42)
data = {
    'Gender': np.random.choice(['Male', 'Female'], size=100),
    'Age': np.random.randint(18, 70, size=100),
    'Salary': np.random.randint(30000, 120000, size=100),
    'Profession': np.random.choice(['Engineer', 'Doctor', 'Artist', 'Lawyer'], size=100)
}

# Create a DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,Gender,Age,Salary,Profession
0,Male,35,111734,Artist
1,Female,43,105450,Artist
2,Male,61,52299,Lawyer
3,Male,51,73585,Lawyer
4,Male,27,94044,Doctor
...,...,...,...,...
95,Female,60,69384,Engineer
96,Female,46,77254,Lawyer
97,Female,53,51918,Artist
98,Female,30,115981,Lawyer
