## Data Management for Machine Learning Concepts and Code

## Data Models 

- describes how data is represented in terms of attributes and relationships
    - car can be represented by attributes such as make, model, year, color, etc, as well as owner, license plate, etc
- applications built by layering one data model on top of another
- many different data models exist, which
    - embody different assumptions about the data
    - are suited to different types of applications
- selecting data model affects how data is stored, queried, and updated
    - ways systems are built, problems that can be solved, and performance characteristics

### Relational Data Model
- Edgar Codd, 1970
- data is represented as a collection of relations (tables)
    - each relation has a set of named attributes (columns)
    - each tuple (row) has a value for each attribute
    - unordered, can shuffle rows and columns
    - usually stored in csv or parquet format
- normalization
    - process of decomposing relations with anomalies into smaller, well-structured relations (1ND, 2NF, 3NF, BCNF etc)
    - reduces redundancy and improves data integrity
    - can be expensive to compute
- databases built around relational model are called relational databases
    - most common type of database
    - SQL is the most common language for querying and manipulating data in relational databases
    - examples: MySQL, PostgreSQL, SQLite, Oracle, Microsoft SQL Server, IBM DB2

#### SQL 
- is a declarative language
- user specifies what data they want, not how to get it
    - tables, conditions, transformations such as joins and aggregations
- query optimizer determines how to execute query
    - which tables to read, which indexes to use, etc
    - how to break query into smaller subqueries, order of operations, etc
- generalized a lot but is still restrictive, needs a strict schema, schema changes are expensive

### NoSQL
- non-relational databases
- retroactively reinforced as "not only SQL"
- data is stored in a variety of ways
    - key-value / document stores
        - targets use cases where data comes in self-contained documents
        - single continuous string of data, encoded as JSON, XML, or similar format
        - each document has a unique key that is used to retrieve it
    - wide-column stores
        - targets use cases where data is stored in sparse tables, with many columns
        - each row has a unique key, but unlike key-value stores, each row can have different columns
        - each column has a name and a value
        - examples: Cassandra, HBase, BigTable
    - graph databases
        - targets use cases where data has complex relationships between data entities exist and are important
- does not enforce a schema
    - misleading, as schema is still assumed by the reader of the data
    - shifts the burden of ensuring data integrity from the database to the application
- has better locality than relational databases
    - data with complex relationships can be stored together and retrieved in one operation
    - can be faster than relational databases for some use cases
    - but difficult to execute joins over data from different entities
- examples: MongoDB, Cassandra, HBase, Neo4j, Redis, CouchDB

| Data Type | Definition | Examples | Advantages | Disadvantages |
| --- | --- | --- | --- | --- |
| Structured Data | This type of data is organized in a highly systematic and predictable manner. It is usually stored in relational databases and can be efficiently queried using a language like SQL. | Databases, CSV files, Excel spreadsheets. | Easy to store, search, and analyze. High accuracy and reliability. | Lack of flexibility. Not suitable for complex, hierarchical, or multi-dimensional data. |
| Semi-Structured Data | This type of data does not conform to the formal structure of data models, but contains tags and other markers to separate semantic elements. It is more flexible than structured data, but less organized than unstructured data. | XML, JSON, NoSQL databases. | More flexible than structured data. Can represent more complex and hierarchical relationships. | Less efficient to query and process than structured data. Requires more storage. |
| Unstructured Data | This type of data doesn't have a predefined model or is not organized in a predefined manner. It is typically text-heavy, but can also be in the form of images, videos, etc. | Emails, Word documents, PDFs, images, videos, web pages. | Highly flexible. Can represent any type of information. | Difficult to analyze and process. Requires advanced tools and algorithms, such as Natural Language Processing (NLP) for text, or Computer Vision for images and videos. |

### Data Warehouses and Data Lakes
- data warehouse
    - database that is optimized for analytics
    - typically used to store structured data
    - often used for reporting and dashboarding
    - examples: Amazon Redshift, Google BigQuery, Snowflake
- data lake
    - repository for structured and unstructured data
    - typically used for storing large amounts of raw data before processing
    - examples: Amazon S3, Google Cloud Storage, Hadoop File System (HDFS)

## Data Management
- about transforming data into a format that is more convenient to work with for later stages of the pipeline
- apply data transformations to make data easier to work with
    - filtering, aggregating, joining, sorting, etc
- train models, anonymize data, etc
- delete data that is no longer needed

### Multi-phases
1. creation
    - data created in process, outside of our control, captured in some storage system
    - some are:
        - static (infrequently updated) eg. photo recognition dataset
        - dynamic (frequently updated / real-time) eg. stock market data
    - may be structured, semi-structured, or unstructured
        - requires different tools and techniques in each case
    - may require augmentation, to make it more useful
        - eg. adding labels to images, adding timestamps to stock market data    
2. ingestion
    - filtering / selection / sampling may be done
        - we may not necessarily want to keep all the data
        - sampling may lose some details, but may be necessary for performance reasons, trade-off between quality cost of model and savings in time and money
    - may be simple or complex
        - dumping data into a database, or running a complex ETL pipeline
    - may be done in real-time or in batches
    - reliability concerns focus on correctness and throughput
        - correctness: is the data being ingested correctly?
        - throughput: how fast can we ingest data?
    - monitoring existence and condition of data before and during ingestion is the most difficult part of the process
- processing (validation, cleaning, enrichment)
- post-processing (data management, storage, analysis, visualization)
