### 10Apr25
#### Database Normalization and Algorithms
- designing a database
    - generative AI is actually pretty good at identifying potential data properties and generating test data
    - what makes a bad relation?
        - attributes that don't belong together
        - attributes with lots of NULL values
    - general guidelines
        - each tuple should represent a single entity or relationship instance
            - the schema should be easilly explained relation by relation
            - attributes should be easy to interpret
            - foreign keys should be used to refer to other entities
                - entity and relationship attributes should be kept separate as much as possible
        - design a schema that avoids anomalies from insert/delete/updates
            - consider `EMP_PROJ(Emp#, Proj#, Hours)`
                - you cannot insert a project without an employee assigned to it
                - deleting the project will delete the employees associated with it
                - updating the project will require updating all employees associated with it
                    - probably not efficiently
        - relations should be designed to minimize NULL values in their tuples
            - attributes that are frequently NULL may be better off in a separate relation
        - relations should be designed to satisfy lossless join
            - no spurious tuples should be generated when joining relations
            - non-additive or losslessness of joins
                - **Lossless Join Property:** a join operation on two relations R1 and R2 is lossless if the result of the join contains all the tuples in R1 and R2
                    - **exam question^**
                - **Lossy Join:**: a join operation on two relations R1 and R2 is lossy if the result of the join does not contain all the tuples in R1 and R2
                    - **exam question^**
                - if you join two relations, you should be able to get back the original relation
                - if you join two relations and get back a relation that is not the original, then it is a spurious tuple
            - preservation of dependencies
                - if you join two relations and get back a relation that does not satisfy the original dependencies, then it is a spurious tuple
- how to do this? Normalization
    - multiple formats
        - these formats are not mutually exclusive
        - dictate how to design a database
        - each normal format builds on the previous one
        - progressively refine the design to minimize redundancy and improve integrity of data
        - help to prevent anomalies from poorly structured relations and data
        - 1NF, 2NF, 3NF, BCNF
            - 3NF is the most commonly used in industry
            - BCNF is stronger and more restrictive
                - it is easier to define than 3NF
            - a well designed ER model will automatically yield relational tables in BCNF or even 4NF
    - how?
        - refining the schema
        - decomposing relations with undesirable properties into smaller relations
            - these follow a series of rules called normal forms
    - objectives
        - reduce data redundancy
            - minimize storage space
        - improve data integrity
            - ensure accuracy, consistency, and reliability of data by preventing anomalies
        - simplify maintenance
            - reducing complexity and risk for error
    - determining the normal form:
        - the form is determined by the **primary keys** and **functional dependencies**
            - Functional dependencies are the relationships between attributes in a relation
        - i.e. you must understand how one piece of data relates to another within the entity
        - the key helps determine if there are partial or transitive dependencies
            - partial dependency: a functional dependency where a non-prime attribute is functionally dependent on part of a candidate key
                - i.e. if the primary key is a composite key, then a non-prime attribute is dependent on only part of the composite key
                - this is bad because it means that the relation is not in 2NF
            - transitive dependency: a functional dependency where a non-prime attribute is functionally dependent on another non-prime attribute
                - i.e. if the primary key is a composite key, then a non-prime attribute is dependent on another non-prime attribute
                - this is bad because it means that the relation is not in 3NF
        - the functional dependencies help identify potential redundancies and other areas of concern
            - i.e. if the same data is stored in multiple places, then it is redundant
            - FD e.g.
                - ZIP_CODE -> CITY means:
                    - for every ZIP_CODE, there is a unique CITY
                    - if two records have the same ZIP_CODE, they must have the same CITY
                    - a zip code is a unique identifier for a city
                    - the city is functionally dependent on the zip code
                    - knowing the zip code allows you to determine the city
                    - attribute ZIP_CODE functionally determines CITY
                    - attribute CITY is functionally dependent on ZIP_CODE
    - determining the normal form
        - more to follow
- functional dependencies
    - constraints derived from the meaning and interrelationships of data attributes
    - a set of attributes X functionally determines a set of attributes Y if each value of X is associated with exactly one value of Y
        - i.e. if you know the value of X, you can determine the value of Y
    - X -> Y holds if for all tuples with the same X have the same Y
    - an FD is like a partial key
        - it uniquely determines some but not all attributes
    - a key is a special case of an FD
        - a key uniquely determines all attributes in the relation
    - these are derived from real-world constraints
        - e.g. a student ID uniquely identifies a student
        - e.g. a course code uniquely identifies a course
    - e.g.
        -   ```sql
            Courses (cid, title, prof, office)
            prof -> office
            ```
    -   e.g. within the company database
        - ```sql
            project_number -> {project_name, project_location}
            {SSN, project_number} -> hours
            ```
    - keys vs FDs
        - see slides around 25
        - e.g. what FDs hold in the following relation?
            - `Offerings (Course_id, Teacher_id, Hour, Room, Stu_id, Grade)`


### 17Apr 25
- examples
    - syntax
        -   ```
            R(a,b,c)
            K
            FD F = {a->b, b->c, ...}
            ```
    - if `a` functionally determines `b` and `c`, you can also write `a->b,c` or `a->bc`
    - if a functional dependency violates a normal form, you can decompose the relation
        - i.e. split the relation into two or more relations
- decomposition must be lossless
    - lossy decomposition is worse than normal form violations
    - lossy $\implies R \subseteq R_1 \bowtie R_2$
        - i.e. the join of the two relations does not contain all the tuples in the original relation
    - lossless $\implies R_1 \bowtie R_2 \equiv R$
        - i.e. the join of the two relations is equivalent to the original relation
            - nothing is lost in the join
    - you can test for losslessness after performing a decomposition
        - compute the intersection of the two relations
        - if $R_1 \cap R_2$ is a key for either relation, then the decomposition is lossless
            - e.g. `R(A,B,C,D,E)` with `FDs = {A->B, B->C, A->D, D->E}`
                - **likely exam question**
                - decompose into `R1(A,B,C)` and `R2(B,D,E)`
                - if $R_1 \cap R_2 = B$
                    - does `B` functionally determine `R_1` or `R_2`?
                        - no, `B` only functionally determines `C`
                    - `B` is not a key for either relation
                    - $\implies$ the decomposition is lossy
                - if $R_1 \cap R_2 = A$
                    - does `A` functionally determine `R_1` or `R_2`?
                        - yes, `A` functionally determines `B`
                    - `A` is a key for `R_1`
                    - $\implies$ the decomposition is lossless
- trivial functional dependencies
    - self
        - e.g. `A->A`
- closure of an FD set
    - it may be possible to imply other FDs from a set of FDs
    - e.g. `R(A,B,C)` with `FDs = {A->B, B->C}`
        - `A->C` is implied by `A->B` and `B->C`
        - $F^+$ denotes closure of the set of $F$
            - i.e. $F^+$ is the set of all FDs that can be derived from $F$
    - derived using the Armstrong Axioms
        - reflexivity
            - if $Y \subseteq X$, then $X \to Y$
        - augmentation
            - if $X \to Y$, then $XZ \to YZ$
        - transitivity
            - if $X \to Y$ and $Y \to Z$, then $X \to Z$
    - examples
        -   ```
            R(A,B,C,G,H,I)
            F = {A->B, A->C, CG->H, CG->I, B->H}
            ```
            - some other members of $F^+$
                - `A->H`
                    - by transitivity from `A->B` and `B->H`
                - `AG->I`
                    - by augmentation from `A->C` to get `AG->CG`
                    - then by pseudo-transitivity from `CG->I`
                    - hence `AG->CG->I`
    - algorithm for computing $F^+$
        - input the set of FDs
        - repeat until no change:
            - for FD $f \in F$: // apply
                - add all FDs generated by reflexivity (if $Y \subseteq X, then X \to Y$)
                - add all FDs generated by augmentation (if $X \to Y$, then $XZ \to YZ$)
            - for each pair f1




- determining the normal form of relations
    - **likely test questions**
    - testing for 2NF
        - definition: no non-prime attribute is partially dependent on any candidate key
            - i.e. no partial dependencies
        - check: for any `X->A`
            - is A a prime attribute of R?
                - prime: an attribute that is part of a key
                - if no, decompose R
            - is X a partial key of R?
                - if yes, decompose R
    - testing for 3NF
        - definition: all non- prime attributes depend only on the candidate keys
            - i.e. no transitive dependencies
        - check: for any `X->A`
            - is X a key of R?
                - if no, decompose R
            - is A a prime attribute of R?
                - if no, decompose R
    - testing for BCNF
        - definition: for every FD X->A, X is a key of R
        - check: for any `X->A`
            - is X a key of R?
                - if no, decompose R
    - e.g. relation
        - given `R(A1, A2, A3, ...,An)`
        - given $K \subseteq {A1, A2, A3, ...,An}$ is a key of $R$
        - given R has a set of FDs of form `X->A` where `X` and `A` are subsets of ${A1, A2, A3, ...,An}`
        - Question: is R in 2NF, 3NF, BCNF? If not, transform to the desired from.
            - check slides
    - achieving 2NF
        - **likely exam question**
        - consider `R(A,B,C,D,E)`, key = `{A,B}`, FDs = `{B->D}`
            - check steps
                - B determines D
                - B is not the whole key
                - D is not prime
                - hence `B->D` is a partial dependency
                - **R is not in 2NF**
            - decompose
                - split into `R1(A,B,C,E)` and `R2(B,D)`
                    - `{A,B}` is the key for `R1`
        - consider
            -   ```
                Relation Employee_Project(SSN, P_number, Hours, E_name, P_name, P_location)
                Key:{SSN, P_number}
                ```
            - check steps
                - look at the dependencies
                    - FD1: `SSN,P_number -> Hours`
                    - FD2: `SSN -> E_name`
                    - FD3: `P_number -> Pname, Plocation`
                        <img src="images/2NF_fail.png">
                        - possible anomalies
                            - deleting the last employee attached to a project deletes the project
                - decompose
                    - project into `Employee_Project(SSN, P_number, Hours)`, `Project(P_number, P_name, P_location)`, and `Employee(SSN, E_name)`
                        - now you can delete an employee without deleting the project
                        - retrieve the original table with `Employee_Project()`
                    <img src="images/decomposition.png">
    - achieving 3NF
        - consider a relation in 2NF
            - i.e. no non-prime attributes are partially dependent on any candidate key
        - recall the test for 3NF
            - if `X->A` is a FD, then:
                - `X` is a key of `R`
                - `A` is a prime attribute of `R`
        - consider
            -   ```
                Employee_Department(E_name, SSN, Address, D_num, D_name, M_SSN)
                Key: {SSN}
                ```
                <img src="images/3NF_fail.png">
                - `SSN->E_name, Address, D_num`
                - `D_num->D_name, M_SSN`
            - decompose
                - project into `Employee_Department(SSN, E_name, Address, D_num)`, `Department(D_num, D_name, M_SSN)`
                    - join the two relations to get the original relation


- e.g. the LOTS relation
    - a lot is a piece of land which someone owns
    - consider a state with two counties, Earp and Kidd
    - both counties keep records about lots
    - each county had internal lot numbering
    - later the state took over lot management
    - each lot was given a state lot number
    - now LOTS has a primary key ID, the state lot number
        - it also has a candidate key, the combination of the county, and county lot number
    - each county still needs to know its lot number
    <img src="images/LOTS_Relation.png">

        - if ID# was the only key, LOTS would be 2NF