# Joins and Linear Algebra

In [None]:
!pip install pandas numpy

In [None]:
import pandas as pd
import numpy as np

## Inner Join
![Inner join](images/inner_join.png)
- Combines rows from two tables using a related column (called a key)
- Returns only rows where a match exists in both tables
  - Example: If you join two tables on a date column, only rows where the date appears in both tables are included
- Excludes any rows that do not have a matching value in the other table

## Your Turn! Inner Join

- Load the Student and Enrollment CSV files into pandas DataFrames
- Inspect both tables and identify the column they have in common
- Create an inner join between the two tables using that column
- Display the result and observe which rows were included and which were excluded

# Right Join
![Left Join](images/left_join.png)
- Returns all records from the left table
- Only shows matching rows from the right table
- Displays None where there is no matching row in the right table


# Right Join
![Right Join](images/right_join.png)
- Returns all records from the right table
- Only shows matching rows from the left table
- Displays NaN for right table rows without a match

# Your Turn! Right and Left Join
- Perform a right join on enrollments and students
- Perform a left join on enrollments and students

# Outer Join
![Outer or Full Join](images/outer_join.png)
- Returns all rows
- Rows that do not match will have None for missing values

# Your Turn: Outer Join
- Create an outer Join on enrollments and students

# Concatenate vs Merge

**Merge**
- Combines two DataFrames based on columns or indexes

**Concatenate**
- Stacks DataFrames vertically or horizontally without needing a key


# Vertical Concatenation (Rows)

- Used to add more rows to an existing DataFrame
- Combines DataFrames by stacking them on top of each other
- Missing values are filled with None if a column exists in only one DataFrame

In [None]:
patients2 = pd.DataFrame({
    "patient_id": [1015, 1016, 1017, 1018, 1019],
    "first_name": ["Daniel", "Sofia", "Malik", "Hannah", "Victor"],
    "last_name": ["Lopez", "Anderson", "Brown", "Kim", "Alvarez"],
    "date_of_birth": ["1989-02-14", "1993-07-22", "1976-11-05", "2000-01-30", "1984-09-18"],
    "phone": [
        "206-555-0311",
        "425-555-0322",
        "253-555-0333",
        "206-555-0344",
        "425-555-0355"
    ],
    "city": ["Seattle", "Bellevue", "Tacoma", "Seattle", "Redmond"]
})


In [None]:
# What if there's different features?
patients2 = pd.DataFrame({
    "patient_id": [1015, 1016, 1017, 1018, 1019],
    "first_name": ["Daniel", "Sofia", "Malik", "Hannah", "Victor"],
    "last_name": ["Lopez", "Anderson", "Brown", "Kim", "Alvarez"],
    "date_of_birth": ["1989-02-14", "1993-07-22", "1976-11-05", "2000-01-30", "1984-09-18"],
    "phone": [
        "206-555-0311",
        "425-555-0322",
        "253-555-0333",
        "206-555-0344",
        "425-555-0355"
    ],
    "city": ["Seattle", "Bellevue", "Tacoma", "Seattle", "Redmond"],
    "insurance_provider": [
        "Premera",
        "Kaiser Permanente",
        "Regence",
        "Aetna",
        "Cigna"
    ]
})


# Horizontal Concatenation (Columns)
- Adds additional columns (features) to an existing DataFrame
- Combines DataFrames side-by-side
- Rows are aligned based on their index position

# Your Turn! Concatenation
- You are given a new dataframe students2
- Inspect the `students` and `students2` DataFrames
- Identify which column exists in `students2` but not in `students`
- Vertically concatenate the two DataFrames and store the result in a new variable called `all_students`
- Display `all_students`
- Horizontally concatenate the `students` and `enrollment` DataFrames and store the result in a variable called `horizontal_concat`
- Display `horizontal_concat`




In [None]:
students2 = pd.DataFrame({
    "student_id": [2013, 2014, 2015, 2016],
    "first_name": ["Diego", "Priya", "Ethan", "Maya"],
    "last_name": ["Santos", "Sharma", "Collins", "Desai"],
    "email": [
        "diego.santos@example.edu",
        "priya.sharma@example.edu",
        "ethan.collins@example.edu",
        "maya.desai@example.edu"
    ],
    "program": ["Computer Science", "Data Science", "Cybersecurity", "Information Technology"],
    "start_year": [2025, 2025, 2026, 2026],
    "gpa": [3.6, 3.9, 3.4, 3.8]
})

students2

### Column Space  
All possible linear combinations of the columns of a matrix.

In other words, the column space is every output you can create by scaling and adding the columns together.

---

### Rank  
- Rank is a nonnegative integer (0, 1, 2, â€¦).  
- Each matrix has exactly one rank.  
- Rank tells you how many independent columns (or rows) the matrix has.  
- This also tells you the dimension of the column space.

---

### What does $A \in \mathbb{R}^{4 \times 3}$ Mean?
- 4x3 matrix 

### What does rank 3 Mean?
- Means that the 3 columns are independent 
- You can't create one column from combining the others 

### What does $v \in C(A)$ mean?

- $C(A)$: the column space of matrix A, meaning all possible outputs that can be created from the columns of A  
- $\in$: means "is an element of" or "belongs to"  
- $v$: the output vector  

So $v \in C(A)$ means the vector **v can be created by combining the columns of A**.

---

### What's the point of this?

This tells us whether a linear regression model can actually produce a given output.

Linear regression works by combining the columns of A (the features) with weights. If the real output vector is in the column space of A, then the model can produce it exactly. If it is not in the column space, then the model cannot produce it exactly, and there will always be some error.

In other words:  
**Can the real output be created by combining the features?**
Let's say you are trying to predict whether years of experience and level of education can accurately predict salary.

v represents the actual salaries (the real info from the data).

C(A) represents all possible salary predictions the model can produce using only years of experience and education.

If a real salary v is in C(A), then the model can predict it exactly.
If it is not in C(A), then the model can only approximate it.

## AI Transparency Statement

AI tools were used to assist with grammar correction, formatting, and the creation of example datasets for demonstration purposes. All concepts, interpretations, and final content were created by me. 