# Lesson 1: Merging Datasets

Here's your content rewritten and properly formatted in Markdown:

```markdown
# Lesson Introduction

Welcome! Today, we'll dive into **merging datasets**. Imagine you have two sets of information about the same group of people or things, and you want to combine them into one set. This is crucial in data analysis because it helps you enrich your data and uncover deeper insights.

By the end of this lesson, you'll understand:

- How to merge datasets using the `pd.merge()` function.
- Different ways to merge datasets: left, right, inner, and outer joins.
- Practical examples connecting theory with real-life scenarios.

---

## Introduction to `pd.merge()`

Merging datasets is like combining two puzzles. In Python, we use the `pd.merge()` function from Pandas to achieve this. Let's start by creating two simple datasets:

```python
import pandas as pd

# Dataset 1: Basic information about some students
students = pd.DataFrame({
    'student_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [15, 16, 17]
})

# Dataset 2: Additional performance information about the same students
performance = pd.DataFrame({
    'student_id': [1, 3, 4],
    'grade': ['A', 'B', 'A'],
    'attendance': [95, 85, 100]
})
```

---

## Example with Two DataFrames

Now that we have two datasets, let's combine them using `pd.merge()`. We'll merge these datasets based on the common column `student_id`.

```python
# Merging the datasets on 'student_id'
students_merged = pd.merge(students, performance, on='student_id', how='left')
print(students_merged)
```

Output:
```
   student_id     name  age grade  attendance
0           1    Alice   15     A        95.0
1           2      Bob   16   NaN         NaN
2           3  Charlie   17     B        85.0
```

### Explanation:
1. **`students`**: Contains basic information about students.
2. **`performance`**: Contains additional details like grades and attendance.
3. **`pd.merge(students, performance, on='student_id', how='left')`**: Merges the two DataFrames based on `student_id`. The `how='left'` parameter keeps all rows from the left DataFrame (`students`).

---

## Different Types of Joins: Part 1

Merging datasets can be done in various ways, depending on the type of data you want to include. Here are the most common types:

### Left Join
Includes all rows from the left DataFrame and matches rows from the right DataFrame.

```python
left_join = pd.merge(students, performance, on='student_id', how='left')
print(left_join)
```

Output:
```
   student_id     name  age grade  attendance
0           1    Alice   15     A        95.0
1           2      Bob   16   NaN         NaN
2           3  Charlie   17     B        85.0
```

### Right Join
Includes all rows from the right DataFrame and matches rows from the left DataFrame.

```python
right_join = pd.merge(students, performance, on='student_id', how='right')
print(right_join)
```

Output:
```
   student_id     name   age grade  attendance
0           1    Alice  15.0     A          95
1           3  Charlie  17.0     B          85
2           4      NaN   NaN     A         100
```

---

## Different Types of Joins: Part 2

### Inner Join
Includes only rows that match in both DataFrames.

```python
inner_join = pd.merge(students, performance, on='student_id', how='inner')
print(inner_join)
```

Output:
```
   student_id     name  age grade  attendance
0           1    Alice   15     A          95
1           3  Charlie   17     B          85
```

### Outer Join
Includes all rows from both DataFrames, matching where possible.

```python
outer_join = pd.merge(students, performance, on='student_id', how='outer')
print(outer_join)
```

Output:
```
   student_id     name   age grade  attendance
0           1    Alice  15.0     A        95.0
1           2      Bob  16.0   NaN         NaN
2           3  Charlie  17.0     B        85.0
3           4      NaN   NaN     A       100.0
```

---

## Lesson Summary

Great job! Today, we've covered:

- The basics of merging datasets using `pd.merge()`.
- Types of joins: left, right, inner, and outer.
- Practical examples to see how merging can be applied in real-life situations.

You're now ready to move on to the practice part. Hands-on exercises will help you solidify your understanding of merging datasets. Combining different pieces of information can lead to better insights and informed decisions. Let's get started!
```

## Merging Student Data with Outer Join

Ever wonder how to combine student information with their performance data, even if some information is missing? The given code merges these datasets using an outer join to include all students and their available performance data. Let's run it and see how it works!

```py
import pandas as pd

# Dataset 1: Basic information about some students
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [15, 16, 17, 16]
})

# Dataset 2: Additional performance information about the same students
performance = pd.DataFrame({
    'student_id': [2, 3, 4],
    'grade': ['B', 'A', 'C'],
    'attendance': [85, 100, 90]
})

# Merging the datasets on 'student_id' using outer join
students_merged = pd.merge(students, performance, on='student_id', how='outer')
print(students_merged)
```

The code demonstrates how to merge two datasets using an **outer join** in Python with Pandas. Here's the explanation and the expected output:

### Code Explanation:
1. **Dataset 1 (`students`)**:
   - Contains basic information about students (IDs, names, and ages).
2. **Dataset 2 (`performance`)**:
   - Contains additional details such as grades and attendance for some of the same students.
3. **Outer Join**:
   - Ensures that all rows from both datasets are included, whether or not there's a matching `student_id` in both.

### Code Output:
When you run the script, the merged DataFrame will look like this:

```
   student_id     name   age grade  attendance
0           1    Alice  15.0   NaN         NaN
1           2      Bob  16.0     B        85.0
2           3  Charlie  17.0     A       100.0
3           4    David  16.0     C        90.0
```

### Key Points:
- **`Alice` (ID 1)**: Present in the `students` dataset but missing in `performance`, so `grade` and `attendance` are `NaN`.
- **`Bob` (ID 2), `Charlie` (ID 3), and `David` (ID 4)**: Appear in both datasets, so their data is combined.
- Columns without matching data are filled with `NaN` (indicating missing values).

### Practical Use:
Outer joins are useful when you want to consolidate data from multiple sources without losing any entries, even if some data points are incomplete. This is particularly common in data integration and analytics tasks.

## Change Merge Join Type

Now that we've learned about merging datasets, let's practice this!

Change the join type in the starter code from an outer join to a right join. This will let you see how different joins affect the combined data.

```py
import pandas as pd

# Dataset 1: Basic information about some students
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [15, 16, 17, 18]
})

# Dataset 2: Additional performance information, missing some students
performance = pd.DataFrame({
    'student_id': [1, 2, 3],
    'grade': ['A', 'B', 'A'],
    'attendance': [95, 85, 100]
})

# Outer Join
students_merged = pd.merge(students, performance, on='student_id', how='outer')
print(students_merged)


```

Here’s how to modify the code to use a **right join** instead of an **outer join**, and the expected outcome.

### Modified Code
Replace the `how='outer'` parameter in the `pd.merge()` function with `how='right'`:

```python
import pandas as pd

# Dataset 1: Basic information about some students
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [15, 16, 17, 18]
})

# Dataset 2: Additional performance information, missing some students
performance = pd.DataFrame({
    'student_id': [1, 2, 3],
    'grade': ['A', 'B', 'A'],
    'attendance': [95, 85, 100]
})

# Right Join
students_merged = pd.merge(students, performance, on='student_id', how='right')
print(students_merged)
```

### Expected Output
The output will include all rows from the `performance` DataFrame and only the matching rows from the `students` DataFrame:

```
   student_id     name   age grade  attendance
0           1    Alice  15.0     A        95.0
1           2      Bob  16.0     B        85.0
2           3  Charlie  17.0     A       100.0
```

### Explanation:
1. **Right Join**:
   - Includes all rows from the `performance` DataFrame.
   - Matches and merges rows from the `students` DataFrame where `student_id` values align.
   - Drops rows from `students` that are not present in `performance` (e.g., `David` with `student_id=4` is excluded).
2. Columns that don’t find a match are filled with `NaN`.

### Practical Use:
A **right join** is particularly useful when you want to focus on the data from the second DataFrame (in this case, `performance`) and include any available corresponding information from the first.

## Merging Student Information and Performance Data

You're on a roll, Space Wanderer!

Let's put your newfound merging skills to the test. Add the missing pieces to merge two datasets containing student information and their academic performance. Use a join that includes ONLY data that is present in both dataframes.

May the cosmic forces be with you!
```py
import pandas as pd

# Student Information DataFrame
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Academic Performance DataFrame
performance = pd.DataFrame({
    'student_id': [2, 3, 4],
    'grade': ['B', 'A', 'C'],
    'attendance': [85, 95, 80]
})

# TODO: Merge the two dataframes using an appropriate join type
# and assign the result to a variable

print(merged_data)
```

To include only the data that is present in both DataFrames, you should use an **inner join**. Here's the completed code:

```python
import pandas as pd

# Student Information DataFrame
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Academic Performance DataFrame
performance = pd.DataFrame({
    'student_id': [2, 3, 4],
    'grade': ['B', 'A', 'C'],
    'attendance': [85, 95, 80]
})

# Merge the two dataframes using an inner join
merged_data = pd.merge(students, performance, on='student_id', how='inner')

print(merged_data)
```

### Expected Output:
The result will include only the rows where `student_id` is present in both DataFrames:

```
   student_id     name grade  attendance
0           2      Bob     B          85
1           3  Charlie     A          95
2           4    David     C          80
```

### Explanation:
- **Inner Join**:
  - Includes only the rows where the `student_id` exists in both `students` and `performance` DataFrames.
  - Excludes rows with `student_id` values that are present in only one of the DataFrames (e.g., `Alice` with `student_id=1`).

### Cosmic Note:
You’ve successfully filtered out incomplete data and focused on the overlap. Now you're one step closer to mastering the art of data merging! 🚀

## Merging Student Data with Outer Join

Hello, Space Explorer! It's time to merge datasets about students and their academic performance. Combine the data to get a complete view of each student's profile and grades using the pd.merge() function. Use the type of join that ensures all the student names are included, even if some performance data is missing. Follow the TODO comments to complete this task.

You're almost there. Let's go for it!

```py
import pandas as pd

# TODO: Create a DataFrame for basic information about some students (columns: 'student_id', 'name', 'age')

# TODO: Create a DataFrame for additional performance information about the same students and one extra student (columns: 'student_id', 'grade', 'attendance')

# TODO: Perform an outer join to combine both DataFrames using 'student_id' with pd.merge()

# TODO: Print the combined DataFrame to see the result


```