# Data Aggregation, Data Provenance, and Reproducible Data Preparation Workflows

## 1. Data Aggregation

### 1.1 Overview
Data aggregation is the process of gathering and summarizing data from multiple sources to make it more useful and informative. In Pandas, you can use the `groupby()` function to perform data aggregation operations.

### 1.2 Grouping Data
To group data by one or more columns, you can use the `groupby()` method.

In [1]:
import pandas as pd

# Example DataFrame
data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 15, 20, 25, 30, 35]
}
df = pd.DataFrame(data)

# Group data by 'Category'
grouped_data = df.groupby('Category')

### 1.3 Aggregating Data
Apply aggregation functions like `sum()`, `mean()`, `max()`, and `min()` to the grouped data.

In [2]:
# Sum values within each group
summed_data = grouped_data.sum()
print(summed_data)

          Value
Category       
A            55
B            80


## 2. Data Provenance

### 2.1 Overview
Data provenance refers to the origin and history of a dataset, including how it was collected, processed, and analyzed. Maintaining data provenance is essential for ensuring the reliability and reproducibility of research findings.

### 2.2 Tracking Data Provenance
Document data sources, processing steps, and any changes made to the dataset. Use code comments, version control systems (e.g., Git), and metadata to maintain a record of data provenance.

In [3]:
# Load data from CSV file (specify the data source)
df = pd.read_csv('pre-course_survey.csv')

# Clean data (document the cleaning steps)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

## 3. Reproducible Data Preparation Workflows

### 3.1 Overview
Reproducible data preparation workflows involve organizing your code and data in a way that makes it easy for others to understand, use, and replicate your work.

### 3.2 Best Practices

Use version control systems (e.g., Git) to track changes in code and data.
Write clean, modular, and well-documented code.
Separate data cleaning, processing, and analysis steps.
Use virtual environments to manage dependencies.
Share your code and data through public repositories (e.g., GitHub).

In [4]:
# Example of a clean, modular, and well-documented code snippet
def clean_data(df):
    """
    Clean the input DataFrame by removing missing values and duplicates.

    Parameters:
    df (DataFrame): The input DataFrame to be cleaned.

    Returns:
    DataFrame: The cleaned DataFrame.
    """
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    return df

# Load data
data = pd.read_csv('pre-course_survey.csv')

# Clean data
cleaned_data = clean_data(data)

## 4. Hands-on Exercises

### Exercise 1
Create a DataFrame with random data and perform data aggregation using the groupby() method. Apply different aggregation functions like sum(), mean(), and count().

### Exercise 2
Design a simple reproducible data preparation workflow. Load a dataset, clean it, and visualize the results. Document your code and steps using comments and Markdown cells.