# Data Aggregation, Data Provenance, and Reproducible Data Preparation Workflows

## 1. Data Aggregation

### 1.1 Overview
Data aggregation is the process of gathering and summarizing data from multiple sources to make it more useful and informative. In Pandas, you can use the `groupby()` function to perform data aggregation operations.

### 1.2 Grouping Data
To group data by one or more columns, you can use the `groupby()` method.

In [1]:
import pandas as pd

# Example DataFrame
data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 15, 20, 25, 30, 35]
}
df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,Category,Value
0,A,10
1,A,15
2,B,20
3,B,25
4,A,30
5,B,35


In [9]:
# Group data by 'Category'
grouped_data = df.groupby('Category').std()
grouped_data

Unnamed: 0_level_0,Value
Category,Unnamed: 1_level_1
A,10.40833
B,7.637626


### 1.3 Aggregating Data
Apply aggregation functions like `sum()`, `mean()`, `max()`, and `min()` to the grouped data.

In [12]:
# Sum values within each group
aggregated_data = df.groupby('Category').max()
aggregated_data

Unnamed: 0_level_0,Value
Category,Unnamed: 1_level_1
A,30
B,35


## 2. Data Provenance

### 2.1 Overview
Data provenance refers to the origin and history of a dataset, including how it was collected, processed, and analyzed. Maintaining data provenance is essential for ensuring the reliability and reproducibility of research findings.

### 2.2 Tracking Data Provenance
Document data sources, processing steps, and any changes made to the dataset. Use code comments, version control systems (e.g., Git), and metadata to maintain a record of data provenance.

In [16]:
# Load data from CSV file (specify the data source)
df = pd.read_csv('pre-course_survey.csv')

# Clean data (document the cleaning steps)
# df.dropna(inplace=True)
# df.drop_duplicates(inplace=True)

In [17]:
df

Unnamed: 0,Timestamp,"1. On a scale of 1 to 5, how would you rate your current knowledge of Python programming?","2. On a scale of 1 to 5, how would you rate your current knowledge of data science concepts?",3. Have you ever used version control systems such as Git and GitHub?,4. Are you familiar with Jupyter Notebooks or JupyterHub?,"5. On a scale of 1 to 5, how would you rate your understanding of reproducible research principles?",6. Have you ever conducted text analysis or natural language processing (NLP) projects?,7. Have you ever worked with social media data or network analysis?,8. What are your primary learning goals for this course?,9. What specific skills or techniques do you hope to gain from this course?,10. Do you have any concerns or challenges that you anticipate facing in this course?
0,4/1/2023 17:50:33,1,1,No,No,1.0,No,No,Python and the data science methods described ...,I currently have zero coding knowledge but I w...,All I have is my laptop. I am worried about ov...
1,4/1/2023 18:11:14,1,2,No,No,5.0,No,Yes,To learn more about data science processes.,Coding experience,no
2,4/1/2023 18:50:40,1,1,No,No,4.0,No,No,I’d like to gain more experience in python and...,SQL， R，Python.,Since I’ve never taken a CS course except a li...
3,4/1/2023 18:53:09,1,1,No,No,1.0,No,No,My primary learning goals are to be skillful i...,I want to learn Python programming and data sc...,I'm graduating this quarter and I really want ...
4,4/1/2023 19:25:36,2,2,No,No,1.0,No,No,Learning how we can use data science in commun...,Technical skills I can use in the professional...,Using programming languages
5,4/1/2023 19:43:48,1,3,No,No,1.0,No,Yes,Learning basic English skills,Not sure yet since I don’t have a good underst...,Yes but I don’t know what may come up
6,4/1/2023 20:03:11,1,1,No,No,1.0,No,Yes,I want to understand data science principles b...,"Learning to code would be incredible, since I ...",Since I have very limited knowledge of program...
7,4/1/2023 20:42:55,1,2,No,No,1.0,Yes,Yes,"As a senior, I will go to a Marketing Master p...",I hope I can master some basic skills of codin...,"I have little experience in coding, so I'm qui..."
8,4/1/2023 20:58:16,1,1,No,No,2.0,No,No,Understanding what data science for social stu...,I hope to get a basic grasp of coding and good...,I have little to no experience with coding and...
9,4/1/2023 21:13:10,1,2,No,No,1.0,No,No,Learning the basics to apply to real world pro...,I hope to be able to conduct coding my own dat...,I don’t have much experience with coding


In [18]:
df.isnull()

Unnamed: 0,Timestamp,"1. On a scale of 1 to 5, how would you rate your current knowledge of Python programming?","2. On a scale of 1 to 5, how would you rate your current knowledge of data science concepts?",3. Have you ever used version control systems such as Git and GitHub?,4. Are you familiar with Jupyter Notebooks or JupyterHub?,"5. On a scale of 1 to 5, how would you rate your understanding of reproducible research principles?",6. Have you ever conducted text analysis or natural language processing (NLP) projects?,7. Have you ever worked with social media data or network analysis?,8. What are your primary learning goals for this course?,9. What specific skills or techniques do you hope to gain from this course?,10. Do you have any concerns or challenges that you anticipate facing in this course?
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False


## 3. Reproducible Data Preparation Workflows

### 3.1 Overview
Reproducible data preparation workflows involve organizing your code and data in a way that makes it easy for others to understand, use, and replicate your work.

### 3.2 Best Practices

Use version control systems (e.g., Git) to track changes in code and data.
Write clean, modular, and well-documented code.
Separate data cleaning, processing, and analysis steps.
Use virtual environments to manage dependencies.
Share your code and data through public repositories (e.g., GitHub).

In [4]:
# Example of a clean, modular, and well-documented code snippet
def clean_data(df):
    """
    Clean the input DataFrame by removing missing values and duplicates.

    Parameters:
    df (DataFrame): The input DataFrame to be cleaned.

    Returns:
    DataFrame: The cleaned DataFrame.
    """
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    return df

# Load data
data = pd.read_csv('pre-course_survey.csv')

# Clean data
cleaned_data = clean_data(data)

## 4. Hands-on Exercises

### Exercise 1
Create a DataFrame with random data and perform data aggregation using the groupby() method. Apply different aggregation functions like sum(), mean(), and count().

### Exercise 2
Design a simple reproducible data preparation workflow. Load a dataset, clean it, and visualize the results. Document your code and steps using comments and Markdown cells.