In this part, we will focus on basic Git operations that are essential for managing your code. We will start by creating some initial scripts for our ETL pipeline and then learn how to track changes using Git.
To make this hands-on and practical, we'll start by developing Python scripts that are part of our ETL pipeline. This will involve writing scripts for data extraction, transformation, and loading.
- Creating Initial Python Scripts:
-
Data Extraction:
- Create a script named
extract.py
that reads data from a source, such as a CSV file or an API.
import pandas as pd def extract_data(file_path): return pd.read_csv(file_path) if __name__ == "__main__": data = extract_data('data/weather_data.csv') print(data.head())
- Create a script named
-
Data Transformation:
- Create a script named
transform.py
that cleans and transforms the extracted data.
import pandas as pd def transform_data(data): data['temperature_celsius'] = (data['temperature_fahrenheit'] - 32) * 5.0/9.0 return data if __name__ == "__main__": data = pd.read_csv('data/weather_data.csv') transformed_data = transform_data(data) print(transformed_data.head())
- Create a script named
-
Data Loading:
- Create a script named
load.py
that loads the transformed data into a database.
import pandas as pd from sqlalchemy import create_engine def load_data(data, db_connection_string): engine = create_engine(db_connection_string) data.to_sql('weather', con=engine, if_exists='replace', index=False) if __name__ == "__main__": data = pd.read_csv('data/transformed_weather_data.csv') load_data(data, 'sqlite:///weather_data.db')
- Create a script named
-
Next, we will introduce key Git concepts: the working directory and the staging area.
-
Working Directory:
- The working directory is the current state of your project files. It includes all the files in your project folder, whether they are tracked by Git or not.
- Any changes you make to your files occur in the working directory. These changes are initially untracked by Git until you explicitly tell Git to track them.
-
Staging Area (Index):
- The staging area, also known as the index, is a space where you can prepare your changes before committing them to the repository.
- Think of the staging area as a holding area for changes that you want to include in your next commit. This allows you to control exactly which changes are included in a commit.
Now, let's add our newly created scripts to the staging area and commit them to the repository.
-
Adding Changes to the Staging Area:
- Use the
git add
command to add files to the staging area.git add extract.py transform.py load.py
- This command tells Git to start tracking these files and prepares them for committing.
- Use the
-
Committing Changes:
- Use the
git commit
command to save the changes from the staging area to the repository.git commit -m "Initial commit with extract, transform, and load scripts"
- The
-m
flag allows you to include a commit message, which should describe the changes you are committing.
- Use the
Finally, let's view the commit history to see the changes we've made.
- Viewing Commit History:
- Use the
git log
command to see a log of all the commits made to the repository.git logs
- This command shows a list of commits, including their commit hashes, author information, date, and commit messages. This history is crucial for tracking the development of your project over time.
- Use the
By the end of this session, you will have:
- Created initial scripts for data extraction, transformation, and loading.
- Understood the concepts of the working directory and the staging area.
- Learned how to add changes to the staging area and commit them to the repository.
- Viewed the commit history to track changes.
These foundational Git skills are essential for managing the code in your ETL pipeline project effectively.