# Exercise 3 - Data Processing Phases


## Understanding the Data Science Lifecycle
The data science lifecycle consists of several important phases that structure a project from start to finish:

**Phases Include:**
1. Data Collection – gather raw data
2. Data Cleaning – handle missing or incorrect data
3. Data Exploration – understand patterns and distributions
4. Feature Engineering – create inputs for modeling
5. Modeling – apply machine learning algorithms
6. Evaluation – assess model performance
7. Deployment – deliver insights or live models

Each stage builds on the last to create a pipeline of insights and automation.


In [1]:
lifecycle = [
    "1. Data Collection",
    "2. Data Cleaning",
    "3. Data Exploration",
    "4. Feature Engineering",
    "5. Modeling",
    "6. Evaluation",
    "7. Deployment"
]

print("Data Science Lifecycle Phases:")
for step in lifecycle:
    print(step)


Data Science Lifecycle Phases:
1. Data Collection
2. Data Cleaning
3. Data Exploration
4. Feature Engineering
5. Modeling
6. Evaluation
7. Deployment


### Practice
- Describe an example project and match it to each phase in the lifecycle.
- Which step do you think takes the most time in practice? Why?


Stock price prediction:
- Need to collect data by web scraping or API calls
- Need to clean data, make sure all in same format, no missing values
- Data exploration - need to find most predictive features
- Need to explore different combinations of features
- Need to test different models
- Need to evaluate the performance of the chosen model
- Deploy it into real world, so that people can submit realtime data

## Extract – Getting Data from a Source
**Extract** is the first step in ETL. It involves pulling data from various sources such as CSV files, databases, APIs, or web scraping.

In this example, we simulate extracting data from a CSV file.


In [3]:
import pandas as pd

# Simulating extract from CSV
# In practice, you'd use: df = pd.read_csv("filename.csv")
data = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Score': [85, 90, None, 88]
}
df_extracted = pd.DataFrame(data)
print("Extracted Data:")
print(df_extracted)


Extracted Data:
    Name  Score
0  Alice   85.0
1    Bob   90.0
2   None    NaN
3  David   88.0


### Practice
- Try simulating a different dataset with missing or inconsistent data.
- What happens if you extract from a different file format like Excel?


In [22]:
# Your practice code here
df_extracted['Name'] = df_extracted['Name'].fillna("Unknown")
df_extracted['Score'] = df_extracted['Score'].fillna(df_extracted['Score'].mean())
df_extracted

Unnamed: 0,Name,Score
0,Alice,85.0
1,Bob,90.0
2,Unknown,87.666667
3,David,88.0


## Transform – Cleaning and Shaping Data
**Transform** is the second step in ETL. It involves cleaning, correcting, and shaping the data into usable formats.

Common operations include:
- Filling missing values
- Renaming columns
- Type conversions
- Creating derived columns


In [5]:
# Fill missing values and standardize names
df_transformed = df_extracted.copy()
df_transformed['Name'] = df_transformed['Name'].fillna('Unknown')
df_transformed['Score'] = df_transformed['Score'].fillna(df_transformed['Score'].mean())

print("Transformed Data:")
print(df_transformed)


Transformed Data:
      Name      Score
0    Alice  85.000000
1      Bob  90.000000
2  Unknown  87.666667
3    David  88.000000


### Practice
- Try rounding the scores to the nearest whole number.
- Add a new column that indicates whether the student passed (e.g., score > 80).


In [18]:
# Your practice code here 
df_transformed['Score'] = df_transformed['Score'].round()
df_transformed

Unnamed: 0,Name,Score
0,Alice,85.0
1,Bob,90.0
2,Unknown,88.0
3,David,88.0


In [20]:
df_transformed['Passed'] = df_transformed['Score'] > 80
df_transformed

Unnamed: 0,Name,Score,Passed
0,Alice,85.0,True
1,Bob,90.0,True
2,Unknown,88.0,True
3,David,88.0,True


## Load – Storing Processed Data
**Load** is the final step in ETL. Once the data is clean, it's saved to a database, data warehouse, or file for use in downstream systems.

In this example, we simulate loading by saving to a CSV.


In [7]:
# Simulate loading into a CSV
# Check file system for new file
df_transformed.to_csv("processed_students.csv", index=False)

print("Data would be saved to 'processed_students.csv'")


Data would be saved to 'processed_students.csv'


### Practice
- Try saving the data to Excel using `to_excel()`.
- Think about where this data might go in a real project (e.g., database, dashboard).


In [None]:
# Your practice code here
df_transformed.to_excel("output.xlsx")

## Putting It All Together – The Full ETL Process
This step combines all ETL phases — Extract, Transform, and Load — into one reusable process.

This is useful for automating data pipelines.


In [9]:
import pandas as pd

def etl_pipeline():
    # Extract
    raw_data = {
        'Name': ['Alice', 'Bob', None, 'David'],
        'Score': [85, 90, None, 88]
    }
    df = pd.DataFrame(raw_data)

    # Transform
    df['Name'] = df['Name'].fillna('Unknown')
    df['Score'] = df['Score'].fillna(df['Score'].mean())
    df['Passed'] = df['Score'] > 80

    # Load
    # Check file system for new file
    df.to_csv("etl_output.csv", index=False)

    return df

# Run the pipeline
df_final = etl_pipeline()
print("Final ETL Result:")
print(df_final)


Final ETL Result:
      Name      Score  Passed
0    Alice  85.000000    True
1      Bob  90.000000    True
2  Unknown  87.666667    True
3    David  88.000000    True


### Practice
- Modify the ETL function to include a new feature, like score categories (Low, Medium, High).
- Try abstracting the steps into separate functions for extract, transform, and load.


In [11]:
# Your practice code here
def etl_pipeline2():
    # Extract
    raw_data = {
        'Name': ['Alice', 'Bob', None, 'David'],
        'Score': [85, 90, None, 88]
    }
    df = pd.DataFrame(raw_data)

    # Transform
    df['Name'] = df['Name'].fillna('Unknown')
    df['Score'] = df['Score'].fillna(df['Score'].mean())
    df['Passed'] = df['Score'] > 80

    df['ScoreBucket'] = pd.cut(
    df['Score'],
    bins=[0, 70, 85, 100],         # define bucket edges
    labels=['Low', 'Medium', 'High'],  # bucket labels
    include_lowest=True)

    # Load
    # Check file system for new file
    df.to_csv("etl_output.csv", index=False)

    return df

# Run the pipeline
df_final = etl_pipeline2()
print("Final ETL Result:")
print(df_final)

Final ETL Result:
      Name      Score  Passed ScoreBucket
0    Alice  85.000000    True      Medium
1      Bob  90.000000    True        High
2  Unknown  87.666667    True        High
3    David  88.000000    True        High


In [12]:
def extract():
    raw_data = {
        'Name': ['Alice', 'Bob', None, 'David'],
        'Score': [85, 90, None, 88]
    }
    df = pd.DataFrame(raw_data)
    return df

def transform(df):
    df['Name'] = df['Name'].fillna('Unknown')
    df['Score'] = df['Score'].fillna(df['Score'].mean())
    df['Passed'] = df['Score'] > 80

    df['ScoreBucket'] = pd.cut(
    df['Score'],
    bins=[0, 70, 85, 100],         # define bucket edges
    labels=['Low', 'Medium', 'High'],  # bucket labels
    include_lowest=True)

    return df

def load(df):
    df.to_csv("etl_output.csv", index=False)

df = extract()
transformed_df = transform(df)
load(transformed_df)