# MLOps Episode 3: Data Extraction, Validation & Preparation

In this episode, we focus on three critical steps of data preparation for machine learning models:

1. **Data Extraction**: Retrieving data from various sources.
2. **Data Validation**: Ensuring data quality and integrity.
3. **Data Preparation**: Transforming the data for analysis.

These steps are essential for ensuring the data used in machine learning pipelines is accurate, relevant, and ready for analysis.


## Data Extraction

Data extraction involves retrieving data from sources such as databases, APIs, flat files, or web scraping. Important considerations include:

- **Source Identification**: Determine where the data resides.
- **Data Formats**: Understand formats (CSV, JSON, XML, etc.).
- **Automation**: Use scripts to streamline the process.

### Sample Data
We'll start with a simple CSV file:

```csv
user_id,name,email,age
1,John Doe,john.doe@example.com,28
2,Jane Smith,jane.smith@example.com,34
3,Sam Johnson,sam.johnson@example.com,22
```


In [None]:
# Import required libraries
import pandas as pd

# Load CSV data
data = pd.DataFrame({
    'user_id': [1, 2, 3],
    'name': ['John Doe', 'Jane Smith', 'Sam Johnson'],
    'email': ['john.doe@example.com', 'jane.smith@example.com', 'sam.johnson@example.com'],
    'age': [28, 34, 22]
})

# Display extracted data
print("Extracted Data:")
data.head()

## Data Validation

After extraction, validating the data ensures quality and consistency. Important steps include:

- **Schema Validation**: Ensure proper data types.
- **Missing Values**: Handle missing entries.
- **Outlier Detection**: Detect and manage outliers.
- **Consistency Checks**: Ensure consistency across sources.

### Sample Data with Issues

```csv
user_id,name,email,age
1,John Doe,john.doe@example.com,28
2,Jane Smith,jane.smith@example.com,
3,Sam Johnson,,22
4,Emily Davis,emily.davis@example.com,120
```


In [None]:
# Introducing issues in data
data_with_issues = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'name': ['John Doe', 'Jane Smith', 'Sam Johnson', 'Emily Davis'],
    'email': ['john.doe@example.com', 'jane.smith@example.com', None, 'emily.davis@example.com'],
    'age': [28, None, 22, 120]
})

# Check for missing values
missing_values = data_with_issues.isnull().sum()
print("Missing Values:\n", missing_values)

# Outlier detection
outliers = data_with_issues[(data_with_issues['age'] > 100) | (data_with_issues['age'] < 0)]
print("Outliers:\n", outliers)

# Basic schema validation
print("Data Types:\n", data_with_issues.dtypes)

## Data Preparation

This phase transforms validated data for analysis:

- **Data Cleaning**: Remove duplicates and handle errors.
- **Feature Engineering**: Create or adjust features for better model performance.
- **Normalization and Scaling**: Adjust data for better convergence.
- **Splitting the Data**: Create training, validation, and testing datasets.


In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Cleaning data: Remove duplicates and fill missing values
data_cleaned = data_with_issues.drop_duplicates()
data_cleaned['age'].fillna(data_cleaned['age'].median(), inplace=True)

# Normalize the age feature
scaler = MinMaxScaler()
data_cleaned['age_scaled'] = scaler.fit_transform(data_cleaned[['age']])

# Splitting data into train and test sets
train, test = train_test_split(data_cleaned, test_size=0.2, random_state=42)

print("Training Data:")
print(train)
print("Testing Data:")
print(test)