# Lab 1: Data Preprocessing & Feature Engineering

**AI/ML for Data Scientists - Day 1**

In this lab, you'll practice:
- Loading and exploring datasets
- Handling missing values
- Feature scaling and encoding
- Feature engineering
- Model training and evaluation

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Libraries loaded successfully!")

## Part 1: Load and Explore the Dataset

We'll use a sample customer churn dataset for this lab.

In [None]:
# Create a sample dataset (simulating customer churn data)
np.random.seed(42)
n_samples = 1000

data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.randint(18, 70, n_samples),
    'tenure_months': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 5000, n_samples),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'churn': np.random.choice([0, 1], n_samples, p=[0.73, 0.27])
}

# Add some missing values
df = pd.DataFrame(data)
df.loc[np.random.choice(df.index, 50), 'age'] = np.nan
df.loc[np.random.choice(df.index, 30), 'total_charges'] = np.nan
df.loc[np.random.choice(df.index, 20), 'tech_support'] = np.nan

print(f"Dataset shape: {df.shape}")
df.head()

### Exercise 1.1: Explore the Dataset

Use pandas methods to explore the dataset structure and statistics.

In [None]:
# YOUR CODE HERE: Check the data types and info
df.info()

In [None]:
# YOUR CODE HERE: Get descriptive statistics
df.describe()

In [None]:
# YOUR CODE HERE: Check for missing values
df.isnull().sum()

In [None]:
# YOUR CODE HERE: Check the target variable distribution
df['churn'].value_counts(normalize=True)

## Part 2: Handling Missing Values

Let's handle the missing values in our dataset.

### Exercise 2.1: Impute Numeric Missing Values

Use SimpleImputer to fill missing numeric values with the median.

In [None]:
# YOUR CODE HERE: Impute numeric columns
numeric_cols = ['age', 'total_charges']

# Create imputer with median strategy
imputer = SimpleImputer(strategy='median')

# Fit and transform
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

# Verify no missing values remain
print("Missing values after imputation:")
print(df[numeric_cols].isnull().sum())

### Exercise 2.2: Impute Categorical Missing Values

Fill missing categorical values with the most frequent value.

In [None]:
# YOUR CODE HERE: Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')

df[['tech_support']] = cat_imputer.fit_transform(df[['tech_support']])

# Verify
print("Missing values after imputation:")
print(df['tech_support'].isnull().sum())

## Part 3: Feature Engineering

Create new features that might help the model.

### Exercise 3.1: Create New Features

Create the following engineered features:
1. `avg_monthly_charge`: total_charges / tenure_months
2. `age_group`: Binned age (Young, Middle, Senior)
3. `is_long_term`: 1 if contract is not month-to-month

In [None]:
# YOUR CODE HERE: Create engineered features

# Average monthly charge
df['avg_monthly_charge'] = df['total_charges'] / df['tenure_months'].replace(0, 1)

# Age group
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])

# Long-term contract indicator
df['is_long_term'] = (df['contract_type'] != 'Month-to-month').astype(int)

print("New features created:")
df[['avg_monthly_charge', 'age_group', 'is_long_term']].head()

## Part 4: Feature Encoding

Encode categorical variables for model training.

### Exercise 4.1: One-Hot Encode Categorical Variables

In [None]:
# Define categorical columns to encode
cat_columns = ['contract_type', 'payment_method', 'internet_service', 'tech_support', 'age_group']

# YOUR CODE HERE: One-hot encode categorical columns
df_encoded = pd.get_dummies(df, columns=cat_columns, drop_first=True)

print(f"Shape after encoding: {df_encoded.shape}")
df_encoded.head()

## Part 5: Feature Scaling

Scale numeric features for better model performance.

In [None]:
# Define features and target
feature_cols = [col for col in df_encoded.columns if col not in ['customer_id', 'churn']]
X = df_encoded[feature_cols]
y = df_encoded['churn']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# YOUR CODE HERE: Scale numeric features
numeric_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 'avg_monthly_charge']

scaler = StandardScaler()

# Fit on training data only!
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])

# Transform test data
X_test[numeric_features] = scaler.transform(X_test[numeric_features])

print("Features scaled successfully!")
X_train[numeric_features].describe()

## Part 6: Model Training & Evaluation

### Exercise 6.1: Train a Logistic Regression Model

In [None]:
# YOUR CODE HERE: Train logistic regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train, y_train)

# Predict
y_pred_lr = lr_model.predict(X_test)

# Evaluate
print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_lr):.4f}")

### Exercise 6.2: Train a Random Forest Model

In [None]:
# YOUR CODE HERE: Train random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_rf):.4f}")

### Exercise 6.3: Cross-Validation

In [None]:
# YOUR CODE HERE: Perform stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validation scores
cv_scores_lr = cross_val_score(lr_model, X_train, y_train, cv=cv, scoring='f1')
cv_scores_rf = cross_val_score(rf_model, X_train, y_train, cv=cv, scoring='f1')

print(f"Logistic Regression CV F1: {cv_scores_lr.mean():.4f} (+/- {cv_scores_lr.std():.4f})")
print(f"Random Forest CV F1: {cv_scores_rf.mean():.4f} (+/- {cv_scores_rf.std():.4f})")

### Exercise 6.4: Confusion Matrix Visualization

In [None]:
# YOUR CODE HERE: Plot confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Logistic Regression')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')

# Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_title('Random Forest')
axes[1].set_ylabel('Actual')
axes[1].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

### Exercise 6.5: Feature Importance

In [None]:
# YOUR CODE HERE: Plot feature importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(15), x='importance', y='feature', palette='viridis')
plt.title('Top 15 Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

## Summary

In this lab, you learned how to:

1. **Load and explore** datasets with pandas
2. **Handle missing values** using SimpleImputer
3. **Engineer new features** from existing data
4. **Encode categorical variables** with one-hot encoding
5. **Scale numeric features** with StandardScaler
6. **Train and evaluate** classification models
7. **Perform cross-validation** for robust evaluation
8. **Visualize results** with confusion matrices and feature importance

---

*AI/ML for Data Scientists | AI Elevate*