# Data Science Lifecycle Basics

## 1. README

### **Objective**
This notebook provides a basic, end-to-end walkthrough of the data science lifecycle using the classic Titanic survival dataset. It covers problem definition, data collection, cleaning, exploratory data analysis (EDA), feature engineering, modeling, evaluation, a toy deployment example, and an introduction to monitoring. The goal is to provide a simple, runnable example for an intern or beginner to understand the flow of a typical data science project.

### **Setup Instructions**
1.  **Create a virtual environment:** It's best practice to isolate your project's dependencies. 
    * On macOS/Linux:
        ```bash
        python3 -m venv .venv
        source .venv/bin/activate
        ```
    * On Windows:
        ```bash
        python -m venv .venv
        .venv\Scripts\activate
        ```
2.  **Install dependencies:** Ensure all required libraries are installed. The `-U` flag updates existing packages.
    ```bash
    pip install -U pip
    pip install pandas numpy matplotlib seaborn scikit-learn joblib jupyter
    ```
3.  **Run the notebook:** Start the Jupyter Notebook server in the same directory.
    ```bash
    jupyter notebook
    ```
    Then, open `ds_lifecycle_basics.ipynb` in your browser.

---

# Data Science Lifecycle Tutorial

## 2. Phase 1 – Problem Definition

**Business Problem:** We want to predict which passengers survived the sinking of the Titanic. This is a classic **classification problem**. Understanding the factors that influenced survival can provide valuable insights, such as what characteristics (e.g., age, gender, class) were most important in determining a person's fate. This type of analysis could be used for historical research or as a learning exercise in machine learning.

**Target Variable:** The `Survived` column. It is a binary variable where `0` = did not survive and `1` = survived.

**Success Criteria:** Since this is a simple introductory model, we'll aim for a reasonable accuracy score. Our primary metrics will be **accuracy, precision, and recall**. We will use a **confusion matrix** and an **ROC curve** to better understand the model's performance beyond a single number.

## 3. Phase 2 – Data Collection

We will load the data directly from a public URL, ensuring the notebook is self-contained and reproducible. The dataset is sourced from the `datasciencedojo` GitHub repository, which is a common source for introductory datasets.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

# Set a random seed for reproducibility across the notebook
np.random.seed(42)

# Dataset URL
data_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

# Load the dataset
try:
    df = pd.read_csv(data_url)
    print("Data loaded successfully from URL.\n")
except Exception as e:
    print(f"Error loading data: {e}")
    df = None
    
if df is not None:
    # Display basic information about the dataset
    print("Dataset Shape:", df.shape)
    print("\nColumn Information:\n")
    df.info()
    print("\nFirst 5 rows:\n")
    display(df.head())

## 4. Phase 3 – Data Cleaning and EDA

This phase involves handling missing data, converting data types, and exploring relationships within the data through visualizations. We'll identify missing values and decide on a strategy to handle them. For EDA, we'll look at the distribution of key features and their relationship with the target variable, `Survived`.

In [None]:
# Check for missing values
print("\nMissing values per column:\n")
print(df.isnull().sum())

# Strategy for missing values:
# 'Age' has a significant number of missing values. We will impute them with the median to avoid skewing the distribution with the mean.
# 'Cabin' has too many missing values (~77%) to be useful, so we will drop this column.
# 'Embarked' has only 2 missing values. We will fill them with the most frequent value (mode).

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop('Cabin', axis=1, inplace=True)

print("\nMissing values after cleaning:\n")
print(df.isnull().sum())

# Basic EDA
# Univariate Analysis: Distribution of 'Age'
plt.figure(figsize=(8, 5))
sns.histplot(df['Age'], kde=True, bins=30)
plt.title('Distribution of Passenger Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# Bivariate Analysis: Survival Rate by Sex
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival Rate by Sex')
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
plt.show()
print("\nInsight: Females had a significantly higher survival rate than males.\n")

# Bivariate Analysis: Survival Rate by Passenger Class (P-class)
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.show()
print("\nInsight: Passengers in higher classes (1st class) had a much higher survival rate.\n")

## 5. Phase 4 – Feature Engineering

Feature engineering is the process of using domain knowledge to create new features that are more informative and can improve model performance. We will create two new features: `FamilySize` and `IsAlone`. We'll also extract a `Title` from the `Name` column to capture social status, which is often a strong predictor of survival.

In [None]:
# Create FamilySize feature
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Create IsAlone feature
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Extract Title from Name
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Replace rare titles with a single 'Rare' category
rare_titles = df['Title'].value_counts()[df['Title'].value_counts() < 10].index
df['Title'] = df['Title'].replace(rare_titles, 'Rare')

print("Top 5 most common titles after cleaning:\n", df['Title'].value_counts().head())

# Drop original features that are no longer needed
df.drop(['Name', 'Ticket', 'PassengerId', 'SibSp', 'Parch'], axis=1, inplace=True)

# Separate features and target variable
X = df.drop('Survived', axis=1)
y = df['Survived']

# Define which features are numerical and which are categorical
numerical_features = ['Age', 'Fare', 'FamilySize']
categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'IsAlone']

# Create a preprocessing pipeline
# StandardScaler is used to normalize numerical features. This is important for many machine learning algorithms to perform optimally.
# OneHotEncoder converts categorical features into a numerical format that the model can understand.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

print("\nFinal feature matrix shape (after preprocessing setup):", (preprocessor.fit_transform(X).shape))


## 6. Phase 5 – Modeling

We will split the data into training and testing sets to evaluate our model on unseen data. Then, we will train a simple Logistic Regression model. Logistic Regression is a good baseline model for classification tasks because it's fast, interpretable, and effective for many problems.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

# Create the modeling pipeline
# The Pipeline object combines the preprocessing steps and the model into a single estimator.
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Train the model on the training data
model_pipeline.fit(X_train, y_train)

print("\nModel training complete.")

## 7. Phase 6 – Evaluation

After training the model, we need to evaluate its performance on the test set. We will use several metrics to get a comprehensive view of how well the model performs. We'll also visualize the confusion matrix and the ROC curve to better understand its predictions.

In [None]:
# Make predictions on the test set
y_pred = model_pipeline.predict(X_test)
y_pred_proba = model_pipeline.predict_proba(X_test)[:, 1]

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Plot Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Did Not Survive', 'Survived'], yticklabels=['Did Not Survive', 'Survived'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

print("\nInterpretation: The model's accuracy is decent, but precision and recall show trade-offs. The ROC curve indicates the model has good discriminative power.")

## 8. Phase 7 – Deployment (Toy Example)

After a model is trained and evaluated, it's ready to be used in production. We can save the entire pipeline (including preprocessing steps) to a file using `joblib`. This ensures that when the model is reloaded, it will apply the exact same transformations to new data as it did during training, preventing errors.

In [None]:
# Save the trained pipeline to a file
model_filename = 'titanic_survival_model.joblib'
joblib.dump(model_pipeline, model_filename)

print(f"Model and preprocessor saved to {model_filename}")

# --- Toy Deployment Simulation --- #

# Load the saved model in a new environment
loaded_model = joblib.load(model_filename)

print("\nModel loaded successfully.")

# Create a small sample of new data (simulating a live prediction request)
new_data = pd.DataFrame([
    # Sample 1: A female from 1st class (likely to survive)
    {'Pclass': 1, 'Sex': 'female', 'Age': 28, 'SibSp': 0, 'Parch': 0, 'Fare': 70.0, 'Embarked': 'S', 'Name': 'Mrs. Test', 'IsAlone': 1, 'FamilySize': 1, 'Title': 'Mrs'},
    # Sample 2: A male from 3rd class (less likely to survive)
    {'Pclass': 3, 'Sex': 'male', 'Age': 45, 'SibSp': 1, 'Parch': 2, 'Fare': 25.0, 'Embarked': 'C', 'Name': 'Mr. Test', 'IsAlone': 0, 'FamilySize': 4, 'Title': 'Mr'}
])

# Use the loaded model to make a prediction on the new data
predictions = loaded_model.predict(new_data)

print("\nPredictions on new data:")
for i, pred in enumerate(predictions):
    status = 'Survived' if pred == 1 else 'Did Not Survive'
    print(f"Sample {i+1}: {status}")

## 9. Phase 8 – Monitoring (Intro Level)

After deployment, it's crucial to monitor the model's performance in the real world. Over time, the distribution of incoming data can change (**data drift**), which can cause the model's performance to degrade. We'll simulate a simple check for data drift by comparing the mean of a key feature from our training data to a new, hypothetical batch of incoming data.

In [None]:
# Simulate a new, small batch of incoming data
np.random.seed(1)
incoming_data_sample = df.sample(n=50, replace=False, random_state=1)

print("Simulating Monitoring Checks:")
print(f"Model Version: v1.0")
print(f"Data Timestamp: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")

# 1. Data Schema Check (simplified)
if sorted(list(incoming_data_sample.columns)) == sorted(list(df.columns)):
    print("\nSchema Check: PASS. Incoming data columns match training data.")
else:
    print("\nSchema Check: FAIL. Column mismatch detected.")

# 2. Simple Data Drift Check (comparing mean 'Age')
train_age_mean = X_train['Age'].mean()
incoming_age_mean = incoming_data_sample['Age'].mean()

mean_diff = abs(train_age_mean - incoming_age_mean)
tolerance = 2.0  # Set a simple threshold for drift detection

print(f"\nTraining 'Age' Mean: {train_age_mean:.2f}")
print(f"Incoming 'Age' Mean: {incoming_age_mean:.2f}")

if mean_diff > tolerance:
    print(f"\nData Drift Alert: The mean 'Age' has shifted by more than {tolerance:.2f}.")
else:
    print(f"\nData Drift Check: PASS. Mean 'Age' is within tolerance.")

# In a real-world scenario, you would log these metrics and send alerts automatically.

---
## Conclusion and Next Steps

This notebook has walked through the complete data science lifecycle, from understanding a problem to a basic simulation of model monitoring. We successfully trained a simple Logistic Regression model that achieved a reasonable performance on the Titanic dataset.

**Key Findings:**
* Gender and passenger class were the most influential factors in survival.
* The model's performance, while good, could be improved with more advanced feature engineering and a more complex model.

**Known Limitations:**
* The model is a simple linear classifier. More complex models like Gradient Boosting or a Random Forest might yield better results.
* The dataset is small and historical. Real-world data is often messier and larger.

**Next Steps:**
* Experiment with different models (e.g., `RandomForestClassifier`).
* Use more sophisticated methods for handling missing values and feature engineering.
* Perform **hyperparameter tuning** to optimize the model's performance.