# 04: Python & Tooling for Machine Learning

Welcome to your ML toolkit! Today we'll master the essential Python libraries that form the backbone of every machine learning project. These tools will be your daily companions in the ML world.

> 💡 **Companion Reading**: This notebook pairs with [04_python_tooling.md](04_python_tooling.md) for deeper insights, best practices, and additional guidance.

## 🎯 Objectives
- Master data manipulation using pandas and NumPy for ML workflows
- Create insightful visualizations using matplotlib and seaborn
- Load, explore, and understand real-world datasets
- Perform data cleaning and feature engineering for machine learning
- Build and evaluate simple models using scikit-learn
- Develop practical skills for the entire ML pipeline

## 📊 Working with DataFrames

Pandas DataFrames are like supercharged spreadsheets. They're the primary way we handle structured data in machine learning projects.


In [None]:
import pandas as pd
import numpy as np

# Load sample data - restaurant tips dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

print("📋 Dataset Overview:")
print(f"Shape: {df.shape} (rows, columns)")
print(f"Columns: {list(df.columns)}")
print("\n📊 First 5 rows:")
print(df.head())

print("\n🔍 Data Types:")
print(df.dtypes)

print("\n📈 Quick Info:")
print(df.info())

## 📈 Exploratory Data Analysis

Understanding your data is the first step in any ML project. Let's explore what we're working with!


In [None]:
# Summary statistics for numerical columns
print("📊 Summary Statistics:")
print(df.describe())

print("\n🏷️ Categorical Variables:")
for col in df.select_dtypes(include=['object']).columns:
    print(f"\n{col}:")
    print(df[col].value_counts())

# Check for missing values
print("\n❓ Missing Values:")
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values! 🎉")

## 🧹 Data Cleaning & Feature Engineering

Raw data is rarely ready for modeling. Let's clean and enhance our dataset!


In [None]:
# Create a copy for cleaning
df_clean = df.copy()

# Feature engineering: create tip percentage
df_clean['tip_percent'] = 100 * df_clean['tip'] / df_clean['total_bill']

# Create additional features that might be useful
df_clean['party_size_category'] = pd.cut(df_clean['size'], 
                                        bins=[0, 2, 4, float('inf')], 
                                        labels=['Small (1-2)', 'Medium (3-4)', 'Large (5+)'])

# Convert categorical variables to numerical (for modeling later)
df_clean['sex_encoded'] = df_clean['sex'].map({'Male': 1, 'Female': 0})
df_clean['smoker_encoded'] = df_clean['smoker'].map({'Yes': 1, 'No': 0})

print("🔧 Feature Engineering Results:")
print(f"Original columns: {len(df.columns)}")
print(f"Enhanced columns: {len(df_clean.columns)}")
print("\n📊 New features preview:")
print(df_clean[['total_bill', 'tip', 'tip_percent', 'party_size_category']].head())

print(f"\n📈 Tip percentage statistics:")
print(f"Mean: {df_clean['tip_percent'].mean():.1f}%")
print(f"Median: {df_clean['tip_percent'].median():.1f}%")
print(f"Range: {df_clean['tip_percent'].min():.1f}% - {df_clean['tip_percent'].max():.1f}%")

## 🎨 Data Visualization

Visualization is crucial for understanding patterns in your data. Let's create insightful plots that reveal relationships and trends!


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the plotting style
plt.style.use('default')
sns.set_palette("husl")

# Create a comprehensive visualization dashboard
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Scatter plot: Tips vs Total Bill
sns.scatterplot(data=df_clean, x='total_bill', y='tip', hue='sex', ax=axes[0,0])
axes[0,0].set_title("Tips vs. Total Bill by Gender")
axes[0,0].grid(True, alpha=0.3)

# 2. Distribution of tip percentages
sns.histplot(data=df_clean, x='tip_percent', bins=20, ax=axes[0,1])
axes[0,1].set_title("Distribution of Tip Percentages")
axes[0,1].axvline(df_clean['tip_percent'].mean(), color='red', linestyle='--', 
                  label=f'Mean: {df_clean["tip_percent"].mean():.1f}%')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# 3. Box plot: Tips by day and time
sns.boxplot(data=df_clean, x='day', y='tip_percent', hue='time', ax=axes[1,0])
axes[1,0].set_title("Tip Percentage by Day and Time")
axes[1,0].tick_params(axis='x', rotation=45)

# 4. Correlation heatmap
numeric_cols = ['total_bill', 'tip', 'size', 'tip_percent']
correlation_matrix = df_clean[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1,1])
axes[1,1].set_title("Feature Correlation Matrix")

plt.tight_layout()
plt.show()

# Print insights
print("🔍 Key Insights from Visualizations:")
print(f"1. Strong positive correlation between total_bill and tip: {correlation_matrix.loc['total_bill', 'tip']:.3f}")
print(f"2. Average tip percentage: {df_clean['tip_percent'].mean():.1f}%")
print(f"3. Tip percentage varies by day and time")
print(f"4. Party size shows moderate correlation with total bill")

## 🔍 Machine Learning with Scikit-learn

Now let's build our first machine learning model! We'll predict tip amounts based on bill characteristics.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Prepare features and target
# Using multiple features for better prediction
X = df_clean[['total_bill', 'size', 'sex_encoded', 'smoker_encoded']]
y = df_clean['tip']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("🤖 Model Performance:")
print(f"R² Score: {r2:.3f} (higher is better, max = 1.0)")
print(f"Mean Squared Error: {mse:.3f}")
print(f"Root Mean Squared Error: {np.sqrt(mse):.3f}")

print("\n📊 Model Coefficients:")
feature_names = ['Total Bill', 'Party Size', 'Gender (Male=1)', 'Smoker (Yes=1)']
for name, coef in zip(feature_names, model.coef_):
    print(f"{name}: {coef:.3f}")
print(f"Intercept: {model.intercept_:.3f}")

# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Tips')
plt.ylabel('Predicted Tips')
plt.title('Predictions vs Actual')
plt.grid(True, alpha=0.3)

# Feature importance visualization
plt.subplot(1, 2, 2)
importance = np.abs(model.coef_)
plt.barh(feature_names, importance)
plt.xlabel('Coefficient Magnitude')
plt.title('Feature Importance')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Model Insights:")
print("- Total bill is the strongest predictor of tip amount")
print("- Party size also influences tip amount")
print("- Gender and smoking status have smaller effects")
print(f"- Model explains {r2*100:.1f}% of the variance in tip amounts")

## ✅ Summary Quiz & Checklist

### Quiz Questions
1. **What does `df.describe()` show you?**
   > `df.describe()` provides summary statistics for numerical columns including count, mean, standard deviation, minimum, maximum, and quartiles (25%, 50%, 75%). It's essential for understanding the distribution and range of your data.

2. **Why is feature engineering (like tip percentage) helpful?**
   > Feature engineering creates more meaningful variables for machine learning. Tip percentage is more informative than raw tip amount because it normalizes for bill size, making it easier for models to learn patterns and make better predictions.

3. **What does the `.fit()` method do in scikit-learn?**
   > The `.fit()` method trains the machine learning model on the provided training data. It learns the patterns and relationships between features (X) and target variable (y), adjusting the model's internal parameters.

4. **Why do we split data into training and testing sets?**
   > To evaluate how well our model generalizes to unseen data. Training on all data and testing on the same data would give overly optimistic results that don't reflect real-world performance.

### Self-Assessment Checklist
Check off each item as you master it:

- [ ] I can load and inspect datasets with pandas
- [ ] I can perform exploratory data analysis to understand my data
- [ ] I can clean data and handle missing values
- [ ] I can create new features through feature engineering
- [ ] I can create informative visualizations with matplotlib and seaborn
- [ ] I can interpret correlation matrices and statistical summaries
- [ ] I can build and evaluate machine learning models with scikit-learn
- [ ] I understand the importance of train/test splits
- [ ] I can interpret model coefficients and performance metrics

### 🔗 Next Steps
- Review the [companion theory file](04_python_tooling.md) for deeper insights and best practices
- Practice with different datasets and visualization types
- Experiment with other scikit-learn models (e.g., decision trees, random forests)
- Learn about cross-validation and more advanced evaluation techniques

### 💡 Key Takeaways
- **Pandas**: Your go-to tool for data manipulation and analysis
- **Visualization**: Essential for understanding data patterns and relationships
- **Feature Engineering**: Often more important than the choice of algorithm
- **Scikit-learn**: Provides a consistent interface for machine learning
- **Evaluation**: Always test your model on unseen data to assess real performance
- **Workflow**: Load → Explore → Clean → Engineer → Model → Evaluate
