# Machine Learning Beginner Tutorial
## TensorFlow & scikit-learn

This notebook introduces two essential Python libraries for machine learning:
- **scikit-learn**: Traditional ML algorithms (classification, regression, clustering)
- **TensorFlow**: Deep learning and neural networks

---

## Part 1: Setup and Imports

In [None]:
# Install libraries if needed (uncomment to run)
!pip install tensorflow scikit-learn pandas matplotlib seaborn

In [None]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
np.random.seed(42)

print("Setup complete!")

---
# PART A: scikit-learn Tutorial
---

scikit-learn provides simple and efficient tools for:
- Classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection and evaluation

## 1. Loading Data

scikit-learn includes several built-in datasets for practice.

In [None]:
from sklearn.datasets import load_iris, load_wine, make_classification, make_regression

# Load the famous Iris dataset
iris = load_iris()

# X = features (measurements), y = target (species)
X = iris.data
y = iris.target

print(f"Feature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")
print(f"Data shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Convert to DataFrame for easier viewing
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = [iris.target_names[i] for i in y]
df.head(10)

## 2. Train-Test Split

Always split your data into training and testing sets to evaluate model performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducibility
    stratify=y          # Maintain class distribution
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## 3. Data Preprocessing

Most ML algorithms perform better when features are scaled.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: transforms data to have mean=0 and std=1
scaler = StandardScaler()

# Fit on training data, transform both train and test
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled = scaler.transform(X_test)        # transform only (use training stats)

print("Before scaling:")
print(f"  Mean: {X_train.mean(axis=0).round(2)}")
print(f"  Std: {X_train.std(axis=0).round(2)}")

print("\nAfter scaling:")
print(f"  Mean: {X_train_scaled.mean(axis=0).round(2)}")
print(f"  Std: {X_train_scaled.std(axis=0).round(2)}")

## 4. Classification Models

scikit-learn follows a consistent API:
1. Create the model
2. `fit()` on training data
3. `predict()` on new data

### 4.1 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create and train model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train_scaled, y_train)

# Predict
y_pred = log_reg.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

### 4.2 Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create and train
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)  # Trees don't require scaling

# Predict and evaluate
y_pred_tree = tree.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.2%}")

### 4.3 Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create and train
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.2%}")

# Feature importance
print("\nFeature Importance:")
for name, importance in zip(iris.feature_names, rf.feature_importances_):
    print(f"  {name}: {importance:.3f}")

### 4.4 K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# KNN requires scaled data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

y_pred_knn = knn.predict(X_test_scaled)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn):.2%}")

### 4.5 Support Vector Machine

In [None]:
from sklearn.svm import SVC

# SVM requires scaled data
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train_scaled, y_train)

y_pred_svm = svm.predict(X_test_scaled)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred_svm):.2%}")

## 5. Regression Example

Regression predicts continuous values instead of categories.

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Create synthetic regression data
X_reg, y_reg = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)

# Split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

# Predict
y_pred_reg = lin_reg.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Linear Regression Results:")
print(f"  RMSE: {rmse:.2f}")
print(f"  RÂ² Score: {r2:.4f}")
print(f"  Coefficients: {lin_reg.coef_.round(2)}")

## 6. Cross-Validation

Cross-validation gives a more robust estimate of model performance.

In [None]:
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf')
}

print("5-Fold Cross-Validation Results:\n")
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    print(f"{name}:")
    print(f"  Scores: {scores.round(3)}")
    print(f"  Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})\n")

## 7. Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

# Create GridSearchCV
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # Use all CPU cores
)

# Fit
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")

## 8. Confusion Matrix Visualization

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Get predictions from best model
y_pred_best = best_model.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_best)

# Plot
plt.figure(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.tight_layout()
plt.show()

## 9. Clustering (Unsupervised Learning)

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_train_scaled)

# Evaluate clustering
silhouette = silhouette_score(X_train_scaled, cluster_labels)
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")

# Visualize clusters (using first 2 features)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title('K-Means Clustering')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title('True Labels')

plt.tight_layout()
plt.show()

## 10. Pipelines

Pipelines chain preprocessing and modeling steps together.

In [None]:
from sklearn.pipeline import Pipeline

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=200))
])

# Use like a regular model
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)

print(f"Pipeline Accuracy: {accuracy:.2%}")
print("\nPipeline steps:")
for name, step in pipeline.named_steps.items():
    print(f"  {name}: {step.__class__.__name__}")

---
# PART B: TensorFlow Tutorial
---

TensorFlow is a deep learning framework for:
- Neural networks
- Image recognition
- Natural language processing
- And much more

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(f"TensorFlow version: {tf.__version__}")

# Set seed for reproducibility
tf.random.set_seed(42)

## 1. TensorFlow Basics - Tensors

In [None]:
# Tensors are the fundamental data structure

# Scalar (0-D tensor)
scalar = tf.constant(42)
print(f"Scalar: {scalar}, shape: {scalar.shape}")

# Vector (1-D tensor)
vector = tf.constant([1, 2, 3, 4, 5])
print(f"Vector: {vector}, shape: {vector.shape}")

# Matrix (2-D tensor)
matrix = tf.constant([[1, 2], [3, 4], [5, 6]])
print(f"Matrix shape: {matrix.shape}")

# 3-D tensor
tensor_3d = tf.constant([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"3D Tensor shape: {tensor_3d.shape}")

In [None]:
# Tensor operations
a = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
b = tf.constant([[5, 6], [7, 8]], dtype=tf.float32)

print("Element-wise addition:")
print(tf.add(a, b).numpy())

print("\nMatrix multiplication:")
print(tf.matmul(a, b).numpy())

print("\nElement-wise multiplication:")
print(tf.multiply(a, b).numpy())

## 2. Neural Network for Classification (Iris Dataset)

In [None]:
# Prepare data for neural network
from sklearn.preprocessing import LabelBinarizer

# One-hot encode the labels
lb = LabelBinarizer()
y_train_onehot = lb.fit_transform(y_train)
y_test_onehot = lb.transform(y_test)

print(f"Original label: {y_train[0]}")
print(f"One-hot encoded: {y_train_onehot[0]}")

In [None]:
# Build a Sequential model
model = keras.Sequential([
    # Input layer (4 features)
    layers.Input(shape=(4,)),
    
    # Hidden layer 1
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),  # Regularization
    
    # Hidden layer 2
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    
    # Output layer (3 classes)
    layers.Dense(3, activation='softmax')
])

# View model architecture
model.summary()

In [None]:
# Compile the model
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
# Train the model
history = model.fit(
    X_train_scaled, y_train_onehot,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    verbose=0  # Quiet training
)

print("Training complete!")

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Over Time')
axes[0].legend()

# Accuracy
axes[1].plot(history.history['accuracy'], label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Over Time')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Evaluate on test set
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test_onehot, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.2%}")

## 3. Image Classification with MNIST

The MNIST dataset contains 70,000 grayscale images of handwritten digits (0-9).

In [None]:
# Load MNIST dataset
(X_train_mnist, y_train_mnist), (X_test_mnist, y_test_mnist) = keras.datasets.mnist.load_data()

print(f"Training images: {X_train_mnist.shape}")
print(f"Test images: {X_test_mnist.shape}")

# Visualize some samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train_mnist[i], cmap='gray')
    ax.set_title(f"Label: {y_train_mnist[i]}")
    ax.axis('off')
plt.suptitle('Sample MNIST Images', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Preprocess the data

# Normalize pixel values to [0, 1]
X_train_mnist = X_train_mnist.astype('float32') / 255.0
X_test_mnist = X_test_mnist.astype('float32') / 255.0

# Reshape for CNN: (samples, height, width, channels)
X_train_mnist = X_train_mnist.reshape(-1, 28, 28, 1)
X_test_mnist = X_test_mnist.reshape(-1, 28, 28, 1)

print(f"Reshaped training data: {X_train_mnist.shape}")

In [None]:
# Build a Convolutional Neural Network (CNN)
cnn_model = keras.Sequential([
    # Input
    layers.Input(shape=(28, 28, 1)),
    
    # Convolutional block 1
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    
    # Convolutional block 2
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    
    # Flatten and dense layers
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    
    # Output layer (10 digits)
    layers.Dense(10, activation='softmax')
])

cnn_model.summary()

In [None]:
# Compile
cnn_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',  # Use with integer labels
    metrics=['accuracy']
)

# Train
cnn_history = cnn_model.fit(
    X_train_mnist, y_train_mnist,
    epochs=5,
    batch_size=64,
    validation_split=0.1,
    verbose=1
)

In [None]:
# Evaluate
test_loss, test_accuracy = cnn_model.evaluate(X_test_mnist, y_test_mnist, verbose=0)
print(f"\nCNN Test Accuracy: {test_accuracy:.2%}")

In [None]:
# Make predictions and visualize
predictions = cnn_model.predict(X_test_mnist[:10], verbose=0)
predicted_labels = np.argmax(predictions, axis=1)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test_mnist[i].reshape(28, 28), cmap='gray')
    color = 'green' if predicted_labels[i] == y_test_mnist[i] else 'red'
    ax.set_title(f"Pred: {predicted_labels[i]} (True: {y_test_mnist[i]})", color=color)
    ax.axis('off')
plt.suptitle('CNN Predictions on Test Images', fontsize=14)
plt.tight_layout()
plt.show()

## 4. Building Models with Functional API

The Functional API allows more complex architectures (multiple inputs/outputs, skip connections).

In [None]:
# Functional API example
inputs = keras.Input(shape=(4,), name='input_layer')

x = layers.Dense(64, activation='relu', name='hidden_1')(inputs)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.3)(x)

x = layers.Dense(32, activation='relu', name='hidden_2')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.3)(x)

outputs = layers.Dense(3, activation='softmax', name='output_layer')(x)

functional_model = keras.Model(inputs=inputs, outputs=outputs, name='functional_model')
functional_model.summary()

## 5. Callbacks

Callbacks let you customize training behavior.

In [None]:
# Early stopping - stop training when validation loss stops improving
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# Reduce learning rate when loss plateaus
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-6
)

# Model checkpoint - save best model
checkpoint = keras.callbacks.ModelCheckpoint(
    'best_model.keras',
    monitor='val_accuracy',
    save_best_only=True,
    mode='max'
)

print("Callbacks defined!")

In [None]:
# Train with callbacks
functional_model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

history = functional_model.fit(
    X_train_scaled, y_train_onehot,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[early_stopping, reduce_lr],
    verbose=0
)

print(f"Training stopped after {len(history.history['loss'])} epochs")
print(f"Final validation accuracy: {history.history['val_accuracy'][-1]:.2%}")

## 6. Regression with Neural Networks

In [None]:
# Create synthetic data
np.random.seed(42)
X_nn = np.random.rand(1000, 5)
y_nn = 3 * X_nn[:, 0] + 2 * X_nn[:, 1] - X_nn[:, 2] + np.random.randn(1000) * 0.1

X_train_nn, X_test_nn, y_train_nn, y_test_nn = train_test_split(
    X_nn, y_nn, test_size=0.2, random_state=42
)

# Build regression model
reg_model = keras.Sequential([
    layers.Input(shape=(5,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)  # No activation for regression
])

reg_model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

# Train
reg_history = reg_model.fit(
    X_train_nn, y_train_nn,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Evaluate
test_mse, test_mae = reg_model.evaluate(X_test_nn, y_test_nn, verbose=0)
print(f"Test MSE: {test_mse:.4f}")
print(f"Test MAE: {test_mae:.4f}")

## 7. Saving and Loading Models

In [None]:
# Save the entire model
cnn_model.save('mnist_cnn_model.keras')
print("Model saved!")

# Load the model
loaded_model = keras.models.load_model('mnist_cnn_model.keras')
print("Model loaded!")

# Verify it works
test_loss, test_acc = loaded_model.evaluate(X_test_mnist[:100], y_test_mnist[:100], verbose=0)
print(f"Loaded model test accuracy: {test_acc:.2%}")

## 8. Custom Training Loop (Advanced)

In [None]:
# Custom training gives you full control

# Create a simple model
custom_model = keras.Sequential([
    layers.Input(shape=(4,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(3, activation='softmax')
])

# Define optimizer and loss
optimizer = keras.optimizers.Adam(learning_rate=0.001)
loss_fn = keras.losses.CategoricalCrossentropy()

# Custom training step
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        predictions = custom_model(x, training=True)
        loss = loss_fn(y, predictions)
    
    gradients = tape.gradient(loss, custom_model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, custom_model.trainable_variables))
    return loss

# Training loop
epochs = 50
batch_size = 16

# Convert to tensors
train_dataset = tf.data.Dataset.from_tensor_slices(
    (X_train_scaled.astype('float32'), y_train_onehot.astype('float32'))
).shuffle(1000).batch(batch_size)

for epoch in range(epochs):
    total_loss = 0
    num_batches = 0
    
    for x_batch, y_batch in train_dataset:
        loss = train_step(x_batch, y_batch)
        total_loss += loss
        num_batches += 1
    
    if (epoch + 1) % 10 == 0:
        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch + 1}: Loss = {avg_loss:.4f}")

# Evaluate
predictions = custom_model(X_test_scaled.astype('float32'))
accuracy = np.mean(np.argmax(predictions, axis=1) == y_test)
print(f"\nCustom training accuracy: {accuracy:.2%}")

---
# Summary: Quick Reference
---

## scikit-learn Workflow
```python
# 1. Prepare data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Create and train model
model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)

# 3. Predict and evaluate
predictions = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
```

## TensorFlow/Keras Workflow
```python
# 1. Build model
model = keras.Sequential([
    layers.Dense(64, activation='relu'),
    layers.Dense(num_classes, activation='softmax')
])

# 2. Compile
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# 3. Train
model.fit(X_train, y_train, epochs=50, validation_split=0.2)

# 4. Evaluate
model.evaluate(X_test, y_test)
```

## Key Concepts Covered

### scikit-learn:
- Loading datasets
- Train/test splitting
- Data preprocessing (scaling)
- Classification models (Logistic Regression, Decision Trees, Random Forest, KNN, SVM)
- Regression models
- Cross-validation
- Hyperparameter tuning (GridSearchCV)
- Clustering (K-Means)
- Pipelines

### TensorFlow/Keras:
- Tensors and operations
- Sequential API
- Functional API
- Convolutional Neural Networks (CNNs)
- Callbacks (Early Stopping, Learning Rate Scheduling)
- Saving and loading models
- Custom training loops

---
**Happy Learning!** ðŸš€