# Introduction to Machine Learning for Business Applications

<img src="https://media.datacamp.com/legacy/image/upload/v1689699751/Comparing_supervised_and_unsupervised_learning_af4d5eccb0.png">

## Machine Learning Hierarchy with Business Applications

<table border="1" cellspacing="0" cellpadding="8" style="margin: auto; text-align: left; font-family: sans-serif;">
  <tr style="background-color: #d0e6f7; text-align: center;">
    <th><strong>Type</strong></th>
    <th><strong>Subtypes / Algorithms</strong></th>
    <th><strong>Business Applications</strong></th>
    <th><strong>Real-World Examples</strong></th>
  </tr>

  <!-- Supervised Learning -->
  <tr style="background-color: #f0f8ff;">
    <td rowspan="2"><strong>Supervised Learning</strong></td>
    <td><strong>Regression</strong><br>• Linear Regression<br>• Random Forest Regression</td>
    <td>
      <ul>
        <li>Sales & revenue forecasting</li>
        <li>Demand prediction</li>
        <li>Customer lifetime value (CLV)</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Amazon Demand Forecasting</li>
        <li>Uber Fare Estimation</li>
        <li>Spotify User Retention Models</li>
      </ul>
    </td>
  </tr>

  <tr style="background-color: #f0f8ff;">
    <td><strong>Classification</strong><br>• Logistic Regression<br>• Support Vector Machines (SVM)</td>
    <td>
      <ul>
        <li>Spam email detection</li>
        <li>Credit scoring / loan approval</li>
        <li>Medical diagnosis</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Gmail Spam Filter</li>
        <li>FICO Credit Risk Models</li>
        <li>IBM Watson Health</li>
      </ul>
    </td>
  </tr>

  <!-- Unsupervised Learning -->
  <tr style="background-color: #e6f7e6;">
    <td rowspan="2"><strong>Unsupervised Learning</strong></td>
    <td><strong>Clustering</strong><br>• K-Means<br>• DBSCAN</td>
    <td>
      <ul>
        <li>Customer segmentation</li>
        <li>Market research</li>
        <li>Anomaly detection</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Spotify Listener Segments</li>
        <li>Airbnb Guest Types</li>
        <li>Credit Card Fraud Detection</li>
      </ul>
    </td>
  </tr>

  <tr style="background-color: #e6f7e6;">
    <td><strong>Dimensionality Reduction</strong><br>• PCA<br>• t-SNE</td>
    <td>
      <ul>
        <li>Data visualization</li>
        <li>Noise reduction in IoT sensors</li>
        <li>Genomics / high-dimensional biology</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Netflix Recommendation System (latent space)</li>
        <li>Intel Sensor Compression</li>
        <li>Gene Expression Analysis (NIH)</li>
      </ul>
    </td>
  </tr>
</table>


## [scitkit-learn](https://scikit-learn.org/stable/)

Machine Learning in Python

<img src="https://scikit-learn.org/1.3/_static/ml_map.png" width=800>

[Source](http://scikit-learn.org/stable/)

# Unsupervised ML

> Clustering: **Minimizing within-cluster distances while maximizing between-cluster distances**

> Dimensionality reduction: **Minimizing information loss while reducing dimensions**

## Clustering

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans

# Set a seed for reproducibility
np.random.seed(42)

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.DataFrame({
    'Size':  [1, 5, 1.5, 8, 1, 9],
    'Color': [2, 8, 1.8, 8, 0.6, 11]
})
df

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(df['Size'], df['Color'], marker='o', s=100, color='blue')
plt.xlabel("Size")
plt.ylabel("Color")
plt.title("Data Points")
plt.grid(True)
plt.show()

How many clusters are in this data?

In [None]:
# Fit KMeans (n_clusters=2, random_state=0)
kmeans =
kmeans.

In [None]:
# Get centroids and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

In [None]:
# Set up colors for each cluster
colors = ['blue', 'red']

# Plot each point with its cluster color
for i in range(len(df)):
    plt.scatter(df['Size'][i], df['Color'][i], color=colors[labels[i]], s=100)

# Plot centroids
plt.scatter(centroids[:, 0], centroids[:, 1], color='black', marker='x', s=150, label='Centroids')

# Final touches
plt.xlabel('Size')
plt.ylabel('Color')
plt.title('K-Means Clustering')
plt.legend()
plt.grid(True)
plt.show()

> Real-world applications:

    - Spotify Listener Segments
    - Airbnb Guest Types
    - Credit Card Fraud Detection (Fraudulent or unusual transactions appear as outliers)

<img src="https://businessmodelanalyst.com/wp-content/uploads/2024/12/Spotify-Target-Market-1024x576.webp">

## Dimensionality reduction

In [None]:
from sklearn.datasets import fetch_olivetti_faces

# Load the Olivetti faces dataset
faces_data = fetch_olivetti_faces(shuffle=True, random_state=42)
faces = faces_data.data
targets = faces_data.target

In [None]:
# data defition
n_samples, n_features = faces.shape
n_faces = len(np.unique(targets))
print(f"Dataset: {n_samples} images")
print(f"Image size: 64x64 pixels = {n_features} dimensions")
print(f"Number of different people: {n_faces}")

About the Olivetti Faces Dataset

- **Size:** 400 grayscale images  
- **Dimensions:** 64×64 pixels  
- **Subjects:** 40 distinct individuals, 10 images per person  
- **Variation:** Includes slight changes in facial expressions and lighting

In [None]:
# Create a DataFrame with the face data

# First, create a DataFrame with the pixel values
pixel_columns = [f'pixel_{i}' for i in range(faces.shape[1])]
faces_df = pd.DataFrame(faces, columns=pixel_columns)

# Add the target (person identifier) as a column
faces_df['person_id'] = targets

print("First 5 rows of the DataFrame:")
faces_df.head()

- Rows = Individual face images (400 total)
- Columns = Pixel positions (4,096) plus person identifier
- Cells = Grayscale intensity values for each pixel

We're displaying actual photos of real people from the dataset - specifically, the first 12 images from the Olivetti faces dataset

In [None]:
# Display some original face images
n_row, n_col = 3, 4
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)

for i in range(n_row * n_col):
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(faces[i].reshape((64, 64)), cmap=plt.cm.gray)
    plt.title(f"Person #{targets[i]}", size=12)
    plt.xticks(())
    plt.yticks(())

plt.suptitle("Original Face Images (64x64 pixels = 4,096 dimensions each)",
             fontsize=16)
plt.show()

The original images above are represented in **4,096 dimensions (64 × 64 pixels)**.

The images below have been reconstructed using only **66 principal components**.

**Despite this significant reduction in dimensionality, the reconstructed images retain most of the important visual information, demonstrating how PCA effectively captures the underlying structure of the data**.

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce to 66 components
n_components = 66
#pca = PCA(n_components=n_components, whiten=True, random_state=42)
pca = PCA(n_components=66, whiten=False, random_state=42)
faces_pca = pca.fit_transform(faces)
faces_reconstructed = pca.inverse_transform(faces_pca)

# Plot the reconstructed faces
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)

for i in range(n_row * n_col):
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(faces_reconstructed[i].reshape((64, 64)), cmap=plt.cm.gray)
    plt.title(f"Person #{targets[i]}", size=12)
    plt.xticks(())
    plt.yticks(())

plt.suptitle(f"Reconstructed Faces Using {n_components} PCA Components", fontsize=16)
plt.show()

> Applications: Face recognition

    - Netflix Recommendation System (latent space): Compressing massive user-item matrix into a lower-dimensional latent space.
    - Intel Sensor Compression
    - Gene Expression Analysis (NIH)

<img src="https://hellopm.co/wp-content/uploads/2024/07/hipertextual-si-te-vas-netflix-no-olvides-descargar-mi-actividad-mi-lista-2019814675.webp" width=500>

# Training Deep Learning Models (CNNs) with the Olivetti Faces dataset

## With the Original Dataset

In [None]:
from sklearn.model_selection import train_test_split

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

In [None]:
# Load dataset
data = fetch_olivetti_faces()
X = data.images  # shape: (400, 64, 64)
y = data.target  # shape: (400,)

# Reshape for CNN input: (samples, height, width, channels)
X = X.reshape((X.shape[0], 64, 64, 1)).astype("float32")

# One-hot encode labels
y_cat = to_categorical(y, num_classes=40)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_cat, test_size=0.2, random_state=42)

In [None]:
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 1)),
    MaxPooling2D(pool_size=(2, 2)),

    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),

    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(40, activation='softmax')  # 40 classes
])

In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=20, batch_size=16,
                    validation_data=(X_test, y_test))


In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.2f}")


In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy
plt.figure(figsize=(14, 5))

# Accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='CNN Train Acc')
plt.plot(history.history['val_accuracy'], label='CNN Val Acc')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

# Loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='CNN Train Loss')
plt.plot(history.history['val_loss'], label='CNN Val Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


> **What’s Going Wrong**:

You're **overfitting** to the training set, and your model fails to generalize. This can happen due to:

- Small dataset (400 samples)

# Exercise

Please answer the following questions:

1. Background: What is your major or field of study? (e.g., Data Analytics, Finance, Operations, Marketing, Accounting, Computer Science)

2. Clustering Applications: Explain how clustering methods might reveal insights in your field. Provide an example scenario where clustering would be valuable.