# Introduction to Machine Learning for Business Applications

<img src="https://media.datacamp.com/legacy/image/upload/v1689699751/Comparing_supervised_and_unsupervised_learning_af4d5eccb0.png">

## Machine Learning Hierarchy with Business Applications

<table border="1" cellspacing="0" cellpadding="8" style="margin: auto; text-align: left; font-family: sans-serif;">
  <tr style="background-color: #d0e6f7; text-align: center;">
    <th><strong>Type</strong></th>
    <th><strong>Subtypes / Algorithms</strong></th>
    <th><strong>Business Applications</strong></th>
    <th><strong>Real-World Examples</strong></th>
  </tr>

  <!-- Supervised Learning -->
  <tr style="background-color: #f0f8ff;">
    <td rowspan="2"><strong>Supervised Learning</strong></td>
    <td><strong>Regression</strong><br>• Linear Regression<br>• Random Forest Regression</td>
    <td>
      <ul>
        <li>Sales & revenue forecasting</li>
        <li>Demand prediction</li>
        <li>Customer lifetime value (CLV)</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Amazon Demand Forecasting</li>
        <li>Uber Fare Estimation</li>
        <li>Spotify User Retention Models</li>
      </ul>
    </td>
  </tr>

  <tr style="background-color: #f0f8ff;">
    <td><strong>Classification</strong><br>• Logistic Regression<br>• Support Vector Machines (SVM)</td>
    <td>
      <ul>
        <li>Spam email detection</li>
        <li>Credit scoring / loan approval</li>
        <li>Medical diagnosis</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Gmail Spam Filter</li>
        <li>FICO Credit Risk Models</li>
        <li>IBM Watson Health</li>
      </ul>
    </td>
  </tr>

  <!-- Unsupervised Learning -->
  <tr style="background-color: #e6f7e6;">
    <td rowspan="2"><strong>Unsupervised Learning</strong></td>
    <td><strong>Clustering</strong><br>• K-Means<br>• DBSCAN</td>
    <td>
      <ul>
        <li>Customer segmentation</li>
        <li>Market research</li>
        <li>Anomaly detection</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Spotify Listener Segments</li>
        <li>Airbnb Guest Types</li>
        <li>Credit Card Fraud Detection</li>
      </ul>
    </td>
  </tr>

  <tr style="background-color: #e6f7e6;">
    <td><strong>Dimensionality Reduction</strong><br>• PCA<br>• t-SNE</td>
    <td>
      <ul>
        <li>Data visualization</li>
        <li>Noise reduction in IoT sensors</li>
        <li>Genomics / high-dimensional biology</li>
      </ul>
    </td>
    <td>
      <ul>
        <li>Netflix Recommendation System (latent space)</li>
        <li>Intel Sensor Compression</li>
        <li>Gene Expression Analysis (NIH)</li>
      </ul>
    </td>
  </tr>
</table>


## [scitkit-learn](https://scikit-learn.org/stable/)

Machine Learning in Python

<img src="https://scikit-learn.org/1.3/_static/ml_map.png" width=800>

[Source](http://scikit-learn.org/stable/)

# Supervised ML

<img src="https://panamahitek.com/wp-content/uploads/2023/09/image-6-696x237.png">

> Supervised machine learning is fundamentally about solving optimization problems, often formulated as **minimization problems**.


### Regression: Minimizing Prediction Error

In regression, the model tries to **minimize the total error** between actual and predicted values.

This is often done by minimizing the **Mean Squared Error (MSE)**:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

Where:
- $y_i$ is the actual value
- $\hat{y}_i$ is the predicted value


### Classification: Minimizing Misclassification

In classification, the model aims to **minimize the number of incorrect predictions**.

This is often measured using **accuracy**.



In [None]:
# import python packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# data preprocessing
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression

# Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree

# evaluation
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Set a seed for reproducibility
np.random.seed(42)

import warnings
warnings.filterwarnings('ignore')

# Regression

## Univarge regression

In [None]:
# data
df = pd.DataFrame({
    'HoursStudy': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'TestScore': [3, 4, 3, 6, 6, 5, 8, 9, 10, 10]
})

df

In [None]:
plt.scatter(df['HoursStudy'], df['TestScore'], c='blue', label='Actual')
plt.xlabel('Hours of study')
plt.ylabel('Test scores')
plt.legend('')
plt.show();

In [None]:
# Define X and y
X = df['HoursStudy'].values
y = df['TestScore'].values

> **Slope (m) Formula — for simple linear regression**:

$$
m = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

> **Intercept (b) Formula - The intercept is calculated as**:

$$
b = \bar{y} - m \cdot \bar{x}
$$

Where:
- $\bar{x}$ is the **mean of x values**
- $\bar{y}$ is the **mean of y values**
- $m$ is the **slope**



In [None]:
# Calculate means
X_mean = np.mean(X)
y_mean = np.mean(y)

# Calculate slope (m) and intercept (b)
numerator =
denominator =

m =
b =

# Print results
print(f"Slope: {m}")
print(f"Intercept: {b}")

In [None]:
w1 = m
b = b
print(f"Regression Equation: TestScore = {w1:.2f} * HoursStudy + {b:.2f}")

In [None]:
# Generate predicted y values based on the regression line
y_pred = m * X + b

In [None]:
# Create the plot
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label=f'Regression Line: y = {m:.2f}x + {b:.2f}')
plt.xlabel('Hours of Study')
plt.ylabel('Test Score')
plt.title('Simple Linear Regression: Hours of Study vs Test Score')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
r2 = r2_score(y, y_pred)
mae = mean_absolute_error(y, y_pred)

# Print the results
print(f"R² Score: {r2:.3f}")
print(f"Mean Absolute Error (MAE): {mae:.3f}")

> **Using ```scikit-learn```, Python’s popular machine learning library, makes building ML models simple and efficient**.

In [None]:
# Define X and y
X = df[['HoursStudy']]
y = df['TestScore']

# initializing algorithm and training the model
model =
model.

print("Coefficients (w):", model.coef_)
print("Intercept (b):", model.intercept_)

In [None]:
w1 = model.coef_[0]
b = model.intercept_
print(f"Regression Equation: TestScore = {w1:.2f} * HoursStudy + {b:.2f}")

## Multiple regression

> **Formula - For two input variables**:

$$
y = b_0 + b_1 x_1 + b_2 x_2
$$

Where:
- $y$: Predicted output (e.g., test score)  
- $x_1$: Hours of study  
- $x_2$: Hours of sleep  
- $b_0$: Intercept (bias term)  
- $b_1, b_2$: Coefficients for each input variable  

---
$$
\hat{y} = w_1 \cdot \text{HoursStudy} + w_2 \cdot \text{HoursSleep} + b
$$



In [None]:
# data
df = pd.DataFrame({
    'HoursStudy': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'HoursSleep': [8, 7, 6, 7, 6, 5, 4, 5, 3, 2],
    'TestScore': [3, 4, 3, 6, 6, 5, 8, 9, 10, 10]
})

df

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

x = df['HoursStudy']
y = df['HoursSleep']
z = df['TestScore']

scatter = ax.scatter(x, y, z, c=z, cmap='cool', s=100, edgecolor='k')

ax.set_xlabel('Hours Studied')
ax.set_ylabel('Hours Slept')
ax.set_zlabel('Test Score')
ax.set_title('3D Visualization: Study & Sleep vs Test Score')

ax.set_zlim(0, 12)  # ensure we can see the top
ax.view_init(elev=20, azim=120)  # better perspective
fig.colorbar(scatter, ax=ax, label='Test Score')

plt.tight_layout()
plt.show()

> **What insights can we draw from this 3D plot?**

In [None]:
# Define X and y
X = df[['HoursStudy', 'HoursSleep']]
y = df['TestScore']

# initializing algorithm and training the model





print("Coefficients (w):", model.coef_)
print("Intercept (b):", model.intercept_)

$$
\hat{y} = w_1 \cdot \text{HoursStudy} + w_2 \cdot \text{HoursSleep} + b
$$

In [None]:
w1, w2 = model.coef_
b = model.intercept_
print(f"Regression Equation: TestScore = {w1:.2f} * HoursStudy + {w2:.2f} * HoursSleep + {b:.2f}")

In [None]:
y_pred =

In [None]:
r2 =
mae =

# Print the results
print(f"R² Score: {r2:.3f}")
print(f"Mean Absolute Error (MAE): {mae:.3f}")


Conclusion: Adding another variable (**Hoursleep**) can increase a model’s accuracy, but only if that variable provides new, useful information.

> Real-world applications:

    - Amazon Demand Forecasting
    - Uber Fare Estimation
    - Spotify User Retention Models

<img src="https://www.wikihow.com/images/thumb/8/8f/Get-an-Uber-Fare-Estimate-in-Advance-Step-25.jpg/aid7831868-v4-728px-Get-an-Uber-Fare-Estimate-in-Advance-Step-25.jpg.webp">

# Classification

In [None]:
import pandas as pd

df = pd.DataFrame({
    'Size':  [3, 1, 2, 3, 7, 8, 7, 8],
    'Color': [1, 2, 1, 2, 3, 3, 2, 2],
    'Label': [0, 0, 0, 0, 1, 1, 1, 1]   # fruits: Good or Bad
})
df

In [None]:
# Create scatter plot
for i in range(len(df)):
    if df['Label'][i] == 0:
        plt.scatter(df['Size'][i], df['Color'][i], color='red', label='Bad' if i == 0 else "")
    else:
        plt.scatter(df['Size'][i], df['Color'][i], color='blue', label='Good' if i == 4 else "")

plt.xlabel("Size")
plt.ylabel("Color")
plt.title("Classification Example")
plt.legend()
plt.grid(True)
plt.show()

Build a classification model (taking just two lines of code)

In [None]:
X = df[['Size', 'Color']]
y = df['Label']

## KNeighborsClassifier

In [None]:
# Create a k-NN classifier with k=3
knn =

# Fit the model
knn.                           # The magic happens in fit()

Let's suppose I have a new fruit [4, 2]. Is this fruit good or bad? Ask our model!!!

In [None]:
# New fruit data for prediction: [size, color]
new_fruit = np.array([[4, 2]])

In [None]:
# Plot existing fruits
for i in range(len(df)):
    if df['Label'][i] == 0:
        plt.scatter(df['Size'][i], df['Color'][i], color='red', label='Bad' if i == 0 else "")
    else:
        plt.scatter(df['Size'][i], df['Color'][i], color='blue', label='Good' if i == 4 else "")

# Plot new fruit
plt.scatter(new_fruit[:, 0], new_fruit[:, 1], color='black', marker='x', s=100, label='New Fruit')

plt.xlabel('Size')
plt.ylabel('Color')
plt.title('New Fruit Classification')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# New fruit data for prediction: [Size, Color]
new_fruit = np.array([[4, 2]])

# Make prediction
prediction = knn.predict(new_fruit)[0]  # Get the scalar value

# Output the result
label = "Good" if prediction == 1 else "Bad"
print(f"Prediction for new fruit with features {new_fruit.tolist()[0]}: {label}")


## Decision Tree

In [None]:
#Train decision tree (max_depth=2, random_state=0)
tree =
tree.

In [None]:
plt.figure(figsize=(10, 6))
plot_tree(tree,
          feature_names=['Size', 'Color'],
          class_names=['Bad', 'Good'],
          filled=True,
          rounded=True,
          fontsize=12)
plt.title("Decision Tree for Fruit Classification")
plt.show()

Conclusion: **```Size``` is more important than ```Color```**. ```Color``` had no effect on the model's decision-making.

In [None]:
# New fruit data for prediction: [Size, Color]
new_fruit = np.array([[4, 2]])

# Make prediction
prediction = tree.predict(new_fruit)[0]  # Get the scalar value

# Output the result
label = "Good" if prediction == 1 else "Bad"
print(f"Prediction for new fruit with features {new_fruit.tolist()[0]}: {label}")

> Real-world applications:

    - Gmail Spam Filter
    - FICO Credit Risk Models (predict the likelihood of a borrower **defaulting** on a loan or becoming delinquent)
    - IBM Watson Health

<img src="https://www.gizchina.com/wp-content/uploads/images/2023/10/h-700x394.jpg">

# Exercise

Please answer the following questions:

1. Background: What is your major or field of study? (e.g., Data Analytics, Finance, Operations, Marketing, Accounting, Computer Science)

2. Regression Applications: Identify a specific way regression analysis could be applied in your field. Provide a brief, concrete example of how it would be used to solve a real problem.

3. Classification Applications: Describe how classification techniques could benefit your field. Share a practical example of when and why you would use classification.