# UBC
## Introduction to Machine Learning

### Week 1
Instructor: Socorro Dominguez-Vidana

### Module 1

Overview:

- [] Motivation to Study Machine Learning
- [] Supervised vs. Unsupervised Learning
- [] Classification vs. Regression Problems
- [] Machine Learning Terminology
- [] Baseline Models with DummyClassifier/DummyRegressor
- [] Understanding `.fit()`, `.predict()`, and `.score()` in Machine Learning

#### Motivation to Study Machine Learning

- Machine learning allows computers to learn from data and make predictions or decisions without being explicitly programmed (main difference from traditional programming)

- It's used in fields like healthcare, finance, autonomous vehicles, and more to solve complex problems.

#### Supervised vs. Unsupervised Learning

##### Supervised Learning

- In supervised learning, the model is trained using labeled data.
- It learns to map input data to the correct output, i.e., it learns a function.
- **Example:** Predicting house prices based on features (area, bedrooms, etc.).

##### Unsupervised Learning

- In unsupervised learning, the model finds patterns or structure in data without labeled outputs.
- It's often used for clustering and dimensionality reduction.
- **Example:** Grouping similar customer behavior for targeted marketing.

In this course, we will only talk about Supervised Learning.

#### Supervised Learning
- Divided into Classification and Regression Problems

##### Classification

- Classification is used to predicts a category or class label.
- **Example:** Spam vs. Not Spam email classification.

##### Regression

- Regression allows us to predict a continuous numerical value.
- **Example:** Predicting the price of a house.

##### Machine Learning Terminology

- **Features**: Input variables or attributes used to make predictions.
  - **Example:** In predicting house prices, features might include `square footage`, `number of bedrooms`, and `location`.

- **Targets**: The output or the value we want to predict.
  - In classification, it's the class label (e.g., spam or not spam).
  - In regression, it's the numerical value (e.g., house price).

- **Training**: The process of the model learning from the data.
  - It optimizes internal parameters to make accurate predictions.

- **Error**: The difference between the predicted output and the actual target.
  - Reducing error is the goal of training.

#### Baseline Models with `DummyClassifier` / `DummyRegressor`

- Before using complex models, we establish a baseline that we will want to improve.
- DummyClassifier and DummyRegressor are simple models used for benchmarking.
- Dummy models make predictions based on simple rules (e.g., random or most frequent class).
- We learn them because they follow the same "mechanical" process as a "real" model in `sklearn`, the library we will be using.

##### Example

In [1]:
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
data = load_iris()

# Create a DataFrame for the features
iris_df = pd.DataFrame(data.data, columns=data.feature_names)

# Create a Series for the target values
target = pd.Series(data.target, name="target")

# Concatenate the features and target into a single DataFrame
iris_df = pd.concat([iris_df, target], axis=1)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [2]:
# Split the data into training and testing sets
X = iris_df.drop("target", axis=1)
y = iris_df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n""Features")
print(X.head(3))
print("\n""Target")
print(y.head(3))


Features
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2

Target
0    0
1    0
2    0
Name: target, dtype: int64


In [9]:
# Create a DummyClassifier
dummy = DummyClassifier(strategy="most_frequent", random_state=1)


In [7]:
iris_df['target'].value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

#### Understanding `.fit()`, `.predict()`, and `.score()` in Machine Learning

- `.fit()` is used to train the model on the provided data.
- `.predict()` makes predictions on new, unseen data.
- `.score()` calculates a performance metric (e.g., accuracy, mean squared error) for the model.

In [10]:
# Fit the model
dummy.fit(X_train, y_train)


In [11]:
# Predict on test data
y_pred = dummy.predict(X_test)
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

In [12]:
y_test

73     1
18     0
118    2
78     1
76     1
31     0
64     1
141    2
68     1
82     1
110    2
12     0
36     0
9      0
19     0
56     1
104    2
69     1
55     1
132    2
29     0
127    2
26     0
128    2
131    2
145    2
108    2
143    2
45     0
30     0
Name: target, dtype: int64

In [14]:
# Calculate accuracy
accuracy = dummy.score(X_train, y_train)
print("Accuracy:", accuracy)

Accuracy: 0.3416666666666667
