## Course Assignment Instructions
You should have Python (version 3.8 or later) and Jupyter Notebook installed to complete this assignment. You will write code in the empty cell/cells below the problem. While most of this will be a programming assignment, some questions will ask you to "write a few sentences" in markdown cells. 

Submission Instructions:

Create a labs directory in your personal class repository (e.g., located in your home directory)
Clone the class repository
Copy this Jupyter notebook file (.ipynb) into your repo/labs directory
Make your edits, commit changes, and push to your repository
All submissions must be pushed before the due date to avoid late penalties. 

Labs are graded out of a 100 pts. Each day late is -5. For a max penalty of -50 after 10 days. From there you may submit the lab anytime before the semester ends for a max score of 50.  

Lab 2 is due on 2/18/25

## Basic Modeling
In the 342 class an example was given that considered a variable `x_3` which measured "criminality". In this example there are L = 4 levels "none", "infraction", "misdemeanor" and "felony". Create a variable `x_3` here with 100 random elements (equally probable). Create it as a nominal (i.e. unordered) factor. Hint: use random.choice from NumPy and Categorical from Pandas.

In [1]:
import numpy as np
import pandas as pd

# Define the categories
categories = ["none", "infraction", "misdemeanor", "felony"]

# Generate 100 random elements with equal probability
x_3 = np.random.choice(categories, size = 100, replace = True)
# Convert to a categorical (nominal) variable in pandas
pd.Categorical(x_3, categories = categories, ordered = False)
print(x_3)

['felony' 'misdemeanor' 'none' 'felony' 'infraction' 'felony'
 'misdemeanor' 'misdemeanor' 'felony' 'felony' 'none' 'none' 'misdemeanor'
 'none' 'infraction' 'none' 'misdemeanor' 'none' 'none' 'misdemeanor'
 'misdemeanor' 'none' 'infraction' 'felony' 'felony' 'misdemeanor'
 'felony' 'felony' 'misdemeanor' 'none' 'felony' 'misdemeanor'
 'misdemeanor' 'misdemeanor' 'felony' 'none' 'felony' 'infraction'
 'infraction' 'misdemeanor' 'felony' 'none' 'none' 'misdemeanor'
 'infraction' 'none' 'felony' 'felony' 'infraction' 'misdemeanor'
 'infraction' 'felony' 'none' 'infraction' 'none' 'misdemeanor'
 'misdemeanor' 'felony' 'felony' 'none' 'infraction' 'misdemeanor'
 'felony' 'none' 'infraction' 'felony' 'infraction' 'misdemeanor' 'none'
 'felony' 'infraction' 'none' 'infraction' 'felony' 'misdemeanor' 'felony'
 'none' 'misdemeanor' 'misdemeanor' 'misdemeanor' 'misdemeanor' 'felony'
 'none' 'felony' 'infraction' 'none' 'misdemeanor' 'none' 'infraction'
 'infraction' 'infraction' 'felony' 'felon

Use x_3 to create x_3_bin, a binary feature where 0 is no crime and 1 is any crime.

In [3]:
# creates a boolean array (True for crime, False for no crime)
x_3_bin = (x_3 != "none").astype(int)
print(x_3_bin)

[1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1
 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1]


Use `x_3` to create `x_3_ord`, an ordered factor variable. Ensure the proper ordinal ordering.

In [5]:
x_3_ord = pd.Categorical(x_3, categories = categories, ordered = True)
print(x_3_ord)

['felony', 'misdemeanor', 'none', 'felony', 'infraction', ..., 'infraction', 'infraction', 'none', 'felony', 'infraction']
Length: 100
Categories (4, object): ['none' < 'infraction' < 'misdemeanor' < 'felony']


Convert this variable into three binary variables without any information loss and put them into a data matrix. Hint: use column_stack from Numpy.

In [7]:
x_3_matrix = np.column_stack([
    (x_3 == "infraction").astype(int),
    (x_3 == "misdemeanor").astype(int),
    (x_3 == "felony").astype(int)
])

x_3_matrix = pd.DataFrame(x_3_matrix, columns = ["is_infraction", "is_misdemeanor", "is_felony"])
print(x_3_matrix)

    is_infraction  is_misdemeanor  is_felony
0               0               0          1
1               0               1          0
2               0               0          0
3               0               0          1
4               1               0          0
..            ...             ...        ...
95              1               0          0
96              1               0          0
97              0               0          0
98              0               0          1
99              1               0          0

[100 rows x 3 columns]


What should the sum of each row be (in English)? Write your answer in the markdown cell below

0 or 1. 0 if the person did no crime, 1 if the person did one of the types of crime

Verify that in the code cell below

In [9]:
row_sum = x_3_matrix.sum(axis = 1)
print(row_sum.value_counts())

1    75
0    25
Name: count, dtype: int64


 How should the column sum look (in English)? Write your answer in the markdown cell below

The sum of each row should be around 25 as we used random to assign the categories which should disperse them evenely

Verify that in the code cell below

In [11]:
col_sum = x_3_matrix.sum(axis = 0)
print(col_sum)

is_infraction     21
is_misdemeanor    25
is_felony         29
dtype: int64


Generate a matrix with 100 rows where the first column is realization from a normal with mean 17 and variance 38, the second column is uniform between -10 and 10, the third column is poisson with mean 6, the fourth column is exponential with lambda of 9, the fifth column is binomial with n = 20 and p = 0.12 and the sixth column is a binary variable with exactly 24% 1's dispersed randomly. Name the rows the entries of the `fake_first_names` vector. You will need to use Numpy

In [13]:
# Number of rows
n = 100

# Assign row names (index) from fake_first_names
fake_first_names = [
    "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
    "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
    "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
    "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
    "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
    "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
    "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
    "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
    "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
    "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
    "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
    "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
    "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
    "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
    "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
    "Lincoln"
]

# Create a DataFrame with the specified distributions
X = pd.DataFrame({
    "Normal": np.random.normal(loc = 17, scale = np.sqrt(38), size = n),  # Normal(17, variance 38)
    "Uniform": np.random.uniform(low = -10, high = 10, size = n),         # Uniform(-10, 10)
    "Poisson": np.random.poisson(6, size = n),                            # Poisson(6)
    "Exponential": np.random.exponential(1/9, size = n),                  # Exponential(λ=9)
    "Binomial": np.random.binomial(n = 20, p = .12, size = n),            # Binomial(n=20, p=0.12)
    "Binary": np.random.permutation([1] * int(n * .24) + [0] * int(n * .76))  # 24% 1s, shuffled
})
X.index = fake_first_names[:n]
print(X)

              Normal   Uniform  Poisson  Exponential  Binomial  Binary
Sophia     15.486595  5.491942        7     0.042153         2       0
Emma       16.812618 -1.689414        5     0.164838         2       0
Olivia     19.247926 -2.364159        5     0.364949         2       0
Ava        12.173202  8.585097        4     0.078225         3       0
Mia        23.521994 -2.141151        6     0.019589         1       0
...              ...       ...      ...          ...       ...     ...
Christian  26.065681  0.708227        8     0.258644         2       0
Andrew     18.380533 -1.388724        6     0.003340         1       1
Brayden    15.980011 -0.230347        6     0.068276         3       1
John       15.749537 -0.708023        8     0.013984         2       0
Lincoln     9.904255 -7.295067        2     0.018377         5       0

[100 rows x 6 columns]


Create a data frame of the same data as above except make the binary variable a factor "DOMESTIC" vs "FOREIGN" for 0 and 1 respectively. In Rstudio you used the `View` function to ensure this worked as desired. In python use .head() on the DataFrame. I recommend creating a copy of the DataFrame and then using the .replace in conjunction with .astype("category") to make the binary variable a factor. 

In [15]:
# Convert matrix DataFrame to categorical for the binary variable
# Make a copy to keep X unchanged
X_copy = X.copy() 

# Convert binary column (6th column) to categorical labels
X_copy["Binary"] = X_copy["Binary"].replace({0: "Domestic", 1: "Foreign"}).astype("category")

# Display first few rows
X_copy.head()

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,15.486595,5.491942,7,0.042153,2,Domestic
Emma,16.812618,-1.689414,5,0.164838,2,Domestic
Olivia,19.247926,-2.364159,5,0.364949,2,Domestic
Ava,12.173202,8.585097,4,0.078225,3,Domestic
Mia,23.521994,-2.141151,6,0.019589,1,Domestic


Print out a table of the binary variable. Then print out the proportions of "DOMESTIC" vs "FOREIGN". Pandas DataFrames has a .value_count() feature. 

In [17]:
X_copy["Binary"].value_counts(normalize = True)

Binary
Domestic    0.76
Foreign     0.24
Name: proportion, dtype: float64

Print out a summary of the whole dataframe.

In [19]:
print(X_copy.describe())
print(X_copy["Binary"].value_counts())

           Normal     Uniform     Poisson  Exponential    Binomial
count  100.000000  100.000000  100.000000   100.000000  100.000000
mean    16.825248    1.304737    5.910000     0.112094    2.130000
std      6.832460    5.701908    2.127715     0.116061    1.440083
min      1.756510   -9.499054    1.000000     0.000814    0.000000
25%     12.761071   -2.920130    4.750000     0.036676    1.000000
50%     16.943084    2.028144    6.000000     0.067879    2.000000
75%     20.577845    5.964783    7.250000     0.147208    3.000000
max     34.793817    9.660658   10.000000     0.575841    6.000000
Binary
Domestic    76
Foreign     24
Name: count, dtype: int64


## Dataframe creation
Imagine you are running an experiment with many manipulations. You have 14 levels in the variable "treatment" with levels a, b, c, etc. For each of those manipulations you have 3 submanipulations in a variable named "variation" with levels A, B, C. Then you have "gender" with levels M / F. Then you have "generation" with levels Boomer, GenX, Millenial. Then you will have 6 runs per each of these groups. In each set of 6 you will need to select a name without duplication from the appropriate set of names (from the last question). Create a data frame with columns treatment, variation, gender, generation, name and y that will store all the unique unit information in this experiment. Leave y empty because it will be measured as the experiment is executed. In Rstudio you used `rep` function using the `times` argument. For python use np.tile, and np.repeat.

In [23]:
# Define categories
treatments = list("abcdefghijklmn")
variations = list("ABC")
genders = ["M", "F"]
generations = ["Boomer", "GenX", "Millenial"]


# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}


# Create experiment dataframe
df = pd.DataFrame({
    "treatment": np.repeat(treatments, len(variations) * len(genders) * len(generations) * 6),
    "variation": np.tile(np.repeat(variations, len(genders) * len(generations) * 6), len(treatments)),
    "gender": np.tile(np.repeat(genders, len(generations) * 6), len(treatments) * len(variations)),
    "generation": np.tile(np.repeat(generations, 6), len(treatments) * len(variations) * len(genders))
})

# Add a unique identifier to preserve the original order
df = df.reset_index().rename(columns={'index': 'orig_index'})

# Function to assign unique names per group
def assign_names(group):
    gender_val = group["gender"].iloc[0]  # Extract gender
    generation_val = group["generation"].iloc[0]  # Extract generation
    return np.random.choice(name_sets[gender_val][generation_val], 6, replace=False)

# Apply function to assign names
df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names).explode().reset_index(drop=True)

# Add empty column y
df["y"] = np.nan

# Display first few rows
print(df.head())
df

   orig_index treatment variation gender generation      name   y
0           0         a         A      M     Boomer  Marjorie NaN
1           1         a         A      M     Boomer       Kay NaN
2           2         a         A      M     Boomer    Dianne NaN
3           3         a         A      M     Boomer   Shirley NaN
4           4         a         A      M     Boomer   Mildred NaN


  df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names).explode().reset_index(drop=True)


Unnamed: 0,orig_index,treatment,variation,gender,generation,name,y
0,0,a,A,M,Boomer,Marjorie,
1,1,a,A,M,Boomer,Kay,
2,2,a,A,M,Boomer,Dianne,
3,3,a,A,M,Boomer,Shirley,
4,4,a,A,M,Boomer,Mildred,
...,...,...,...,...,...,...,...
1507,1507,n,C,F,Millenial,Gabriel,
1508,1508,n,C,F,Millenial,Evan,
1509,1509,n,C,F,Millenial,Austin,
1510,1510,n,C,F,Millenial,Casey,


Now that you've done it with the np.tile and np.repeat, Try doing this by importing product from the itertools module. This will be analogous to using `expand.grid` function from Rstudio. 

| **R Function** | **Python Equivalent** |
|--------------|-----------------|
| `rep(x, times=n)` | `np.repeat(x, n)` |
| `rep(x, each=n)` | `np.tile(np.repeat(x, n), times)` |
| `rep(x, length.out=n)` | `np.resize(x, n)` |
| `expand.grid()` | `itertools.product()` |

| **R Function** | **Python Equivalent** | **Use Case** |
|--------------|-----------------|-----------|
| `rep(x, times=n)` | `np.repeat(x, n)` | Repeat each element **`n` times** in order |
| `rep(x, each=n)` | `np.tile(x, n)` | Repeat the full sequence **`n` times** |
| `rep(x, length.out=n)` | `np.resize(x, n)` | Repeat `x` but **truncate** or **expand** to length `n` |

**`expand.grid()` → `itertools.product()`** for generating **all combinations**  
**`rep(..., each=n)` → `np.repeat()`** for **repeating values in order**  
**`rep(..., times=n)` → `np.tile()`** for **cycling through values**  
**`Combination of `np.repeat()` and `np.tile()`** replaces **nested `rep()`** in R

In [29]:
from itertools import product

# Define categories
treatments = list("abcdefghijklmn")
variations = list("ABC")
genders = ["M", "F"]
generations = ["Boomer", "GenX", "Millenial"]
runs_per_group = 6
# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}

# Generate all unique combinations (equivalent to expand.grid in R)
df = pd.DataFrame(product(treatments, variations, genders, generations, range(1, runs_per_group + 1)),
                  columns=["treatment", "variation", "gender", "generation", "run"])



# Function to assign unique names per gender-generation group
def assign_names(group):
    gender = group["gender"].iloc[0]
    generation = group["generation"].iloc[0]
    return np.random.choice(name_sets[gender][generation], size = len(group), replace = False)
    
# Apply function ensuring each group gets 6 names
df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], sort = False, group_keys = False).apply(assign_names).explode().reset_index(drop = True)

# Add an empty column for y (to be measured later)
df["y"] = np.nan

# Print first few rows
print(df.head())

# Verify the number of rows (should be 1512)
print(f"Total rows: {len(df)} (Expected: {14 * 3 * 2 * 3 * 6})")
df

  treatment variation gender generation  run      name   y
0         a         A      M     Boomer    1       Lee NaN
1         a         A      M     Boomer    2       Tom NaN
2         a         A      M     Boomer    3     Eddie NaN
3         a         A      M     Boomer    4   Herbert NaN
4         a         A      M     Boomer    5  Theodore NaN
Total rows: 1512 (Expected: 1512)


  df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], sort = False, group_keys = False).apply(assign_names).explode().reset_index(drop = True)


Unnamed: 0,treatment,variation,gender,generation,run,name,y
0,a,A,M,Boomer,1,Lee,
1,a,A,M,Boomer,2,Tom,
2,a,A,M,Boomer,3,Eddie,
3,a,A,M,Boomer,4,Herbert,
4,a,A,M,Boomer,5,Theodore,
...,...,...,...,...,...,...,...
1507,n,C,F,Millenial,2,Alexis,
1508,n,C,F,Millenial,3,Samantha,
1509,n,C,F,Millenial,4,Lauren,
1510,n,C,F,Millenial,5,Taylor,


## Basic Binary Classification Modeling

Load the famous `iris` data frame into the namespace. In Rstudio you used the `skim` function from the package `skimr` to provide a summary of the columns. In python we will use df.describe() and the ProfileReport from the ydata-profiling package. The `iris` data set is not available in base python, but we can get this data from the sklearn package. Write a few descriptive sentences about the distributions using the code below in English.

In [None]:
#install scikit-learn by uncommenting the code below
%pip install scikit-learn

In [None]:
#install ydata-profile by uncommenting the code below
%pip install -U ydata-profiling[notebook]

### **Comparing the `iris` Dataset in R vs Python**
| Feature  | **R (`datasets::iris`)**  | **Python (`sklearn.datasets.load_iris()`)**  |
|----------|-------------------------|--------------------------------|
| **Total Rows**  | 150 | 150 |
| **Columns (Features)** | 5 (`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) | 5 (`sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, `petal width (cm)`, `species`) |
| **Species Encoding**  | `"setosa"`, `"versicolor"`, `"virginica"` (Categorical Factor) | `0` (setosa), `1` (versicolor), `2` (virginica) (Numerical Encoding) |
| **Data Type for Species** | Factor (Categorical) | Integer (0,1,2) |
| **Data Loading Method** | `data(iris)` (built-in dataset) | `datasets.load_iris()` (from `sklearn`) |

### **Key Differences**
- **Species Encoding:**  
  - **R uses categorical factor labels (`setosa`, `versicolor`, `virginica`).**  
  - **Python (`sklearn`) encodes species numerically as `0`, `1`, and `2`.**
- **Column Names:**  
  - **R:** `Sepal.Length`, `Sepal.Width`, etc.  
  - **Python:** `sepal length (cm)`, `sepal width (cm)`, etc.  

In [None]:
from sklearn import datasets
import ydata_profiling  

# Load the famous Iris dataset
Iris = datasets.load_iris()
df_iris = pd.DataFrame(Iris.data, columns = Iris.feature_names)

df_iris["species"] = Iris.target
print(df_iris.describe())

profile = ydata_profiling.ProfileReport(df_iris, title = "iris summary", explorative = True)
# Generate the profiling report (Uncomment to generate HTML file)
profile.to_file("iris_report.html")

TO-DO: describe this data

The iris dataset has 150 observations and 5 variables: sepal length, sepal width, petal length, petal width and species
sepal width, sepal length, petal width and petal length are numerical observations while species is categorical with 3 levels.
There is also no missing data in this dataset.
The minimum sepal length is 4.3 and the max is 7.9 with a mean of 5.843
The minimum sepal width is 2 and the max is 4.4 with a mean of 3.057
The minimum petal length is 1 and the max is 6.9 with a mean of 3.758
The minimum petal width is .1 and the max is 2.5 with a mean of 1.199
There are 3 species and each one's count is 50
Our goal is to be able to perdict if it is species 0 or 1 using only the sepal length. 

The outcome / label / response is `Species`. This is what we will be trying to predict. However, we only care about binary classification between "setosa" and "versicolor" for the purposes of this exercise. Thus the first order of business is to drop one class. Let's drop the data for the level "virginica" from the data frame.

In [None]:
# Filter out "virginica" from the dataset
df_iris_binary = df_iris[df_iris["species"] != 2].copy()
print(df_iris_binary["species"].unique())
df_iris_binary

Now create a vector `y` that is length the number of remaining rows in the data frame whose entries are 0 if "setosa" and 1 if "versicolor".

In [None]:
# Create binary target vector `y` (0 for setosa, 1 for versicolor)
y = (df_iris_binary["species"] == 1).astype(int)
print(y)

Write a function `mode` returning the sample mode of a vector of numeric values. Use np.random.choice from NumPy and import Counter from the collections module.

In [None]:
from collections import Counter

# Define mode function
def mode(v):
    return Counter(v).most_common(1)[0][0]

# Test with a random sample (equivalent to `sample(letters, 1000, replace=TRUE)`)
sample_data = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), 1000, replace=True)
print("Mode of sample letters:", mode(sample_data))

# Test with binary target vector `y`
print("Mode of y:", mode(y))

Fit a threshold model to `y` using the feature `Sepal.Length`. Write your own code to do this. What is the estimated value of the threshold parameter? Save the threshold value as `threshold`. Hint: use np.zeros and np.sum from Numpy. You will need to use a for loop using the range() function.  

In [None]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    num_errors_by_parameter[i] = [threshold, num_errors]  # Store values

# Sort by number of errors
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]

# Get the threshold with the least number of errors
best_threshold = num_errors_by_parameter[0, 0]

# Print results
print(f"Optimal threshold for classification: {best_threshold}")

What is the total number of errors this model makes? This requires a couple of minor modifications to the previous code.

In [None]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))
total_errors = 0
# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    
    # Store threshold and corresponding errors
    num_errors_by_parameter[i] = [threshold, num_errors]
    
    # Accumulate total errors across all thresholds
    total_errors += num_errors

# Sort by number of errors to find the best threshold
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]
best_threshold = num_errors_by_parameter[0, 0]  # Best threshold with the least errors

# Print results
print(f"Optimal threshold for classification: {best_threshold}")
print(f"Total number of errors across all thresholds: {total_errors}")

Does the threshold model's performance make sense given the following summaries:

In [None]:
# Print the best threshold found earlier
print(f"Optimal threshold for classification: {best_threshold}")

# Summary statistics for setosa and versicolor Sepal.Length
setosa_summary = df_iris_binary[df_iris_binary["species"] == 0]["sepal length (cm)"].describe()
versicolor_summary = df_iris_binary[df_iris_binary["species"] == 1]["sepal length (cm)"].describe()

# Print summaries
print("\nSummary statistics for Setosa Sepal Length:")
print(setosa_summary)

print("\nSummary statistics for Versicolor Sepal Length:")
print(versicolor_summary)

TO-DO: Write your answer here in English

Yes, the threshold is in middle of the mean's of the setosa and versicolor sepal length

Create the function `g` explicitly that can predict `y` from `x` being a new `Sepal.Length`. Hint: use np.where from Numpy ... this can also be down using a lambda function. 

In [None]:
# Define function `g` for threshold-based prediction
def g(x):
    return np.where(x > best_threshold, 1, 0)

test_val = 4
g(test_val)

In [None]:
g = lambda x: np.where(x > best_threshold, 1, 0)

g(8)