## Course Assignment Instructions
You should have Python (version 3.8 or later) and Jupyter Notebook installed to complete this assignment. You will write code in the empty cell/cells below the problem. While most of this will be a programming assignment, some questions will ask you to "write a few sentences" in markdown cells. 

Submission Instructions:

Create a labs directory in your personal class repository (e.g., located in your home directory)
Clone the class repository
Copy this Jupyter notebook file (.ipynb) into your repo/labs directory
Make your edits, commit changes, and push to your repository
All submissions must be pushed before the due date to avoid late penalties. 

Labs are graded out of a 100 pts. Each day late is -5. For a max penalty of -50 after 10 days. From there you may submit the lab anytime before the semester ends for a max score of 50.  

Lab 2 is due on 2/18/25

## Basic Modeling
In the 342 class an example was given that considered a variable `x_3` which measured "criminality". In this example there are L = 4 levels "none", "infraction", "misdemeanor" and "felony". Create a variable `x_3` here with 100 random elements (equally probable). Create it as a nominal (i.e. unordered) factor. Hint: use random.choice from NumPy and Categorical from Pandas.

In [1]:
import numpy as np
import pandas as pd

# Define the categories
categories = ["none", "infraction", "misdemeanor", "felony"]

# Generate 100 random elements with equal probability
x_3 = np.random.choice(categories, size = 100, replace = True)

# Convert to a categorical (nominal) variable in pandas
x_3 = pd.Categorical(x_3, categories = categories, ordered = False) # order by the given categories list
print(x_3)


['misdemeanor', 'none', 'felony', 'misdemeanor', 'felony', ..., 'infraction', 'infraction', 'none', 'felony', 'none']
Length: 100
Categories (4, object): ['none', 'infraction', 'misdemeanor', 'felony']


Use x_3 to create x_3_bin, a binary feature where 0 is no crime and 1 is any crime.

In [2]:
# creates a boolean array (True for crime, False for no crime)
x_3_bin = (x_3 != "none").astype(int)
print(x_3_bin)


[1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1
 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0]


Use `x_3` to create `x_3_ord`, an ordered factor variable. Ensure the proper ordinal ordering.

In [3]:
x_3_ord = pd.Categorical(x_3, categories = categories, ordered = True)
x_3_ord

['misdemeanor', 'none', 'felony', 'misdemeanor', 'felony', ..., 'infraction', 'infraction', 'none', 'felony', 'none']
Length: 100
Categories (4, object): ['none' < 'infraction' < 'misdemeanor' < 'felony']

Convert this variable into three binary variables without any information loss and put them into a data matrix. Hint: use column_stack from Numpy.

In [4]:
x_3_matrix = np.column_stack([
    (x_3 == "infraction").astype(int),
    (x_3 == "misdemeanor").astype(int),
    (x_3 == "felony").astype(int)
]
)

x_3_matrix = pd.DataFrame(x_3_matrix, columns = ["infraction", "misdemeanor", "felony"])
print(x_3_matrix)


    infraction  misdemeanor  felony
0            0            1       0
1            0            0       0
2            0            0       1
3            0            1       0
4            0            0       1
..         ...          ...     ...
95           1            0       0
96           1            0       0
97           0            0       0
98           0            0       1
99           0            0       0

[100 rows x 3 columns]


What should the sum of each row be (in English)? Write your answer in the markdown cell below

Ans: The sum of each row be 0 or 1. 0 mean it doen't have any crime or none. and 1 means it has one of the crime.

Verify that in the code cell below

In [5]:
row_sum = x_3_matrix.sum(axis = 1)

print(row_sum.value_counts()) 

1    77
0    23
Name: count, dtype: int64


 How should the column sum look (in English)? Write your answer in the markdown cell below

The column sum should be around 25 as we generate a random of 100 between 4 categories

Verify that in the code cell below

In [7]:
col_sum = x_3_matrix.sum(axis = 0)
print(col_sum)

infraction     23
misdemeanor    31
felony         23
dtype: int64


Generate a matrix with 100 rows where the first column is realization from a normal with mean 17 and variance 38, the second column is uniform between -10 and 10, the third column is poisson with mean 6, the fourth column is exponential with lambda of 9, the fifth column is binomial with n = 20 and p = 0.12 and the sixth column is a binary variable with exactly 24% 1's dispersed randomly. Name the rows the entries of the `fake_first_names` vector. You will need to use Numpy

In [8]:
# Number of rows
n = 100

# Assign row names (index) from fake_first_names
fake_first_names = [
    "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
    "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
    "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
    "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
    "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
    "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
    "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
    "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
    "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
    "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
    "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
    "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
    "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
    "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
    "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
    "Lincoln"
]

# Create a DataFrame with the specified distributions
X = pd.DataFrame({
    "Normal": np.random.normal(loc = 17, scale=np.sqrt(38), size = n),  # Normal(17, variance 38)
    "Uniform": np.random.uniform(low = -10, high = 10, size = n),                 # Uniform(-10, 10)
    "Poisson": np.random.poisson(6, size = n),                            # Poisson(6)
    "Exponential": np.random.exponential(1/9, size = n),                  # Exponential(λ=9)
    "Binomial": np.random.binomial(n = 20, p = 0.12, size = n),             # Binomial(n=20, p=0.12)
    "Binary": np.random.permutation([1]*int(n * 0.24) + [0]*int(n*0.76))  # 24% 1s, shuffled
})

X.index = fake_first_names[:n]

X

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,18.868902,-2.525198,8,0.349672,2,0
Emma,22.116624,-5.457612,7,0.065484,1,0
Olivia,13.816416,2.252798,3,0.105131,2,0
Ava,12.810844,-7.371833,8,0.016127,2,0
Mia,12.189338,-1.561189,5,0.213204,4,1
...,...,...,...,...,...,...
Christian,14.203085,-3.960957,6,0.095198,0,0
Andrew,19.716283,1.704336,2,0.003634,4,0
Brayden,19.143242,7.176130,4,0.067066,2,1
John,18.083321,1.741598,6,0.144042,4,0


Create a data frame of the same data as above except make the binary variable a factor "DOMESTIC" vs "FOREIGN" for 0 and 1 respectively. In Rstudio you used the `View` function to ensure this worked as desired. In python use .head() on the DataFrame. I recommend creating a copy of the DataFrame and then using the .replace in conjunction with .astype("category") to make the binary variable a factor. 

In [9]:
# Convert matrix DataFrame to categorical for the binary variable
# Make a copy to keep X unchanged
df = X.copy()

# Convert binary column (6th column) to categorical labels
df["Binary"] = df["Binary"].replace({0:"DOMESTIC", 1: "FOREIGN"}).astype("category")

# Display first few rows
df.head()

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,18.868902,-2.525198,8,0.349672,2,DOMESTIC
Emma,22.116624,-5.457612,7,0.065484,1,DOMESTIC
Olivia,13.816416,2.252798,3,0.105131,2,DOMESTIC
Ava,12.810844,-7.371833,8,0.016127,2,DOMESTIC
Mia,12.189338,-1.561189,5,0.213204,4,FOREIGN


Print out a table of the binary variable. Then print out the proportions of "DOMESTIC" vs "FOREIGN". Pandas DataFrames has a .value_count() feature. 

In [10]:
df["Binary"].value_counts(normalize=True)

Binary
DOMESTIC    0.76
FOREIGN     0.24
Name: proportion, dtype: float64

Print out a summary of the whole dataframe.

In [11]:
print(df.describe())
print(df["Binary"].value_counts())

           Normal     Uniform     Poisson  Exponential   Binomial
count  100.000000  100.000000  100.000000   100.000000  100.00000
mean    16.963458    0.628906    6.060000     0.101890    2.55000
std      6.311549    5.705476    2.423871     0.088484    1.62291
min      4.307257   -9.063880    0.000000     0.001533    0.00000
25%     13.420740   -3.988257    4.000000     0.039538    1.00000
50%     17.501764    0.326173    6.000000     0.082301    2.50000
75%     21.351098    5.498100    8.000000     0.134087    4.00000
max     31.294212    9.962693   15.000000     0.416597    7.00000
Binary
DOMESTIC    76
FOREIGN     24
Name: count, dtype: int64


## Dataframe creation
Imagine you are running an experiment with many manipulations. You have 14 levels in the variable "treatment" with levels a, b, c, etc. For each of those manipulations you have 3 submanipulations in a variable named "variation" with levels A, B, C. Then you have "gender" with levels M / F. Then you have "generation" with levels Boomer, GenX, Millenial. Then you will have 6 runs per each of these groups. In each set of 6 you will need to select a name without duplication from the appropriate set of names (from the last question). Create a data frame with columns treatment, variation, gender, generation, name and y that will store all the unique unit information in this experiment. Leave y empty because it will be measured as the experiment is executed. In Rstudio you used `rep` function using the `times` argument. For python use np.tile, and np.repeat.

In [None]:
# Define categories
'''
summary
14 levels of treatments from a to n so using repeat i repeat the treatemets for 3 * 2 * 3 * 6
3 levels of variations so using tile i repeat the variations for  2 * 3 * 6 and the whole thing for len(treatments)
2 levels of genders, it use tile to repeat the gender for 3 * 6 and the whole thing fro len(treatments) * len(variations)
3 levels of generations, it use tile to repeat the generations for 6 and the whole thing for len(treatments) * len(variations) * len(gender)
then, create a set of 6 unique name and repeat the process for each group. then add an empty column y
'''
treatments =  list("abcdefghijklmn") # 14 levels
variations = list("ABC")  # 3 levels
genders = ["M", "F"]  # 2 levels
generations = ["Boomer", "GenX", "Millenial"]  # 3 levels


# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}


# Create experiment dataframe
df = pd.DataFrame({
    "treatment": np.repeat(treatments, len(variations) * len(genders) * len(generations) * 6),
    "variation": np.tile(np.repeat(variations, len(genders) * len(generations) * 6), len(treatments)),
    "gender": np.tile(np.repeat(genders, len(generations) * 6), len(treatments) * len(variations)),
    "generation": np.tile(np.repeat(generations, 6), len(treatments) * len(variations) * len(genders)),
}) 


# Function to assign unique names per group
def assign_names(group):
    gender_val = group["gender"].iloc[0]  # Extract gender
    generation_val = group["generation"].iloc[0]  # Extract generation
    return np.random.choice(name_sets[gender_val][generation_val], 6, replace=False)

# Apply function to assign names
df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names).explode().reset_index(drop=True)

# Add empty column y
df["y"] = np.nan

# Display first few rows
print(df.head())
df

  treatment variation gender generation      name   y
0         a         A      M     Boomer   Mildred NaN
1         a         A      M     Boomer  Lorraine NaN
2         a         A      M     Boomer   Dorothy NaN
3         a         A      M     Boomer      Joan NaN
4         a         A      M     Boomer     Betty NaN


  df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names).explode().reset_index(drop=True)


Unnamed: 0,treatment,variation,gender,generation,name,y
0,a,A,M,Boomer,Mildred,
1,a,A,M,Boomer,Lorraine,
2,a,A,M,Boomer,Dorothy,
3,a,A,M,Boomer,Joan,
4,a,A,M,Boomer,Betty,
...,...,...,...,...,...,...
1507,n,C,F,Millenial,Dylan,
1508,n,C,F,Millenial,Gabriel,
1509,n,C,F,Millenial,Casey,
1510,n,C,F,Millenial,Christian,


Now that you've done it with the np.tile and np.repeat, Try doing this by importing product from the itertools module. This will be analogous to using `expand.grid` function from Rstudio. 

| **R Function** | **Python Equivalent** |
|--------------|-----------------|
| `rep(x, times=n)` | `np.repeat(x, n)` |
| `rep(x, each=n)` | `np.tile(np.repeat(x, n), times)` |
| `rep(x, length.out=n)` | `np.resize(x, n)` |
| `expand.grid()` | `itertools.product()` |

| **R Function** | **Python Equivalent** | **Use Case** |
|--------------|-----------------|-----------|
| `rep(x, times=n)` | `np.repeat(x, n)` | Repeat each element **`n` times** in order |
| `rep(x, each=n)` | `np.tile(x, n)` | Repeat the full sequence **`n` times** |
| `rep(x, length.out=n)` | `np.resize(x, n)` | Repeat `x` but **truncate** or **expand** to length `n` |

**`expand.grid()` → `itertools.product()`** for generating **all combinations**  
**`rep(..., each=n)` → `np.repeat()`** for **repeating values in order**  
**`rep(..., times=n)` → `np.tile()`** for **cycling through values**  
**`Combination of `np.repeat()` and `np.tile()`** replaces **nested `rep()`** in R

In [19]:
from itertools import product

# Define categories
treatments = list("abcdefghijklmn")  # 14 treatment levels
variations = list("ABC")             # 3 variation levels
genders = ["M", "F"]                 # 2 gender levels
generations = ["Boomer", "GenX", "Millenial"]  # 3 generation levels
runs_per_group = 6               # 6 runs per group

# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}

# Generate all unique combinations (equivalent to expand.grid in R)
df = pd.DataFrame(
    product(treatments, variations, genders, generations, range(1, runs_per_group + 1)),
    columns=["treatment", "variation", "gender", "generation", "run"]
)



# Function to assign unique names per group (each group has 6 rows)
def assign_names(group):
    gender = group["gender"].iloc[0]
    generation = group["generation"].iloc[0]
    # Sample 6 unique names (without replacement) from the appropriate name set
    return np.random.choice(name_sets[gender][generation], size=len(group), replace=False)

# Group by all four factors and apply the function.
# Using sort=False preserves the order generated by product.
df["name"] = (df.groupby(["treatment", "variation", "gender", "generation"], sort=False, group_keys=False).apply(assign_names).explode().reset_index(drop=True))

# Add an empty column for y (to be measured later)
df["y"] = np.nan

# Display first few rows and verify the total number of rows
print(df.head())
print(f"Total rows: {len(df)} (Expected: {14 * 3 * 2 * 3 * 6})")
df

  treatment variation gender generation  run      name   y
0         a         A      M     Boomer    1      Gene NaN
1         a         A      M     Boomer    2     Leroy NaN
2         a         A      M     Boomer    3  Theodore NaN
3         a         A      M     Boomer    4   Bernard NaN
4         a         A      M     Boomer    5       Lee NaN
Total rows: 1512 (Expected: 1512)


  df["name"] = (df.groupby(["treatment", "variation", "gender", "generation"], sort=False, group_keys=False).apply(assign_names).explode().reset_index(drop=True))


Unnamed: 0,treatment,variation,gender,generation,run,name,y
0,a,A,M,Boomer,1,Gene,
1,a,A,M,Boomer,2,Leroy,
2,a,A,M,Boomer,3,Theodore,
3,a,A,M,Boomer,4,Bernard,
4,a,A,M,Boomer,5,Lee,
...,...,...,...,...,...,...,...
1507,n,C,F,Millenial,2,Brittany,
1508,n,C,F,Millenial,3,Taylor,
1509,n,C,F,Millenial,4,Bethany,
1510,n,C,F,Millenial,5,Samantha,


## Basic Binary Classification Modeling

Load the famous `iris` data frame into the namespace. In Rstudio you used the `skim` function from the package `skimr` to provide a summary of the columns. In python we will use df.describe() and the ProfileReport from the ydata-profiling package. The `iris` data set is not available in base python, but we can get this data from the sklearn package. Write a few descriptive sentences about the distributions using the code below in English.

In [20]:
#install scikit-learn by uncommenting the code below
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [21]:
#install ydata-profile by uncommenting the code below
%pip install -U "ydata-profiling[notebook]"

Note: you may need to restart the kernel to use updated packages.


### **Comparing the `iris` Dataset in R vs Python**
| Feature  | **R (`datasets::iris`)**  | **Python (`sklearn.datasets.load_iris()`)**  |
|----------|-------------------------|--------------------------------|
| **Total Rows**  | 150 | 150 |
| **Columns (Features)** | 5 (`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) | 5 (`sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, `petal width (cm)`, `species`) |
| **Species Encoding**  | `"setosa"`, `"versicolor"`, `"virginica"` (Categorical Factor) | `0` (setosa), `1` (versicolor), `2` (virginica) (Numerical Encoding) |
| **Data Type for Species** | Factor (Categorical) | Integer (0,1,2) |
| **Data Loading Method** | `data(iris)` (built-in dataset) | `datasets.load_iris()` (from `sklearn`) |

### **Key Differences**
- **Species Encoding:**  
  - **R uses categorical factor labels (`setosa`, `versicolor`, `virginica`).**  
  - **Python (`sklearn`) encodes species numerically as `0`, `1`, and `2`.**
- **Column Names:**  
  - **R:** `Sepal.Length`, `Sepal.Width`, etc.  
  - **Python:** `sepal length (cm)`, `sepal width (cm)`, etc.  

In [1]:
from sklearn import datasets
import ydata_profiling  
import pandas as pd

# Load the famous Iris dataset
Iris = datasets.load_iris()
df_iris = pd.DataFrame(Iris.data, columns = Iris.feature_names)

df_iris["Species"] = Iris.target
df_iris.describe()

# Generate the profiling report (Uncomment to generate HTML file)
# profile = ydata_profiling.ProfileReport(df_iris, title = "Iris Summary", explorative = True)
profile = ydata_profiling.ProfileReport(df_iris, title="Iris Summary", explorative=True)

profile.to_file("iris_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

TO-DO: describe this data

The dataset has 5 columns and 150 rows. There is one duplicate row which is ok and usually we need to drop duplicate rows. 4 columns (sepal length, sepal width, petal length and petal width) are numeric and 1 is categorical(species).

The outcome / label / response is `Species`. This is what we will be trying to predict. However, we only care about binary classification between "setosa" and "versicolor" for the purposes of this exercise. Thus the first order of business is to drop one class. Let's drop the data for the level "virginica" from the data frame.

In [2]:
# Filter out "virginica" from the dataset
df_iris_binary = df_iris[df_iris["Species"] != 2].copy()
df_iris_binary.head()
print(df_iris_binary["Species"].unique())
print(df_iris_binary)

[0 1]
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                 5.1               3.5                1.4               0.2   
1                 4.9               3.0                1.4               0.2   
2                 4.7               3.2                1.3               0.2   
3                 4.6               3.1                1.5               0.2   
4                 5.0               3.6                1.4               0.2   
..                ...               ...                ...               ...   
95                5.7               3.0                4.2               1.2   
96                5.7               2.9                4.2               1.3   
97                6.2               2.9                4.3               1.3   
98                5.1               2.5                3.0               1.1   
99                5.7               2.8                4.1               1.3   

    Species  
0         0  
1    

Now create a vector `y` that is length the number of remaining rows in the data frame whose entries are 0 if "setosa" and 1 if "versicolor".

In [3]:
# Create binary target vector `y` (0 for setosa, 1 for versicolor)
y = (df_iris_binary["Species"] == 1).astype(int)
y.unique()

array([0, 1])

Write a function `mode` returning the sample mode of a vector of numeric values. Use np.random.choice from NumPy and import Counter from the collections module.

In [6]:
from collections import Counter
import numpy as np

# Define mode function
def mode(v):
    return Counter(v).most_common(1)[0][0]
    
# Test with a random sample (equivalent to `sample(letters, 1000, replace=TRUE)`)
sample_data = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), 1000, replace=True)
print("Mode of sample letters:", mode(sample_data))

# Test with binary target vector `y`
print("Mode of y:", mode(y))

Mode of sample letters: l
Mode of y: 0


Fit a threshold model to `y` using the feature `Sepal.Length`. Write your own code to do this. What is the estimated value of the threshold parameter? Save the threshold value as `threshold`. Hint: use np.zeros and np.sum from Numpy. You will need to use a for loop using the range() function.  

In [13]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    num_errors_by_parameter[i] = [threshold, num_errors]  # Store values

# Sort by number of errors
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]

# Get the threshold with the least number of errors
best_threshold = num_errors_by_parameter[0, 0]

# Print results
print(f"Optimal threshold for classification: {best_threshold}")



Optimal threshold for classification: 5.4


What is the total number of errors this model makes? This requires a couple of minor modifications to the previous code.

In [15]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))
total_errors = 0
# Loop over all possible threshold values
for i in range(n):
 
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    
    # Store threshold and corresponding errors
    num_errors_by_parameter[i] = [threshold, num_errors]
    
    # Accumulate total errors across all thresholds
    total_errors += num_errors


# Sort by number of errors to find the best threshold
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]
best_threshold = num_errors_by_parameter[0, 0]  # Best threshold with the least errors

# Print results
print(f"Optimal threshold for classification: {best_threshold}")
print(f"Total number of errors across all thresholds: {total_errors}")


Optimal threshold for classification: 5.4
Total number of errors across all thresholds: 2796


Does the threshold model's performance make sense given the following summaries:

In [16]:
# Print the best threshold found earlier
print(f"Optimal threshold for classification: {best_threshold}")

# Summary statistics for setosa and versicolor Sepal.Length
setosa_summary = df_iris_binary[df_iris_binary["Species"] == 0]["sepal length (cm)"].describe()
versicolor_summary = df_iris_binary[df_iris_binary["Species"] == 1]["sepal length (cm)"].describe()

# Print summaries
print("\nSummary statistics for Setosa Sepal Length:")
print(setosa_summary)

print("\nSummary statistics for Versicolor Sepal Length:")
print(versicolor_summary)

Optimal threshold for classification: 5.4

Summary statistics for Setosa Sepal Length:
count    50.00000
mean      5.00600
std       0.35249
min       4.30000
25%       4.80000
50%       5.00000
75%       5.20000
max       5.80000
Name: sepal length (cm), dtype: float64

Summary statistics for Versicolor Sepal Length:
count    50.000000
mean      5.936000
std       0.516171
min       4.900000
25%       5.600000
50%       5.900000
75%       6.300000
max       7.000000
Name: sepal length (cm), dtype: float64


TO-DO: Write your answer here in English

Optimal threshold for classification is 5.4 which means if I have 5.4 and below it is setosa and otherwise it is versicolor. There are 50 numbers of setosa and 50 numbers of versicolor. 

Summary for Setosa Sepal Length:
The average of the sepal length is 5.00600 cm
It has 0.35249 cm deviation
The minimum sepal length is 4.3 cm
25% of the sepal has 4.8 cm length
50% of the sepal has 5 cm length
75% of the sepal has 5.2 cm length
The maximum sepal length is 5.8 cm

Summary for Versicolor Sepal Length:
The average of the sepal length is 5.936000 cm
It has 0.516171 cm deviation
The minimum sepal length is 4.900000 cm
25% of the sepal has 5.600000 cm length
50% of the sepal has 5.900000 cm length
75% of the sepal has 6.3 cm length
The maximum sepal length is 7 cm

Create the function `g` explicitly that can predict `y` from `x` being a new `Sepal.Length`. Hint: use np.where from Numpy ... this can also be down using a lambda function. 

In [21]:
# Define function `g` for threshold-based prediction
def g(x):
    return np.where(x> best_threshold, 1, 0)
test = g(5.5)
print(test)

1


In [18]:
g = lambda x: np.where(x> best_threshold, 1, 0)
g(9)

array(1)