5. Load the height/weight data using data = np.genfromtxt(’heightWeightData.txt’, delimiter=’,’). The first
column is the class label (1=male, 2=female), the second column is height, the third weight

In [20]:
# Importing necessary packages:
import numpy as np
import scipy

# Import the data for the assigment:
data = np.loadtxt("heightWeightData.csv", delimiter=",", dtype=float)

# Split the data between each atribute:
G = data[:, 0]
H = data[:, 1]
W = data[:, 2]

# Finds which samples are male:
is_male = G == 1

In [21]:
# Compute the samples average:
mu_G = np.mean(G)
mu_H = np.mean(H)
mu_W = np.mean(W)

a) Write a Python script to fit a Gaussian model to each class using all the data for training. What’s the training error?

Answer: The problem asks to divide the groups considering class, therefore $P(x|G)$ is possible to directly compute. The solution will be achieved estimating the gender based on the bayes theorem:
$P(G|x)=\frac{P(x|G)P(G)}{P(x)}$
for a given $x=(height,width)$.
For both genders the values of $P(G)$ and $P(x)$ are the same (same population and same evaluated point $x$), therefore it is possible to estimate the gender through the inequality:
$P(male|x)>P(female|x) \to P(x|male)>P(x|female)$

In [22]:
# Divide the original data between male and female subspace:
male_H = H[is_male]
male_W = W[is_male]
female_H = H[is_male != 1]
female_W = W[is_male != 1]

# Compute its mean:
mean_male_H = np.mean(male_H)
mean_male_W = np.mean(male_W)
mean_female_H = np.mean(female_H)
mean_female_W = np.mean(female_W)

# Compute distribution for both classes:
male_HW = np.stack((male_H, male_W), axis=0)
cov_male_HW = np.cov(male_HW)
# Compute p(x|g=male):
pdf_male_HW = lambda x : scipy.stats.multivariate_normal(
    mean=[mean_male_H, mean_male_W],
    cov=cov_male_HW).pdf(x)

female_HW = np.stack((female_H, female_W), axis=0)
cov_female_HW = np.cov(female_HW)
# Compute P(x|g=female)
pdf_female_HW = lambda x : scipy.stats.multivariate_normal(
    mean=[mean_female_H, mean_female_W],
    cov=cov_female_HW).pdf(x)


# Now, for all samples compute the estimated gender. If the prob(g=2) > prob(g=1), then it is a female. Male, otherwise.
estimated_g1 = np.array([1 if pdf_male_HW([h, w]) > pdf_female_HW([h, w]) else 2 for h, w in zip(H, W)])

# Compute the MSE:
mse_1 = np.linalg.norm(estimated_g1 - G)
print(f"MSE2 = {mse_1}")
# print(f"Error={estimated_g - G}")
print(f"Error in percentage = {np.sum(np.abs(estimated_g1 - G))/len(G)*100:.2f} %")

MSE2 = 5.0
Error in percentage = 11.90 %


b) Repeat a) imposing the same covariance matrix for both classes.

Answer: If no classes are considerate, there will be only one global covariance matrix for both distributions. The means remain the same as last answer.

In [23]:
# Compute one single covariance matrix, ignoring the classes:
HW = np.stack((H, W), axis=0)
cov_hw = np.cov(HW)
male_pdf_HW = lambda x : scipy.stats.multivariate_normal(
    mean=[mean_male_H, mean_male_W],
    cov=cov_hw).pdf(x)
female_pdf_HW = lambda x : scipy.stats.multivariate_normal(
    mean=[mean_female_H, mean_female_W],
    cov=cov_hw).pdf(x)

# Now, for all samples compute the estimated gender. If the prob(g=2) > prob(g=1), then it is a female. Male, otherwise.
estimated_g2 = np.array([1 if male_pdf_HW([h, w]) > female_pdf_HW([h, w]) else 2 for h, w in zip(H, W)])

# Compute the MSE:
mse_2 = np.linalg.norm(estimated_g2 - G)
print(f"MSE2 = {mse_2}")
# print(f"Error={estimated_g - G}")
print(f"Error in percentage = {np.sum(np.abs(estimated_g2 - G))/len(G)*100:.2f} %")

MSE2 = 5.0990195135927845
Error in percentage = 12.38 %


c. Repeat a) imposing the diagonal covariance matrices.

Answer: Imposing a diagonal covariance matrix translate as considering total independence between variables. Therefore, covariance=0

In [24]:
# Compute distributions for both classes, considering that height (H) and weight (H) are completely independent variables, thus covariance=0 and var!=0:
male_HW = np.stack((male_H, male_W), axis=0)
cov_male_HW = np.zeros(shape=(2, 2))
cov_male_HW[0, 0] = np.var(male_H)
cov_male_HW[1, 1] = np.var(male_W)

pdf_male_HW = scipy.stats.multivariate_normal(
    mean=[mean_male_H, mean_male_W],
    cov=cov_male_HW) # P(x|g=male)

# For the female:
cov_female_HW = np.zeros(shape=(2, 2))
cov_female_HW[0, 0] = np.var(female_H)
cov_female_HW[1, 1] = np.var(female_W)
pdf_female_HW = scipy.stats.multivariate_normal(
    mean=[mean_female_H, mean_female_W],
    cov=cov_female_HW) # P(H,W|g=female)

# Now, for all samples compute the estimated gender. If the prob(g=2) > prob(g=1), then it is a female. Male, otherwise.
estimated_g3 = np.array([1 if pdf_male_HW.pdf([h, w]) > pdf_female_HW.pdf([h, w]) else 2 for h, w in zip(H, W)])

# Compute the MSE:
mse_3 = np.linalg.norm(estimated_g3 - G)
print(f"MSE3 = {mse_3}")
# print(f"Error={estimated_g - G}")
print(f"Error in percentage = {np.sum(np.abs(estimated_g3 - G))/len(G)*100:.2f} %")

MSE3 = 5.0990195135927845
Error in percentage = 12.38 %
