# ML Assignment: 1
Starting Assignment 1 of Machine Learning, tackling both theoretical concepts and practical questions.

## Theoretical Questions

### Exercise 1
Compare t-test and z-test in terms of assumptions, population standard deviation, and
use case.

#### Answer
Lets first take a look at both Student's t-score & Normal z-score formula for hypothesis testing:


$$ t = \frac{\bar{X} - \mu}{s / \sqrt{n}} $$

$$ z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} $$

The difference between both of these formulas is between `s`(**sample standard deviation**) and the `σ` (**population standard deviation**).

So we could answer the question like this:
- Assumptions:
- - z-test: Requires known population standard deviation, normally distributed data, and large sample size (n ≥ 30).
- - t-test: Uses sample standard deviation, assumes normality (or approximate normality for small n), and is suitable for small samples.
- Population Standard Deviation:
- - z-test uses the population standard deviation; t-test uses the sample standard deviation.
- Use Case:
- - Use a z-test for large samples with known population parameters. Use a t-test for small samples or when population parameters are unknown.

### Exercise 2
A tech company is deploying a recommendation system for a music streaming platform.
Suppose user ratings for songs (on a scale from 1 to 5) follow a non-normal distribution.
If we take a large enough sample, how does CLT justify using a normal approximation
for constructing confidence intervals? If the mean predicted rating for a song is 4.2 with
a standard deviation of 0.5, how can we use CLT to determine the probability that a
randomly selected user will rate the song above 4.5?

#### Answer
First we must clarify something about Centeral Limit Theorem that might be confusing with out knowing it.

- **CLT**:
The Central Limit Theorem in Statistics states that as the sample size increases and its variance is finite, then the **distribution of the sample mean approaches normal distribution** irrespective of the shape of the population distribution. [geeksforgeeks reference](https://www.geeksforgeeks.org/central-limit-theorem/)

So we could say that the distribution of the **sample mean** approaches normal distribution not any sample from the population.

And for the probabilty calculation we must first convert the `X: 4.5` value to the normal `Z` value:

for values:
$$ \mu = 4.2 $$
$$ \sigma = 0.5$$
$$ z = \frac{\bar{X} - \mu}{\sigma} = \frac{4.5 - 4.2}{0.5} = 0.6$$
Now we find the probability of Z > 0.6 using the Table from `John E. Freund's Mathematical Statistics`
$$P(Z > 0.6) = 1 - 0.7257 = 0.2743$$

### Exercise 3
You are given a dataset containing information about customers in an online retail store.
The dataset includes the following features:
- Age
- Annual Income
- Customer Satisfication Score
- Preferred Payment Method

For each of the following scenarios, determine the most appropriate statistical test from the options: Pearson’s Correlation, Spearman’s Rank Correlation, or Chi-Square Test. Then, apply the test using the dataset provided and interpret the result.

#### Tests
- **[Pearson's Correlation](https://datatab.net/tutorial/pearson-correlation)** -> Measures the linear relationship between two continuous variables. **Parametric**
- **[Spearman’s Rank Correlation](https://datatab.net/tutorial/spearman-correlation)** -> Measures the monotonic relationship between two variables (non-linear trends) **Non-Parametric**.
- **[Chi-Square Test](https://datatab.net/tutorial/chi-square-test)** -> Tests the association between categorical variables **Non-Parametric**.

#### Parametric Vs Non-Parametric
parametric assumes follows a specifc distribution but in non-parametric there is no strict.
datatypes in parametric is continues and numerical but in non-parametric is ordinal or categorical.

And some more differences but for answering this part we just need these.

#### Scenario (a):
The store owner wants to examine whether there is a linear relationship between Age and Annual Income of customers. Which statistical test should be applied? Compute the test statistic and interpret the result.

#### Answer
Because both Age and Annual Income of customers are continues and we wants to find the linear relationship between these variables based on the comparison that we talked about it earlier we might want to use `Pearson's Correlation`.

For the calculations we use some python code for eaiser calculations:

In [66]:
import pandas as pd
from scipy.stats import pearsonr

# Sample Data in Assignment: 
data = {
    'Age': [25, 32, 40, 50, 60],
    'Annual Income (USD)': [40_000, 55_000, 65_000, 70_000, 85_000]
}

df = pd.DataFrame(data)
display(df)
r, p_value = pearsonr(df['Age'], df['Annual Income (USD)'])
print(f"Pearson's r: {r:.2f}, p-value: {p_value:.4f}")

Unnamed: 0,Age,Annual Income (USD)
0,25,40000
1,32,55000
2,40,65000
3,50,70000
4,60,85000


Pearson's r: 0.98, p-value: 0.0035


#### Interpretation
Based on the high value of `r=0.98` and knowing that Pearson's Correlation is between -1 to 1 we could say that there is strong positive linear relationship.

#### Scenario (b):
The store owner wants to check whether there is a monotonic relationship between
Age and Customer Satisfaction Score. Which statistical test should be applied?
Compute the test statistic and interpret the result.

#### Answer
As in the question highlighted that we want to have a monotonic relationship between Age and Customer Satisfaction  Score, we might want to use `Spearman’s Rank Correlation`.

For the calculations we use some python code for eaiser calculations:

In [73]:
from scipy.stats import spearmanr

# Sample Data in Assignment: 
age = [25, 32, 40, 50, 60]
satisfaction = [8, 6, 7, 5, 3]

rho, p_value = spearmanr(age, satisfaction)
print(f"Spearman's ρ: {rho:.2f}, p-value: {p_value:.4f}")

Spearman's ρ: -0.90, p-value: 0.0374


#### Interpretation
Based on the high value of `ρ=-0.90` There is a strong negative correlation between age and satisfaction.

### Exercise 4
What is the key difference between the Mann-Whitney U test and the Wilcoxon Signed-Rank test? Suppose you are testing two marketing strategies’ effectiveness, but the data is non-normally distributed. Which test would you use?


#### Tests
- **[Mann-Whitney U test](https://datatab.net/tutorial/mann-whitney-u-test)**: The Mann-Whitney U-Test tests whether there is a difference between two samples. To determine if there is a difference between two samples, the rank sums of the two samples are used rather than the means as in the t-test for **independent samples**. The Mann-Whitney U test is thus the non-parametric counterpart to the t-test.
- **[Wilcoxon Signed-Rank test](https://datatab.net/tutorial/wilcoxon-test)**: The Wilcoxon test (Wilcoxon signed-rank test) determines whether two dependent groups differ significantly from each other. To do this, the Wilcoxon test uses the ranks of the groups instead of the mean values. The Wilcoxon test is a non-parametric test, parametric counterpart to the paired samples t-test.


#### Comparison
1) Both tests are non-parametric alternatives to t-tests, meaning they don’t assume normality.
2) Mann-Whitney U Test compares independent groups but the Wilcoxcon Test compares independent groups (like group A is a running test at morning and group B is another running test with same previous group members but at night).
3) Both of them use rankings to determine the differences.

### Exercise 5
We have collected a dataset with three numerical features and one categorical feature called group. The group feature represents three different experimental conditions (1, 2, or 3). Our goal is to determine whether the groups significantly differ given the numerical feature X.

#### Test
Before we dive into each scenarios lets take a look at summary of some tests:

- **[ANOVA](https://datatab.net/tutorial/anova)**: analysis of variance (ANOVA) tests whether statistically significant differences exist between more than two samples. For this purpose, the means and variances of the respective groups are compared with each other. In contrast to the t-test, which tests between two samples, ANOVA tests between more than two groups. Type of this test is parametric so it assumes the **normally distribution and homogeneity**. Also this test is used for independent groups.
- **[Kruskal-Wallis](https://datatab.net/tutorial/kruskal-wallis-test)**: The Kruskal-Wallis test (H test) is a non-parametric statistical test used to compare three or more independent groups to determine if there are statistically significant differences between them. As this test isn't non-parametric **there is no strict to have normally distribution and homogeneity**. Also this test is used for independent groups.
- **[Shapiro-Wilk Test]()**: The Shapiro-Wilk test is a hypothesis test that is applied to a sample with a null hypothesis that the sample has been generated from a normal distribution. So might want to use this later on this exercise. We must note that there another tests for normality (e.g. Kolmogorov-Smirnov Test, Anderson-Darling Test) but because there is no restricition to explain them we skip these.

#### Scenario (a)
We want to see whether the groups differ given this feature X or not. Now, we should choose between these two tests (ANOVA or Kruskal-Wallis). In your opinion, what should we check (test) to help us choose between these two tests And once you have made your choice, apply your test. (Consider P-Value less than 0.05.)

#### Answer
So the question says we must to check there is difference or not, and we have two options called ANOVA & Kruskal-Wallis. With the previous summary we know that their bigest differnce is that ANOVA is parametric (assume normality & homogeneity) and Kruskal-Wallis is non-parametric, so we take a Shapiro-Wilk Test on this sample to check which test we must use.

We use some python packages for have easier calculations:

In [108]:
from scipy.stats import shapiro

feature_x = [10.2, 10.8, 11.0, 18.5, 17.9, 18.2, 30.1, 29.8, 30.3]
feature_y = [7.8, 8.1, 7.5, 6.9, 7.3, 6.8, 6.2, 6.5, 6.0]
feature_z = [5.4, 5.2, 5.8, 6.2, 6.0, 5.9, 7.0, 7.1, 6.9]

# Shapiro-Wilk test for normality
_, p1 = shapiro(feature_x)
_, p2 = shapiro(feature_y)
_, p3 = shapiro(feature_z)

print(f"Normality p-values: {p1:.3f}, {p2:.3f}, {p3:.3f}")

Normality p-values: 0.048, 0.891, 0.387


#### Interpretation
Since the p-value for feature_x (Group 1) is less than 0.05, this indicates a rejection of the normality assumption for that group. When even one group deviates from normality, the validity of the ANOVA test becomes questionable.

Therefore, the non-parametric `Kruskal-Wallis` test is more appropriate in this scenario because it does not assume normality of the data.

#### Scenario (b)
Based on the result of (a), apply the appropriate Statistical Test (ANOVA or
Kruskal-Wallis) to determine if there is a significant difference between the groups.
Which one would you use and why?


#### Answer 
Answer of this question is in the previous answer and with those that in mind we use Kruskal-Wallis test and use some python packages for easier calculations:

In [120]:
from scipy.stats import kruskal

feature_x = [10.2, 10.8, 11.0, 18.5, 17.9, 18.2, 30.1, 29.8, 30.3]
feature_y = [7.8, 8.1, 7.5, 6.9, 7.3, 6.8, 6.2, 6.5, 6.0]
feature_z = [5.4, 5.2, 5.8, 6.2, 6.0, 5.9, 7.0, 7.1, 6.9]

h_stat, p_kruskal = kruskal(feature_x, feature_y, feature_z)
print(f"Kruskal-Wallis p-value: {p_kruskal:.4f}")

Kruskal-Wallis p-value: 0.0001


#### Interpretation 
Since the p-value (0.0001) is less than the significance level of 0.05, we conclude that there is a statistically significant difference between the groups for the numerical feature X.

#### Scenario (c)
Assume the normality assumption holds. If you wanted to check whether the groups
also significantlly differ given all three features (X, Y and Z), what statistical test
could have been used?

#### Answer
When we have multiple continuous dependent variables (in this case, X, Y, and Z) and a single categorical independent variable (the groups), and the assumption of normality holds, you would typically use Multivariate Analysis of Variance (**[MANOVA](https://www.ibm.com/docs/sl/spss-statistics/beta?topic=statistics-multivariate-analysis-variance-manova)**).

MANOVA tests whether the mean vectors of the dependent variables differ across the groups. It is particularly useful when the dependent variables might be correlated, as it accounts for their interrelationships while testing for overall group differences.

#### Scenario (c)
For feature selection, you can apply ANOVA or Kruskal-Wallis to each feature individually to test if its distribution significantly differs across groups. If a feature yields a p-value below 0.05, it suggests that the feature varies across groups and may be a useful predictor. Use ANOVA when the data are normally distributed, or Kruskal-Wallis when the normality assumption is violated.

In [133]:
import pandas as pd
from scipy.stats import f_oneway, kruskal

# Sample dataset with a 'group' column and three features X, Y, and Z
data = {
    'group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'X': [10.2, 10.8, 11.0, 18.5, 17.9, 18.2, 30.1, 29.8, 30.3],
    'Y': [7.8, 8.1, 7.5, 6.9, 7.3, 6.8, 6.2, 6.5, 6.0],
    'Z': [5.4, 5.2, 5.8, 6.2, 6.0, 5.9, 7.0, 7.1, 6.9]
}
df = pd.DataFrame(data)

groups = df['group'].unique()

print("ANOVA results:")
for feature in ['X', 'Y', 'Z']:
    group_data = [df[df['group'] == g][feature].values for g in groups]
    stat, p = f_oneway(*group_data)
    print(f"Feature {feature}: p-value = {p:.4f}")

print("\nKruskal-Wallis results:")
for feature in ['X', 'Y', 'Z']:
    group_data = [df[df['group'] == g][feature].values for g in groups]
    stat, p = kruskal(*group_data)
    print(f"Feature {feature}: p-value = {p:.4f}")


ANOVA results:
Feature X: p-value = 0.0000
Feature Y: p-value = 0.0013
Feature Z: p-value = 0.0003

Kruskal-Wallis results:
Feature X: p-value = 0.0273
Feature Y: p-value = 0.0273
Feature Z: p-value = 0.0273


#### Interpretation 
Both tests confirm that features X, Y, and Z are significant in differentiating the groups. These test supports the conclusion that these features are useful for predicting group membership.