In [None]:
import sys
sys.path.append('/Hypothesis 4/')

from config import files_directory

import pandas as pd

In [2]:
# Import transformed raw files

df_table6 = pd.read_csv(f"{files_directory}/df_table6_age.csv")
df_table10 = pd.read_csv(f"{files_directory}/df_table10_age.csv")

## Statistical Analysis: Age, Sex and Transmission Category

### 1. Age Group Hypothesis:
- **Null Hypothesis (H₀):** Age groups 15-19, 20-24, and 25-29 are **not** more likely to get diagnosed with HIV compared to other age groups.
- **Alternative Hypothesis (H₁):** Age groups 15-19, 20-24, and 25-29 are **more** likely to get diagnosed with HIV compared to other age groups.
- **Significance Level:** 0.05 (probability threshold for rejecting the null hypothesis).
- **Statistical Method:** Chi-square statistic
  - Categorical data: The age groups and diagnoses are represented as aggregated counts of categorical variables, therefore this test is well suited for determining if there is a significant association between categorical variables.
  - Independence: Test whether the number of diagnoses in certain age groups is independent of other age groups. The chi-square test of independence evaluates whether the observed frequencies of diagnoses are significantly different from what we would expect under the assumption that there is no association between age group and diagnosis rate.

- [Reference](https://dwstockburger.com/Introbook/sbk22.htm#:~:text=Hypothesis%20Testing%20with%20Contingency%20Tables&text=That%20is%2C%20a%20statistic%20is,called%20the%20chi%2Dsquare%20statistic.)


### Chi-Square (χ²) Formula:

The chi-square statistic is calculated as:

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

Where:
- \(O_i\) = Observed frequency in each category
- \(E_i\) = Expected frequency in each category, assuming the null hypothesis is true
- The sum is taken over all categories

In [3]:
from scipy.stats import chi2_contingency

In [4]:
# Focus on the relevant age groups for the age hypothesis
age_groups_interest = ['15 - 19', '20 - 24', '25 - 29']

In [6]:
# Group by age and sum diagnoses in df_table6 (to address the age and sex hypotheses)
age_group_summary = df_table6.groupby('Age Group')['Number of diagnosis'].sum()

In [7]:
# Define the subset for age groups of interest and everything else
age_groups_interest_data = age_group_summary.loc[age_groups_interest].sum()  # Sum of interest groups
other_age_groups_data = age_group_summary.drop(age_groups_interest).sum()    # Sum of other age groups

**Contingency Table**

|                            | Diagnosed in Age Groups of Interest | Diagnosed in Other Age Groups | Total Diagnosed |
|----------------------------|-------------------------------------|------------------------------|----------------|
| **HIV Diagnosed**           | 141,747                            | 233,549                      | 375,296        |
| **Total Diagnosed**         | 375,296                            | 375,296                      | 750,592        |

In [8]:
# Create a contingency table for the chi-square test
contingency_table = [
    [age_groups_interest_data, other_age_groups_data],
    [df_table6['Number of diagnosis'].sum(), df_table6['Number of diagnosis'].sum()]
]

In [9]:
contingency_table

[[141747, 233549], [375296, 375296]]

In [10]:
# Perform the chi-square test
# p: p-value
# dof: degrees of freedom
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the results
chi2, p, dof, expected

(15070.299428358769,
 0.0,
 1,
 array([[172347.66666667, 202948.33333333],
        [344695.33333333, 405896.66666667]]))

**Observed vs Expected Values**

| Age Group Category             | Observed Value | Expected Value |
|--------------------------------|----------------|----------------|
| **Age Groups of Interest**      | 141,747        | 172,348        |
| **Other Age Groups**            | 233,549        | 202,948        |


The p-value is effectively 0, which is much smaller than the standard significance level (0.05). This means we reject the null hypothesis, providing strong evidence that the age groups 15-19, 20-24, and 25-29 are more likely to be diagnosed with HIV compared to other age groups.

### 2. Sex Hypothesis:
- **Null Hypothesis (H₀):** Males are **not** more likely to get diagnosed with HIV than females.
- **Alternative Hypothesis (H₁):** Males are **more** likely to get diagnosed with HIV than females.
- **Data Limitations:** Unfortunately, the only available data published by the government is an aggregated from 1983 to 2024. I recognize that this is not ideal and that we should have cases by sex and year, that way we could analyze HIV diagnoses relative to the total population size for each sex and year. We recognize that aggregated data over such a long period averages out fluctuations in population growth, health policies, medical advacements or socio economic factors that could have influenced diagnosis rates.
- **Data Enhancements:** To alleviate for the previous limitation, I will compare the proportions of diagnoses between males and females within each age group. This will alow to assess whether males account for a disproportionate share of diagnoses compared to females.


To-do:
-completar esta seccion con toda la informacion
-agregar limitaciones de la data
-hacer proportional comparison

### 3. Transmission Category Hypothesis:
- **Null Hypothesis (H₀):** Sexual transmission is **not** the most common way to contract HIV compared to other modes of transmission.
- **Alternative Hypothesis (H₁):** Sexual transmission is the **most common** way to contract HIV compared to other modes of transmission.

In [None]:
# Group by sex and sum diagnoses (to address the sex hypothesis)
sex_group_summary = df_table6.groupby('Sex')['Number of diagnosis'].sum()

# Step 2: Summarize data for Transmission Category Hypothesis
# Group by Transmission Category in df_table10 to check the most common way of transmission
transmission_summary = df_table10.groupby('Transmission Category')['Number of diagnosis'].sum()