### Report: summary of steps and code used.

I have divided the EDA project in 3 main steps:

1) Preprocessing
2) Cleaning
3) Analysis



### 1. Preprocessing of the datasets

The datasets that have been used are of two types:

- **Health surveys** of individuals
- Results of the **analysis of certain biomarkers** (blood analysis, immune cells, etc.).

From the health surveys, categorical variables will be extracted to do correlation analysis with the numerical data from the biomarkers.

First, I processed the tables of the **health surveys** (categorical). Since they had around 7000 columns, I chose the most relevant ones to address the hypotheses (age, gender), or those that seemed interesting to analyze (diabetes, cancer, heart attack, dementia, etc.). In these tables, I changed the column names (previously codes), replaced numbers with YES/NO responses, etc.

Next, I worked with the tables of the **analysis of certain biomarkers** (numerical). Again, I selected the most significant columns in each of them and changed their names (previously codes).

All preprocessed tables were saved with the suffix "_clean" in the ./data folder.

### 2. Cleaning the datasets

In this part of the Exploratory Data Analysis, I have been visualizing the variables, as well as identifying possible null values and outliers to clean the data.

- **Categorical Datasets**

I started by visualizing the tables of health surveys:

1. I created the function **categorical_distribution**, which allows observing the distribution (relative frequency) of each of the categorical variables in these tables. After checking for any errors in the data (previously not cleaned numerical codes, etc.), I corrected the data in all columns.
2. I created the function boxplot_with_outliers that takes only the numerical columns of a dataframe and visualizes the data with their corresponding outliers in boxplot subplots. After examining the data, I cleaned the necessary outliers.

All clean2 tables were saved with the suffix _clean in the /data folder.

- **Numerical Datasets**

I continued visualizing the tables of biomarkers (blood/cells/etc):

As the data from all these tables were numerical, I created general **boxplots** function to generate subplots for all numerical data with outliers.

Despite the presence of some outliers, I did not remove them in case they could be correlated with age or other parameters.

I kept these tables with the suffix "_clean" (I did not save new ones).

### 3. Analysis

Once the data has been visualized and cleaned, I proceeded to the analysis of the data.

The data analysis has mainly focused on conducting a study of correlations.

All biological markers have been correlated with:

- Age
- Gender (Male/Female)
- Age group (Young/Middle/Old)
- High blood pressure (Y/N)
- Diabetes (Y/N)
- Cancer (Y/N)
- Lung disease (Y/N)
- Heart condition (Y/N)
- Stroke (Y/N)
- Psychiatric problems (Y/N)
- Dementia (Y/N)
- Cholesterol (Y/N)


First, I generated a table with the results from the 2016 survey containing the data of the biological markers collected in that same year. I excluded the data from the 2018 and 2020 surveys since there were no biological data from those surveys.

1. For the correlations of a main numerical variable (like Age) vs the rest of the biological markers, I created a function called **scatter_plots_aggregated** where regression coefficient data is calculated, and subplots of scatterplots for all variables with their corresponding regression line are generated.

2. To obtain a table of data with the regression analysis of the above, I created the function **calculate_correlation_regression** and saved the calculations in ./data.

3. For the correlations of a main numerical variable (like Age) vs the rest of the biological markers, taking into account a categorical variable (e.g., Gender), I created a function called **scatter_plots_aggregated_with_categorical** where regression coefficient data is calculated, and subplots of scatterplots for all variables with their corresponding regression line are generated.

4. Similarly to point 2, to obtain a table of data with the regression analysis of the above, I created the function **calculate_correlation_regression_with_categorical** and saved the calculations in ./data.

After obtaining all the correlation visualizations and their numerical statistical data, I created the **filter_and_extract_values** function to take from each dataset the correlation values > 0.15 or < -0.15 and see what values are interesting for our hypothese, while checking if any other parameter (gender, health problems, etc.) could be interesting.

Next, I combined all the correlation data into a single table and, using the function **plot_all_variables_across_groups**, generated lineplots to observe if the existing correlations became stronger or weaker with any specific parameter.

Finally, I created the **find_max_min** function to find max and min values along with their corresponding columns and generate a dataframe with all the data.

Once the results have been obtained, I proceed to prepare the presentation of the findings.
