# Statistical Analysis and Advanced Visualisation of FluPRINT Data

## Introduction

This notebook delves deeper into the FluPRINT dataset, focusing on statistical analysis and advanced data visualization techniques. We aim to uncover significant relationships, trends, and patterns that will inform our predictive modeling of vaccine responses.

---

## Step 1- Setup and Data Loading

First, we'll import the required libraries and load our preprocessed data.

In [None]:
import csv
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, ttest_ind, chi2_contingency, f_oneway, levene

## Data Loading

We begin by loading the preprocessed FluPRINT dataset (`Fluprint.csv`). This dataset has already been cleaned and filtered in the 'data_aquisition_preprocessing' notebook, with unnecessary columns removed.

In [3]:
# Load the preprocessed dataset
data_path = r"C:\Users\OneDrive\Documents\Applied Data science\FluPRINT_database\FluPRINT_filtered_data\aggregated_participants.csv"
fluprint_data = pd.read_csv(data_path)

# Display basic information about the dataset
display(fluprint_data.head())
print(fluprint_data.info())

Unnamed: 0,donor_id,study_id,gender,race,visit_year,visit_type_hai,visit_age,vaccine,geo_mean,d_geo_mean,vaccine_response
0,49,29,Male,Caucasian,2014,pre,9.71,1.0,269.09,1.0,0.0
1,50,29,Female,Caucasian,2014,pre,12.31,1.0,320.0,1.0,0.0
2,51,29,Male,Other,2014,pre,9.86,1.0,134.54,1.0,0.0
3,53,29,Female,Asian,2014,pre,9.01,4.0,47.57,2.0,0.0
4,54,29,Male,Asian,2014,pre,10.47,4.0,67.27,2.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   donor_id          293 non-null    int64  
 1   study_id          293 non-null    int64  
 2   gender            293 non-null    object 
 3   race              292 non-null    object 
 4   visit_year        293 non-null    int64  
 5   visit_type_hai    293 non-null    object 
 6   visit_age         293 non-null    float64
 7   vaccine           293 non-null    float64
 8   geo_mean          293 non-null    float64
 9   d_geo_mean        293 non-null    float64
 10  vaccine_response  293 non-null    float64
dtypes: float64(5), int64(3), object(3)
memory usage: 25.3+ KB
None


### Age distribution

In [6]:
gender_counts = fluprint_data['gender'].value_counts()
print("Gender Distribution:")
print(gender_counts)

Gender Distribution:
gender
Female    172
Male      121
Name: count, dtype: int64


### Race Distribution

In [7]:
race_counts = fluprint_data['race'].value_counts()
print("\nRace Distribution:")
print(race_counts)


Race Distribution:
race
Caucasian                    155
Other                         85
Asian                         45
Black or African American      4
Hispanic/Latino                3
Name: count, dtype: int64


## Statistics

In [50]:
participants_numbers = fluprint_filtered["donor_id"].nunique()
print(f"Number of participants across the studies: {participants_numbers}")

Number of participants across the studies: 740


In [11]:
# Select the columns you're interested in
columns_of_interest = ['geo_mean', 'd_geo_mean', 'vaccine_response', 'vaccine']

# Calculate the correlation matrix for these columns
correlation_matrix = fluprint_data[columns_of_interest].corr()

print("\nCorrelation Matrix:")
display(correlation_matrix)


Correlation Matrix:


Unnamed: 0,geo_mean,d_geo_mean,vaccine_response,vaccine
geo_mean,1.0,-0.199802,-0.111422,0.079247
d_geo_mean,-0.199802,1.0,0.292278,0.155443
vaccine_response,-0.111422,0.292278,1.0,0.07891
vaccine,0.079247,0.155443,0.07891,1.0


### Chi-square test of independence

In [None]:
contingency_table = pd.crosstab(fluprint_data['gender'], fluprint_data['vaccine_response'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p_value}")

Chi-square statistic: 0.03443654723489459
p-value: 0.8527813002913769


### One-way ANOVA

In [None]:
races = fluprint_data['race'].unique()
vaccine_responses = [fluprint_data[fluprint_data['race'] == race]['vaccine_response'] for race in races]
f_statistic, p_value = f_oneway(*vaccine_responses)
print(f"F-statistic: {f_statistic}")
print(f"p-value: {p_value}")

F-statistic: nan
p-value: nan


  f_statistic, p_value = f_oneway(*vaccine_responses)


### Two-sample t-test:
To compare vaccine response between genders:

In [None]:
male_response = fluprint_data[fluprint_data['gender'] == 'Male']['vaccine_response']
female_response = fluprint_data[fluprint_data['gender'] == 'Female']['vaccine_response']
t_statistic, p_value = ttest_ind(male_response, female_response)
print(f"T-statistic: {t_statistic}")
print(f"p-value: {p_value}")

T-statistic: -0.3112968241472829
p-value: 0.7557978692308014


In [None]:
stat, p_value = levene(male_response, female_response)
print(f"Levene's Test p-value: {p_value}")

Levene's Test p-value: 0.7557978692307444


### Pearson correlation

In [18]:
correlation, p_value = pearsonr(fluprint_data['visit_age'], fluprint_data['vaccine_response'])
print(f"Correlation coefficient: {correlation}")
print(f"p-value: {p_value}")

Correlation coefficient: 0.10360991493916716
p-value: 0.07661003088824983
