# Data Retrieval

### 1. Introduction

This section details the steps taken to obtain the "Alzheimer's Disease Dataset" from Kaggle, describes its features, and outlines the project’s goals. The dataset, provided by Rabie El Kharoua, contains tabular data with a binary target variable, `Diagnosis`, indicating the presence or absence of Alzheimer’s disease. The goal is to use this data for a binary classification task to predict Alzheimer’s disease using machine learning models.

### 2. Dataset Source
- **Dataset Name:** Alzheimer’s Disease Dataset
- **Source:** Kaggle, provided by Rabie El Kharoua
- **URL:** [https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset](https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset)
- **Access Date:** March 17, 2025
- **License:** The dataset is publicly available on Kaggle under the Kaggle Community Guidelines, which typically allow usage for research and educational purposes. No additional permissions were required for access.
- **Rationale for Selection:** This dataset was chosen because it provides a well-labeled, tabular dataset for Alzheimer’s disease classification, suitable for machine learning tasks. It includes 2149 samples with a binary target variable, making it appropriate for a binary classification problem. Its public availability ensures reproducibility, and its tabular format aligns with the project’s focus on traditional machine learning models.

### 3. Data Retrieval Process
The data retrieval process involved the following steps to acquire the dataset and perform an initial inspection:

#### 3.1. Downloading the Dataset
- **Method:** The dataset was downloaded directly from Kaggle using the provided URL.
- **Steps:**
  1. Navigated to the dataset page on Kaggle ([https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset](https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset)).
  2. Signed into a Kaggle account to access the download option (Kaggle requires authentication for dataset downloads).
  3. Clicked the "Download" button to retrieve the dataset as a ZIP file.
  4. Extracted the ZIP file to obtain the CSV file, named `alzheimers_disease_data.csv`.
- **Storage:** The extracted CSV file was stored locally in the project directory under `/data/alzheimers_disease_data.csv` for version control and reproducibility.

#### 3.2. Initial Data Inspection

- **Tool Used:** Python with the pandas library was used to load and inspect the dataset.

In [12]:
import pandas as pd
df = pd.read_csv('data/alzheimers_disease_data.csv')

df.head()

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid


In [17]:
print(df.shape)
print(df.info())

(2149, 35)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  2149 non-null   int64  
 1   Age                        2149 non-null   int64  
 2   Gender                     2149 non-null   int64  
 3   Ethnicity                  2149 non-null   int64  
 4   EducationLevel             2149 non-null   int64  
 5   BMI                        2149 non-null   float64
 6   Smoking                    2149 non-null   int64  
 7   AlcoholConsumption         2149 non-null   float64
 8   PhysicalActivity           2149 non-null   float64
 9   DietQuality                2149 non-null   float64
 10  SleepQuality               2149 non-null   float64
 11  FamilyHistoryAlzheimers    2149 non-null   int64  
 12  CardiovascularDisease      2149 non-null   int64  
 13  Diabetes                   2149 non-n

In [14]:
print(df['Diagnosis'].value_counts()) 

Diagnosis
0    1389
1     760
Name: count, dtype: int64


- **Findings:**
  - Confirmed the dataset has 2149 rows and 35 columns (based on the actual dataset structure).
  - Identified the data types of each column (numerical, categorical, or binary).
  - Verified the distribution of the target variable to understand the class balance between the two categories (Alzheimer’s vs. No Alzheimer’s).

### 4. Explanation of Features and Target Variable
The dataset contains 35 columns, including 34 features and 1 target variable. Below is a detailed description of each feature, the target variable, and their relevance to Alzheimer’s disease prediction, based on the dataset’s structure as described on Kaggle.

#### 4.1. Target Variable
- **Diagnosis (Target):** This is the target variable, indicating the presence or absence of Alzheimer’s disease. It is a binary variable with two possible values:
  - 0: No Alzheimer’s (healthy or non-demented)
  - 1: Alzheimer’s (demented)
  

#### 4.2. Features
The dataset includes 34 features, which can be categorized into demographic, lifestyle, medical history, clinical, cognitive, functional, behavioral, and physical measurements. Below is a detailed breakdown,for the 34 features:

##### Demographic Features
1. **PatientID:** A unique identifier for each patient (numerical, integer).
   
2. **Age:** The patient’s age in years (numerical, integer, range: 60–90).
   
3. **Gender:** The patient’s gender (binary, 0 = Male, 1 = Female).
   
4. **Ethnicity:** The patient’s ethnicity (categorical, 0 = Caucasian, 1 = African American, 2 = Asian, 3 = Other).
   
5. **EducationLevel:** The patient’s education level in years (categorical, 0 = None, 1 = High School, 2 = Bachelor's, 3 = Higher).
   

##### Lifestyle Features
6. **BMI:** Body Mass Index (numerical, float, range: 15–40).
   
7. **Smoking:** Smoking status (binary, 0 = No, 1 = Yes).
   
8. **AlcoholConsumption:** Weekly alcohol consumption in units (numerical, float, range: 0–20).

9. **PhysicalActivity:** Weekly physical activity in hours (numerical, float, range: 0–10).
   
10. **DietQuality:** Quality of diet (numerical, float, range: 0–10, higher indicates better quality).
    
11. **SleepQuality:** Quality of sleep (numerical, float, range: 4–10, higher indicates better quality).
    

##### Medical History Features
12. **FamilyHistoryAlzheimers:** Family history of Alzheimer’s (binary, 0 = No, 1 = Yes).
    
13. **CardiovascularDisease:** Presence of cardiovascular disease (binary, 0 = No, 1 = Yes).
    
14. **Diabetes:** Presence of diabetes (binary, 0 = No, 1 = Yes).
    
15. **Depression:** Presence of depression (binary, 0 = No, 1 = Yes).
    
16. **HeadInjury:** History of head injury (binary, 0 = No, 1 = Yes).
    
17. **Hypertension:** Presence of hypertension (binary, 0 = No, 1 = Yes).
    

##### Clinical Measurements
18. **SystolicBP:** Systolic blood pressure (numerical, integer, range: 90–180).
    
19. **DiastolicBP:** Diastolic blood pressure (numerical, integer, range: 60–120).
    
20. **CholesterolTotal:** Total cholesterol level (numerical, float, range: 150–300).
    
21. **CholesterolLDL:** Low-density lipoprotein cholesterol level (numerical, float, range: 50–200).
    
22. **CholesterolHDL:** High-density lipoprotein cholesterol level (numerical, float, range: 20–100).
    
23. **CholesterolTriglycerides:** Triglyceride level (numerical, float, range: 50–400).
    

##### Cognitive and Functional Assessments
24. **MMSE:** Mini-Mental State Examination score (numerical, float, range: 0–30, higher indicates better cognitive function).
    
25. **FunctionalAssessment:** Functional assessment score (numerical, float, range: 0–10, higher indicates better functioning).
    
26. **MemoryComplaints:** Presence of memory complaints (binary, 0 = No, 1 = Yes).
    
27. **BehavioralProblems:** Presence of behavioral problems (binary, 0 = No, 1 = Yes).
    
28. **ADL:** Activities of Daily Living score (numerical, float, range: 0–10, higher indicates better independence).
    

##### Behavioral and Symptom-Based Features
29. **Confusion:** Presence of confusion (binary, 0 = No, 1 = Yes).
    
30. **Disorientation:** Presence of disorientation (binary, 0 = No, 1 = Yes).

31. **PersonalityChanges:** Presence of personality changes (binary, 0 = No, 1 = Yes).
    
32. **DifficultyCompletingTasks:** Difficulty completing familiar tasks (binary, 0 = No, 1 = Yes).
    
33. **Forgetfulness:** Presence of forgetfulness (binary, 0 = No, 1 = Yes).

##### Metadata
34. **DoctorInCharge:** A column indicating the doctor in charge (string, e.g., "XXX").

##### 4.3. Observations
- **Class Distribution:** The target variable (Diagnosis) showed the following distribution (example based on inspection):
  - No Alzheimer’s (0): 1,389 samples
  - Alzheimer’s (1): 760 samples
  This indicates a potential class imbalance, with fewer Alzheimer’s cases (approximately 37% of the dataset), which will be addressed in the preprocessing phase.
- **Feature Types:** The dataset includes a mix of numerical (e.g., Age, MMSE), binary (e.g., Gender, Smoking), and categorical (e.g., Ethnicity, EducationLevel) features, requiring appropriate preprocessing (e.g., encoding, scaling) in the next phase.
- **Metadata Column:** The `DoctorInCharge` column appears to be a placeholder and will be removed during preprocessing, as it does not contribute to the prediction task.