## Classification Problem: Parkinson’s Disease Prediction
The goal of this project is to develop a classification model that predicts whether an individual has Parkinson’s disease (label 1) or is healthy (label 0) based on vocal characteristics measured from voice recordings. These features are indicative of the distinct vocal impairments commonly observed in Parkinson’s patients, making this a binary classification problem where the model needs to accurately distinguish between healthy and Parkinson’s cases.



## Parkinson's Disease Prediction Project Lifecycle

### 1. Data Pre-Processing & Cleaning
- **Load and Review Data**: Import data and get an overview of the dataset.
- **Understand Target Variable Distribution**: Analyze the distribution of the target variable (e.g., healthy vs Parkinson's diagnosis).
- **Handle Missing or Outlier Values**: Identify and address any missing data or outliers to ensure consistency and reliability.
- **Data Transformation**: Apply appropriate data scaling or normalization techniques to bring features to similar ranges.

### 2. Exploratory Data Analysis (EDA)
- **Define EDA Goals**: Establish the goals for EDA, focusing on extracting meaningful insights from the dataset.
- **Analyze Voice Frequency Variables**: Analyze voice features like jitter, shimmer, and pitch frequency.
- **Study Amplitude & Noise Variation Metrics**: Explore metrics related to amplitude and noise to understand their distribution and trends.
- **Examine Nonlinear Complexity Measures**: Investigate nonlinear features such as fractal dimension or entropy metrics.
- **Summarize Insights and Identify Potential Predictors**: Record insights from EDA and identify key variables that may play a significant role in prediction.

### 3. Feature Engineering (If Required)
- **Create Interactions or Transform Features**: Create interaction features or transformations to potentially enhance predictive power.
- **Dimensionality Reduction (e.g., PCA)**: Apply dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce redundancy and simplify the feature space.

### 4. Data Splitting
- **Train-Test Split**: Split the data into training and testing datasets (e.g., 80/20 or 70/30 split) to prepare for model training and evaluation.

### 5. Balance Data with Sampling Techniques
- **Apply SMOTE or Other Methods**: If the dataset is imbalanced, use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the class distribution.

### 6. Training Machine Learning Model
- **Select Models and Define Evaluation Metrics**: Choose appropriate models (e.g., Logistic Regression, SVM, Random Forest) and metrics (e.g., accuracy, precision, recall) to evaluate model performance.
- **Train Models and Perform Hyperparameter Tuning**: Train selected models and use hyperparameter tuning techniques (e.g., Grid Search, Random Search) to improve their performance.
- **Select Final Model Based on Performance**: Assess all trained models and select the one with the best performance based on defined metrics.

### 7. Feature Importance Analysis (Post-Modeling)
- **Perform Feature Importance Analysis**: Analyze the importance of features for the final model to understand which variables had the most impact on the predictions.
- **Summarize Key Predictor Insights for Interpretability**: Summarize the important features and document key insights to make the model more interpretable for stakeholders.

### End Goal:
An accurate and interpretable model for Parkinson's disease prediction.



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [3]:
data = pd.read_csv(r'C:\Users\Usman\Desktop\Portofolio\Datasets\parkinson-disease-detection\archive\Parkinsson disease.csv')

data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


## Dataset Overview

### Name
- **Name**: Contains an identifier for each subject and recording.

### Vocal Attributes
- **MDVP:Fo(Hz)**, **MDVP:Fhi(Hz)**, **MDVP:Flo(Hz)**: Measures of the fundamental vocal frequency, capturing average, maximum, and minimum frequency respectively.
- **MDVP:Jitter(%)**, **MDVP:Jitter(Abs)**, **MDVP:RAP**, **MDVP:PPQ**, **Jitter:DDP**: Metrics capturing variations in fundamental frequency (jitter), which indicate stability in vocal pitch, often affected in Parkinson’s disease.
- **MDVP:Shimmer**, **MDVP:Shimmer(dB)**, **Shimmer:APQ3**, **Shimmer:APQ5**, **MDVP:APQ**, **Shimmer:DDA**: Measures of amplitude variation (shimmer), representing voice stability in terms of loudness.
- **NHR**, **HNR**: Ratios capturing noise to tonal components in the voice, which may reflect vocal disorder severity.

### Dynamical Complexity Measures
- **RPDE**, **DFA**: Measures of vocal signal complexity and fractal scaling, linked to vocal stability.
- **Spread1**, **Spread2**, **D2**, **PPE**: Nonlinear vocal metrics capturing deviations in pitch and energy, relevant for Parkinson’s diagnosis.

### Status
- **Status**: Target variable indicating health status (1 = Parkinson’s, 0 = Healthy).


---

## Discussing Attributes & Measures

### Vocal Attributes

#### 1. Fundamental Frequency
- **MDVP:Fo(Hz)**, **MDVP:Fhi(Hz)**, **MDVP:Flo(Hz)**:
  - These features measure the **fundamental frequency (pitch)** of the voice:
    - **MDVP:Fo(Hz)**: This is the **average fundamental frequency** of the voice. The fundamental frequency, often called pitch, is the basic frequency at which the vocal cords vibrate. In healthy individuals, this pitch tends to be more stable, whereas in Parkinson’s patients, there can be noticeable variations.
    - **MDVP:Fhi(Hz)**: This represents the **maximum fundamental frequency** reached during the recording. It gives insight into the **highest pitch** that the subject can achieve.
    - **MDVP:Flo(Hz)**: This represents the **minimum fundamental frequency** during the recording, indicating the **lowest pitch** that the subject is capable of.
  - **Lower variability** in frequency is typically **better**, as **higher variability** might indicate vocal instability, a common symptom in Parkinson’s patients.

#### 2. Jitter Metrics (Frequency Variation)
- **MDVP:Jitter(%)**, **MDVP:Jitter(Abs)**, **MDVP:RAP**, **MDVP:PPQ**, **Jitter:DDP**:
  - These metrics measure **variations in the fundamental frequency** of the voice, which is referred to as **jitter**:
    - **Jitter** represents the **cycle-to-cycle variability** in pitch during sustained phonation. It measures the consistency of the vocal cord vibrations.
    - **MDVP:Jitter(%)**: Represents the **frequency variation as a percentage**. **Higher values** indicate **inconsistency** in vocal fold vibration, suggesting reduced vocal stability.
    - **MDVP:Jitter(Abs)**: This is the **absolute measure of pitch variation**.
    - **MDVP:RAP** (Relative Average Perturbation): Measures **short-term variability** in pitch, averaged over 3 pitch periods. **Higher RAP** values indicate **greater instability** in pitch.
    - **MDVP:PPQ** (Pitch Period Quotient): Measures **pitch variation** averaged over a longer time frame compared to RAP. Again, **higher values** are a sign of instability.
    - **Jitter:DDP**: A form of jitter calculated based on three consecutive periods. Higher values indicate **more instability** in vocal fold vibrations.
  - Individuals with Parkinson’s often show **increased jitter values**, implying **less stability** in pitch.

#### 3. Shimmer Metrics (Amplitude Variation)
- **MDVP:Shimmer**, **MDVP:Shimmer(dB)**, **Shimmer:APQ3**, **Shimmer:APQ5**, **MDVP:APQ**, **Shimmer:DDA**:
  - These metrics measure **variations in amplitude** (or loudness), which is referred to as **shimmer**:
    - **Shimmer** represents **cycle-to-cycle variability in the loudness** of the voice, reflecting the stability in volume produced by the vocal cords.
    - **MDVP:Shimmer**: Measures **amplitude variation as a percentage**. **Higher values** indicate **less control** over the amplitude.
    - **MDVP:Shimmer(dB)**: Represents the **variation in loudness in decibels**. **Higher values** suggest greater instability.
    - **Shimmer:APQ3** (Amplitude Perturbation Quotient): Measures **average amplitude variability** over 3 periods. **Higher values** indicate instability.
    - **Shimmer:APQ5**: Measures shimmer over **5 periods**, indicating variability over a longer time frame compared to APQ3. **Higher values** suggest worse control.
    - **MDVP:APQ**: Another metric to quantify shimmer over multiple periods.
    - **Shimmer:DDA**: Represents the **average difference between consecutive periods' amplitude**.
  - **Increased shimmer values** indicate **difficulty controlling loudness**, which is often seen in individuals with Parkinson’s, leading to unstable voice volume.

#### 4. Noise to Harmonics Ratio
- **NHR (Noise to Harmonics Ratio)**, **HNR (Harmonics to Noise Ratio)**:
  - **NHR** measures the amount of **noise in the voice compared to the tonal (harmonic) components**:
    - **Higher NHR** values indicate **more noise** relative to the tonal parts, which is considered **worse** for vocal quality. It suggests greater vocal fold irregularities and is often associated with Parkinson’s.
  - **HNR** measures the **ratio of harmonic content to noise**:
    - **Lower HNR** values indicate a **noisier voice**, which implies **worse vocal clarity**. People with Parkinson’s often have lower HNR values, leading to a breathier or hoarse voice.
    - **Higher HNR** values are **better**, as they indicate a clearer, more stable voice.

### Dynamical Complexity Measures

#### 1. Complexity Metrics
- **RPDE (Recurrence Period Density Entropy)**, **DFA (Detrended Fluctuation Analysis)**:
  - These metrics are used to measure the **complexity and scaling properties** of the vocal signal:
    - **RPDE** measures **how predictable the signal is over time**. **Higher RPDE** values indicate greater **unpredictability** or irregularity, which is **worse** for vocal stability.
    - **DFA** provides an indication of **fractal scaling properties** of the voice signal. **Higher DFA** values imply **greater instability** and are usually **worse** for vocal health.

#### 2. Nonlinear Vocal Metrics
- **Spread1**, **Spread2**, **D2**, **PPE (Pitch Period Entropy)**:
  - These metrics capture **nonlinear aspects** of the vocal signal:
    - **Spread1** and **Spread2**: Represent deviations in the **frequency domain**. **Higher values** often indicate **greater instability** or deviation, which is **worse** for vocal health.
    - **D2**: Represents the **dimensionality of the signal**, with **higher values** indicating more complexity and being generally **worse** for vocal stability.
    - **PPE**: Measures the **entropy of pitch periods**. **Higher PPE** values indicate greater irregularity in pitch, which is **worse** for vocal health.

### Status
- **Status**: The target variable that indicates the health status of the subject:
  - **1**: Indicates the presence of Parkinson’s Disease.
  - **0**: Indicates a healthy individual.

---

### Key Points
- For most metrics (**Jitter**, **Shimmer**, **NHR**, **RPDE**, **DFA**, **Spread1**, **Spread2**, **D2**, and **PPE**), **lower values are better** as they indicate stability in vocal features.
- For **HNR**, **higher values are better**, as they represent clearer, more stable vocal quality.
- These features collectively help in identifying the characteristic changes in voice stability, amplitude, and noise, which are commonly impacted by Parkinson's disease.


### 1. Data Pre-Processing & Cleaning

In [10]:
pd.set_option('display.max_columns', None)

data.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,0.015664,0.017878,0.024081,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,0.010153,0.012024,0.016947,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,0.00455,0.0057,0.00719,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,0.008245,0.00958,0.01308,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,0.01279,0.01347,0.01826,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,0.020265,0.02238,0.0294,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,0.05647,0.0794,0.13778,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [11]:
pd.reset_option('display.max_columns')


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [5]:
data.isna().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

##### There are no null values in the dataset.

In [8]:
data['status'].unique()

array([1, 0], dtype=int64)

In [7]:
data['status'].value_counts()

1    147
0     48
Name: status, dtype: int64

### Target Variable Analysis: `status`

The target variable for this classification problem is **`status`**, which represents the presence or absence of Parkinson's disease:
- **1**: Indicates a positive diagnosis of Parkinson's disease.
- **0**: Indicates a negative diagnosis (healthy).

The output of `data['status'].unique()` confirms that this is a **binary classification problem**, with the two possible values being **1** and **0**.

#### Value Counts of `status`
To understand the distribution of the target variable, we examined the value counts:

```
data['status'].value_counts()
```
**Output**:
```
1    147
0     48
Name: status, dtype: int64
```

From the value counts:
- **147 samples** are labeled as **1** (Parkinson's positive).
- **48 samples** are labeled as **0** (Healthy).

This indicates a significant **class imbalance**, where the positive class (Parkinson's) is approximately **three times more frequent** than the negative class (Healthy). This imbalance can lead to bias in the model, making it more inclined to predict the majority class (Parkinson's).

#### Addressing Class Imbalance with SMOTE
Since we cannot simply collect additional data to balance the classes, we will use **SMOTE (Synthetic Minority Over-sampling Technique)** for **oversampling** the minority class (`0`). SMOTE works by generating synthetic examples for the minority class, which helps create a more balanced dataset. This, in turn, can improve the model's performance by reducing bias towards the majority class and ensuring the model learns to recognize both classes effectively.

