# Exploratory Data Analysis (EDA)
**Team:** The Closer  
**Week:** 10  

## Objective
To understand the dataset structure, identify patterns, and detect relationships between the target variable (`Persistency_Flag`) and other features.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load Data
df = pd.read_excel('Healthcare_dataset (1).xlsx', sheet_name='Dataset')
print("Shape:", df.shape)
df.head()

## 1. Univariate Analysis
analyzing the distribution of individual variables.

In [None]:
# Target Variable Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='Persistency_Flag', data=df)
plt.title('Distribution of Persistency Flag')
plt.show()

In [None]:
# Numerical Variable Distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['Count_Of_Risks'], kde=True, bins=20)
plt.title('Distribution of Count_Of_Risks')
plt.show()

## 2. Bivariate Analysis
Analyzing relationships between variables.

In [None]:
# Persistency by Gender
plt.figure(figsize=(6, 4))
sns.countplot(x='Gender', hue='Persistency_Flag', data=df)
plt.title('Persistency by Gender')
plt.show()

In [None]:
# Risk Count vs Persistency
plt.figure(figsize=(6, 4))
sns.boxplot(x='Persistency_Flag', y='Count_Of_Risks', data=df)
plt.title('Count of Risks vs Persistency')
plt.show()

## 3. Correlation Analysis
Checking correlation between numerical features.

In [None]:
numeric_df = df.select_dtypes(include=[np.number])
col_count = len(numeric_df.columns)
print(f"Number of numeric columns: {col_count}")

if col_count > 1:
    plt.figure(figsize=(10, 8))
    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap')
    plt.show()
else:
    print("Not enough numeric columns for proper correlation analysis.")

## 4. Final Recommendations
Based on the EDA and previous analysis:
1.  **Target Imbalance**: Check if `Persistency_Flag` classes are balanced. If not, consider resampling (SMOTE).
2.  **Key Drivers**: `Count_Of_Risks` appears to have some differentiation between Persistent/Non-Persistent groups.
3.  **Data Quality**: Ensure missing values in `Dexa_Freq_During_Rx` (if any found in cleaning phase) are handled before modeling.
4.  **Modeling**: Start with Logistic Regression as a baseline, then try Random Forest for better interaction capture.