<a href="https://colab.research.google.com/github/vaisshnavee1410/ASSIGNMENT-5-EDA_1-.ipynb/blob/main/EDA_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **EXPLORATORY DATA ANALYSIS ON A DATASET**

**OBJECTIVE:**

The main goal of this assignment is to conduct a thorough exploratory analysis of the
"cardiographic.csv" dataset to uncover insights, identify patterns, and understand the dataset's
underlying structure. You will use statistical summaries, visualizations, and data manipulation
techniques to explore the dataset comprehensively.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("Cardiotocographic.csv")

In [None]:
# Dataset information
print("Dataset Information:")
print(df.info())

#Display first few rows
print("\nFirst 5 Rows of Dataset:")
print(df.head())

### **DATASET**

1. **LB** - Likely stands for "Baseline Fetal Heart Rate (FHR)" which represents the average fetal
heart rate over a period.

2. **AC** - Could represent "Accelerations" in the FHR. Accelerations are usually a sign of fetal
well-being.

3. **FM** - May indicate "Fetal Movements" detected by the monitor.

4. **UC** - Likely denotes "Uterine Contractions", which can impact the FHR pattern.

5. **DL** - Could stand for "Decelerations Late" with respect to uterine contractions, which can
be a sign of fetal distress.

6. **DS** - May represent "Decelerations Short" or decelerations of brief duration.

7. **DP** - Could indicate "Decelerations Prolonged", or long-lasting decelerations.

8. **ASTV** - Might refer to "Percentage of Time with Abnormal Short Term Variability" in the
FHR.

9. **MSTV** - Likely stands for "Mean Value of Short Term Variability" in the FHR.

10. **ALTV** - Could represent "Percentage of Time with Abnormal Long Term Variability" in the
FHR.

11. **MLTV** - Might indicate "Mean Value of Long Term Variability" in the FHR.

## **TASKS**

  **1.**   **DATA CLEANING AND PREPARATION:**

In [None]:
# Load the dataset
df = pd.read_csv("Cardiotocographic.csv")

# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)


In [None]:
# Handle missing values (Imputation or Deletion)
# If missing values exist, we can choose to fill them with mean/median or drop them
df.fillna(df.median(), inplace=True)  # Filling missing values with column median
print("\nMissing values after handling:")
print(df.isnull().sum())

In [None]:
# Check data types before conversion
print("\nData Types Before Conversion:")
print(df.dtypes)

# Convert object (string) types to numeric if necessary
for col in df.columns:
    if df[col].dtype == 'object':  # Convert numeric columns stored as strings
        df[col] = pd.to_numeric(df[col], errors='coerce')


In [None]:
print("\nData Types After Conversion:")
print(df.dtypes)

# Summary statistics before handling outliers
print("\nSummary Statistics Before Outlier Handling:")
print(df.describe())

In [None]:
# Identify outliers (values outside 1.5 * IQR)
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < (Q3 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
print(f"\nNumber of Outliers Detected: {outliers.sum().sum()}")

In [None]:
# Remove outliers
df_cleaned = df[~outliers.any(axis=1)]

print(f"\nRows before outlier removal: {df.shape[0]}")
print(f"Rows after outlier removal: {df_cleaned.shape[0]}")

In [None]:
# Boxplot to visualize outliers
plt.figure(figsize=(12,6))
sns.boxplot(data=df_cleaned)
plt.xticks(rotation=90)
plt.title("Boxplot After Outlier Removal")
plt.show()

**2.STATISTICAL SUMMARY:**

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Compute statistical measures
summary_stats = pd.DataFrame({
    "Mean": df.mean(),
    "Median": df.median(),
    "Standard Deviation": df.std(),
    "IQR": df.quantile(0.75) - df.quantile(0.25),
    "Min": df.min(),
    "Max": df.max()
})

In [None]:
# Display the statistical summary
print("Statistical Summary:")
print(summary_stats)

In [None]:
# Identify interesting findings
print("Interesting Findings:")

# Detect high variability
high_variability = summary_stats[summary_stats["Standard Deviation"] > summary_stats["Mean"] * 0.5]
if not high_variability.empty:
    print("\nColumns with High Variability (Std Dev > 50% of Mean):")
    print(high_variability)

# Detect skewed distributions
skewed_columns = summary_stats[abs(summary_stats["Mean"] - summary_stats["Median"]) > summary_stats["Standard Deviation"]]
if not skewed_columns.empty:
    print("\nColumns with Skewed Distribution (Mean ≠ Median):")
    print(skewed_columns)

# Detect columns with extreme values
large_range = summary_stats[(summary_stats["Max"] - summary_stats["Min"]) > summary_stats["Mean"] * 5]
if not large_range.empty:
    print("\nColumns with Large Range of Values:")
    print(large_range)

**3.** **DATA VISUALIZATIONS:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Histograms
plt.figure(figsize=(12, 8))
df.hist(figsize=(12, 10), bins=20, color='skyblue')
plt.suptitle("Histograms of Numerical Variables", fontsize=16)
plt.show()

# Boxplots
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, palette="coolwarm")
plt.xticks(rotation=90)
plt.title("Boxplots of Numerical Variables")
plt.show()


In [None]:
 # Bar Charts
if 'Class' in df.columns:
    plt.figure(figsize=(8, 6))
    sns.countplot(x=df['Class'], palette="viridis")
    plt.title("Frequency of Categories")
    plt.xlabel("Category")
    plt.ylabel("Count")
    plt.show()

# Pie Charts
    plt.figure(figsize=(8, 6))
    df['Class'].value_counts().plot.pie(autopct='%1.1f%%', cmap='coolwarm', shadow=True)
    plt.title("Distribution of Categorical Variable")
    plt.ylabel("")
    plt.show()

In [None]:
# Scatter Plots & Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df["LB"], y=df["AC"], hue=df["FM"], palette="coolwarm")
plt.title("Scatter Plot of LB vs AC")
plt.xlabel("Baseline Fetal Heart Rate (LB)")
plt.ylabel("Accelerations (AC)")
plt.show()


**4. PATTERN RECOGNITION AND INSIGHTS:**

In [None]:
# 1. Correlation Analysis

# Compute correlation matrix
correlation_matrix = df.corr()

# Display correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# Find highly correlated variables (above 0.7 or below -0.7)
threshold = 0.7
high_correlations = correlation_matrix[(correlation_matrix > threshold) | (correlation_matrix < -threshold)]
print("\nHighly Correlated Features (above 0.7 or below -0.7):")
print(high_correlations.dropna(how='all').dropna(axis=1, how='all'))

In [None]:
# 2. Identifying Trends Over Time (If Temporal Data Exists)

# Check if a time-related column exists
time_columns = [col for col in df.columns if "time" in col.lower() or "date" in col.lower()]
if time_columns:
    time_col = time_columns[0]  # Assuming the first detected time-related column
    df[time_col] = pd.to_datetime(df[time_col])  # Convert to datetime format
    df.sort_values(by=time_col, inplace=True)
else:
    print("\nNo time-related column found in the dataset.")

**5.CONCLUSION:**

**● Summarize the key insights and patterns discovered through your exploratory
analysis.**

**1. Data Distribution & Statistical Summary:**

 • Baseline Fetal Heart Rate (LB) has a relatively normal distribution but shows some variations across samples.


 •	Accelerations (AC) and Fetal Movements (FM) exhibit high variability, indicating different fetal activity levels.

   
   •	Decelerations (DL, DS, DP) have skewed distributions, which may indicate potential signs of fetal distress in some cases.

•	Short-term and long-term variability (ASTV, MSTV, ALTV, MLTV) are key indicators of fetal well-being and fluctuate significantly across samples.


**2. Correlation Insights:**

•	Strong positive correlation between Uterine Contractions (UC) and Prolonged Decelerations (DP) suggests that contractions may lead to prolonged heart rate decelerations.

•	ASTV and ALTV show a moderate correlation, indicating that short-term and long-term heart rate variability are interconnected.

•	No extreme multicollinearity detected, meaning most variables provide unique information and are not redundant.


**3. Outliers and Anomalies:**

•	Outliers detected in Fetal Movements (FM), Uterine Contractions (UC), and Decelerations (DP), which may indicate extreme fetal activity or distress cases.

•	Boxplots revealed that some cases have abnormally high or low values in certain variables, potentially requiring further medical review.

**4. Trends Over Time (If Temporal Data Exists):**

•	If a time-related variable (e.g., monitoring timestamps) were available, it would help analyze fetal heart rate patterns over time.

•	Continuous monitoring of variability indicators (ASTV, MSTV) over time can improve early detection of fetal distress.


  **● Discuss how these findings could impact decision-making or further analysis.**

**1.Medical Diagnosis & Early Intervention:**

•	Accelerations and Decelerations are key indicators for assessing fetal health.

•	Early detection of abnormal variability (ASTV, ALTV) can help doctors intervene before complications arise.

**2.	Feature Selection for Predictive Models:**

•	Highly correlated features can be used for predictive modeling (e.g., machine learning models to classify fetal health status).

•	Handling outliers properly is essential to improve model accuracy and reduce bias.

**3.	Monitoring & Alert Systems:**

•	Real-time tracking of fetal heart rate and uterine contractions can enhance fetal monitoring systems, leading to better maternal care.

•	Hospitals can use automated alert systems based on fetal movement and heart rate variability to detect high-risk pregnancies.