        "# Exploratory Data Analysis (EDA)\n",
        "\n",
        "This notebook demonstrates common EDA steps, including:\n",
        "1. Importing libraries and loading the dataset.\n",
        "2. Viewing data structure (head, info, describe).\n",
        "3. Checking for missing values.\n",
        "4. Plotting histograms and box plots.\n",
        "5. Generating correlation matrices and heatmaps.\n",
        "6. Identifying potential outliers.\n",
        "7. Summarizing findings.\n"

        "## 1. Import Libraries\n",
        "We'll import `pandas`, `numpy`, `matplotlib`, and `seaborn` for data analysis and visualization.\n"

In [None]:
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "from IPython.display import display\n",
        "\n",
        "# Set plotting style\n",
        "sns.set_style('whitegrid')\n",
        "plt.rcParams['figure.figsize'] = (10, 6)"

        "## 2. Load Dataset\n",
        "Here, we read a sample dataset from `../../data/raw/dataset.csv`. You can adapt the path based on your actual directory structure."

In [None]:
        "# Adjust this path to point to your actual data file\n",
        "data_path = '../../data/raw/dataset.csv'\n",
        "\n",
        "df = pd.read_csv(data_path)\n",
        "print('Data loaded. Shape:', df.shape)\n",
        "display(df.head())"

        "## 3. Basic Data Overview\n",
        "- `.info()` to see column types and non-null counts.\n",
        "- `.describe()` to get numerical summaries.\n"

In [None]:
        "df.info()\n",
        "display(df.describe())"

        "### 3.1 Checking for Missing Values\n",
        "We can use `.isnull().sum()` or `.isna().sum()` to see if any columns have missing data.\n"

In [None]:
        "missing_counts = df.isnull().sum()\n",
        "print('Missing values per column:')\n",
        "display(missing_counts)\n",
        "\n",
        "# Optionally, see percentage of missing data\n",
        "missing_percent = (missing_counts / len(df)) * 100\n",
        "print('Percentage of missing data per column:')\n",
        "display(missing_percent)"

        "## 4. Data Visualization\n",
        "We'll plot histograms for numeric columns and boxplots to check distributions and outliers.\n"


In [None]:
        "# Identify numeric columns for histograms\n",
        "numeric_cols = df.select_dtypes(include=[np.number]).columns\n",
        "df[numeric_cols].hist(bins=30, figsize=(12, 8))\n",
        "plt.suptitle('Histograms of Numeric Features', fontsize=16)\n",
        "plt.tight_layout()\n",
        "plt.show()"

In [None]:
        "# Boxplots for numeric features\n",
        "fig, axs = plt.subplots(nrows=1, ncols=len(numeric_cols), figsize=(5*len(numeric_cols), 5))\n",
        "if len(numeric_cols) == 1:\n",
        "    # If there's only one numeric col, wrap in a list\n",
        "    axs = [axs]\n",
        "\n",
        "for i, col in enumerate(numeric_cols):\n",
        "    sns.boxplot(data=df, y=col, ax=axs[i])\n",
        "    axs[i].set_title(f'Boxplot: {col}')\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.show()"

        "## 5. Correlation Matrix and Heatmap\n",
        "We'll create a correlation matrix for numeric features and visualize it with a heatmap.\n"

In [None]:
        "corr_matrix = df[numeric_cols].corr()\n",
        "plt.figure(figsize=(8, 6))\n",
        "sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)\n",
        "plt.title('Correlation Heatmap')\n",
        "plt.show()"

        "## 6. Outlier Detection (Optional)\n",
        "We can detect outliers via the **Interquartile Range (IQR)** or Z-scores. Below is a quick IQR method:\n"

In [None]:
        "def find_outliers_iqr(data, factor=1.5):\n",
        "    # This function returns a boolean mask indicating which points are outliers.\n",
        "    q1 = np.percentile(data, 25)\n",
        "    q3 = np.percentile(data, 75)\n",
        "    iqr = q3 - q1\n",
        "    lower_bound = q1 - (factor * iqr)\n",
        "    upper_bound = q3 + (factor * iqr)\n",
        "    return (data < lower_bound) | (data > upper_bound)\n",
        "\n",
        "outlier_counts = {}\n",
        "for col in numeric_cols:\n",
        "    outliers_mask = find_outliers_iqr(df[col].dropna())\n",
        "    outlier_count = np.sum(outliers_mask)\n",
        "    outlier_counts[col] = outlier_count\n",
        "\n",
        "print('Outlier counts (IQR method, factor=1.5):')\n",
        "display(outlier_counts)"

        "## 7. Potential Next Steps\n",
        "- **Data Cleaning**: Handle missing data or outliers.\n",
        "- **Feature Engineering**: Create new features, transform existing ones.\n",
        "- **Splitting Data**: Split into training and test sets.\n",
        "- **Modeling**: Move to training scripts in `ml/scripts/train.py`.\n",
        "- **Evaluation**: Use `ml/scripts/evaluate.py`.\n"

        "## 8. Summary of Findings\n",
        "Write any key observations here. For example:\n",
        "- Which features have strong correlation?\n",
        "- Did you find many missing values?\n",
        "- Which features appear to have outliers?\n",
        "- Any interesting distribution patterns?\n"