[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1epau1m6yAK5PKl2HkLNyKo3WRS-6Bhqy?usp=sharing)

In [1]:
%%capture

!pip install numpy matplotlib seaborn opencv-python pillow pandas tqdm scikit-image scikit-learn

Before Data cleaning and annotation

In [2]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from collections import Counter
import pandas as pd
from tqdm import tqdm
import random
from skimage.feature import graycomatrix, graycoprops
from sklearn.decomposition import PCA
from concurrent.futures import ThreadPoolExecutor

In [3]:
class GroundnutLeafspotEDA:
    def __init__(self, dataset_path):
        """
        Initialize with the path to the Groundnut leaf spot dataset.

        Args:
            dataset_path (str): Path to the parent folder containing the 6 scale folders
        """
        self.dataset_path = dataset_path
        self.classes = ["Leafspot Scale 1", "Leafspot Scale 2", "Leafspot Scale 3",
                        "Leafspot Scale 4", "Leafspot Scale 5", "Leafspot Scale 6"]
        self.results = {}

    def analyze_dataset_structure(self):
        """Analyze the structure of the dataset and count images per class"""
        print("Analyzing dataset structure...")

        class_counts = {}
        total_images = 0

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                print(f"Warning: Path {class_path} does not exist.")
                class_counts[class_name] = 0
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]
            class_counts[class_name] = len(image_files)
            total_images += len(image_files)

        self.results['class_counts'] = class_counts
        self.results['total_images'] = total_images

        # Print summary
        print(f"Total images: {total_images}")
        for class_name, count in class_counts.items():
            print(f"{class_name}: {count} images ({count/total_images*100:.2f}%)")

        # Create distribution plot
        plt.figure(figsize=(12, 6))
        sns.barplot(x=list(class_counts.keys()), y=list(class_counts.values()))
        plt.title("Distribution of Images Across Classes")
        plt.xlabel("Class")
        plt.ylabel("Number of Images")
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig("class_distribution.png")
        plt.close()

        return class_counts

    def analyze_image_properties(self, sample_size=100):
        """
        Analyze properties of images: dimensions, aspect ratios, file sizes, formats

        Args:
            sample_size (int): Number of images to sample from each class for analysis
        """
        print("Analyzing image properties...")

        dimensions = []
        aspect_ratios = []
        file_sizes = []
        file_formats = []

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                image_files = random.sample(image_files, sample_size)

            for img_file in image_files:
                img_path = os.path.join(class_path, img_file)

                # Get file size
                file_size = os.path.getsize(img_path) / 1024  # size in KB
                file_sizes.append(file_size)

                # Get file format
                file_format = os.path.splitext(img_file)[1].lower()
                file_formats.append(file_format)

                # Open image to get dimensions
                try:
                    with Image.open(img_path) as img:
                        width, height = img.size
                        dimensions.append((width, height))
                        aspect_ratio = width / height
                        aspect_ratios.append(aspect_ratio)
                except Exception as e:
                    print(f"Error processing {img_path}: {e}")

        # Store results
        self.results['dimensions'] = dimensions
        self.results['aspect_ratios'] = aspect_ratios
        self.results['file_sizes'] = file_sizes
        self.results['file_formats'] = file_formats

        # Create visualizations

        # Image dimensions scatter plot
        plt.figure(figsize=(10, 8))
        width, height = zip(*dimensions)
        plt.scatter(width, height, alpha=0.5)
        plt.title("Image Dimensions")
        plt.xlabel("Width (pixels)")
        plt.ylabel("Height (pixels)")
        plt.tight_layout()
        plt.savefig("image_dimensions.png")
        plt.close()

        # Aspect ratio histogram
        plt.figure(figsize=(10, 6))
        plt.hist(aspect_ratios, bins=20)
        plt.title("Distribution of Aspect Ratios")
        plt.xlabel("Aspect Ratio (width/height)")
        plt.ylabel("Count")
        plt.tight_layout()
        plt.savefig("aspect_ratios.png")
        plt.close()

        # File size histogram
        plt.figure(figsize=(10, 6))
        plt.hist(file_sizes, bins=20)
        plt.title("Distribution of File Sizes")
        plt.xlabel("File Size (KB)")
        plt.ylabel("Count")
        plt.tight_layout()
        plt.savefig("file_sizes.png")
        plt.close()

        # File formats pie chart
        plt.figure(figsize=(8, 8))
        format_counts = Counter(file_formats)
        plt.pie(format_counts.values(), labels=format_counts.keys(), autopct='%1.1f%%')
        plt.title("Distribution of File Formats")
        plt.tight_layout()
        plt.savefig("file_formats.png")
        plt.close()

    def analyze_image_quality(self, sample_size=50):
        """
        Analyze image quality: brightness, contrast, blur, noise

        Args:
            sample_size (int): Number of images to sample from each class
        """
        print("Analyzing image quality...")

        brightness_values = []
        contrast_values = []
        blur_scores = []

        class_brightness = {c: [] for c in self.classes}
        class_contrast = {c: [] for c in self.classes}
        class_blur = {c: [] for c in self.classes}

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                image_files = random.sample(image_files, sample_size)

            for img_file in image_files:
                img_path = os.path.join(class_path, img_file)

                try:
                    # Read image with OpenCV
                    img = cv2.imread(img_path)
                    if img is None:
                        print(f"Warning: Could not read {img_path}")
                        continue

                    # Convert to grayscale for some metrics
                    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

                    # Calculate brightness (mean pixel value)
                    brightness = np.mean(gray)
                    brightness_values.append(brightness)
                    class_brightness[class_name].append(brightness)

                    # Calculate contrast (standard deviation of pixel values)
                    contrast = np.std(gray)
                    contrast_values.append(contrast)
                    class_contrast[class_name].append(contrast)

                    # Calculate blur score using Laplacian variance
                    laplacian = cv2.Laplacian(gray, cv2.CV_64F)
                    blur_score = np.var(laplacian)
                    blur_scores.append(blur_score)
                    class_blur[class_name].append(blur_score)

                except Exception as e:
                    print(f"Error processing {img_path}: {e}")

        # Store results
        self.results['brightness_values'] = brightness_values
        self.results['contrast_values'] = contrast_values
        self.results['blur_scores'] = blur_scores
        self.results['class_brightness'] = class_brightness
        self.results['class_contrast'] = class_contrast
        self.results['class_blur'] = class_blur

        # Create visualizations

        # Brightness distribution
        plt.figure(figsize=(10, 6))
        plt.hist(brightness_values, bins=20)
        plt.title("Distribution of Image Brightness")
        plt.xlabel("Brightness (mean pixel value)")
        plt.ylabel("Count")
        plt.tight_layout()
        plt.savefig("brightness_distribution.png")
        plt.close()

        # Contrast distribution
        plt.figure(figsize=(10, 6))
        plt.hist(contrast_values, bins=20)
        plt.title("Distribution of Image Contrast")
        plt.xlabel("Contrast (std of pixel values)")
        plt.ylabel("Count")
        plt.tight_layout()
        plt.savefig("contrast_distribution.png")
        plt.close()

        # Blur score distribution
        plt.figure(figsize=(10, 6))
        plt.hist(blur_scores, bins=20)
        plt.title("Distribution of Image Blur Scores")
        plt.xlabel("Blur Score (Laplacian variance)")
        plt.ylabel("Count")
        plt.tight_layout()
        plt.savefig("blur_distribution.png")
        plt.close()

        # Boxplot of brightness by class
        plt.figure(figsize=(12, 6))
        data_to_plot = [values for class_name, values in class_brightness.items() if values]
        plt.boxplot(data_to_plot, labels=[c for c in class_brightness.keys() if class_brightness[c]])
        plt.title("Brightness Distribution by Class")
        plt.xlabel("Class")
        plt.ylabel("Brightness")
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig("brightness_by_class.png")
        plt.close()

        # Boxplot of contrast by class
        plt.figure(figsize=(12, 6))
        data_to_plot = [values for class_name, values in class_contrast.items() if values]
        plt.boxplot(data_to_plot, labels=[c for c in class_contrast.keys() if class_contrast[c]])
        plt.title("Contrast Distribution by Class")
        plt.xlabel("Class")
        plt.ylabel("Contrast")
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig("contrast_by_class.png")
        plt.close()

    def analyze_color_distribution(self, sample_size=50):
        """
        Analyze color distributions across images and classes

        Args:
            sample_size (int): Number of images to sample from each class
        """
        print("Analyzing color distributions...")

        # Collect average RGB values per image and class
        avg_colors = {c: {'r': [], 'g': [], 'b': []} for c in self.classes}
        all_colors = {'r': [], 'g': [], 'b': []}

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                image_files = random.sample(image_files, sample_size)

            for img_file in image_files:
                img_path = os.path.join(class_path, img_file)

                try:
                    # Read image with OpenCV (BGR format)
                    img = cv2.imread(img_path)
                    if img is None:
                        continue

                    # Convert BGR to RGB
                    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

                    # Calculate average RGB
                    avg_r = np.mean(img_rgb[:, :, 0])
                    avg_g = np.mean(img_rgb[:, :, 1])
                    avg_b = np.mean(img_rgb[:, :, 2])

                    # Store by class
                    avg_colors[class_name]['r'].append(avg_r)
                    avg_colors[class_name]['g'].append(avg_g)
                    avg_colors[class_name]['b'].append(avg_b)

                    # Store all colors
                    all_colors['r'].append(avg_r)
                    all_colors['g'].append(avg_g)
                    all_colors['b'].append(avg_b)

                except Exception as e:
                    print(f"Error processing {img_path}: {e}")

        # Store results
        self.results['avg_colors'] = avg_colors
        self.results['all_colors'] = all_colors

        # Create visualizations

        # RGB distribution across all images
        plt.figure(figsize=(15, 5))

        plt.subplot(1, 3, 1)
        plt.hist(all_colors['r'], bins=20, color='red', alpha=0.7)
        plt.title("Red Channel Distribution")
        plt.xlabel("Average Red Value")
        plt.ylabel("Count")

        plt.subplot(1, 3, 2)
        plt.hist(all_colors['g'], bins=20, color='green', alpha=0.7)
        plt.title("Green Channel Distribution")
        plt.xlabel("Average Green Value")

        plt.subplot(1, 3, 3)
        plt.hist(all_colors['b'], bins=20, color='blue', alpha=0.7)
        plt.title("Blue Channel Distribution")
        plt.xlabel("Average Blue Value")

        plt.tight_layout()
        plt.savefig("rgb_distribution.png")
        plt.close()

        # 3D scatter plot of RGB values by class
        fig = plt.figure(figsize=(10, 8))
        ax = fig.add_subplot(111, projection='3d')

        colors = ['r', 'g', 'b', 'c', 'm', 'y']

        for i, class_name in enumerate(self.classes):
            if not avg_colors[class_name]['r']:  # Skip if empty
                continue

            ax.scatter(
                avg_colors[class_name]['r'],
                avg_colors[class_name]['g'],
                avg_colors[class_name]['b'],
                c=colors[i % len(colors)],
                label=class_name,
                alpha=0.7
            )

        ax.set_xlabel('Red')
        ax.set_ylabel('Green')
        ax.set_zlabel('Blue')
        ax.set_title('RGB Color Distribution by Class')
        plt.legend()
        plt.tight_layout()
        plt.savefig("rgb_by_class_3d.png")
        plt.close()

    def detect_potential_issues(self, sample_size=30):
        """
        Detect potential issues in the dataset that need cleaning or special attention

        Args:
            sample_size (int): Number of images to sample from each class
        """
        print("Detecting potential issues...")

        issues = {
            'duplicates': [],
            'corrupted': [],
            'very_low_brightness': [],
            'very_high_brightness': [],
            'very_blurry': [],
            'unusual_aspect_ratio': [],
            'outlier_size': []
        }

        # Set thresholds for issue detection
        brightness_low_threshold = 40
        brightness_high_threshold = 220
        blur_threshold = 50  # Lower values indicate more blur
        aspect_ratio_thresholds = (0.5, 2.0)  # (min, max)

        # Sample images for processing
        all_image_files = []

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [os.path.join(class_path, f) for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                sampled_files = random.sample(image_files, sample_size)
            else:
                sampled_files = image_files

            all_image_files.extend(sampled_files)

        # Process each image
        for img_path in tqdm(all_image_files, desc="Checking images"):
            try:
                # Try to open the image
                img = cv2.imread(img_path)
                if img is None:
                    issues['corrupted'].append(img_path)
                    continue

                # Check brightness
                gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                brightness = np.mean(gray)

                if brightness < brightness_low_threshold:
                    issues['very_low_brightness'].append((img_path, brightness))
                elif brightness > brightness_high_threshold:
                    issues['very_high_brightness'].append((img_path, brightness))

                # Check blur
                laplacian = cv2.Laplacian(gray, cv2.CV_64F)
                blur_score = np.var(laplacian)

                if blur_score < blur_threshold:
                    issues['very_blurry'].append((img_path, blur_score))

                # Check aspect ratio
                height, width = img.shape[:2]
                aspect_ratio = width / height

                if aspect_ratio < aspect_ratio_thresholds[0] or aspect_ratio > aspect_ratio_thresholds[1]:
                    issues['unusual_aspect_ratio'].append((img_path, aspect_ratio))

                # Check file size
                file_size = os.path.getsize(img_path) / 1024  # KB
                if file_size < 10 or file_size > 1000:  # Example thresholds
                    issues['outlier_size'].append((img_path, file_size))

            except Exception as e:
                print(f"Error processing {img_path}: {e}")
                issues['corrupted'].append(img_path)

        # Store results
        self.results['issues'] = issues

        # Print summary of issues
        print("\nPotential issues detected:")
        for issue_type, issue_list in issues.items():
            print(f"  {issue_type}: {len(issue_list)} images")

        return issues

    def sample_images_by_class(self, samples_per_class=5, figsize=(15, 10)):
        """
        Display random sample images from each class

        Args:
            samples_per_class (int): Number of sample images to display per class
            figsize (tuple): Figure size for the display
        """
        print("Sampling images from each class...")

        # Determine grid dimensions
        n_classes = len(self.classes)
        n_samples = samples_per_class

        fig, axes = plt.subplots(n_classes, n_samples, figsize=figsize)

        # Make sure axes is 2D even if there's only one class
        if n_classes == 1:
            axes = axes.reshape(1, -1)

        for i, class_name in enumerate(self.classes):
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Skip if no images in this class
            if not image_files:
                continue

            # Select random samples
            if len(image_files) > n_samples:
                selected_files = random.sample(image_files, n_samples)
            else:
                selected_files = image_files

            # Fill remaining slots with blank images if needed
            selected_files.extend([''] * (n_samples - len(selected_files)))

            # Display each sample
            for j, img_file in enumerate(selected_files[:n_samples]):
                if not img_file:
                    axes[i, j].axis('off')
                    continue

                img_path = os.path.join(class_path, img_file)
                try:
                    img = cv2.imread(img_path)
                    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
                    axes[i, j].imshow(img)
                    axes[i, j].set_title(f"{class_name}\n{img_file[:10]}...")
                    axes[i, j].axis('off')
                except Exception as e:
                    print(f"Error displaying {img_path}: {e}")
                    axes[i, j].axis('off')

        plt.tight_layout()
        plt.savefig("sample_images.png")
        plt.close()

    def analyze_texture_features(self, sample_size=30):
        """
        Analyze texture features using GLCM and visualize differences between classes

        Args:
            sample_size (int): Number of images to sample from each class
        """
        print("Analyzing texture features...")

        # Features to extract
        properties = ['contrast', 'dissimilarity', 'homogeneity', 'energy', 'correlation']

        # Dictionary to store features by class
        texture_features = {c: {prop: [] for prop in properties} for c in self.classes}

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                image_files = random.sample(image_files, sample_size)

            for img_file in tqdm(image_files, desc=f"Processing {class_name}"):
                img_path = os.path.join(class_path, img_file)

                try:
                    # Read image and convert to grayscale
                    img = cv2.imread(img_path)
                    if img is None:
                        continue

                    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

                    # Resize to reduce computation time if needed
                    resized = cv2.resize(gray, (128, 128))

                    # Calculate GLCM
                    distances = [1]
                    angles = [0, np.pi/4, np.pi/2, 3*np.pi/4]
                    glcm = graycomatrix(resized, distances=distances, angles=angles,
                                      levels=256, symmetric=True, normed=True)

                    # Calculate properties
                    for prop in properties:
                        glcm_prop = graycoprops(glcm, prop).mean()
                        texture_features[class_name][prop].append(glcm_prop)

                except Exception as e:
                    print(f"Error processing {img_path}: {e}")

        # Store results
        self.results['texture_features'] = texture_features

        # Create visualizations for each property
        for prop in properties:
            plt.figure(figsize=(12, 6))

            data_to_plot = []
            labels = []

            for class_name in self.classes:
                if texture_features[class_name][prop]:
                    data_to_plot.append(texture_features[class_name][prop])
                    labels.append(class_name)

            if data_to_plot:
                plt.boxplot(data_to_plot, labels=labels)
                plt.title(f"GLCM {prop.capitalize()} by Class")
                plt.xlabel("Class")
                plt.ylabel(prop.capitalize())
                plt.xticks(rotation=45)
                plt.tight_layout()
                plt.savefig(f"texture_{prop}.png")

            plt.close()

        # Create PCA visualization for texture features
        self.visualize_texture_pca(texture_features, properties)

    def visualize_texture_pca(self, texture_features, properties):
        """
        Create PCA visualization for texture features to see class separability

        Args:
            texture_features (dict): Dictionary of texture features by class
            properties (list): List of feature properties used
        """
        # Prepare data for PCA
        X = []
        y = []
        class_names = []

        for class_name in self.classes:
            if not any(texture_features[class_name].values()):
                continue

            class_names.append(class_name)
            n_samples = len(texture_features[class_name][properties[0]])

            for i in range(n_samples):
                features = [texture_features[class_name][prop][i] for prop in properties]
                X.append(features)
                y.append(class_names.index(class_name))

        if not X:
            return

        X = np.array(X)

        # Apply PCA
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X)

        # Create scatter plot
        plt.figure(figsize=(10, 8))

        colors = plt.cm.tab10(np.linspace(0, 1, len(class_names)))

        for i, class_name in enumerate(class_names):
            idx = np.array(y) == i
            plt.scatter(X_pca[idx, 0], X_pca[idx, 1], c=[colors[i]],
                      label=class_name, alpha=0.7)

        plt.title("PCA of Texture Features")
        plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)")
        plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)")
        plt.legend()
        plt.tight_layout()
        plt.savefig("texture_pca.png")
        plt.close()

    def generate_eda_report(self):
        """Generate a summary report of the EDA findings"""
        report = {
            "dataset_summary": {
                "total_images": self.results.get('total_images', 0),
                "class_counts": self.results.get('class_counts', {})
            },
            "image_properties": {
                "unique_dimensions": len(set(self.results.get('dimensions', []))),
                "avg_file_size_kb": np.mean(self.results.get('file_sizes', [0])),
                "file_formats": Counter(self.results.get('file_formats', []))
            },
            "image_quality": {
                "avg_brightness": np.mean(self.results.get('brightness_values', [0])),
                "avg_contrast": np.mean(self.results.get('contrast_values', [0])),
                "avg_blur_score": np.mean(self.results.get('blur_scores', [0]))
            },
            "issues_summary": {
                issue_type: len(issues)
                for issue_type, issues in self.results.get('issues', {}).items()
            },
            "cleaning_recommendations": []
        }

        # Generate cleaning recommendations
        if self.results.get('issues'):
            issues = self.results['issues']

            if len(issues.get('corrupted', [])) > 0:
                report['cleaning_recommendations'].append(
                    f"Remove {len(issues['corrupted'])} corrupted images that couldn't be opened.")

            if len(issues.get('very_low_brightness', [])) > 0:
                report['cleaning_recommendations'].append(
                    f"Consider adjusting brightness for {len(issues['very_low_brightness'])} very dark images.")

            if len(issues.get('very_high_brightness', [])) > 0:
                report['cleaning_recommendations'].append(
                    f"Consider adjusting brightness for {len(issues['very_high_brightness'])} very bright images.")

            if len(issues.get('very_blurry', [])) > 0:
                report['cleaning_recommendations'].append(
                    f"Consider removing or enhancing {len(issues['very_blurry'])} blurry images.")

            if len(issues.get('unusual_aspect_ratio', [])) > 0:
                report['cleaning_recommendations'].append(
                    f"Standardize dimensions for {len(issues['unusual_aspect_ratio'])} images with unusual aspect ratios.")

        # Class imbalance check
        if self.results.get('class_counts'):
            counts = list(self.results['class_counts'].values())
            if max(counts) > 2 * min(counts):
                report['cleaning_recommendations'].append(
                    "Address class imbalance through augmentation of minority classes or sampling strategies.")

        # Standard deviation of brightness across classes
        if self.results.get('class_brightness'):
            class_means = [np.mean(vals) for vals in self.results['class_brightness'].values() if vals]
            if class_means and np.std(class_means) > 20:  # Arbitrary threshold
                report['cleaning_recommendations'].append(
                    "Consider normalizing brightness across classes as there is significant variation.")

        # Print report
        print("\n========== EDA REPORT ==========")
        print(f"Dataset contains {report['dataset_summary']['total_images']} images across {len(report['dataset_summary']['class_counts'])} classes")

        print("\nImage Properties:")
        print(f"  - {report['image_properties']['unique_dimensions']} unique image dimensions")
        print(f"  - Average file size: {report['image_properties']['avg_file_size_kb']:.2f} KB")
        print(f"  - File formats: {dict(report['image_properties']['file_formats'])}")

        print("\nImage Quality:")
        print(f"  - Average brightness: {report['image_quality']['avg_brightness']:.2f}")
        print(f"  - Average contrast: {report['image_quality']['avg_contrast']:.2f}")
        print(f"  - Average blur score: {report['image_quality']['avg_blur_score']:.2f}")

        print("\nIssues Detected:")
        for issue_type, count in report['issues_summary'].items():
            print(f"  - {issue_type}: {count}")

        print("\nCleaning Recommendations:")
        for i, rec in enumerate(report['cleaning_recommendations'], 1):
            print(f"  {i}. {rec}")

        return report

    def analyze_leafspot_specific_features(self, sample_size=30):
        """
        Analyze features specific to leaf spot disease severity scales

        Args:
            sample_size (int): Number of images to sample from each class
        """
        print("Analyzing leaf spot specific features...")

        # Store leaf spot related metrics by class
        spot_metrics = {c: {
            'spot_count': [],
            'spot_area_percent': [],
            'leaf_area_percent': []
        } for c in self.classes}

        for class_name in self.classes:
            class_path = os.path.join(self.dataset_path, class_name)
            if not os.path.exists(class_path):
                continue

            image_files = [f for f in os.listdir(class_path)
                          if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff'))]

            # Sample images if there are too many
            if len(image_files) > sample_size:
                image_files = random.sample(image_files, sample_size)

            for img_file in image_files:
                img_path = os.path.join(class_path, img_file)

                try:
                    # Read image
                    img = cv2.imread(img_path)
                    if img is None:
                        continue

                    # Convert to HSV to better isolate leaf and spots
                    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

                    # Detect green areas (likely leaf areas)
                    # Adjust these ranges based on your specific dataset
                    lower_green = np.array([25, 40, 40])
                    upper_green = np.array([85, 255, 255])
                    leaf_mask = cv2.inRange(hsv, lower_green, upper_green)

                    # Detect potential leaf spots (darker regions in the green areas)
                    # Here we use a simple threshold, but more sophisticated methods can be used
                    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                    _, spot_mask = cv2.threshold(gray, 100, 255, cv2.THRESH_BINARY_INV)

                    # Apply the leaf mask to only look for spots on the leaf
                    spot_on_leaf = cv2.bitwise_and(spot_mask, spot_mask, mask=leaf_mask)

                    # Count spots (connected components)
                    num_labels, labels, stats, _ = cv2.connectedComponentsWithStats(spot_on_leaf, connectivity=8)
                    # Subtract 1 to exclude the background
                    spot_count = num_labels - 1 if num_labels > 0 else 0

                    # Calculate area percentages
                    img_area = img.shape[0] * img.shape[1]
                    leaf_area = cv2.countNonZero(leaf_mask)
                    spot_area = cv2.countNonZero(spot_on_leaf)

                    leaf_area_percent = (leaf_area / img_area) * 100
                    spot_area_percent = (spot_area / leaf_area) * 100 if leaf_area > 0 else 0

                    # Store metrics
                    spot_metrics[class_name]['spot_count'].append(spot_count)
                    spot_metrics[class_name]['spot_area_percent'].append(spot_area_percent)
                    spot_metrics[class_name]['leaf_area_percent'].append(leaf_area_percent)

                except Exception as e:
                    print(f"Error processing {img_path}: {e}")

        # Store results
        self.results['spot_metrics'] = spot_metrics

        # Create visualizations

        # Box plot of spot count by class
        plt.figure(figsize=(12, 6))
        data_to_plot = []
        labels = []

        for class_name in self.classes:
            if spot_metrics[class_name]['spot_count']:
                data_to_plot.append(spot_metrics[class_name]['spot_count'])
                labels.append(class_name)

        if data_to_plot:
            plt.boxplot(data_to_plot, labels=labels)
            plt.title("Leaf Spot Count by Class")
            plt.xlabel("Class")
            plt.ylabel("Number of Spots")
            plt.xticks(rotation=45)
            plt.tight_layout()
            plt.savefig("spot_count_by_class.png")
        plt.close()

        # Box plot of spot area percentage by class
        plt.figure(figsize=(12, 6))
        data_to_plot = []
        labels = []

        for class_name in self.classes:
            if spot_metrics[class_name]['spot_area_percent']:
                data_to_plot.append(spot_metrics[class_name]['spot_area_percent'])
                labels.append(class_name)

        if data_to_plot:
            plt.boxplot(data_to_plot, labels=labels)
            plt.title("Leaf Spot Area Percentage by Class")
            plt.xlabel("Class")
            plt.ylabel("Spot Area % of Leaf")
            plt.xticks(rotation=45)
            plt.tight_layout()
            plt.savefig("spot_area_by_class.png")
        plt.close()

        return spot_metrics

    def run_full_eda(self, output_dir="./eda_results"):
        """
        Run full EDA pipeline and save results

        Args:
            output_dir (str): Directory to save EDA results
        """
        # Create output directory if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)

        # Change working directory to output dir for saving results
        original_dir = os.getcwd()
        os.chdir(output_dir)

        print(f"Running full EDA on Groundnut Leaf Spot dataset at {self.dataset_path}")
        print(f"Results will be saved to {output_dir}")

        # Run all analysis methods
        self.analyze_dataset_structure()
        self.analyze_image_properties()
        self.analyze_image_quality()
        self.analyze_color_distribution()
        self.sample_images_by_class()
        self.analyze_texture_features()
        self.analyze_leafspot_specific_features()
        self.detect_potential_issues()

        # Generate summary report
        report = self.generate_eda_report()

        # Save report as JSON
        with open("eda_report.json", "w") as f:
            json.dump(report, f, indent=4)

        # Save report as markdown
        with open("eda_report.md", "w") as f:
            f.write("# Groundnut Leaf Spot Dataset EDA Report\n\n")

            f.write("## Dataset Summary\n")
            f.write(f"- Total images: {report['dataset_summary']['total_images']}\n")
            f.write("- Class distribution:\n")
            for cls, count in report['dataset_summary']['class_counts'].items():
                f.write(f"  - {cls}: {count} images ({count/report['dataset_summary']['total_images']*100:.2f}%)\n")

            f.write("\n## Image Properties\n")
            f.write(f"- Unique dimensions: {report['image_properties']['unique_dimensions']}\n")
            f.write(f"- Average file size: {report['image_properties']['avg_file_size_kb']:.2f} KB\n")
            f.write("- File formats:\n")
            for fmt, count in report['image_properties']['file_formats'].items():
                f.write(f"  - {fmt}: {count}\n")

            f.write("\n## Image Quality\n")
            f.write(f"- Average brightness: {report['image_quality']['avg_brightness']:.2f}\n")
            f.write(f"- Average contrast: {report['image_quality']['avg_contrast']:.2f}\n")
            f.write(f"- Average blur score: {report['image_quality']['avg_blur_score']:.2f}\n")

            f.write("\n## Issues Detected\n")
            for issue_type, count in report['issues_summary'].items():
                f.write(f"- {issue_type}: {count}\n")

            f.write("\n## Cleaning Recommendations\n")
            for i, rec in enumerate(report['cleaning_recommendations'], 1):
                f.write(f"{i}. {rec}\n")

            f.write("\n## Visualizations\n")
            f.write("The following visualizations were generated during EDA:\n\n")

            # List all PNG files
            vis_files = [f for f in os.listdir() if f.endswith('.png')]
            for vis_file in vis_files:
                name = vis_file.replace('_', ' ').replace('.png', '')
                f.write(f"- [{name}]({vis_file})\n")

        # Return to original directory
        os.chdir(original_dir)

        print(f"\nEDA completed. Results saved to {output_dir}")

        return report


In [4]:
# Helper functions for image annotation suggestions
def suggest_annotation_approach(spot_metrics, class_counts):
    """
    Suggest appropriate annotation approaches based on EDA results

    Args:
        spot_metrics (dict): Metrics related to leaf spots from EDA
        class_counts (dict): Number of images per class

    Returns:
        dict: Annotation recommendations
    """
    # Initialize recommendations
    recommendations = {
        "annotation_type": None,
        "tools": [],
        "approach": "",
        "special_considerations": []
    }

    # Calculate average spot counts across classes
    avg_spot_counts = {}
    for class_name, metrics in spot_metrics.items():
        if metrics['spot_count']:
            avg_spot_counts[class_name] = np.mean(metrics['spot_count'])

    # Determine annotation type based on average spot counts
    max_avg_spots = max(avg_spot_counts.values()) if avg_spot_counts else 0

    if max_avg_spots > 20:
        # Many spots - semantic segmentation might be best
        recommendations["annotation_type"] = "semantic_segmentation"
        recommendations["tools"] = ["LabelMe", "CVAT", "Supervisely"]
        recommendations["approach"] = "Use pixel-level segmentation to mark all leaf spot areas. This will provide the most detailed information for severe cases."
        recommendations["special_considerations"].append("Focus on accurately marking boundaries between healthy and diseased tissue.")
    elif max_avg_spots > 5:
        # Moderate number of spots - could do instance segmentation or object detection
        recommendations["annotation_type"] = "instance_segmentation"
        recommendations["tools"] = ["VGG Image Annotator (VIA)", "CVAT", "Roboflow"]
        recommendations["approach"] = "Mark individual spots as separate instances. This will help differentiate between spot sizes and distributions."
        recommendations["special_considerations"].append("Consider grouping very small spots if they're clustered together.")
    else:
        # Few spots - bounding boxes might be sufficient
        recommendations["annotation_type"] = "object_detection"
        recommendations["tools"] = ["LabelImg", "CVAT", "Roboflow"]
        recommendations["approach"] = "Use bounding boxes around individual leaf spots. For early stages with few spots, this is efficient and sufficient."
        recommendations["special_considerations"].append("Make sure to annotate even small or faint spots in early disease stages.")

    # Additional considerations
    total_images = sum(class_counts.values())

    if total_images > 1000:
        recommendations["special_considerations"].append("Given the large dataset size, consider active learning approaches to prioritize which images to annotate.")

    # Check for class imbalance
    if max(class_counts.values()) > 2 * min(class_counts.values()):
        recommendations["special_considerations"].append("Address class imbalance by ensuring thorough annotation of minority classes.")

    # Add annotation schema recommendation
    recommendations["annotation_schema"] = {
        "classes": ["leaf", "leaf_spot"],
        "attributes": {
            "leaf": ["healthy", "diseased"],
            "leaf_spot": ["severity_scale"]
        }
    }

    return recommendations

In [5]:
# Example usage
if __name__ == "__main__":
    import json


    # Run EDA
    eda = GroundnutLeafspotEDA("/content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Data/wetransfer_leafspot-scores-photos_2024-03-20_1533")
    report = eda.run_full_eda("/content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Output_v1")

    # Generate annotation recommendations
    if 'spot_metrics' in eda.results and 'class_counts' in eda.results:
        annotation_recs = suggest_annotation_approach(
            eda.results['spot_metrics'],
            eda.results['class_counts']
        )

        # Save annotation recommendations
        with open(os.path.join("/content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Output_v1", "annotation_recommendations.json"), "w") as f:
            json.dump(annotation_recs, f, indent=4)

        print("\nAnnotation recommendations generated and saved.")

Running full EDA on Groundnut Leaf Spot dataset at /content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Data/wetransfer_leafspot-scores-photos_2024-03-20_1533
Results will be saved to /content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Output_v1
Analyzing dataset structure...
Total images: 273
Leafspot Scale 1: 37 images (13.55%)
Leafspot Scale 2: 40 images (14.65%)
Leafspot Scale 3: 48 images (17.58%)
Leafspot Scale 4: 27 images (9.89%)
Leafspot Scale 5: 56 images (20.51%)
Leafspot Scale 6: 65 images (23.81%)
Analyzing image properties...
Analyzing image quality...


  plt.boxplot(data_to_plot, labels=[c for c in class_brightness.keys() if class_brightness[c]])
  plt.boxplot(data_to_plot, labels=[c for c in class_contrast.keys() if class_contrast[c]])


Analyzing color distributions...
Sampling images from each class...
Analyzing texture features...


Processing Leafspot Scale 1: 100%|██████████| 30/30 [00:11<00:00,  2.53it/s]
Processing Leafspot Scale 2: 100%|██████████| 30/30 [00:07<00:00,  3.96it/s]
Processing Leafspot Scale 3: 100%|██████████| 30/30 [00:09<00:00,  3.22it/s]
Processing Leafspot Scale 4: 100%|██████████| 27/27 [00:11<00:00,  2.29it/s]
Processing Leafspot Scale 5: 100%|██████████| 30/30 [00:07<00:00,  3.76it/s]
Processing Leafspot Scale 6: 100%|██████████| 30/30 [00:10<00:00,  2.76it/s]
  plt.boxplot(data_to_plot, labels=labels)
  plt.boxplot(data_to_plot, labels=labels)
  plt.boxplot(data_to_plot, labels=labels)
  plt.boxplot(data_to_plot, labels=labels)
  plt.boxplot(data_to_plot, labels=labels)


Analyzing leaf spot specific features...


  plt.boxplot(data_to_plot, labels=labels)
  plt.boxplot(data_to_plot, labels=labels)


Detecting potential issues...


Checking images: 100%|██████████| 177/177 [01:29<00:00,  1.98it/s]


Potential issues detected:
  duplicates: 0 images
  corrupted: 0 images
  very_low_brightness: 0 images
  very_high_brightness: 0 images
  very_blurry: 66 images
  unusual_aspect_ratio: 0 images
  outlier_size: 177 images

Dataset contains 273 images across 6 classes

Image Properties:
  - 4 unique image dimensions
  - Average file size: 6676.75 KB
  - File formats: {'.jpg': 273}

Image Quality:
  - Average brightness: 116.49
  - Average contrast: 52.86
  - Average blur score: 210.99

Issues Detected:
  - duplicates: 0
  - corrupted: 0
  - very_low_brightness: 0
  - very_high_brightness: 0
  - very_blurry: 66
  - unusual_aspect_ratio: 0
  - outlier_size: 177

Cleaning Recommendations:
  1. Consider removing or enhancing 66 blurry images.
  2. Address class imbalance through augmentation of minority classes or sampling strategies.

EDA completed. Results saved to /content/drive/MyDrive/MSCS_folder/Computer-Vision/Assignment three/Output_v1

Annotation recommendations generated and save


