<a href="https://colab.research.google.com/github/vanadhisivakumar-source/Machine-learning-projects/blob/main/candidate%20algorithm%20test%20case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement the Candidate Elimination algorithm to learn a concept from a medical diagnosis dataset. The dataset should have attributes 'Fever', 'Cough', and 'Fatigue', and a target 'Disease'. After implementing the algorithm, display the final Most Specific Hypothesis (S) and Most General Hypothesis (G).

## Define Dataset

### Subtask:
Define the medical diagnosis dataset with 'Fever', 'Cough', 'Fatigue' as attributes and 'Disease' as the target, based on the examples provided. This will be represented as a list of lists or a pandas DataFrame.


**Reasoning**:
The subtask requires defining a dataset as a pandas DataFrame. This involves importing pandas, creating a list of lists with the specified data, converting it into a DataFrame with appropriate column names, and then displaying the head of the DataFrame for verification.



In [None]:
import pandas as pd

# Define the dataset as a list of lists
data = [
    ['Yes', 'Yes', 'No', 'Flu'],
    ['No', 'Yes', 'Yes', 'Cold'],
    ['Yes', 'No', 'Yes', 'Flu'],
    ['No', 'Yes', 'No', 'Cold'],
    ['Yes', 'Yes', 'Yes', 'COVID-19'],
    ['No', 'No', 'No', 'Healthy'],
    ['Yes', 'No', 'No', 'Healthy'],
    ['No', 'Yes', 'Yes', 'COVID-19'],
    ['Yes', 'Yes', 'Yes', 'Flu'],
    ['No', 'No', 'Yes', 'Cold']
]

# Define the column names
columns = ['Fever', 'Cough', 'Fatigue', 'Disease']

# Create a pandas DataFrame
df_medical_diagnosis = pd.DataFrame(data, columns=columns)

# Display the head of the DataFrame
print("Medical Diagnosis Dataset:")
print(df_medical_diagnosis.head())

## Initialize Hypotheses

### Subtask:
Initialize the Most Specific Hypothesis (S) with the first positive example and the Most General Hypothesis (G) with the most general possible hypotheses (e.g., all '?'s for each attribute).


**Reasoning**:
The subtask requires initializing the Most Specific Hypothesis (S) from the first positive example and the Most General Hypothesis (G) with all '?'s for each attribute. This involves identifying attributes, filtering for the target concept 'Flu', extracting the first positive example, and then constructing G with '?'s.



In [None]:
attributes = ['Fever', 'Cough', 'Fatigue']
target_concept = 'Flu'

# 3. Find the first positive example for the target concept 'Flu'
positive_examples = df_medical_diagnosis[df_medical_diagnosis['Disease'] == target_concept]

# Extract attribute values for the first positive example to initialize S
# If no 'Flu' example exists, this would need error handling, but we assume it exists from the dataset definition.
if not positive_examples.empty:
    s_initial = list(positive_examples.iloc[0][attributes].values)
else:
    print(f"Warning: No positive examples found for '{target_concept}'. Initializing S with generic values.")
    s_initial = ['?', '?', '?'] # Default if no positive example found

# 4. Initialize G with '?' for each attribute
g_initial = [['?' for _ in attributes]]

print(f"Initial Most Specific Hypothesis (S): {s_initial}")
print(f"Initial Most General Hypothesis (G): {g_initial}")

## Implement Candidate Elimination Algorithm

### Subtask:
Implement the Candidate Elimination algorithm by iterating through each example in the dataset and updating the Most Specific Hypothesis (S) and Most General Hypothesis (G) accordingly.


**Reasoning**:
First, define the `covers` helper function. This function is crucial for determining if a hypothesis is consistent with a given instance, which is a fundamental operation in the Candidate Elimination algorithm.



In [None]:
def covers(hypothesis, instance):
    """
    Checks if a hypothesis covers an instance.
    A hypothesis covers an instance if for every attribute, the hypothesis's value for that attribute
    is either '?' or matches the instance's value.
    """
    for i in range(len(hypothesis)):
        if hypothesis[i] == '?':
            continue
        if hypothesis[i] != instance[i]:
            return False
    return True

print("Defined 'covers' helper function.")

**Reasoning**:
Now, define the `is_more_general_than` helper function, which determines if one hypothesis is more general than or equal to another, a key component for refining the general hypothesis set.



In [None]:
def is_more_general_than(h1, h2):
    """
    Checks if hypothesis h1 is more general than or equal to hypothesis h2.
    This means for every attribute, h1's value is either '?' or matches h2's value.
    """
    for i in range(len(h1)):
        if h1[i] == '?':
            continue
        if h1[i] != h2[i]:
            return False
    return True

print("Defined 'is_more_general_than' helper function.")

**Reasoning**:
Next, define the `generalize_specific_hypothesis` helper function as per the instructions. This function will be used to update the Most Specific Hypothesis (S) when it fails to cover a positive example.



In [None]:
def generalize_specific_hypothesis(s, positive_example_attrs):
    """
    Generalizes the specific hypothesis s to cover the positive example.
    """
    new_s = list(s) # Make a mutable copy
    for i in range(len(new_s)):
        if new_s[i] == '?':
            continue
        elif new_s[i] != positive_example_attrs[i]:
            new_s[i] = '?'
    return new_s

print("Defined 'generalize_specific_hypothesis' helper function.")

**Reasoning**:
Now, define the `specialize_general_hypothesis` helper function as described in the instructions. This function is critical for refining the Most General Hypothesis (G) when a negative example is encountered.



In [None]:
def specialize_general_hypothesis(g_hypothesis, negative_example_attrs, attribute_names, attribute_domains):
    """
    Generates minimal specializations of a general hypothesis `g_hypothesis`
    so that it no longer covers the `negative_example_attrs`.
    """
    specializations = []
    for i in range(len(g_hypothesis)):
        if g_hypothesis[i] == '?':
            # If the hypothesis has '?' for this attribute, specialize it
            # to all values in the domain that are NOT the negative example's value.
            # We only specialize if the negative example's attribute value is in the domain.
            if negative_example_attrs[i] in attribute_domains[attribute_names[i]]:
                for value in attribute_domains[attribute_names[i]]:
                    if value != negative_example_attrs[i]:
                        new_g_hyp = list(g_hypothesis)
                        new_g_hyp[i] = value
                        specializations.append(new_g_hyp)
        elif g_hypothesis[i] != negative_example_attrs[i]:
            # If the hypothesis already doesn't cover the negative example at this attribute,
            # we don't need to specialize further for this attribute.
            # We can effectively keep the original g_hypothesis as a potential specialization
            # if it was already valid. This part needs careful thought within the main loop.
            pass # This case is handled by the main loop where we check covers().
        # If g_hypothesis[i] == negative_example_attrs[i], it means this specific part
        # of the hypothesis covers the negative example, and if this was '?', we just specialized it.
        # If it was a specific value, it remains a specific value and we look to other attributes.

    # The above logic is for generating *all minimal specializations*. A simpler approach for CE is
    # to only specialize '?' attributes that cause a cover of the negative example.
    # Let's refine based on typical CE implementation for G set refinement:
    # If g_hypothesis covers negative_example_attrs, we need to create specializations
    # by replacing '?' in g_hypothesis with specific values such that it no longer covers
    # negative_example_attrs.
    if covers(g_hypothesis, negative_example_attrs):
        for i in range(len(g_hypothesis)):
            if g_hypothesis[i] == '?':
                # Replace '?' with all specific values from the domain that are NOT the negative example's value
                if negative_example_attrs[i] in attribute_domains[attribute_names[i]]:
                    for value in attribute_domains[attribute_names[i]]:
                        if value != negative_example_attrs[i]:
                            new_g_hyp = list(g_hypothesis)
                            new_g_hyp[i] = value
                            specializations.append(new_g_hyp)

    # Remove duplicates and ensure specializations are truly distinct
    unique_specializations = []
    for hyp in specializations:
        if hyp not in unique_specializations:
            unique_specializations.append(hyp)

    return unique_specializations

print("Defined 'specialize_general_hypothesis' helper function.")

**Reasoning**:
Initialize the Most Specific Hypothesis (S) and Most General Hypothesis (G) with mutable copies of their initial states. Then, dynamically determine the domain of possible values for each attribute from the `df_medical_diagnosis` DataFrame, which is crucial for specializing the general hypotheses.



In [None]:
S = list(s_initial)
G = [list(h) for h in g_initial]

# Dynamically determine attribute domains
attribute_domains = {}
for attr in attributes:
    attribute_domains[attr] = list(df_medical_diagnosis[attr].unique())

print(f"Initial S (mutable): {S}")
print(f"Initial G (mutable): {G}")
print(f"Attribute Domains: {attribute_domains}")

**Reasoning**:
Now, implement the main loop of the Candidate Elimination algorithm, iterating through each example in the dataset to update the Most Specific Hypothesis (S) and Most General Hypothesis (G) based on whether the example is positive or negative for the target concept. This step incorporates all previously defined helper functions.



In [None]:
for index, row in df_medical_diagnosis.iterrows():
    example_attrs = list(row[attributes].values)
    example_label = row['Disease']

    print(f"\nProcessing example: {example_attrs} -> {example_label}")

    # If the example is positive
    if example_label == target_concept:
        # Update S
        if not covers(S, example_attrs):
            S = generalize_specific_hypothesis(S, example_attrs)

        # Filter G: remove hypotheses that do not cover the positive example
        G = [g_hyp for g_hyp in G if covers(g_hyp, example_attrs)]

    # If the example is negative
    else:
        G_temp = []
        for g_hyp in G:
            # If g_hyp covers the negative example, it must be specialized
            if covers(g_hyp, example_attrs):
                specializations = specialize_general_hypothesis(g_hyp, example_attrs, attributes, attribute_domains)
                for new_g in specializations:
                    # Only add specialization if it is more general than or equal to S
                    # and does not cover the negative example (which specialize_general_hypothesis ensures)
                    if is_more_general_than(new_g, S) and not covers(new_g, example_attrs):
                        G_temp.append(new_g)
            else:
                # If g_hyp already does not cover the negative example, keep it
                G_temp.append(g_hyp)
        G = G_temp

    # After processing each example, filter G to remove any duplicate hypotheses or any hypothesis that is overly specific
    # Remove duplicates from G and hypotheses that are subsumed by others in G
    new_G = []
    for g1 in G:
        is_redundant = False
        for g2 in G:
            if g1 != g2 and is_more_general_than(g2, g1): # g2 is more general than g1, so g1 is redundant
                is_redundant = True
                break
        if not is_redundant and list(g1) not in [list(x) for x in new_G]: # Avoid adding exact duplicates
            new_G.append(list(g1))
    G = new_G

    print(f"Current S: {S}")
    print(f"Current G: {G}")

print("\nCandidate Elimination Algorithm complete.")
print(f"Final Most Specific Hypothesis (S): {S}")
print(f"Final Most General Hypothesis (G): {G}")

## Final Task

### Subtask:
Summarize the implementation of the Candidate Elimination algorithm and present the final specific and general hypotheses derived from the medical diagnosis dataset.


## Summary:

### Q&A
The final Most Specific Hypothesis (S) derived from the medical diagnosis dataset for the concept 'Flu' is `['Yes', '?', '?']`.
The final Most General Hypothesis (G) derived for the concept 'Flu' is `[]`, meaning no general hypothesis could be found that was consistent with all positive and negative examples processed from the dataset.

### Data Analysis Key Findings
*   A medical diagnosis dataset was successfully created as a pandas DataFrame with attributes 'Fever', 'Cough', 'Fatigue', and a 'Disease' target.
*   The Most Specific Hypothesis (S) was initialized to `['Yes', 'Yes', 'No']`, based on the first positive example for 'Flu'.
*   The Most General Hypothesis (G) was initialized as `[['?', '?', '?']]`, representing the most general possible hypothesis.
*   The Candidate Elimination algorithm successfully generalized the Most Specific Hypothesis (S) from `['Yes', 'Yes', 'No']` to `['Yes', '?', '?']`. This generalization indicates that for the 'Flu' concept, 'Fever' being 'Yes' is a critical specific attribute, while 'Cough' and 'Fatigue' can be more general.
*   The Most General Hypothesis (G) started as `[['?', '?', '?']]` but became empty after processing the dataset. This implies that no hypothesis could satisfy all negative examples while remaining more general than the specific hypothesis S, leading to an inconsistent hypothesis space.
*   The attribute domains were dynamically extracted as `{'Fever': ['Yes', 'No'], 'Cough': ['Yes', 'No'], 'Fatigue': ['No', 'Yes']}`.

### Insights or Next Steps
*   The empty G set suggests that the target concept 'Flu' might not be perfectly learnable by a single conjunction of attributes within the provided dataset and the constraints of the Candidate Elimination algorithm. This could be due to noise in the data, insufficient examples, or the concept itself being non-linearly separable or disjunctive.
*   Consider revisiting the dataset examples, potentially adding more diverse instances, or exploring alternative learning algorithms (e.g., decision trees, neural networks) that can handle more complex concept representations or noise if a complete version space is desired.
