<a href="https://colab.research.google.com/github/vanadhisivakumar-source/Machine-learning-projects/blob/main/student%20result%20prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a K-Nearest Neighbors (KNN) classifier to predict if a student will pass or fail. Use the following training data:
- Student 1: Study Hours=7, Attendance=90%, Result=Pass
- Student 2: Study Hours=2, Attendance=60%, Result=Fail
- Student 3: Study Hours=8, Attendance=95%, Result=Pass
- Student 4: Study Hours=3, Attendance=50%, Result=Fail
- Student 5: Study Hours=6, Attendance=85%, Result=Pass
- Student 6: Study Hours=4, Attendance=65%, Result=Fail
Then, use the classifier to predict the result for a new student with Study Hours=5 and Attendance=75%. Finally, display the predicted result and explain the reasoning based on the nearest neighbors.

## Create Training Dataset

### Subtask:
Define the provided student training data, including 'study hours', 'attendance', and 'result' (pass/fail), as a structured dataset for the KNN classifier.


**Reasoning**:
First, I will create a dictionary containing the provided student training data. This dictionary will hold 'Study Hours', 'Attendance', and 'Result' for six students, which is the initial step to structure the dataset.



In [None]:
import pandas as pd

# Data for six students
data = [
    {'Study Hours': 3, 'Attendance': 7, 'Result': 'Fail'},
    {'Study Hours': 5, 'Attendance': 8, 'Result': 'Pass'},
    {'Study Hours': 2, 'Attendance': 6, 'Result': 'Fail'},
    {'Study Hours': 6, 'Attendance': 9, 'Result': 'Pass'},
    {'Study Hours': 4, 'Attendance': 7, 'Result': 'Pass'},
    {'Study Hours': 1, 'Attendance': 5, 'Result': 'Fail'}
]

# Convert the dictionary to a Pandas DataFrame
training_data = pd.DataFrame(data)

print("Initial training data DataFrame created:")
print(training_data)

**Reasoning**:
The next step is to convert the 'Result' column in the `training_data` DataFrame from categorical (Pass/Fail) to numerical (1/0) as specified in the instructions for machine learning compatibility.



In [None]:
training_data['Result'] = training_data['Result'].map({'Pass': 1, 'Fail': 0})

print("Training data after encoding 'Result' column:")
print(training_data)

## Correct Training Data and Implement KNN Classifier

### Subtask:
First, correct the training data to align with the values provided in the problem description. Then, implement the core K-Nearest Neighbors (KNN) algorithm components, including distance calculation, finding nearest neighbors, and making a classification prediction.


**Reasoning**:
First, I will recreate the training data DataFrame with the correct values provided in the problem description, ensuring that the 'Result' column is numerically encoded. Then, I will separate the features (X) and the target variable (y).



In [None]:
import pandas as pd
import numpy as np

# Corrected data for six students based on the problem description
data_corrected = [
    {'Study Hours': 7, 'Attendance': 90, 'Result': 'Pass'},
    {'Study Hours': 2, 'Attendance': 60, 'Result': 'Fail'},
    {'Study Hours': 8, 'Attendance': 95, 'Result': 'Pass'},
    {'Study Hours': 3, 'Attendance': 50, 'Result': 'Fail'},
    {'Study Hours': 6, 'Attendance': 85, 'Result': 'Pass'},
    {'Study Hours': 4, 'Attendance': 65, 'Result': 'Fail'}
]

# Convert the corrected data to a Pandas DataFrame
training_data = pd.DataFrame(data_corrected)

# Encode the 'Result' column: 'Pass' = 1, 'Fail' = 0
training_data['Result'] = training_data['Result'].map({'Pass': 1, 'Fail': 0})

# Separate features (X) and target variable (y)
X_train = training_data[['Study Hours', 'Attendance']]
y_train = training_data['Result']

print("Corrected training data DataFrame:")
print(training_data)
print("\nFeatures (X_train) head:")
print(X_train.head())
print("\nTarget (y_train) head:")
print(y_train.head())

**Reasoning**:
Now that the training data is prepared, I will define the necessary functions for the KNN algorithm: `euclidean_distance` to calculate the distance between points, `get_neighbors` to find the k nearest neighbors, and `predict_classification` to make a prediction based on the neighbors' labels.



In [None]:
import numpy as np
from collections import Counter

def euclidean_distance(point1, point2):
    """Calculates the Euclidean distance between two data points."""
    point1_array = np.array(point1)
    point2_array = np.array(point2)
    return np.sqrt(np.sum((point1_array - point2_array)**2))

def get_neighbors(X_train, y_train, test_point, k_neighbors):
    """Identifies the k nearest neighbors for a given test point."""
    distances = []
    for i, train_point in enumerate(X_train.values):
        dist = euclidean_distance(test_point, train_point)
        distances.append((dist, y_train.iloc[i]))

    distances.sort(key=lambda x: x[0])
    neighbors_labels = [label for dist, label in distances[:k_neighbors]]
    return neighbors_labels

def predict_classification(neighbors_labels):
    """Predicts the class label based on the majority vote of the neighbors."""
    most_common = Counter(neighbors_labels).most_common(1)
    return most_common[0][0]

print("KNN algorithm functions (euclidean_distance, get_neighbors, predict_classification) defined.")

**Reasoning**:
Now that the KNN functions are defined, I will define the new student's data point, set the value of K (number of neighbors), use the `get_neighbors` function to find the nearest neighbors for the new student, and then use `predict_classification` to predict the result for this new student. Finally, I will explain the reasoning based on the nearest neighbors.



In [None]:
import pandas as pd
import numpy as np
from collections import Counter

# Define the new student's data point
new_student = {'Study Hours': 5, 'Attendance': 75}
new_student_features = np.array([new_student['Study Hours'], new_student['Attendance']])

# Set the number of neighbors (K)
k = 3  # A common choice for K

# Find the k nearest neighbors
neighbors_labels = get_neighbors(X_train, y_train, new_student_features, k)

# Predict the result for the new student
predicted_result_encoded = predict_classification(neighbors_labels)

# Map the encoded result back to 'Pass' or 'Fail'
predicted_result = 'Pass' if predicted_result_encoded == 1 else 'Fail'

# Explain the reasoning
print(f"New student: {new_student}")
print(f"Predicted result for the new student (K={k}): {predicted_result}")
print("\nReasoning based on nearest neighbors:")

# To provide a detailed explanation, let's get the actual neighbors and their original results
distances = []
for i, train_point in enumerate(X_train.values):
    dist = euclidean_distance(new_student_features, train_point)
    distances.append({'student_index': i, 'Study Hours': train_point[0], 'Attendance': train_point[1], 'Result_Encoded': y_train.iloc[i], 'Distance': dist})

# Sort by distance
distances_df = pd.DataFrame(distances)
distances_df = distances_df.sort_values(by='Distance').reset_index(drop=True)

nearest_k_neighbors = distances_df.head(k)
nearest_k_neighbors['Original_Result'] = nearest_k_neighbors['Result_Encoded'].map({1: 'Pass', 0: 'Fail'})

print(f"The {k} nearest neighbors are:")
for idx, row in nearest_k_neighbors.iterrows():
    print(f"- Student with Study Hours={row['Study Hours']}, Attendance={row['Attendance']}: Result={row['Original_Result']} (Distance: {row['Distance']:.2f})")

pass_count = sum(1 for label in neighbors_labels if label == 1)
fail_count = sum(1 for label in neighbors_labels if label == 0)

print(f"\nAmong these {k} neighbors, there are {pass_count} 'Pass' results and {fail_count} 'Fail' results.")
print(f"The majority class is '{predicted_result}', hence the prediction.")

**Reasoning**:
The previous code generated a `SettingWithCopyWarning` because `nearest_k_neighbors` was a view of a DataFrame slice. To resolve this, I will explicitly create a copy of the slice using `.copy()` before adding the 'Original_Result' column.



In [None]:
import pandas as pd
import numpy as np
from collections import Counter

# Define the new student's data point
new_student = {'Study Hours': 5, 'Attendance': 75}
new_student_features = np.array([new_student['Study Hours'], new_student['Attendance']])

# Set the number of neighbors (K)
k = 3  # A common choice for K

# Find the k nearest neighbors
neighbors_labels = get_neighbors(X_train, y_train, new_student_features, k)

# Predict the result for the new student
predicted_result_encoded = predict_classification(neighbors_labels)

# Map the encoded result back to 'Pass' or 'Fail'
predicted_result = 'Pass' if predicted_result_encoded == 1 else 'Fail'

# Explain the reasoning
print(f"New student: {new_student}")
print(f"Predicted result for the new student (K={k}): {predicted_result}")
print("\nReasoning based on nearest neighbors:")

# To provide a detailed explanation, let's get the actual neighbors and their original results
distances = []
for i, train_point in enumerate(X_train.values):
    dist = euclidean_distance(new_student_features, train_point)
    distances.append({'student_index': i, 'Study Hours': train_point[0], 'Attendance': train_point[1], 'Result_Encoded': y_train.iloc[i], 'Distance': dist})

# Sort by distance
distances_df = pd.DataFrame(distances)
distances_df = distances_df.sort_values(by='Distance').reset_index(drop=True)

# Create an explicit copy to avoid SettingWithCopyWarning
nearest_k_neighbors = distances_df.head(k).copy()
nearest_k_neighbors['Original_Result'] = nearest_k_neighbors['Result_Encoded'].map({1: 'Pass', 0: 'Fail'})

print(f"The {k} nearest neighbors are:")
for idx, row in nearest_k_neighbors.iterrows():
    print(f"- Student with Study Hours={row['Study Hours']}, Attendance={row['Attendance']}: Result={row['Original_Result']} (Distance: {row['Distance']:.2f})")

pass_count = sum(1 for label in neighbors_labels if label == 1)
fail_count = sum(1 for label in neighbors_labels if label == 0)

print(f"\nAmong these {k} neighbors, there are {pass_count} 'Pass' results and {fail_count} 'Fail' results.")
print(f"The majority class is '{predicted_result}', hence the prediction.")

## Final Task

### Subtask:
Present the predicted result (pass or fail) for the new student and explain the reasoning based on the KNN algorithm, specifically highlighting the nearest neighbors and their outcomes.


## Summary:

### Q&A
The predicted result for the new student (Study Hours=5, Attendance=75) using the K-Nearest Neighbors (KNN) classifier with K=3 is **Pass**.

The reasoning is as follows:
The three nearest neighbors to the new student are:
*   Student with Study Hours=4, Attendance=65: Result=Fail (Distance: 10.05)
*   Student with Study Hours=6, Attendance=85: Result=Pass (Distance: 10.05)
*   Student with Study Hours=7, Attendance=90: Result=Pass (Distance: 15.13)

Among these three neighbors, two students passed, and one failed. Since the majority (2 out of 3) of the nearest neighbors had a 'Pass' result, the KNN algorithm predicts 'Pass' for the new student.

### Data Analysis Key Findings
*   The initial training data was corrected to match the problem description, including 'Study Hours' and 'Attendance' for six students, with their respective 'Result' (Pass/Fail).
*   The 'Result' column in the training data was numerically encoded, mapping 'Pass' to 1 and 'Fail' to 0 for machine learning compatibility.
*   Custom functions for Euclidean distance calculation, finding K nearest neighbors, and predicting classification based on majority vote were successfully implemented.
*   For a new student with Study Hours=5 and Attendance=75, and K=3:
    *   The closest neighbors were identified as:
        *   Student (Study Hours=4, Attendance=65) with Result='Fail' (distance 10.05).
        *   Student (Study Hours=6, Attendance=85) with Result='Pass' (distance 10.05).
        *   Student (Study Hours=7, Attendance=90) with Result='Pass' (distance 15.13).
    *   Out of the three nearest neighbors, two had a 'Pass' result and one had a 'Fail' result.
*   Based on the majority vote of the nearest neighbors, the new student is predicted to 'Pass'.

### Insights or Next Steps
*   The KNN model provides an interpretable prediction by explicitly showing the influence of the closest data points. This transparency can be valuable for explaining outcomes in educational settings.
*   To further validate the model's robustness, explore different values for K (e.g., K=1, K=5) and analyze their impact on the prediction and the set of nearest neighbors.
