# Q3: Predicting T2 Risk Level and Handling Missing Values

This notebook determines the probability of T2 being **High Risk** using the training dataset's Age and CreditScore patterns. T2 has a missing Education value. We then propose methods to handle similar missing values in the future.

In [None]:
import pandas as pd

# Create the training dataset
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Age': [35, 28, 45, 31, 52, 29, 42, 33],
    'CreditScore': [720, 650, 750, 600, 780, 630, 710, 640],
    'Education': [16, 14, None, 12, 18, 14, 16, 12],
    'RiskLevel': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High']
}

df = pd.DataFrame(data)
df['Education'] = df['Education'].astype('float')

# Display the training dataset
print("Training Data:\n", df)

### Define the Test Case T2

T2 has the following attributes:

- **Age:** 30
- **CreditScore:** 645
- **Education:** missing

Since Education is missing, we focus on Age and CreditScore.

In [None]:
# Define the test record T2 as a dictionary
T2 = {'Age': 30, 'CreditScore': 645}

print("Test Record T2:", T2)

### Identify Similar Training Records

We select training records with similar Age and CreditScore. Here, we define similarity as:
- Absolute difference in Age ≤ 5 years
- Absolute difference in CreditScore ≤ 25 points

These thresholds can be adjusted based on the dataset.

In [None]:
# Define thresholds for similarity
age_threshold = 5
credit_threshold = 25

# Filter the training data for similar records
similar_records = df[(abs(df['Age'] - T2['Age']) <= age_threshold) & 
                     (abs(df['CreditScore'] - T2['CreditScore']) <= credit_threshold)]

print("Similar Training Records:\n", similar_records)

### Calculate the Probability of T2 Being High Risk

Now, we compute the proportion of similar records that are classified as High Risk.

In [None]:
# Count the number of similar records and those with High risk
total_similar = len(similar_records)
high_risk_count = len(similar_records[similar_records['RiskLevel'] == 'High'])

if total_similar > 0:
    probability_high = high_risk_count / total_similar
else:
    probability_high = None

print(f"Probability of T2 being High Risk: {probability_high}")

### Handling Missing Education Values in the Future

To address missing values such as Education, you could use one or more of the following methods:

- **K-Nearest Neighbors (KNN) Imputation:** Impute the missing Education value by averaging the values from the most similar records based on Age and CreditScore.
- **Regression Imputation:** Use a regression model that predicts Education using other features like Age, CreditScore, and even RiskLevel.
- **Mean/Median Imputation:** Replace missing Education values with the mean or median Education value from similar records.
- **Missingness Indicator:** Add an extra binary feature that indicates whether the Education value is missing, allowing the model to capture any signal in the missingness itself.

These methods help maintain the dataset’s integrity and allow the model to use all available information.