<a href="https://colab.research.google.com/github/reyhanfisena/flood-prediction-using-historical-data/blob/main/Flood_prediction_using_historical_data_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prapemrosesan data (update)

### Subtask:
Ekstrak nama kecamatan dari kolom 'Area' dan lakukan one-hot encoding pada fitur kecamatan.


**Reasoning**:
Define the function to extract 'Kecamatan', apply it to create a new column, and then one-hot encode this new column before concatenating with the existing encoded DataFrame and redefining the feature set X.



In [24]:
import re

def extract_kecamatan(area_string):
    """
    Extracts the sub-district (kecamatan) name from the 'Area' string.

    Args:
        area_string (str): The string from the 'Area' column.

    Returns:
        str: The extracted sub-district name or 'Unknown'.
    """
    if pd.isna(area_string):
        return 'Unknown'

    # Look for patterns like 'Kecamatan [Name]' or 'Kec. [Name]'
    match = re.search(r'(Kecamatan|Kec\.)\s*([^<,\n]+)', area_string, re.IGNORECASE)
    if match:
        # Return the captured group (the name after Kecamatan/Kec.) and strip whitespace
        return match.group(2).strip()

    # If no specific pattern is found, return 'Unknown'
    return 'Unknown'

# Apply the function to create the new 'Kecamatan' column
df_banjir['Kecamatan'] = df_banjir['Area'].apply(extract_kecamatan)

# Perform one-hot encoding on the 'Kecamatan' column
kecamatan_encoded = pd.get_dummies(df_banjir['Kecamatan'], prefix='Kecamatan', drop_first=True)

# Concatenate the new one-hot encoded 'Kecamatan' columns with the existing encoded DataFrame
# Ensure 'Area' column from the original encoding is dropped if it was included
if 'Area' in df_banjir_encoded.columns:
    df_banjir_encoded = df_banjir_encoded.drop('Area', axis=1)

# Drop the original 'Area' and 'Kecamatan' columns before concatenation to avoid redundancy
df_banjir_temp = df_banjir_encoded.drop('Area_Unknown', axis=1) if 'Area_Unknown' in df_banjir_encoded.columns else df_banjir_encoded
df_banjir_temp = df_banjir_temp.drop([col for col in df_banjir_temp.columns if col.startswith('Area_') and col != 'Area_Unknown'], axis=1, errors='ignore')


df_banjir_final_encoded = pd.concat([df_banjir_temp, kecamatan_encoded], axis=1)


# Update the definition of features (X)
# Drop identifier, original disaster type, eventdate, original Area, and target
columns_to_drop = ['ID Logs', 'Disaster type', 'Eventdate', 'Level', 'Area']
X = df_banjir_final_encoded.drop(columns=columns_to_drop, errors='ignore')

# Verify the new columns are included and the old 'Area' related columns are removed
print("Columns in the final feature set X:")
print(X.columns.tolist())

display(X.head())

Columns in the final feature set X:
['Latitude', 'Longitude', 'Dead', 'Missing', 'Serious Wound', 'Minor Injuries', 'Regency_Banyuwangi Kabupaten', 'Regency_Batu Kota', 'Regency_Blitar Kabupaten', 'Regency_Blitar Kota', 'Regency_Bojonegoro Kabupaten', 'Regency_Bondowoso Kabupaten', 'Regency_Gresik Kabupaten', 'Regency_Jember Kabupaten', 'Regency_Jombang Kabupaten', 'Regency_Kediri Kabupaten', 'Regency_Lamongan Kabupaten', 'Regency_Lumajang Kabupaten', 'Regency_Madiun Kabupaten', 'Regency_Madiun Kota', 'Regency_Magetan Kabupaten', 'Regency_Malang Kabupaten', 'Regency_Malang Kota', 'Regency_Mojokerto Kabupaten', 'Regency_Mojokerto Kota', 'Regency_Nganjuk Kabupaten', 'Regency_Ngawi Kabupaten', 'Regency_Pacitan Kabupaten', 'Regency_Pamekasan Kabupaten', 'Regency_Pasuruan Kabupaten', 'Regency_Pasuruan Kota', 'Regency_Ponorogo Kabupaten', 'Regency_Probolinggo Kabupaten', 'Regency_Probolinggo Kota', 'Regency_Sampang Kabupaten', 'Regency_Sidoarjo Kabupaten', 'Regency_Situbondo Kabupaten', 'Reg

Unnamed: 0,Latitude,Longitude,Dead,Missing,Serious Wound,Minor Injuries,Regency_Banyuwangi Kabupaten,Regency_Batu Kota,Regency_Blitar Kabupaten,Regency_Blitar Kota,...,Kecamatan_Unknown,Kecamatan_Watulimo,Kecamatan_Winongan,Kecamatan_Winongan :,Kecamatan_Wongsorejo,Kecamatan_Wonoasri,Kecamatan_Wonosari,Kecamatan_Wonotirto,Kecamatan_Wungu,Kecamatan_pace
0,7.29708,112.456,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,-10000.0,111.793,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
11,-8.12098,111.9,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
15,-7.73591,112.958,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
24,-7.6879,111.95,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


**Reasoning**:
Split the updated feature set X and the target variable y into training and testing sets to prepare for model training.



In [25]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (204, 1120)
Shape of X_test: (51, 1120)
Shape of y_train: (204,)
Shape of y_test: (51,)


## Implementasi case-based reasoning (cbr) (update)

### Subtask:
Sesuaikan fungsi `find_similar_cases` untuk mempertimbangkan fitur kecamatan yang baru diekstrak.


**Reasoning**:
Modify the `find_similar_cases` function to handle the updated `X_train` DataFrame which includes the one-hot encoded 'Kecamatan' features. Test the function with a sample new case that includes a hypothetical 'Kecamatan'.



In [26]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def find_similar_cases(new_case, X_train, y_train, k=5):
    """
    Finds the top-k most similar cases in the training data to a new case.

    Args:
        new_case (pd.DataFrame): A DataFrame with one row representing the
                                 characteristics of a new potential flood event.
                                 Must have the same columns as X_train.
        X_train (pd.DataFrame): The training data containing historical cases.
        y_train (pd.Series): The target variable (flood levels) for the training data.
        k (int): The number of most similar cases to retrieve.

    Returns:
        tuple: A tuple containing:
            - pd.DataFrame: The top-k similar cases from X_train.
            - pd.Series: The target levels for the top-k similar cases from y_train.
            - np.ndarray: The distances of the top-k similar cases from the new case.
    """
    # Ensure the new case has the same columns as X_train
    # Add missing columns to new_case and fill with 0
    missing_cols = set(X_train.columns) - set(new_case.columns)
    for c in missing_cols:
        new_case[c] = 0
    # Ensure the order of columns is the same
    new_case = new_case[X_train.columns]

    # Use NearestNeighbors to find the k nearest neighbors based on Euclidean distance
    nn = NearestNeighbors(n_neighbors=k, metric='euclidean')
    nn.fit(X_train)

    distances, indices = nn.kneighbors(new_case)

    # Get the similar cases and their target levels
    similar_cases = X_train.iloc[indices[0]]
    similar_cases_levels = y_train.iloc[indices[0]]

    return similar_cases, similar_cases_levels, distances[0]

# Create a sample new case DataFrame that includes the relevant one-hot encoded 'Kecamatan' column.
# Start with a base case (e.g., from X_test) and modify it.
sample_new_event_base = X_test.iloc[[0]].copy()

# Assume the new event is in a specific kecamatan, e.g., 'Kecamatan_Sampang'
# Set the corresponding one-hot encoded column to True (or 1) and others to False (or 0)
# First, set all 'Kecamatan_' columns in the sample to False
for col in sample_new_event_base.columns:
    if col.startswith('Kecamatan_'):
        sample_new_event_base[col] = False

# Now set the specific kecamatan to True (or 1) - replace 'Kecamatan_Sampang' with a kecamatan present in your data
# You might need to check the actual column names in X_train or X_test
# Let's pick one from the existing X_train columns for demonstration
example_kecamatan_col = [col for col in X_train.columns if col.startswith('Kecamatan_')][0] # Pick the first one as an example
sample_new_event_base[example_kecamatan_col] = True # Or 1

# Ensure all columns match X_train columns (this was already handled in the function, but good to be explicit)
sample_new_event = pd.DataFrame(0, index=[0], columns=X_train.columns)
for col in sample_new_event_base.columns:
    if col in sample_new_event.columns:
        sample_new_event[col] = sample_new_event_base[col]


# Call the updated find_similar_cases function
similar_features_updated, similar_levels_updated, distances_updated = find_similar_cases(sample_new_event, X_train, y_train, k=5)

print("Sample New Flood Event Features (with hypothetical Kecamatan):")
display(sample_new_event)

print("\nTop 5 Similar Cases Features (considering Kecamatan):")
display(similar_features_updated)

print("\nTop 5 Similar Cases Levels (considering Kecamatan):")
display(similar_levels_updated)

print("\nDistances to Similar Cases (considering Kecamatan):")
print(distances_updated)

ValueError: Input X contains NaN.
NearestNeighbors does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

**Reasoning**:
The previous attempt failed because the sample new case DataFrame contained NaN values. I need to ensure the sample new case is properly initialized with all columns from X_train and filled with appropriate values (0 for most one-hot encoded columns and potentially realistic values for numerical features).



In [27]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def find_similar_cases(new_case, X_train, y_train, k=5):
    """
    Finds the top-k most similar cases in the training data to a new case.

    Args:
        new_case (pd.DataFrame): A DataFrame with one row representing the
                                 characteristics of a new potential flood event.
                                 Must have the same columns as X_train.
        X_train (pd.DataFrame): The training data containing historical cases.
        y_train (pd.Series): The target variable (flood levels) for the training data.
        k (int): The number of most similar cases to retrieve.

    Returns:
        tuple: A tuple containing:
            - pd.DataFrame: The top-k similar cases from X_train.
            - pd.Series: The target levels for the top-k similar cases from y_train.
            - np.ndarray: The distances of the top-k similar cases from the new case.
    """
    # Ensure the new case has the same columns as X_train
    # Add missing columns to new_case and fill with 0
    missing_cols = set(X_train.columns) - set(new_case.columns)
    for c in missing_cols:
        new_case[c] = 0
    # Ensure the order of columns is the same
    new_case = new_case[X_train.columns]

    # Use NearestNeighbors to find the k nearest neighbors based on Euclidean distance
    nn = NearestNeighbors(n_neighbors=k, metric='euclidean')
    nn.fit(X_train)

    distances, indices = nn.kneighbors(new_case)

    # Get the similar cases and their target levels
    similar_cases = X_train.iloc[indices[0]]
    similar_cases_levels = y_train.iloc[indices[0]]

    return similar_cases, similar_cases_levels, distances[0]

# Create a sample new case DataFrame that includes the relevant one-hot encoded 'Kecamatan' column.
# Initialize with zeros and ensure all columns from X_train are present.
sample_new_event = pd.DataFrame(0, index=[0], columns=X_train.columns)

# Populate with hypothetical values for numerical features (e.g., medians from training data)
for col in ['Latitude', 'Longitude', 'Dead', 'Missing', 'Serious Wound', 'Minor Injuries']:
    if col in sample_new_event.columns:
        sample_new_event[col] = X_train[col].median()

# Assume the new event is in a specific kecamatan. Set the corresponding one-hot encoded column to 1.
# Replace 'Kecamatan_Sampang' with a kecamatan present in your data.
# Let's pick one from the existing X_train columns for demonstration
example_kecamatan_col = [col for col in X_train.columns if col.startswith('Kecamatan_')][0] # Pick the first one as an example
if example_kecamatan_col in sample_new_event.columns:
    sample_new_event[example_kecamatan_col] = 1

# Ensure the column order is correct before passing to the function
sample_new_event = sample_new_event[X_train.columns]


# Call the updated find_similar_cases function
similar_features_updated, similar_levels_updated, distances_updated = find_similar_cases(sample_new_event, X_train, y_train, k=5)

print("Sample New Flood Event Features (with hypothetical Kecamatan):")
display(sample_new_event)

print("\nTop 5 Similar Cases Features (considering Kecamatan):")
display(similar_features_updated)

print("\nTop 5 Similar Cases Levels (considering Kecamatan):")
display(similar_levels_updated)

print("\nDistances to Similar Cases (considering Kecamatan):")
print(distances_updated)

Sample New Flood Event Features (with hypothetical Kecamatan):


Unnamed: 0,Latitude,Longitude,Dead,Missing,Serious Wound,Minor Injuries,Regency_Banyuwangi Kabupaten,Regency_Batu Kota,Regency_Blitar Kabupaten,Regency_Blitar Kota,...,Kecamatan_Unknown,Kecamatan_Watulimo,Kecamatan_Winongan,Kecamatan_Winongan :,Kecamatan_Wongsorejo,Kecamatan_Wonoasri,Kecamatan_Wonosari,Kecamatan_Wonotirto,Kecamatan_Wungu,Kecamatan_pace
0,-7.63148,112.533,0.0,0.0,0.0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0



Top 5 Similar Cases Features (considering Kecamatan):


Unnamed: 0,Latitude,Longitude,Dead,Missing,Serious Wound,Minor Injuries,Regency_Banyuwangi Kabupaten,Regency_Batu Kota,Regency_Blitar Kabupaten,Regency_Blitar Kota,...,Kecamatan_Unknown,Kecamatan_Watulimo,Kecamatan_Winongan,Kecamatan_Winongan :,Kecamatan_Wongsorejo,Kecamatan_Wonoasri,Kecamatan_Wonosari,Kecamatan_Wonotirto,Kecamatan_Wungu,Kecamatan_pace
713,-7.10636,112.173,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1429,-7.626055,112.533,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
197,-7.53009,112.609,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1565,-7.2707,112.472,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
15,-7.73591,112.958,0.0,0.0,0.0,0.0,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False



Top 5 Similar Cases Levels (considering Kecamatan):


Unnamed: 0,Level
713,0
1429,0
197,0
1565,0
15,1



Distances to Similar Cases (considering Kecamatan):
[2.53087949 2.64575687 2.64878386 2.67093302 2.68170293]


## Pelatihan model xgboost (update)

### Subtask:
Latih ulang model XGBoost dengan menyertakan fitur kecamatan yang baru.


**Reasoning**:
Train the XGBoost model using the updated training data, ensuring column names are valid for XGBoost.



In [28]:
from xgboost import XGBClassifier

# Clean column names to remove invalid characters that might interfere with XGBoost
X_train.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_train.columns]
X_test.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_test.columns]

# Create an instance of XGBClassifier
# Using multi:softmax objective for multi-class classification and setting num_class
xgb_model = XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='mlogloss')

# Train the XGBoost model using the updated training data
xgb_model.fit(X_train, y_train)

print("XGBoost model trained successfully with updated features.")

XGBoost model trained successfully with updated features.


## Prediksi (update)

### Subtask:
Buat fungsi baru yang menerima nama kecamatan sebagai input, membuat kasus baru dengan fitur yang sesuai, dan menggunakan model CBR-XGBoost untuk memprediksi tingkat banjir.


**Reasoning**:
Define the predict_flood_level_by_kecamatan function, create a sample new event, and call the function to get and print the prediction.



In [32]:
import numpy as np
from collections import Counter
import pandas as pd

def predict_flood_level_by_kecamatan(kecamatan_name, xgb_model, X_train, y_train, level_mapping_reverse, k=5):
    """
    Predicts the flood level for a new event based on the kecamatan using CBR and XGBoost.

    Args:
        kecamatan_name (str): The name of the sub-district (kecamatan).
        xgb_model: The trained XGBoost model.
        X_train (pd.DataFrame): The training data containing historical cases.
        y_train (pd.Series): The target variable (flood levels) for the training data.
        level_mapping_reverse (dict): Dictionary to map numerical predictions back to string labels.
        k (int): The number of most similar cases to retrieve for CBR.

    Returns:
        str: The predicted flood level for the new event ('RENDAH', 'SEDANG', or 'TINGGI').
        str: An error message if the kecamatan name is not found, otherwise None.
    """
    # Create a new case DataFrame initialized with zeros and correct columns
    new_case = pd.DataFrame(0, index=[0], columns=X_train.columns)

    # Populate with median values from X_train for numerical features
    numerical_cols = ['Latitude', 'Longitude', 'Dead', 'Missing', 'Serious Wound', 'Minor Injuries']
    for col in numerical_cols:
        if col in new_case.columns:
            new_case[col] = X_train[col].median()

    # Set the value to 1 for the one-hot encoded 'Kecamatan' column
    kecamatan_col_name = f'Kecamatan_{kecamatan_name}'
    if kecamatan_col_name not in new_case.columns:
        return None, f"Error: Kecamatan '{kecamatan_name}' not found in training data."

    new_case[kecamatan_col_name] = 1

    # Ensure the columns of new_case are in the same order as X_train
    new_case = new_case[X_train.columns]

    # Find similar cases using CBR
    similar_features, similar_levels, distances = find_similar_cases(new_case, X_train, y_train, k=k)

    # Get predictions from the XGBoost model for the similar cases
    # Ensure column names of similar_features match the training data columns
    similar_features.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in similar_features.columns]
    xgb_predictions = xgb_model.predict(similar_features)

    # Determine the final predicted flood level using majority voting
    prediction_counts = Counter(xgb_predictions.tolist())
    predicted_level_numerical = prediction_counts.most_common(1)[0][0]

    # Convert the numerical prediction back to the original level string
    predicted_level_string = level_mapping_reverse[predicted_level_numerical]

    return predicted_level_string, None

# Example usage: Predict flood level for a sample kecamatan
sample_kecamatan_name = 'Jombang' # Replace with a valid kecamatan name from your data if needed

predicted_level, error_message = predict_flood_level_by_kecamatan(sample_kecamatan_name, xgb_model, X_train, y_train, level_mapping_reverse, k=5)

if error_message:
    print(error_message)
else:
    print(f"Predicted Flood Level for {sample_kecamatan_name}: {predicted_level}")

# Example with a kecamatan not in the training data to test error handling
# sample_kecamatan_name_invalid = 'InvalidKecamatan'
# predicted_level_invalid, error_message_invalid = predict_flood_level_by_kecamatan(sample_kecamatan_name_invalid, xgb_model, X_train, y_train, level_mapping_reverse, k=5)

# if error_message_invalid:
#     print(error_message_invalid)
# else:
#     print(f"Predicted Flood Level for {sample_kecamatan_name_invalid}: {predicted_level_invalid}")

Error: Kecamatan 'Jombang' not found in training data.
