<a href="https://colab.research.google.com/github/sakethchodi15/NuclearPrediction/blob/main/NuclearPredictionColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Nuclear PowerPlant Prediction**

**Objective:**
Use existing power plant characteristics, geography, and generation patterns to predict whether a country is suitable for future nuclear power plants.

**Target variable:**
future_nuclear_suitable

Country already has nuclear plants (future suitability)

##Load and Read Data Set power_plant_database_global

In [None]:
import pandas as pd
import numpy as np
import warnings
# This will suppress all warnings, including UserWarning and FutureWarning
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

In [None]:
from google.colab import files
files.upload()

In [None]:
df = pd.read_csv('power_plant_database_global.csv')
df.head()


##Data PreProcessing

In [None]:
NucPlant = df.copy()

# Shape of dataset
print("Dataset Shape:", NucPlant.shape)


# Column names
print("\nColumns:\n", NucPlant.columns)


# Data types & nulls
NucPlant.info()

In [None]:

# look at the statistical summary of the data
NucPlant.describe(include="all").T


In [None]:

# checking for duplicate values in the data
print("There are ", NucPlant.duplicated().sum(), "duplicate values in the data")

In [None]:
# checking for missing values in the data.
NucPlant.isnull().sum()

In [None]:
print("Original columns:", NucPlant.columns.tolist())

In [None]:
#Looking at the null values lot of columns dont add value to the prediction
#Drop columns thats not needed
columns_to_drop = [
 'name', 'gppd_idnr', 'other_fuel1', 'other_fuel2', 'other_fuel3',
'commissioning_year', 'owner', 'source', 'url', 'geolocation_source', 'wepp_id',
'year_of_capacity_data', 'generation_gwh_2013','generation_gwh_2014','generation_gwh_2015',
'generation_gwh_2016','generation_gwh_2017','generation_gwh_2018','generation_gwh_2019',
'generation_data_source','estimated_generation_gwh_2013','estimated_generation_gwh_2014',
'estimated_generation_gwh_2015','estimated_generation_gwh_2016','estimated_generation_gwh_2017',
'estimated_generation_note_2013','estimated_generation_note_2014','estimated_generation_note_2015',
'estimated_generation_note_2016','estimated_generation_note_2017'
]

Drop all the columns which has high missing + low predictive value


In [None]:
NucPlant.drop(columns=[col for col in columns_to_drop if col in NucPlant.columns], inplace=True)

In [None]:
# look at the statistical summary of the data
NucPlant.describe(include="all").T

In [None]:
NucPlant.country.value_counts()


In [None]:
print("Original columns:", NucPlant.columns.tolist())
NucPlant.info()

In [None]:
# Drop rows with missing critical info
NucPlant = NucPlant.dropna(subset=['country','country_long', 'capacity_mw', 'primary_fuel','latitude', 'longitude'])


# Fill remaining numeric NaNs with mean
NucPlant.fillna(NucPlant.mean(numeric_only=True), inplace=True)

##Data Analysis and Visualization

In [None]:
#engineer climate zones based on latitude
def get_climate(lat):
    if abs(lat) < 23.5:
        return 'Tropical'
    elif abs(lat) < 35:
        return 'Subtropical'
    elif abs(lat) < 66.5:
        return 'Temperate'
    else:
        return 'Polar'

In [None]:
NucPlant['climate'] = NucPlant['latitude'].apply(get_climate)

In [None]:
NucPlant[['country', 'latitude', 'climate']].head()

In [None]:
colors = {
    'Tropical': 'green',
    'Subtropical': 'orange',
    'Temperate': 'blue',
    'Polar': 'purple'
}

In [None]:
plt.figure(figsize = (10,6))
for climate, color in colors.items():
    subset = NucPlant[NucPlant['climate'] == climate]
    plt.scatter(subset['longitude'], subset['latitude'],
                color=color, label=climate, s=10, alpha=0.7)


plt.title('Power Plants by Climate')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
NeoPlant = NucPlant.copy()

In [None]:
NeoPlant['primary_fuel'].unique()

In [None]:
primaryFuel = NeoPlant['primary_fuel'].unique()


In [None]:
fuelColors = {fuel: ('red' if fuel == 'Nuclear' else 'green') for fuel in primaryFuel}
fuelSize = {fuel: (40 if fuel == 'Nuclear' else 10) for fuel in primaryFuel}
fuelOrder = {fuel: (2 if fuel == 'Nuclear' else 1) for fuel in primaryFuel}
#fuelColors = ['green' if f == 'Nuclear' else 'red' for f in primaryFuel]
fuelOrder.items()


In [None]:
plt.figure(figsize = (20,10))
for primaryFuel, color in fuelColors.items():
    subset = NeoPlant[NucPlant['primary_fuel'] == primaryFuel]
    size = fuelSize[primaryFuel]
    zorder = fuelOrder[primaryFuel]
    plt.scatter(subset['longitude'], subset['latitude'],
                color=color, label=primaryFuel, s=size, alpha=0.7 , zorder = zorder)

plt.title('Power Plants by primary_fuel')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
NeoPlant.columns

In [None]:
climates = NeoPlant['climate'].unique()

In [None]:
# Map climates to numeric codes
climate_mapping = {'Tropical': 0, 'Subtropical': 1, 'Temperate': 2, 'Polar': 3}
inverse_mapping = {v:k for k,v in climate_mapping.items()}

NeoPlant['climate_code'] = NeoPlant['climate'].map(climate_mapping)

# Compute total capacity per fuel per country
country_fuel = NeoPlant.groupby(['country', 'primary_fuel'])['capacity_mw'].sum().unstack(fill_value=0)

# Compute nuclear share per country (percentage)
country_fuel['nuclear_share'] = country_fuel.get('Nuclear', 0) / country_fuel.sum(axis=1) * 100

# Compute mean climate per country
country_climate_mean = NeoPlant.groupby('country')['climate_code'].mean()

# Assign country to nearest climate category
country_climate_cat = country_climate_mean.round().astype(int).map(inverse_mapping)

# Combine nuclear share and climate
country_data = country_fuel[['nuclear_share']].merge(country_climate_cat.rename('climate'), left_index=True, right_index=True)

#  Compute average nuclear share per climate
climate_avg = country_data.groupby('climate')['nuclear_share'].mean().sort_values(ascending=False)

# Bar Plot
plt.figure(figsize=(8,5))
climate_avg.plot(kind='bar', color=['orange','yellow','blue','purple'])
plt.ylabel('Average Nuclear Share (%) per Country')
plt.title('Average Nuclear Share by Climate Zone')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:

plt.figure(figsize=(10,6))
NeoPlant.boxplot(column='capacity_mw', by='climate', grid=False)
plt.title('Capacity Distribution by Climate')
plt.suptitle('')  # Remove default 'Boxplot grouped by climate' title
plt.xlabel('Climate')
plt.ylabel('Capacity (MW)')
plt.show()

In [None]:
plt.figure(figsize=(18,6))

sns.countplot(
    data=NeoPlant,
    x='primary_fuel',
    hue='primary_fuel',
    order=NeoPlant['primary_fuel'].value_counts().index,
    palette='tab20',
    legend=False
)

plt.xticks(rotation=45)
plt.title('Primary Fuel Count Distribution', fontsize=16)
plt.xlabel('Primary Fuel')
plt.ylabel('Number of Plants')
plt.show()

In [None]:
plt.figure(figsize=(10,6))

# We use sns.barplot and set estimator=sum to calculate total capacity
sns.barplot(
    data=NeoPlant,
    x='primary_fuel',
    y='capacity_mw',
    hue='primary_fuel',
    estimator=sum,
    # This ensures the bars are sorted by total capacity from highest to lowest
    order=NeoPlant.groupby('primary_fuel')['capacity_mw'].sum().sort_values(ascending=False).index,
    palette='tab20',
    legend=False,
    errorbar=None  # This removes the small error bars to keep the plot clean
)

plt.xticks(rotation=45)
plt.title('Total Power Generation Capacity (MW) by Fuel Source', fontsize=16)
plt.xlabel('Primary Fuel')
plt.ylabel('Total Capacity (MW)')
plt.tight_layout()
plt.show()

In [None]:
NeoPlant

In [None]:
#Filter only nuclear plants
nuclear_plants = NeoPlant[NeoPlant['primary_fuel'] == 'Nuclear']

#  Compute total capacity per country
nuclear_capacity_by_country = nuclear_plants.groupby('country_long')['capacity_mw'].sum()

# Sort descending for easier visualization
nuclear_capacity_by_country = nuclear_capacity_by_country.sort_values(ascending=False)

# Bar Plot
plt.figure(figsize=(12,6))
nuclear_capacity_by_country.plot(kind='bar')
plt.ylabel('Total Nuclear Capacity (MW)')
plt.xlabel('Country')
plt.title('Total Nuclear Capacity by Country')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
#Filter only nuclear plants
nuclear_plants = NeoPlant[NeoPlant['primary_fuel'] == 'Nuclear']

#Compute total nuclear capacity per country
nuclear_capacity_by_country = nuclear_plants.groupby('country')['capacity_mw'].sum()

#Compute mean climate code per country for nuclear plants
climate_mapping = {'Tropical': 0, 'Subtropical': 1, 'Temperate': 2, 'Polar': 3}
inverse_mapping = {v: k for k, v in climate_mapping.items()}

nuclear_plants['climate_code'] = nuclear_plants['climate'].map(climate_mapping)
mean_climate_code = nuclear_plants.groupby('country')['climate_code'].mean()
# Round to nearest climate and map back
country_climate = mean_climate_code.round().astype(int).map(inverse_mapping)

#Assign colors to climates
climate_colors = {'Tropical': 'orange', 'Subtropical': 'yellow', 'Temperate': 'green', 'Polar': 'blue'}
bar_colors = country_climate.map(climate_colors)

#Sort nuclear capacity descending for plotting
nuclear_capacity_by_country = nuclear_capacity_by_country.sort_values(ascending=False)
bar_colors = bar_colors.loc[nuclear_capacity_by_country.index]  # match sorted order

plt.figure(figsize=(12,6))
plt.bar(nuclear_capacity_by_country.index, nuclear_capacity_by_country.values, color=bar_colors)
plt.ylabel('Total Nuclear Capacity (MW)')
plt.xlabel('Country')
plt.title('Total Nuclear Capacity by Country (Colored by Climate)')
plt.xticks(rotation=90)

# Create a legend using dummy bars
for climate, color in climate_colors.items():
    plt.bar(0, 0, color=color, label=climate)

plt.legend(title='Climate')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,10))

for primaryFuel, color in fuelColors.items():
    subset = NeoPlant[NeoPlant['primary_fuel'] == primaryFuel]
    zorder = fuelOrder[primaryFuel]

    if primaryFuel == 'Nuclear':
        # Scale nuclear dot size by capacity
        size = subset['capacity_mw'] * 0.5  # adjust multiplier as needed
    else:
        # Fixed size for all other fuels
        size = fuelSize[primaryFuel]

    plt.scatter(
        subset['longitude'],
        subset['latitude'],
        color=color,
        label=primaryFuel,
        s=size,
        alpha=0.7,
        zorder=zorder
    )

plt.title('Power Plants by Primary Fuel (Nuclear Sized by Capacity)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
NPlant = NeoPlant.copy()
NPlant

In [None]:
## Drop unneccessary features for model development
NPlant.drop(['country', 'climate_code'], axis=1, inplace=True)

###Define Target Variable

In [None]:
# Nuclear plant = 1, others = 0
NPlant['is_nuclear'] = (NPlant['primary_fuel'] == 'Nuclear').astype(int)

In [None]:
#Target Variable Distribution
plt.figure(figsize=(5,4))
sns.countplot(x='is_nuclear', data=NPlant)
plt.title("Distribution of Future Nuclear Suitability")
plt.xlabel("Nuclear Suitable (1 = Yes, 0 = No)")
plt.ylabel("Count")
plt.show()

**Highly imbalanced → nuclear plants exist in fewer countries**

In [None]:
# Latitude vs Nuclear Suitability
plt.figure(figsize=(6,4))
sns.boxplot(x='is_nuclear', y='latitude', data=NPlant)
plt.title("Latitude vs Nuclear Suitability")
plt.show()

##Feature Engineering

In [None]:
# One-hot encode climate
NPlant_encoded = pd.get_dummies(NPlant, columns=['climate'], drop_first=False)

In [None]:
# Encode fuel types
NPlant_encoded = pd.get_dummies(NPlant_encoded, columns=['primary_fuel'], drop_first=True)


# Remove nuclear fuel indicator from features
nuclear_cols = [c for c in NPlant_encoded.columns if 'Nuclear' in c]
NPlant_encoded.drop(columns=nuclear_cols, inplace=True)

In [None]:
NPlant_encoded

##FEATURE–TARGET SPLIT

In [None]:
# Dropping 'is_nuclear' (target) and 'country_long' (identifier)
X = NPlant_encoded.drop(columns=['is_nuclear', 'country_long'])
y = NPlant_encoded['is_nuclear']

# Split data into training and testing sets (70% train, 30% test)# Dropping 'is_nuclear' (target) and 'country_long' (identifier)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
X_train_df = pd.DataFrame(X_train, columns=X.columns)

# 2. Combine with y_train (y_train is still a Series, so it needs reset_index)
train_combined = pd.concat([
    X_train_df,
    y_train.reset_index(drop=True)
], axis=1)

# 3. Display the result
print("--- Training Data with Target Variable (First 5 Rows) ---")
display(train_combined.head())

In [None]:
X_test_df = pd.DataFrame(X_test, columns=X.columns)
test_combined = pd.concat([X_test_df, y_test.reset_index(drop=True)], axis=1)

print("--- Test Data Table ---")
display(test_combined.head())

##TRAIN & EVALUATE MODELS

In [None]:
# Scale the features for models that need it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# use different models to calculate Accuracy, Precision,  Recall,  F1 Score
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "SVM": SVC(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),

}

results = []

for name, model in models.items():
    # Use scaled data for  SVM, KNN, Logistic Regression
    if name in ["SVM", "KNN", "Logistic Regression"]:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, zero_division=0),
        "Recall": recall_score(y_test, y_pred, zero_division=0),
        "F1 Score": f1_score(y_test, y_pred, zero_division=0)
    })

results_df = pd.DataFrame(results)
print(results_df)

In [None]:
results_df

In [None]:
#Compare all models
# Set plot style
sns.set(style="whitegrid")

# Melt the dataframe for easier plotting
results_melted = results_df.melt(id_vars="Model",
                                 value_vars=["Accuracy", "Precision", "Recall", "F1 Score"],
                                 var_name="Metric",
                                 value_name="Score")

# Create grouped bar plot
plt.figure(figsize=(12,6))
sns.barplot(data=results_melted, x="Model", y="Score", hue="Metric")
plt.xticks(rotation=45, ha='right')
plt.ylim(0.75, 1.02)  # focus on range near 1 for clarity
plt.title("Comparison of Machine Learning Models on Test Set")
plt.ylabel("Score")
plt.legend(title="Metric")
plt.tight_layout()
plt.show()

**Observations:**

- High Accuracy, Low Information: All models show very high accuracy ($\ge$ 99.7%), which suggests this dataset is likely highly imbalanced (e.g., a high percentage of non-fraud cases). In such scenarios, Accuracy is not a reliable metric.
- Best Overall Performer: The KNN model is the best performer across all relevant metrics for imbalanced data, boasting the highest Precision (1.000), Recall (0.949), and F1 Score (0.974).

- Precision (Avoiding False Positives):

  - SVM and KNN are perfect at predicting the positive class when they do (Precision = 1.000).
  - Decision Tree has the lowest precision (0.825), meaning it has the highest rate of false alarms.

- Recall (Finding All Positive Cases):

  - KNN is best at identifying actual positive cases (Recall = 0.949).
  - Logistic Regression struggles significantly here (Recall = 0.678), missing over 30% of actual positive instances.

- F1 Score (Balanced View):

  - KNN (0.974) and SVM (0.956) are the top models.
  - Logistic Regression has the lowest balanced score (0.784).



##PREDICT FUTURE NUCLEAR SUITABILITY (TARGET)

In [None]:
best_model = KNeighborsClassifier(n_neighbors=5, weights='distance')
best_model.fit(X_train_scaled, y_train)

# Predict Probabilities for the Entire Dataset
# IMPORTANT: KNN requires scaled input to calculate distances correctly
all_probs = best_model.predict_proba(scaler.transform(X))[:, 1]
NPlant_encoded['nuclear_probability'] = all_probs

# Aggregate to Country Level (Using the FULL dataset)
# We calculate the mean probability for every country, not just 0 or 1
country_suitability_full = NPlant_encoded.groupby('country_long')['nuclear_probability'].mean().reset_index()

# Get Top 5 Countries Overall
top_5_overall = country_suitability_full.sort_values(by='nuclear_probability', ascending=False).head(5)

# Visualization ---
plt.figure(figsize=(10, 6))

sns.barplot(
    data=top_5_overall,
    x='nuclear_probability',
    y='country_long',
    palette='viridis',
    hue='country_long',
    legend=False
)

plt.title('Top 5 Countries for Nuclear Potential (Overall Data)', fontsize=14, fontweight='bold')
plt.xlabel('Average Nuclear-Likeness Probability', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.xlim(0, 1.0)
plt.tight_layout()
plt.show()

print ( 'country_long')

##Conclusion

Based on our analysis, the K‑Nearest Neighbors (KNN) model turned out to be the most effective algorithm for this project. It had the strongest overall performance and gave the most reliable predictions about which countries could be good candidates for future nuclear development. While other models like SVM, Random Forest, and Gradient Boosting also performed well, none of them matched the balance of accuracy and consistency shown by KNN.

After choosing KNN as the best model, we used it to estimate nuclear‑likeness probabilities for every country in the dataset. The results showed that Burundi, Central African Republic, Liberia, Sudan, and Uganda were the top five countries with the highest potential for nuclear power development. These countries shared climate and energy characteristics similar to places where nuclear power already works successfully, even though many of them currently have little or no nuclear infrastructure.

Overall, this study shows that machine learning can be a helpful tool for identifying where clean and reliable energy sources—like nuclear power—could be expanded in the future. The findings point to several African countries as promising locations and highlight how data‑driven methods can support global efforts toward more sustainable energy.

In the future, this research could be expanded by including more detailed environmental, economic, and political data to make the predictions even more accurate. Using larger or more updated datasets, as well as testing more advanced machine learning models, could also improve the results. These enhancements would help create an even clearer picture of which countries are best prepared for future nuclear energy development.
