---
title: "DS202W - Group Project"
author: "Civic Tensor (Group 2)"
output: html
self-contained: true
jupyter: python3
engine: jupyter
editor:
  render-on-save: true
  preview: true
---

In [18]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report, f1_score, precision_recall_curve
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

In [19]:
# Open the data file
df = pd.read_csv('ds202w-group-projects-civil-war.csv')

In [20]:
# Drop unnamed columns
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

### **EDA**

In [21]:
# Get full expected range of years (modify as needed)
expected_years = set(range(df['year'].min(), df['year'].max() + 1))

# Dictionary to hold missing years per cowcode
missing_years_by_cowcode = {}

# Group by 'cowcode' and check missing years
for cowcode, group in df.groupby('cowcode'):
    actual_years = set(group['year'].dropna().unique())
    missing_years = expected_years - actual_years
    if missing_years:  # Only store if there are missing years
        missing_years_by_cowcode[cowcode] = sorted(missing_years)

# Display
for cowcode, years in missing_years_by_cowcode.items():
    print(f"COWCODE {cowcode} is missing years: {years}")

COWCODE 31 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972]
COWCODE 51 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961]
COWCODE 52 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961]
COWCODE 53 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965]
COWCODE 55 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973]
COWCODE 80 is missing years: [1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969

In [22]:
# Target variable analysis
print("\nTarget variable (warstds) distribution:")
target_counts = df['warstds'].value_counts()
print(target_counts)
print(f"Percentage of civil wars: {target_counts[1] / len(df) * 100:.2f}%")


Target variable (warstds) distribution:
warstds
0    7024
1     116
Name: count, dtype: int64
Percentage of civil wars: 1.62%


In [23]:
# Create a figure showing the distribution of civil wars over time
civil_wars_by_year = df.groupby('year')['warstds'].sum().reset_index()
civil_wars_by_year.columns = ['Year', 'Number of Civil Wars']

fig1 = px.line(civil_wars_by_year, x='Year', y='Number of Civil Wars', 
              title='Number of Civil War Onsets by Year (1945-2000)')
fig1.update_layout(template='plotly_white')
fig1.show()

> **Observations**  
> 1991: the collapse of the Soviet Union and the end of the Cold War, leading to a peak in the number of civil wars due to economic and political restructuring and instability due to collapse of pre-existing systems

In [24]:
# Geographic distribution of civil wars
# Create a region variable based on geo columns if available
geo_cols = [col for col in df.columns if col.startswith('geo')]
if geo_cols:
    region_map = {
        'geo1': 'Western Europe & North America',
        'geo2': 'Eastern Europe',
        'geo34': 'Latin America & Caribbean',
        'geo57': 'Asia & Pacific',
        'geo69': 'Middle East & North Africa',
        'geo8': 'Sub-Saharan Africa'
    }
    
    # Create region variable (using the first matching region for each row)
    region_df = df.copy()
    region_df['region'] = 'Unknown'
    for col in geo_cols:
        if col in region_map:
            mask = df[col] == 1
            region_df.loc[mask, 'region'] = region_map[col]
    
    # Create a choropleth map showing civil wars by region
    region_wars = region_df.groupby('region')['warstds'].sum().reset_index()
    region_wars.columns = ['Region', 'Number of Civil Wars']
    region_wars = region_wars.sort_values('Number of Civil Wars', ascending=False)
    
    fig2 = px.bar(region_wars, x='Region', y='Number of Civil Wars',
                 title='Number of Civil Wars by Region (1945-2000)',
                 color='Number of Civil Wars')
    fig2.update_layout(template='plotly_white', xaxis_tickangle=-45)
    fig2.show()

In [31]:
df['ln_gdpen']

0       0.851709
1      -1.639897
2      -1.629641
3      -1.639897
4      -1.398367
          ...   
7135    0.236829
7136    0.244755
7137    0.261856
7138    0.244217
7139    0.856959
Name: ln_gdpen, Length: 7140, dtype: float64

In [25]:
# Economic factors and civil war
# GDP per capita
df['gdpen'] = np.exp(df['ln_gdpen'])
fig3 = px.box(df, x='warstds', y='ln_gdpen', color='warstds',
             labels={'warstds': 'Civil War', 'ln_gdpen': 'Log GDP per capita'},
             title='Log GDP per Capita vs Civil War Onset',
             category_orders={'warstds': [0, 1]},
             color_discrete_map={0: 'blue', 1: 'red'})
fig3.update_layout(template='plotly_white')
fig3.show()

# GDP growth
fig4 = px.box(df, x='warstds', y='gdpgrowth', color='warstds',
             labels={'warstds': 'Civil War', 'gdpgrowth': 'GDP Growth (%)'},
             title='GDP Growth vs Civil War Onset',
             category_orders={'warstds': [0, 1]},
             color_discrete_map={0: 'blue', 1: 'red'})
fig4.update_layout(template='plotly_white')
fig4.show()

In [27]:
df['gdpen'] = np.exp(df['ln_gdpen'])
fig3_actual = px.box(df, x='warstds', y='gdpen', color='warstds',
             labels={'warstds': 'Civil War', 'gdpen': 'GDP per Capita'},
             title='GDP per Capita vs Civil War Onset',
             category_orders={'warstds': [0, 1]},
             color_discrete_map={0: 'blue', 1: 'red'})
fig3_actual.update_layout(template='plotly_white')
fig3_actual.show()

In [32]:
if 'ef' in df.columns:
    # Create bins for ethnic fractionalization
    ef_df = df.copy()
    ef_df['ef_bin'] = pd.cut(ef_df['ef'], bins=10)

    # Group and calculate mean
    ef_war_prob = ef_df.groupby('ef_bin')['warstds'].mean().reset_index()
    ef_war_prob.columns = ['Ethnic Fractionalization', 'Civil War Probability']

    # Convert Interval objects to string for plotting
    ef_war_prob['Ethnic Fractionalization'] = ef_war_prob['Ethnic Fractionalization'].astype(str)

    # Plot
    fig5 = px.bar(
        ef_war_prob,
        x='Ethnic Fractionalization',
        y='Civil War Probability',
        title='Probability of Civil War by Ethnic Fractionalization',
        color='Civil War Probability'
    )
    fig5.update_layout(template='plotly_white')
    fig5.show()

In [None]:
# Political regime and civil war
# Create a U-shaped curve for polity score vs civil war probability
if 'pol4' in df.columns:
    # Create bins for polity score
    pol_df = df.copy()
    pol_df['pol4_bin'] = pd.cut(pol_df['pol4'], bins=10)

    # Group and calculate mean
    pol_war_prob = pol_df.groupby('pol4_bin')['warstds'].mean().reset_index()
    pol_war_prob.columns = ['Polity Score', 'Civil War Probability']

    # Convert Interval to string for plotting
    pol_war_prob['Polity Score'] = pol_war_prob['Polity Score'].astype(str)

    # Plot
    fig6 = px.line(
        pol_war_prob,
        x='Polity Score',
        y='Civil War Probability',
        title='Probability of Civil War by Polity Score (Political Regime)',
        markers=True
    )
    fig6.update_layout(template='plotly_white')
    fig6.show()

In [None]:
# Correlation heatmap of key variables
# Select relevant variables for correlation
key_vars = ['warstds', 'ln_gdpen', 'gdpgrowth', 'lpopns', 'lmtnest', 'ef', 'relfrac', 'pol4', 'pol4sq']
# Filter to variables that exist in the dataset
key_vars = [var for var in key_vars if var in df.columns]

corr = df[key_vars].corr()

fig7 = px.imshow(corr, text_auto=True, aspect="auto",
                title="Correlation Matrix of Key Variables",
                color_continuous_scale='RdBu_r')
fig7.update_layout(template='plotly_white')
fig7.show()

### **RF modelling**

In [None]:
yearly_war_counts = df.groupby('year')['warstds'].sum().reset_index()

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

target_col = 'warstds'
feature_cols = [col for col in df.columns if col not in [target_col, 'cowcode']]

X = df[feature_cols].copy()
y = df[target_col].copy()

X_train = X[X['year'] < 1988].drop(columns='year')
X_test = X[X['year'] >= 1988].drop(columns='year')

y_train = y[df['year'] < 1988]
y_test = y[df['year'] >= 1988]

In [None]:
# Train Random Forest model
# First, perform a grid search for optimal hyperparameters
param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
}

rf = RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("\nBest hyperparameters:")
print(grid_search.best_params_)

# Train final model with best parameters
best_rf = RandomForestClassifier(
    random_state=42,
    class_weight='balanced',
    n_jobs=-1,
    **grid_search.best_params_
)
best_rf.fit(X_train, y_train)

**Dealing with Class Imbalance using Downsampling**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    fbeta_score, classification_report
)

# Downsample the training set
rus = RandomUnderSampler(random_state=42, sampling_strategy=0.2)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

# Instantiate RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    random_state=42,
    n_jobs=-1
)

# Fit model
rf.fit(X_train_resampled, y_train_resampled)

# Predict on training and test sets
y_train_resampled_pred = rf.predict(X_train_resampled)
y_train_resampled_pred_proba = rf.predict_proba(X_train_resampled)[:, 1]
y_test_resampled_pred = rf.predict(X_test)
y_test_resampled_pred_proba = rf.predict_proba(X_test)[:, 1]

# Evaluate
print("TRAINING SET PERFORMANCE (DOWNSAMPLING)")
print("ROC-AUC:", roc_auc_score(y_train_resampled, y_train_resampled_pred_proba))
print("PR-AUC:", average_precision_score(y_train_resampled, y_train_resampled_pred_proba))
print("F2-score:", fbeta_score(y_train_resampled, y_train_resampled_pred, beta=2))
print(classification_report(y_train_resampled, y_train_resampled_pred, digits=3))

print("\nTEST SET PERFORMANCE (DOWNSAMPLING)")
print("ROC-AUC:", roc_auc_score(y_test, y_test_resampled_pred_proba))
print("PR-AUC:", average_precision_score(y_test, y_test_resampled_pred_proba))
print("F2-score:", fbeta_score(y_test, y_test_resampled_pred, beta=2))
print(classification_report(y_test, y_test_resampled_pred, digits=3))

**Dealing with Class Imbalance using Oversampling**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    fbeta_score, classification_report
)

# Oversample the training set using SMOTE
smote = SMOTE(random_state=42, sampling_strategy=0.2)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Instantiate RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    random_state=42,
    n_jobs=-1
)

# Fit model
rf.fit(X_train_resampled, y_train_resampled)

# Predict on training and test sets
y_train_resampled_pred = rf.predict(X_train_resampled)
y_train_resampled_pred_proba = rf.predict_proba(X_train_resampled)[:, 1]
y_test_resampled_pred = rf.predict(X_test)
y_test_resampled_pred_proba = rf.predict_proba(X_test)[:, 1]

# Evaluate
print("TRAINING SET PERFORMANCE (OVERSAMPLING)")
print("ROC-AUC:", roc_auc_score(y_train_resampled, y_train_resampled_pred_proba))
print("PR-AUC:", average_precision_score(y_train_resampled, y_train_resampled_pred_proba))
print("F2-score:", fbeta_score(y_train_resampled, y_train_resampled_pred, beta=2))
print(classification_report(y_train_resampled, y_train_resampled_pred, digits=3))

print("\nTEST SET PERFORMANCE (OVERSAMPLING)")
print("ROC-AUC:", roc_auc_score(y_test, y_test_resampled_pred_proba))
print("PR-AUC:", average_precision_score(y_test, y_test_resampled_pred_proba))
print("F2-score:", fbeta_score(y_test, y_test_resampled_pred, beta=2))
print(classification_report(y_test, y_test_resampled_pred, digits=3))

----

In [17]:
# Prepare data for modeling
# Define features and target
# Let's select important variables based on the literature
"""feature_cols = [
    'ln_gdpen',  # GDP per capita (log)
    'gdpgrowth', # GDP growth
    'lpopns',    # Population (log)
    'lmtnest',   # Mountainous terrain
    'ef',        # Ethnic fractionalization
    'ef2',       # Ethnic fractionalization squared
    'relfrac',   # Religious fractionalization
    'pol4',      # Polity score
    'pol4sq',    # Polity score squared
    'oil',       # Oil exporter
    'warhist',   # Prior war history
]"""

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

target_col = 'warstds'
feature_cols = [col for col in df.columns if col not in [target_col, 'year', 'cowcode']]

# Filter to features that exist in the dataset
# features = [feature for feature in feature_cols if feature in df.columns]

# Handle missing values with RF imputation
# First create X and y
X = df[feature_cols].copy()
y = df[target_col].copy()

# Impute missing values
rf_imputer = IterativeImputer(estimator=RandomForestRegressor(), random_state=0)
X_imputed = pd.DataFrame(rf_imputer.fit_transform(X), columns=X.columns)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y, test_size=0.2, random_state=42, stratify=y
)

# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

KeyboardInterrupt: 

In [None]:
# Train Random Forest model
# First, perform a grid search for optimal hyperparameters
param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
}

rf = RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train_balanced, y_train_balanced)

print("\nBest hyperparameters:")
print(grid_search.best_params_)

# Train final model with best parameters
best_rf = RandomForestClassifier(
    random_state=42,
    class_weight='balanced',
    n_jobs=-1,
    **grid_search.best_params_
)
best_rf.fit(X_train_balanced, y_train_balanced)

In [None]:
# Model evaluation
# Make predictions
y_pred_proba = best_rf.predict_proba(X_test)[:, 1]
threshold = 0.5
y_pred = (y_pred_proba >= threshold).astype(int)

# Calculate metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

fig8 = px.area(
    x=fpr, y=tpr,
    title=f'ROC Curve (AUC = {roc_auc:.3f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig8.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)
fig8.update_layout(template='plotly_white')
fig8.show()

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
pr_auc = auc(recall, precision)

fig9 = px.area(
    x=recall, y=precision,
    title=f'Precision-Recall Curve (AUC = {pr_auc:.3f})',
    labels=dict(x='Recall', y='Precision'),
    width=700, height=500
)
fig9.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=y_test.mean(), y1=y_test.mean()
)
fig9.update_layout(template='plotly_white')
fig9.show()

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
x = ['No Civil War', 'Civil War']
y = ['No Civil War', 'Civil War']

# Create heatmap
fig10 = ff.create_annotated_heatmap(
    z=conf_matrix,
    x=x,
    y=y,
    annotation_text=conf_matrix,
    colorscale='Blues'
)
fig10.update_layout(
    title='Confusion Matrix',
    xaxis=dict(title='Predicted'),
    yaxis=dict(title='Actual')
)
fig10.show()

In [None]:
# Feature importance
importance = best_rf.feature_importances_
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': importance
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

fig11 = px.bar(
    feature_importance,
    x='Importance',
    y='Feature',
    orientation='h',
    title='Feature Importance in Random Forest Model'
)
fig11.update_layout(template='plotly_white')
fig11.show()

In [None]:
# Create a separation plot
# Sort observations by predicted probability
df_sep = pd.DataFrame({
    'true': y_test.values,
    'prob': y_pred_proba
})
df_sep = df_sep.sort_values('prob')

# Create the plot
fig12 = go.Figure()

# Add a scatter plot for probabilities
fig12.add_trace(go.Scatter(
    x=list(range(len(df_sep))),
    y=df_sep['prob'],
    mode='lines',
    name='Predicted Probability',
    line=dict(color='red')
))

# Add vertical lines for actual events
events = df_sep[df_sep['true'] == 1].index
for event in events:
    fig12.add_shape(
        type='line',
        x0=event, y0=0,
        x1=event, y1=1,
        line=dict(color='black', width=1, dash='dot'),
    )

# Add triangle for expected number of events
expected_events = df_sep['prob'].sum()
fig12.add_trace(go.Scatter(
    x=[expected_events],
    y=[0.05],
    mode='markers',
    marker=dict(symbol='triangle-up', size=15, color='black'),
    name='Expected Events'
))

fig12.update_layout(
    title='Separation Plot',
    xaxis_title='Observations (sorted by predicted probability)',
    yaxis_title='Predicted Probability',
    template='plotly_white',
    showlegend=True
)
fig12.show()

In [None]:
# Map of civil war predictions
# Create a world map showing actual civil wars
if 'cowcode' in df.columns:
    # Get unique cowcodes and their civil war counts
    country_wars = df.groupby('cowcode')['warstds'].sum().reset_index()
    country_wars.columns = ['cowcode', 'civil_wars']
    
    # Create hover text
    hover_text = []
    for index, row in country_wars.iterrows():
        hover_text.append(f'Country Code: {row["cowcode"]}<br>Civil Wars: {row["civil_wars"]}')
    
    fig13 = go.Figure(data=go.Choropleth(
        locations=country_wars['cowcode'],
        z=country_wars['civil_wars'],
        locationmode='ISO-3',
        colorscale='Reds',
        autocolorscale=False,
        marker_line_color='darkgray',
        marker_line_width=0.5,
        colorbar_title='Number of Civil Wars',
        text=hover_text,
        hoverinfo='text'
    ))
    
    fig13.update_layout(
        title_text='Civil Wars by Country (1945-2000)',
        geo=dict(
            showframe=False,
            showcoastlines=True,
            projection_type='equirectangular'
        ),
        width=900,
        height=600
    )
    fig13.show()

In [None]:
# Time trends - GDP and civil war probability
# Calculate average GDP per capita and civil war probability by year
if 'ln_gdpen' in df.columns:
    yearly_data = df.groupby('year').agg({
        'ln_gdpen': 'mean',
        'warstds': 'mean'
    }).reset_index()
    
    yearly_data.columns = ['Year', 'Avg Log GDP per Capita', 'Civil War Probability']
    
    # Create the figure with two y-axes
    fig14 = make_subplots(specs=[[{"secondary_y": True}]])
    
    # Add GDP per capita line
    fig14.add_trace(
        go.Scatter(
            x=yearly_data['Year'],
            y=yearly_data['Avg Log GDP per Capita'],
            name='Avg Log GDP per Capita',
            line=dict(color='blue')
        ),
        secondary_y=False
    )
    
    # Add civil war probability line
    fig14.add_trace(
        go.Scatter(
            x=yearly_data['Year'],
            y=yearly_data['Civil War Probability'],
            name='Civil War Probability',
            line=dict(color='red')
        ),
        secondary_y=True
    )
    
    # Set titles
    fig14.update_layout(
        title_text='GDP per Capita and Civil War Probability (1945-2000)',
        template='plotly_white'
    )
    
    # Set x-axis title
    fig14.update_xaxes(title_text='Year')
    
    # Set y-axes titles
    fig14.update_yaxes(title_text='Avg Log GDP per Capita', secondary_y=False)
    fig14.update_yaxes(title_text='Civil War Probability', secondary_y=True)
    
    fig14.show()

In [None]:
# 3D plot of GDP, Ethnic Fractionalization, and Civil War
if all(col in df.columns for col in ['ln_gdpen', 'ef', 'warstds']):
    # Create a sample to reduce markers for better visualization
    sample_size = min(2000, len(df))
    df_sample = df.sample(sample_size, random_state=42)
    
    fig15 = px.scatter_3d(
        df_sample,
        x='ln_gdpen',
        y='ef',
        z='pol4' if 'pol4' in df.columns else 'lpopns',
        color='warstds',
        opacity=0.7,
        color_discrete_map={0: 'blue', 1: 'red'},
        title='3D Plot of Key Civil War Predictors',
        labels={
            'ln_gdpen': 'Log GDP per Capita',
            'ef': 'Ethnic Fractionalization',
            'pol4': 'Polity Score',
            'lpopns': 'Log Population',
            'warstds': 'Civil War'
        }
    )
    
    fig15.update_layout(
        template='plotly_white',
        scene=dict(
            xaxis_title='Log GDP per Capita',
            yaxis_title='Ethnic Fractionalization',
            zaxis_title='Polity Score' if 'pol4' in df.columns else 'Log Population'
        )
    )
    
    fig15.show()

> **Issues**
> * Every observation is "not civil war" (class 0) -> data imbalance issues
> * ROC-AUC score is quite high, model is ranking predictions well, but threshold for classifying sth as civil war is too high

> **Further Steps** (non-exhaustive)  
> (Fixing issues)
> * Add P-R curve analysis (better than ROC for imbalanced data)
> * Better threshold 
> * Add SMOTE for class imbalance, class weights in RF classifier, stratified k-fold cross-validation to maintain class proportions 
> 
> (Optimising)
> * Which scaler to use?
> * Country-level prediction analysis to identify which are under/over-predicted countries