# 🇺🇬 A Machine Learning Approach to Accurate Import Valuation in Uganda

This notebook aims to build predictive models for import valuation using Uganda's customs data (2020–2024). Inaccurate manual methods contribute to revenue leakage, and we aim to use ML to bridge that gap. This aligns with the goals of URA’s Vision 2040 and digital transformation.

**Objectives**:
- Predict CIF values accurately using supervised ML models
- Compare ML vs traditional valuation methods
- Provide visual insights for operational integration

---


In [4]:
#importing the required libraries
# === Essential Libraries ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import scipy.stats as stats
import warnings
import shap
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings.filterwarnings("ignore")
pd.set_option("display.float_format", lambda x: "%.2f" % x)


LOADING THE DATA

In [5]:
# Load training and testing datasets
df_train = pd.read_csv("Uganda_imports_train.csv")
df_test = pd.read_csv("Uganda_imports_test.csv")

In [7]:
# Overview
print(df_train.shape)
print(df_test.shape)

(70734, 26)
(70734, 25)


In [9]:
df_train.head(5)

Unnamed: 0,HS_Code,Item_Description,Country_of_Origin,Port_of_Shipment,Quantity,Quantity_Unit,Net_Mass_kg,Gross_Mass_kg,FOB_Value_USD,Freight_USD,...,Mode_of_Transport,Year,Month,Invoice_Amount,Valuation_Method,Value_per_kg,Value_per_unit,FOB_per_kg,Freight_per_kg,Insurance_per_kg
0,30049099,Generic pharmaceutical products,China,Port Bell,482.42,kg,2220.29,2403.43,2352.84,220.04,...,Water,2021,11,9671924.57,Deductive Value Method (DVM),4356.15,20048.76,1.06,0.1,0.02
1,30049099,Generic pharmaceutical products,China,Entebbe Airport,131.97,liters,348.67,377.42,2084.1,169.47,...,Air,2022,11,8412978.38,Computed Value Method (CVM),24128.77,63749.17,5.98,0.49,0.05
2,15079090,Vegetable fats and oils,Germany,Entebbe Airport,113.44,pairs,449.93,487.04,2759.84,151.3,...,Air,2022,3,10672562.76,Transaction Value of Similar Goods (TVSG),23720.5,94081.12,6.13,0.34,0.12
3,10063010,Milled rice,India,Busia,230.52,units,808.09,874.73,2917.65,214.86,...,Land,2023,4,11692581.49,Computed Value Method (CVM),14469.41,50722.63,3.61,0.27,0.05
4,84089010,Industrial machinery parts,Saudi Arabia,Entebbe Airport,341.7,boxes,896.63,970.58,6971.39,366.85,...,Air,2021,6,26519078.57,Computed Value Method (CVM),29576.39,77609.24,7.78,0.41,0.15


In [10]:
df_test.head(5)

Unnamed: 0,HS_Code,Item_Description,Country_of_Origin,Port_of_Shipment,Quantity,Quantity_Unit,Net_Mass_kg,Gross_Mass_kg,FOB_Value_USD,Freight_USD,...,Mode_of_Transport,Year,Month,Invoice_Amount,Valuation_Method,Value_per_kg,Value_per_unit,FOB_per_kg,Freight_per_kg,Insurance_per_kg
0,30049099,Generic pharmaceutical products,China,Port Bell,482.42,kg,2220.29,2403.43,2352.84,220.04,...,Water,2021,11,9671924.57,Deductive Value Method (DVM),4356.15,20048.76,1.06,0.1,0.02
1,30049099,Generic pharmaceutical products,China,Entebbe Airport,131.97,liters,348.67,377.42,2084.1,169.47,...,Air,2022,11,8412978.38,Computed Value Method (CVM),24128.77,63749.17,5.98,0.49,0.05
2,15079090,Vegetable fats and oils,Germany,Entebbe Airport,113.44,pairs,449.93,487.04,2759.84,151.3,...,Air,2022,3,10672562.76,Transaction Value of Similar Goods (TVSG),23720.5,94081.12,6.13,0.34,0.12
3,10063010,Milled rice,India,Busia,230.52,units,808.09,874.73,2917.65,214.86,...,Land,2023,4,11692581.49,Computed Value Method (CVM),14469.41,50722.63,3.61,0.27,0.05
4,84089010,Industrial machinery parts,Saudi Arabia,Entebbe Airport,341.7,boxes,896.63,970.58,6971.39,366.85,...,Air,2021,6,26519078.57,Computed Value Method (CVM),29576.39,77609.24,7.78,0.41,0.15


In [11]:
df_train.columns

Index(['HS_Code', 'Item_Description', 'Country_of_Origin', 'Port_of_Shipment',
       'Quantity', 'Quantity_Unit', 'Net_Mass_kg', 'Gross_Mass_kg',
       'FOB_Value_USD', 'Freight_USD', 'Insurance_USD', 'CIF_Value_USD',
       'CIF_Value_UGX', 'Unit_Price_UGX', 'Tax_Rate', 'Currency_Code',
       'Mode_of_Transport', 'Year', 'Month', 'Invoice_Amount',
       'Valuation_Method', 'Value_per_kg', 'Value_per_unit', 'FOB_per_kg',
       'Freight_per_kg', 'Insurance_per_kg'],
      dtype='object')

In [12]:
df_test.columns

Index(['HS_Code', 'Item_Description', 'Country_of_Origin', 'Port_of_Shipment',
       'Quantity', 'Quantity_Unit', 'Net_Mass_kg', 'Gross_Mass_kg',
       'FOB_Value_USD', 'Freight_USD', 'Insurance_USD', 'CIF_Value_USD',
       'CIF_Value_UGX', 'Tax_Rate', 'Currency_Code', 'Mode_of_Transport',
       'Year', 'Month', 'Invoice_Amount', 'Valuation_Method', 'Value_per_kg',
       'Value_per_unit', 'FOB_per_kg', 'Freight_per_kg', 'Insurance_per_kg'],
      dtype='object')

### OBSERVATION
Although the train and test datasets are structurally consistent—with 24 shared features—there is a key omission in the test data: the target variable Unit_Price_UGX. This is an intentional and standard practice in supervised machine learning, ensuring that the model is evaluated on truly unseen data. The consistent feature set between both datasets supports model generalization, while the exclusion of the target from the test set prevents data leakage. However, it is critical to ensure that all preprocessing steps (e.g., encoding, scaling, and imputation) applied during training are identically replicated on the test data to maintain prediction accuracy and integrity.