# Project Overview: Machine Learning for Predicting Unit Price of Imported Goods
## Author: Paul Sentongo

## Introduction
The objective of this project is to detect undervaluation and overvaluation of imported goods by predicting the unit price of a given commodity in local currency. This is critical for ensuring fair trade practices and accurate taxation. This project involves several stages including data preparation, exploratory data analysis, feature engineering, model building, evaluation, and deployment.

## Key Steps
1. **Data Collection and Preparation**
   - Load the dataset and understand its structure.
   - Clean and preprocess the data.
   - Handle missing values and detect outliers.
   
2. **Exploratory Data Analysis (EDA)**
   - Conduct descriptive statistics to understand data distribution.
   - Visualize data to identify patterns and relationships.

3. **Feature Engineering**
   - Create new features that may enhance model performance.
   - Normalize or scale features if necessary.

4. **Model Building**
   - Select appropriate machine learning algorithms.
   - Train and validate models using cross-validation.
   
5. **Model Evaluation**
   - Evaluate models using relevant metrics.
   - Select the best-performing model based on evaluation results.

6. **Model Deployment**
   - Preparing the final model for deployment.
   - Deploying the model using a suitable platform (e.g., Streamlit, Flask for web deployment).

7. **Monitoring and Maintenance**
   - Monitor model performance in production.
   - Update the model as necessary based on new data and feedback.

## Tools and Technologies
- **Programming Language:** Python
- **Libraries:** Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Flask
- **Deployment Platform:** Streamlit, Flask, Docker

This approach ensures a robust solution for detecting pricing discrepancies in imported goods, thereby aiding in regulatory compliance and economic fairness.

## Import necessary libraries

In [22]:
# importing libraries to be used in the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
import joblib
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## Variables description

In [23]:
from tabulate import tabulate

# Defining attributes and their descriptions
attributes = [
    ["Date", "Date of import record"],
    ["HS_Code", "Harmonized system commodity code (HS code)"],
    ["Item_Description", "Description of the imported item"],
    ["Country_of_Origin", "Country from which the goods originated"],
    ["Port_of_Shipment", "Port through which the goods entered Uganda"],
    ["Quantity", "Quantity of goods imported"],
    ["Quantity_Unit", "Unit of measurement for the quantity (e.g., units, boxes, liters)"],
    ["Net_Mass_kg", "Net mass (excluding packaging)"],
    ["Gross_Mass_kg", "Gross mass (including packaging)"],
    ["FOB_Value_USD", "Free on Board value in USD (excluding freight and insurance)"],
    ["Freight_USD", "Freight cost in USD"],
    ["Insurance_USD", "Insurance cost in USD"],
    ["CIF_Value_USD", "CIF (Cost, Insurance, and Freight) value in USD"],
    ["CIF_Value_UGX", "CIF value in local currency (UGX)"],
    ["Unit_Price_Actual_UGX", "Actual unit price in local currency (UGX)"],
    ["Unit_Price_Predicted_UGX", "Predicted unit price in local currency (UGX)"],
    ["Tax_Rate", "Import duty rate as a percentage"],
    ["Currency_Code", "Currency code for original values (e.g., USD)"],
    ["Mode_of_Transport", "Mode of transport used (Land, Air, or Water)"],
    ["Unit_Price_Calculated_UGX", "Unit price in local currency including all duties and taxes"]
]

# Creating markdown table
markdown_table = tabulate(attributes, headers=["Attribute", "Description"], tablefmt="github")

print(markdown_table)


| Attribute                 | Description                                                       |
|---------------------------|-------------------------------------------------------------------|
| Date                      | Date of import record                                             |
| HS_Code                   | Harmonized system commodity code (HS code)                        |
| Item_Description          | Description of the imported item                                  |
| Country_of_Origin         | Country from which the goods originated                           |
| Port_of_Shipment          | Port through which the goods entered Uganda                       |
| Quantity                  | Quantity of goods imported                                        |
| Quantity_Unit             | Unit of measurement for the quantity (e.g., units, boxes, liters) |
| Net_Mass_kg               | Net mass (excluding packaging)                                    |
| Gross_Mass_kg     

In [24]:
# Loading the dataset
data = pd.read_csv('uganda_imports_trade_data_train_set.csv')

In [25]:
data.head(5)

Unnamed: 0,Date,HS_Code,Item_Description,Country_of_Origin,Port_of_Shipment,Quantity,Quantity_Unit,Net_Mass_kg,Gross_Mass_kg,FOB_Value_USD,Freight_USD,Insurance_USD,CIF_Value_USD,CIF_Value_UGX,Unit_Price_UGX,Tax_Rate,Currency_Code,Mode_of_Transport
0,23/04/2021,27101931,Mineral fuels and oils,Saudi Arabia,Busia,23.58,units,114.15,123.56,9044.66,1050.6,143.72,10238.98,37439216.34,1587753.03,0.05,USD,Land
1,03/11/2021,30049099,Generic pharmaceutical products,China,Port Bell,482.42,kg,2253.97,2439.86,2352.84,220.04,47.01,2619.89,9671924.57,20048.76,0.18,USD,Water
2,13/11/2022,30049099,Generic pharmaceutical products,China,Entebbe Airport,131.97,liters,348.67,377.42,2084.1,169.47,17.04,2270.61,8412978.38,63749.17,0.18,USD,Air
3,13/03/2022,15079090,Vegetable fats and oils,Germany,Entebbe Airport,113.44,pairs,449.93,487.04,2759.84,151.3,53.46,2964.6,10672562.76,94081.12,0.15,USD,Air
4,04/06/2020,12010090,Soybeans,China,Port Bell,44.42,units,210.32,227.66,9861.95,1345.54,62.96,11270.45,41822493.34,941523.94,0.05,USD,Water


In [26]:
#checking the data types of the columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               80000 non-null  object 
 1   HS_Code            80000 non-null  int64  
 2   Item_Description   80000 non-null  object 
 3   Country_of_Origin  80000 non-null  object 
 4   Port_of_Shipment   80000 non-null  object 
 5   Quantity           80000 non-null  float64
 6   Quantity_Unit      80000 non-null  object 
 7   Net_Mass_kg        80000 non-null  float64
 8   Gross_Mass_kg      80000 non-null  float64
 9   FOB_Value_USD      80000 non-null  float64
 10  Freight_USD        80000 non-null  float64
 11  Insurance_USD      80000 non-null  float64
 12  CIF_Value_USD      80000 non-null  float64
 13  CIF_Value_UGX      80000 non-null  float64
 14  Unit_Price_UGX     80000 non-null  float64
 15  Tax_Rate           80000 non-null  float64
 16  Currency_Code      800

In [27]:
# checking for shape of the dataset
data.shape

(80000, 18)

In [28]:
#checking the data types of the columns
data.dtypes

Date                  object
HS_Code                int64
Item_Description      object
Country_of_Origin     object
Port_of_Shipment      object
Quantity             float64
Quantity_Unit         object
Net_Mass_kg          float64
Gross_Mass_kg        float64
FOB_Value_USD        float64
Freight_USD          float64
Insurance_USD        float64
CIF_Value_USD        float64
CIF_Value_UGX        float64
Unit_Price_UGX       float64
Tax_Rate             float64
Currency_Code         object
Mode_of_Transport     object
dtype: object