# **Agricultural Exports Categories Analysis**
*by Sergio Postigo and Víctor Diví*

## **1. Introduction**
Many countries base a big portion of their economies in foreign trade. Therefore, the customs agencies around the world collect data about every imported/exported good that passed across their ports, airports, borders, etc. In some countries, this data is open, and anyone can access to it to analyze it and make more informed decisions while importing or exporting goods. However, this data demands some challenges before its use. One of them is the labelling. For example, in Peru every time a good is imported/exported, a customs agent fills a form with the information of the product(s), where they include descriptions about it. Nonetheless, there isn´t a proper labelling, for instance in case we wanted to aggregate amounts imported/exported by category.

The labelling for goods is done manually mainly by consultancy agencies, who get this data to generate analytics reports for companies and institutions interested in foreign trade information of specific products. They usually use MS Excel spreadsheets to perform the labeling, which is not efficient and takes much time. We want to address this pain by automatizing the labelling process using Machine Learning.

## **2. Data Wrangling**
The data was provided from a consultancy company in Peru called RTM. They were hired by an agricultural exports company who were interested in knowing which categories of products were exported from Peru from 2017 till 2021. RTM adquired the data from the company Veritrade, who consolidate foreign trade databases from many conuntries in South America. 

RTM provided us the data in .xlsx format. We converted it into CSV (using Excel) and then proceed to import it into this notebook as DataFrame with the Pandas library. 

In [None]:
# Import Pandas library
import pandas as pd

In [None]:
# Convert the CSV data into Dataframe
data =  pd.read_csv("../data/raw_data/data.csv", encoding='latin-1', sep=';')
data.head()

Since we don't have a separated training data source, we will split the data available into training and testing data

In [None]:
from sklearn.model_selection import train_test_split

# We use 20% of the data to test and 80% to train
train, test = train_test_split(data, test_size=0.2, random_state=5)

## **3. Exploratory Data Analysis**

Let's first describe the columns of the table

In [None]:
# Get all the columns
train.columns

1. Partida Aduanera: Specific code of a product included in the Harmonized System of the World Customs Organization (WCO)

2. Descripcion de la Partida Aduanera: Description about the product associated with the customs code

3. Aduana: Customs office from which the export was performed

4. DUA:  Single Administrative Document, it is a document that gathers information about the shipping

5. Fecha: Shipping date

6. Año: Shipping year

7. Cod. Tributario: Tax code of the company exporting the good

8. Exportador en Perú: Company or entity exporting the good

9. Importador Extranjero: Company or entity importing the good

10. Kg Bruto: Weight of the good in kg, including the weight of the container or box

11. Kg Neto: Weight of good in kg, excluding the weight of the container or box

12. Toneladas Netas: Weight of good in tons, excluding the weight of the container or box

13. Qty 1: Quantity of the good in terms of a specific measurement unit (1)

14. Und 1: Unit of measurement (1)

15. Qty 2: Quantity of the good in terms of a specific measurement unit (2)

16. Und 2:  Unit of measurement (2)

17. U$ FOB Tot: The value of the goods at the exporter's customs frontier in USD

18. Miles de USD Fob TOTAL: The value of the goods at the exporter's customs frontier in thousands of USD

19. U$ FOB Und 1: The value of the goods by unity (1)

20. U$ FOB Und 2: The value of the goods by unity (2)

21. Pais de Destino: Country of destiny

22. Puerto de destino: Port of destiny

23. Último Puerto Embarque: Last port of shipment

24. Via: Via (air, see, maritime)

25. Agente Portuario: Port agent

26. Agente de Aduana: Customs agent

27. Descripcion Comercial: Commercial description of the good

28. Descripcion1: Commercial description portion 1

29. Descripcion2: Commercial description portion 2

30. Descripcion3: Commercial description portion 3

31. Descripcion4: Commercial description portion 4

32. Descripcion5: Commercial description portion 5

33. Naviera: Shipping company

34. Agente Carga(Origen): Load Agent (origin)

35. Agente Carga(Destino): Load Agent (destiny)

36. Canal: Selectivity channe. Type of control that the Customs Service will carry out on the merchandise to be exported. There are three channels: Green, Orange and Red

37. Concatenar: Column that concatenates 27, 28, 29, 30, 31, 32

38. Categoría macro Aurum: Designated category/label

39. Subcategoría inicial: Designated subcategories/sub-lables

40. Subcategoría Consolidada Aurum:  Designated subcategories/sub-lables (with less granularity, it groups some sub-categories in "others")

41. Categoría Consolidada Aurum: Designated category/label (with less granularity, it groups some categories in "others")

#### **Remark 1:**

All posible categories all labeled in *Categoria macro Aurum* and all possible subcategories are labelled in *Subcategoria inicial*. Aurum grouped some of the categories in *Categoria consolidada Aurum* as "others" and did the same in *Subcategoría Consolidada Aurum* for the subcategories. This last two columns were very likely a requirement frem their client. He may have been interested specially in a list of categories and the rest were simply labeled as "others". However, what is from interest from us are the colums with all the categories and all the subcategories (*Categoria macro Aurum* and *Subcategoria inicial*). <br>

**Since the categories can be mapped from the subcategories, the model(s) to work on should predict the subcategories that are in the column *Subcategria inicial***

Let's analyze the distribution of the target class: *Subcategoría inicial*

Check unique values 

In [None]:
print(train['Subcategoría inicial'].nunique())

Check count of appearances of each class

In [None]:
train['Subcategoría inicial'].value_counts()

As seen, there is an important class imbalance. Let's show this in an histogram

In [None]:
# Take the first 75 rows
train['Categoría macro Aurum'].value_counts().plot(kind='bar')

By analyzing the first 75 rows we can see that there is already an important class unbalance

#### **Remark 2:**

 *Descripcion1*, *Descripcion2*, *Descripcion3*, *Descripcion4* and *Descripcion5* concatenaded build *Descripcion Comercial*

In [None]:
# Replace NaN values with ''
train.fillna('', inplace=True)

# Concatenate Descripcion1, Descripcion2, Descripcion3, Descripcion4 and Descripcion5 and save the string in column "train"
train['Concatenated_Descriptions'] = train['Descripcion1'] + ' ' + train['Descripcion2'] + ' ' + train['Descripcion3'] + ' ' + train['Descripcion4'] + ' ' + train['Descripcion5']

# Remove spaces before and after the string
train['Concatenated_Descriptions'] = train['Concatenated_Descriptions'].str.strip()

# Compare "Descripcion Comercial" and "train"
train['Equal?'] = train['Descripcion Comercial'] == train['Concatenated_Descriptions']

# Print the comparison
temp_df = train[['Descripcion Comercial', 'Concatenated_Descriptions', 'Equal?']]
temp_df

In [None]:
# Check if all are equal
print("From "+str(temp_df[["Equal?"]].describe().values[0][0])+" rows, *Descripcion Comercial* and the *Concatenated_Descriptions* are equal in "+str(temp_df[["Equal?"]].describe().values[3][0])+".")

In [None]:
temp_df[temp_df['Equal?']==False]

Let's check what happens in a row where *Equal?* is false

In [None]:
# Descrpicion Comercial
print(temp_df.iloc[215653]['Descripcion Comercial'])
# Concatenated_Descriptions
print(temp_df.iloc[215653]['Concatenated_Descriptions'])

It seems that when trimming the column *Descripcion Comercial* into *Description* 1,2,3,4 and 5, Veritrade removed some characters, in this case some white spaces. That is why when we reconstruct the *Concatenated_Descriptions* from *Description* 1,2,3,4 and 5 we don´t get the exactly same string as in *Descripcion Comercial*.

Additionally, we can make an additional remark here: *Descripcion Comercial* has repeated sentences in it´s values, as is showed in the example above. This must be cleaned

In [None]:
# Remove the columns used to explain this remark
train.drop(columns=['Concatenated_Descriptions', 'Equal?'], inplace=True)

#### **Remark 3:**

The column *Concatenar* concatenates *Descripcion Comercial* and *Descripcion1*, *2,3,4* and *5*. Thus, it basically has a concatenation of two times the string from *Descripcion Comercial*. It seems that the consultants didn´t know that *Descripcion1*, *2,3,4* and *5* are trims of *Descripcion Comercial*. Maybe they thought this extra columns contained additional information and that is why they decided to concatenate everything in the "Concatenar" column to then process the information from here.

#### **Remark 4:**

For each *Partida aduanera* there is only one possible *Descripcion de la partida aduanera*.

In [None]:
# Create a temporal dataframe 
temp_df = train[["Partida Aduanera", "Descripcion de la Partida Aduanera"]].copy()
# Remove duplicated rows
temp_df.drop_duplicates(inplace=True)
# Get the number of rows
print("The number of all combinations of the columns *Partida Aduanera* and *Descripcion de la Partida Aduanera* are "+str(temp_df.shape[0]))
# Get the number of unique values of *Partida Aduanera*
print("The number of unique values of the column *Partida Aduanera* is "+str(temp_df['Partida Aduanera'].nunique()))
# Get the number of unique values of *Descripcion de la Partida Aduanera*
print("The number of unique values of the column *Descripcion de la Partida Aduanera* is "+str(temp_df['Descripcion de la Partida Aduanera'].nunique()))
temp_df

There are some values of *Descripcion de la partida aduanera* that correspond to multiple values of *Partida Aduanera*

In [None]:
# Get the values of *Descripcion de la Partida Aduanera* that are related to multiple values of *Partida Aduanera*
temp_df[temp_df.duplicated(['Descripcion de la Partida Aduanera'], keep=False)].sort_values(by=['Descripcion de la Partida Aduanera'])

This is very likely an error in the customs agency systems, we will have to deal with it

#### **Remark 5:**

Since we are dealing with agricultural items, we can presume that there is an seasonal influence in the dates in which they are exported. Let's test this assumption

We will compare the *Partida aduanera* with the date column names *Fecha*

In [None]:
# Get the columns *Partida Aduanera* and *Fecha*
temp_df = train[["Partida Aduanera", "Fecha"]].copy()
# Cast the column of date (*Fecha*) to datetime
temp_df['Fecha'] = pd.to_datetime(temp_df['Fecha'], format='%d/%m/%Y')
# Count the exports by date
temp_df['Count of exports'] = temp_df.groupby(['Partida Aduanera','Fecha'])['Fecha'].transform('count')
temp_df.drop_duplicates(inplace=True)
# Sort the dataframe
temp_df.sort_values(by=['Partida Aduanera', 'Fecha'])

Sample randomly some values of "Partida Aduanera" to plot

In [None]:
# Import random library
import random

# Number of samples
samples_qty = 6
# Get samples from *Partida Aduanera* (without repetitions)
samples = random.sample(list(dict.fromkeys(temp_df['Partida Aduanera'].tolist())), samples_qty)
print("The random selected values from *Partida Aduanera* are: ")
print(samples)
# Create a list with the dataframes of each sample
samples_dfs = []
for sample in samples:
    samples_dfs.append(temp_df[temp_df['Partida Aduanera'] == sample ].sort_values(by=["Partida Aduanera", "Fecha"])[['Fecha', 'Count of exports']])
# For each dataframe, populate the missing dates (imputing values of 0 for Count of exports)
populated_samples_dfs =[]
for sample_df in samples_dfs:
    dates = pd.date_range(sample_df['Fecha'].min(),sample_df['Fecha'].max())
    sample_df.set_index('Fecha', inplace=True)
    sample_df = sample_df.reindex(dates, fill_value=0) #this cant be done inplace
    populated_samples_dfs.append(sample_df) 


Let's plot all the selected samples

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np


for i in range(0,len(populated_samples_dfs)-1):
    #----------------------------------------------------------------
    # GRAPH
    #----------------------------------------------------------------

    # size:
    plt.figure(figsize=(30,3))

    # title:
    plt.title('Shipments of '+ str(samples[i])+' by day', fontsize=20)

    # x axis:
    # x values
    x = range(0,len(populated_samples_dfs[i].index.date.tolist()))
    # x ticks
    my_xticks = populated_samples_dfs[i].index.date.tolist()
    for c in range(0, len(my_xticks)):
        my_xticks[c] = my_xticks[c].strftime('%h-%Y')
    plt.xticks(x[::30], my_xticks[::30], rotation='vertical')
    # # x label
    plt.xlabel("date", fontsize=16)

    # y axis:
    # y values
    y = populated_samples_dfs[i]["Count of exports"].tolist()
    # y ticks
    #plt.yticks(np.arange(populated_samples_dfs[i]["Count of exports"].min(), populated_samples_dfs[i]["Count of exports"].max()+1, 1))

    # y label
    plt.ylabel("# shipments", fontsize=16)

    # create plot
    plt.plot(x, y, marker='o')

    plt.grid()

    # show plot
    plt.show()

    #----------------------------------------------------------------

As we can see, some products (represented by its *Partida Aduanera* number) present a seasonal pattern (as expected) but others not.

#### **Remark 6**

The custom agents in Peru fill the columns of *Importador Extranjero*. Thus, we presume that there may be inconsistencies in the naming of the same company in different rows. Let's check this

We take as an example the *Importador Extranjero* value of "Comercial Agricola Montoliva Ltda.". Let's check the rows have a similar name (more than 90% similarity using Levenshtein Algorithm)

In [None]:
# Import Levenshtein
import Levenshtein

# Get the colum *Importador Extranjero*
df=pd.DataFrame(train, columns=['Importador Extranjero'])
# Add a column with the similarity magnitude according to Levenshtein Algorithm
df["Similarity"]=df.apply(lambda x: Levenshtein.ratio(x['Importador Extranjero'],  "Comercial Agricola Montoliva Ltda."), axis=1)
# Filter rows with more than 90% similarity
df.iloc[(df["Similarity"]>=0.90).values]


As seen, rows refering to the same company in the *Importador Extranjero* column, have slightly different values. This must be cleaned in the next section.

Let's check now for *Exportador en Peru*. We will use a random company as an example

In [None]:
# Get the colum *Importador Extranjero*
df=pd.DataFrame(train, columns=['Exportador en Perú'])
# Random company
company = df.sample()['Exportador en Perú'].values[0]
print("The company to be analyzed is: "+company)
# Add a column with the similarity magnitude according to Levenshtein Algorithm
df["Similarity"]=df.apply(lambda x: Levenshtein.ratio(x['Exportador en Perú'],  company), axis=1)
# Filter rows with more than 90% similarity
df.iloc[(df["Similarity"]>=0.9).values]

Check how many unique values are in each column:

In [None]:
df.iloc[(df["Similarity"]>=0.8).values].nunique()

We know from Aurum that the column *Exportador en Perú* is filled from a dropdown menu. So, since it's not "typed" there are no inconsistencies. This is alligned with the results we are getting above regarding this column. As such, there is no need to clean this column.

#### **Remark 7:**

The column *Descripcion de la Partida Aduanera* gives general information about the asociated product code of *Partida aduanera*, while the column "Descripcion Comercial" contains more detailed information. Let's study this colums

In [None]:
# Select the three columns of interest
df = train[["Partida Aduanera", "Descripcion de la Partida Aduanera", "Descripcion Comercial"]]
# Get the cardinality of *Partida aduanera*
print("There are "+ str(len(df['Partida Aduanera'].unique()))+ " different codes of Partida Aduanera in total")
# Get the cardinality of *Descripcion de la Partida Aduanera*
print("There are "+ str(len(df['Descripcion de la Partida Aduanera'].unique()))+ " different values of Descripcion de la Partida Aduanera in total")
# Get the cardinality of *Partida aduanera*
print("There are "+ str(len(df['Descripcion Comercial'].unique()))+ " different values of Descripcion Comercial in total")


## **4. Data cleaning**

In this stage we will clean the data and specifically the columns that we will use in the model(s) in the next section. Of course, we don´t need to clean all the columns, since many of them are not relevant for labeling the rows. So, let's first determine the columns to be used and justify why

| COLUMN | USEFUL | JUSTIFICATION |
| --- | --- | --- |
| Partida Aduanera | NO | For each customs code there is one description in *Descripcion de la Partida Aduanera*. This last carries more information about the product. So, we won´t take this attribute and consider the next one. |
| Descripcion de la Partida Aduanera | **YES** | This is a general description about the product, so this carries valuable information for the labeling |
| Aduana | NO | The port from which the product is beeing shipped. For now, we won´t consider it for our models |
| DUA | NO | This is a random generated code associated with the shipping, it does not carry information that can be captured |
| Fecha | **YES** | Associating the date of shipping to a category is insightfull. As we saw, some products are exported in specific seasons of the year |
| Año | NO | Already included in the previous attribute |
| Cod. Tributario | NO| There is one tax code for each company. A company can be associated to specific groups of products, however the amount of different companies can be huge.  |
| Exportador en Perú | NO | Same idea as previous row |
| Importador Extranjero | NO | The amount of different importers abroad may be huge and new data my carry new names not learned by the model |
| Kg Bruto | NO | See next attribute |
| Kg Neto | **YES** | The weight of the shipments is insightfull, but is highly variable among same products, so initially we won´t use this feature. However we will use it to calculate the price by kg, which is actually insightfull |
| Toneladas Netas | NO  | See previous attribute |
| Qty 1 | NO | Same as before |
| Und 1 | NO | Same as before |
| Qty 2 | NO | Same as before |
| Und 2 | NO | Same as before |
| U$ FOB Tot | **YES** | The cost of the shipment will be use to calulate the cost by kg of the product |
| Miles de USD Fob TOTAL | NO | It is just a repetition of the previous attribute |
| U$ FOB Und 1 | NO | |
| U$ FOB Und 2 | NO  |  |
| Pais de Destino | **YES** | The country were this products are beeing imported can be related to groups of products |
| Puerto de destino | NO | The previous attribute indirectly captures this information already |
| Último Puerto Embarque | NO | |
| Via | NO |  |
| Agente Portuario | NO |  |
| Agente de Aduana | NO  |  |
| Descripcion Comercial | **YES** | The comercial description also carries valuable information for the labeling |
| Descripcion1 | NO | Alredy captured in *Descripcion Comercial* |
| Descripcion2 | NO | Alredy captured in *Descripcion Comercial* |
| Descripcion3 | NO | Alredy captured in *Descripcion Comercial* |
| Descripcion4 | NO | Alredy captured in *Descripcion Comercial* |
| Descripcion5 | NO | Alredy captured in *Descripcion Comercial* |
| Naviera | NO |  |
| Agente Carga(Origen) | NO |  |
| Agente Carga(Destino) | NO |  |
| Canal | NO |  |
| Concatenar | NO |  |
| Categoría macro Aurum | NO | While we also need this category, it can be inferred given a prediction of the subcategory |
| Subcategoría inicial | **YES** | **LABEL** |
| Subcategoría Consolidada Aurum | NO |  |
| Categoría Consolidada Aurum | NO |  |

In [None]:
train_data = train[["Descripcion de la Partida Aduanera", "Fecha", "Kg Neto", "U$ FOB Tot", "Pais de Destino", "Descripcion Comercial", "Categoría macro Aurum" ]].copy()
train_data.head()

From now on we will focus on each of the selected columns

#### **Descripcion de la Partida Aduanera (description of the customs code)**

In [None]:
train_data[["Descripcion de la Partida Aduanera"]]

Since in this column we are dealing with textual descriptions of the product, we will use Natural Language Processing techniques. A first important step that we will perform is to remove the so called *stop words* from each cell, so that we get rid of the low-level information. For example, we see that the second row in the above table has the word 'Y' (and) or 'O' (or). This words should not be considered in our future model.

To do this we will use the Natural Language Toolkit (NLTK).

In [None]:
# Import the library
import nltk
# Download the stopwords feature
nltk.download('stopwords')
# Import the stopwords
from nltk.corpus import stopwords

# Get the stopword in Spanish
sw_nltk = stopwords.words('spanish')
print("The words considered stopwords in spanish are: ")
print(sw_nltk)

Now let's remove the stopwords, punctuations, accents and let's set the strings to lowercase

In [None]:
import unidecode

# Create an array with the column values
old_descriptions = train_data["Descripcion de la Partida Aduanera"].tolist()

# Array to store cleaned values
new_descriptions = []

# Remove the stopwords from each old cell and populate the new array
for sentence in old_descriptions:
    # Remove stopwords
    words = [word for word in sentence.split() if word.lower() not in sw_nltk ]
    new_text = " ".join(words)
    # Additionally remove punctuations
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    words = tokenizer.tokenize(new_text)
    new_text = " ".join(words)
    # Set to lowercase
    new_text = new_text.lower()
    # Remove accents
    new_text = unidecode.unidecode(new_text)
    # Append to the array
    new_descriptions.append(new_text)
new_descriptions

# Add the cleaned data to the training dataframe
train_data["Descripcion de la Partida Aduanera"] = new_descriptions
train_data[["Descripcion de la Partida Aduanera"]]

#### **Fecha (date)**

For this column we will map the month of shipment

In [None]:
import datetime as dt

date = train_data['Fecha'].tolist()
date = pd.to_datetime(date, infer_datetime_format=True).month
train_data['Fecha'] = date
train_data['Fecha']

#### **Kg Neto (net weight in of good KG) and U$ FOB Tot (total price of good)**

As we said before, here we will get the price by kg of the good. To do this we will use both columns and transform them into one.

In [None]:
# First, drop rows were weight is 0
print("From "+ str(len(train_data)) + " rows there are "+str(len(train_data[train_data['Kg Neto']==0]))+" with weight = 0")
train_data.drop(train_data[train_data["Kg Neto"] == 0].index, inplace=True)

# Then divide the price over weight
weight = train_data['Kg Neto'].str.replace(',','.').astype(float).values
price = train_data['U$ FOB Tot'].str.replace(',','.').astype(float).values
price_by_kg = np.divide(price, weight)

# Drop the used columns
train_data.drop(columns=["Kg Neto", "U$ FOB Tot"], inplace=True)

# Add the new column and name it usd_kg
train_data["usd_kg"]=price_by_kg

#### **País de destino (country of destiny)**

In [None]:
countries = train_data["Pais de Destino"].unique()
countries.sort()
countries

The column is correct and shows not corrupted data. We will only set the values to lowercase and remove accents.

In [None]:
import unidecode

# Create an array with the column values
old_countries = train_data["Pais de Destino"].tolist()

# Array to store cleaned values
new_countries = []

# Remove the stopwords from each old cell and populate the new array
for country in old_countries:
    # Set to lowercase
    new_text = country.lower()
    # Remove accents
    new_text = unidecode.unidecode(new_text)
    # Append to the array
    new_countries.append(new_text)

# Add the cleaned data to the training dataframe
train_data["Pais de Destino"] = new_countries
train_data[["Pais de Destino"]]

#### **Descripcion Comercial (comercial description)**

As it will be shown below, there are values in this columns with repeated sentences inside

In [None]:
comercial_description = train_data["Descripcion Comercial"].tolist()
comercial_description[0]

Let's clean this and also remove accents, double or more white spaces, stopwords, punctuations and set to lowercase

In [None]:
import re
import unidecode
from tqdm import tqdm

# Function to remove repeated sentences inside a same string
def get_unrepeated_string(source: str) -> str:
    return re.match(r'^\s*([\w\s!"#$%&\'()*+,-./:;<=>?@{|}~º°«»\[\]§y¨`¦´¤¿]+?)(?:\s*\1)*\s*$', source)[1]

new_comercial_description = []

for description in tqdm(comercial_description):
    # First remove all accents
    new_description = unidecode.unidecode(description)
    # Remove two or more consecutive spaces and set one
    new_description= ' '.join(new_description.split())
    # Remove stopwords
    words = [word for word in new_description.split() if word.lower() not in sw_nltk ]
    new_description = " ".join(words)
    # Additionally remove punctuations
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    words = tokenizer.tokenize(new_description)
    new_description = " ".join(words)
    # Set to lowercase
    new_description = new_description.lower()
    # Then remove the duplicated sentences
    try:
        new_comercial_description.append(get_unrepeated_string(new_description))
    except:
        print(new_description)

train_data["Descripcion Comercial"] = new_comercial_description
new_comercial_description[0]

#### **Subcategoría inicial (subcategories)**

This is the column to predict

Finally, our data is clean and ready to be preprocessed. As a las step, we will reset the indexes.

In [None]:
train_data.reset_index(drop=True, inplace=True)
train_data.to_csv('../data/cleaned_data/cleaned_data.csv', index=False)
train_data

## **5. Data Preprocessing**

In this stage we will preprocess the data to be used in a classification model. As seen in the Data Exploration section, there is a big class inbalance. We will adress this issue as first step.



In [None]:
# Import the cleaned data
train_data = pd.read_csv('../data/cleaned_data/cleaned_data.csv')
# Select the target variable and the explanatory variables
y = train_data[['Categoría macro Aurum']].values
X = train_data.drop(['Categoría macro Aurum'], axis=1)

We will use a method called Oversampling, in which we will increase the low counts' classes by duplicating their rows as many times as needed

In [None]:
# Import the library for Oversampling
from imblearn.over_sampling import RandomOverSampler
# Create the oversampling model
ros = RandomOverSampler(random_state=0)
# Get the oversampled dataset
X_resampled, y_resampled = ros.fit_resample(X, y)

Since there was a big umbalance, the oversample generates too many extra rows. We will sample this new resampled dataset.

In [None]:
import numpy as np
# We sample 100,000 rows from the new oversampled dataset
idx = np.random.choice(np.arange(len(X_resampled)), 100000, replace=False)
x_sample = X_resampled[idx]
y_sample = y_resampled[idx]

Convert the resample dataset into a dataframe and persist locally it for easy future use

In [None]:
import numpy as np
resampled_train = np.column_stack((x_sample, y_sample))
resampled_train = pd.DataFrame(resampled_train, columns=['Descripcion de la Partida Aduanera', 'Fecha', 'Pais de Destino', 'Descripcion Comercial', 'usd_kg', 'Subcategoría inicial'])
resampled_train.to_csv("../data/preprocessed_data/resampled_data.csv")

Now recheck the class balance

In [None]:
resampled_train["Subcategoría inicial"].value_counts()

As shown above, now the classes are balanced

We are dealing with text, categorical and numerical data in this dataset. The next step will be then to represent the text columns as numbers, which is known as *sentence embedding*. This will be done in the columns *Descripcion de la Partida Aduanera* and *Descripcion Comercial*. Let's create a function to convert the text columns into vectors.

In [None]:
# Function to convert a column with strings into vectors (the input is a list with the strings of the column)

def col2vectors(rows):
    # Import libraries for sentence embedding
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    import gensim
    import gensim.downloader as api

    # Get arrays of words for each row
    data = [row.split() for row in rows]

    # Create a TaggedDocument for each array (this is the input format for Doc2Vec)
    def tagged_document(list_of_list_of_words):
        for i, list_of_words in enumerate(list_of_list_of_words):
            yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])

    # Get the data for training by converting the arrays to TaggedDocuments
    data_for_training = list(tagged_document(data))

    # Create the model
    model = gensim.models.doc2vec.Doc2Vec(vector_size=10, min_count=2, epochs=10)
    model.build_vocab(data_for_training)

    # Train the model
    model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

    return model

Convert *Descripcion de la Partida Aduanera (description of the customs code)*

In [None]:
# Get the column
descriptions = X["Descripcion de la Partida Aduanera"].values
model = col2vectors(descriptions)

Save the model locally for future use

In [None]:
# Save model
model.save("../models/custom_descriptions_doc2vec_model")

Convert *Descripcion Comercial (comercial description)*

Let's do the same for this column

In [None]:
# Get the column
descriptions = X["Descripcion Comercial"].values
model = col2vectors(descriptions)

Save the model locally for future use

In [None]:
# Save model
model.save("../models/comercial_descriptions_doc2vec_model")

In [None]:
descriptions[0]

## **6. Model Building**

In this section we will create predictive models using different approaches

### **6.1. Multi-Layer Perceptron**

#### **6.1.1. Using only text colums**

##### **6.1.1.1 Using *Descripcion Comercial* (comercial description)**

In [None]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec
from tqdm import tqdm

# Import the preprocessed data
# training_data = pd.read_csv('../data/preprocessed_data/resampled_data.csv')
training_data = X.copy()

# Get the column and convert to vector using the doc2vec model trained before
model = Doc2Vec.load('../models/comercial_descriptions_doc2vec_model')
X = []
for row in tqdm(training_data['Descripcion Comercial'].tolist()):
    X.append(model.infer_vector(row.split(), epochs=10))

# for i in tqdm(range(0,len(model.dv))):
#     X.append(model.dv[i])

# Get the target variable column
y = y.copy()

In [None]:
training_data.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder().fit_transform(training_data['Pais de Destino'].values.reshape(-1,1))
ohe

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', alpha=1e-3, hidden_layer_sizes=(16, 8, 8),
                    random_state=1, max_iter=1000, early_stopping=True, verbose=True)

In [None]:
clf.fit(X[:], y[:])

In [None]:
len(np.unique(y[:]))

In [None]:
clf.predict([X[5]])

In [None]:
predicted = clf.predict(X)

In [None]:
predicted

In [None]:
original = y[:]
original

In [None]:
sum(x == y for x, y in zip(original, predicted))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
f,ax = plt.subplots(figsize=(20,20))
cm = confusion_matrix(original, predicted)
ConfusionMatrixDisplay(confusion_matrix=cm).plot(ax=ax)