# **Agricultural Exports Categories Analysis**
*by Sergio Postigo and Víctor Diví*

## **4. Data cleaning**

In this stage we will clean the data and specifically the columns that we will use in the model(s) in the next section. Of course, we don´t need to clean all the columns, since many of them are not relevant for labeling the rows. So, let's first determine the columns to be used and justify why

| COLUMN                             | USEFUL  | JUSTIFICATION                                                                                                                                                                                                    |
|------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Partida Aduanera                   | NO      | For each customs code there is one description in *Descripcion de la Partida Aduanera*. This last carries more information about the product. So, we won´t take this attribute and consider the next one.        |
| Descripcion de la Partida Aduanera | **YES** | This is a general description about the product, so this carries valuable information for the labeling                                                                                                           |
| Aduana                             | NO      | The port from which the product is being shipped. For now, we won´t consider it for our models                                                                                                                   |
| DUA                                | NO      | This is a random generated code associated with the shipping, it does not carry information that can be captured                                                                                                 |
| Fecha                              | **YES** | Associating the date of shipping to a category is insightful. As we saw, some products are exported in specific seasons of the year                                                                              |
| Año                                | NO      | Already included in the previous attribute                                                                                                                                                                       |
| Cod. Tributario                    | NO      | There is one tax code for each company. A company can be associated to specific groups of products, however the amount of different companies can be huge.                                                       |
| Exportador en Perú                 | NO      | Same idea as previous row                                                                                                                                                                                        |
| Importador Extranjero              | NO      | The amount of different importers abroad may be huge and new data my carry new names not learned by the model                                                                                                    |
| Kg Bruto                           | NO      | See next attribute                                                                                                                                                                                               |
| Kg Neto                            | **YES** | The weight of the shipments is insightful, but is highly variable among same products, so initially we won´t use this feature. However we will use it to calculate the price by kg, which is actually insightful |
| Toneladas Netas                    | NO      | See previous attribute                                                                                                                                                                                           |
| Qty 1                              | NO      | Same as before                                                                                                                                                                                                   |
| Und 1                              | NO      | Same as before                                                                                                                                                                                                   |
| Qty 2                              | NO      | Same as before                                                                                                                                                                                                   |
| Und 2                              | NO      | Same as before                                                                                                                                                                                                   |
| U$ FOB Tot                         | **YES** | The cost of the shipment will be use to calculate the cost by kg of the product                                                                                                                                  |
| Miles de USD Fob TOTAL             | NO      | It is just a repetition of the previous attribute                                                                                                                                                                |
| U$ FOB Und 1                       | NO      |                                                                                                                                                                                                                  |
| U$ FOB Und 2                       | NO      |                                                                                                                                                                                                                  |
| Pais de Destino                    | **YES** | The country were this products are being imported can be related to groups of products                                                                                                                           |
| Puerto de destino                  | NO      | The previous attribute indirectly captures this information already                                                                                                                                              |
| Último Puerto Embarque             | NO      |                                                                                                                                                                                                                  |
| Via                                | NO      |                                                                                                                                                                                                                  |
| Agente Portuario                   | NO      |                                                                                                                                                                                                                  |
| Agente de Aduana                   | NO      |                                                                                                                                                                                                                  |
| Descripcion Comercial              | **YES** | The comercial description also carries valuable information for the labeling                                                                                                                                     |
| Descripcion1                       | NO      | Already captured in *Descripcion Comercial*                                                                                                                                                                      |
| Descripcion2                       | NO      | Already captured in *Descripcion Comercial*                                                                                                                                                                      |
| Descripcion3                       | NO      | Already captured in *Descripcion Comercial*                                                                                                                                                                      |
| Descripcion4                       | NO      | Already captured in *Descripcion Comercial*                                                                                                                                                                      |
| Descripcion5                       | NO      | Already captured in *Descripcion Comercial*                                                                                                                                                                      |
| Naviera                            | NO      |                                                                                                                                                                                                                  |
| Agente Carga(Origen)               | NO      |                                                                                                                                                                                                                  |
| Agente Carga(Destino)              | NO      |                                                                                                                                                                                                                  |
| Canal                              | NO      |                                                                                                                                                                                                                  |
| Concatenar                         | NO      |                                                                                                                                                                                                                  |
| Categoría macro Aurum              | **YES** | **LABEL**                                                                                                                                                                                                        |
| Subcategoría inicial               | NO      | While we also need this category, it can be inferred given a prediction of the subcategory                                                                                                                       |
| Subcategoría Consolidada Aurum     | NO      |                                                                                                                                                                                                                  |
| Categoría Consolidada Aurum        | NO      |                                                                                                                                                                                                                  |

In [None]:
import pandas as pd

full_data = pd.read_csv("../data/raw_data/data.csv", encoding='latin-1', sep=';')

In [None]:
data = full_data[
    ["Descripcion de la Partida Aduanera", "Fecha", "Kg Neto", "U$ FOB Tot", "Pais de Destino", "Descripcion Comercial",
     "Categoría macro Aurum"]].copy()
data.head()

From now on we will focus on each of the selected columns

#### **Descripcion de la Partida Aduanera (description of the customs code)**

In [None]:
data[["Descripcion de la Partida Aduanera"]]

Since in this column we are dealing with textual descriptions of the product, we will use Natural Language Processing techniques. A first important step that we will perform is to remove the so-called *stop words* from each cell, so that we get rid of the low-level information. For example, we see that the second row in the above table has the word 'Y' (and) or 'O' (or). This words should not be considered in our future model.

To do this we will use the Natural Language Toolkit (NLTK).

First, let's define a function to normalize strings. Here, we don't care about accents*, punctuations and non-alphabetic characters in general, and lower/upper case, so we will get rid of that
*while it's true that in Spanish an accent can change the meaning of a word, this doesn't usually happen with nouns, and we have to take into account that is probable that most words that have an accent are probably written with and without it inside the dataset

In [None]:
import unidecode
import re


def to_alpha_lower_ascii(val: str) -> str:
    ascii_value = unidecode.unidecode(val)
    lower = ascii_value.lower()
    alpha = re.sub(r'[^a-z]', ' ', lower)
    alpha_spaces = re.sub(r'\s+', ' ', alpha)
    return alpha_spaces

In [None]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

sw_nltk = [to_alpha_lower_ascii(x) for x in stopwords.words('spanish')]
print("The words considered stopwords in spanish are: ")
print(sw_nltk)

Now let's convert the values to simple ascii using the previous function and get rid of stopwords and other short words (1 or 2 letters)

In [None]:
def clean_str(value: str) -> str:
    converted_words = to_alpha_lower_ascii(value).split()
    return ' '.join(word for word in converted_words if word not in sw_nltk and len(word) > 2)

In [None]:
data['Descripcion de la Partida Aduanera_clean'] = data['Descripcion de la Partida Aduanera'].apply(clean_str)
data[['Descripcion de la Partida Aduanera', 'Descripcion de la Partida Aduanera_clean']]

In [None]:
data.drop('Descripcion de la Partida Aduanera', axis=1, inplace=True)
data.rename({'Descripcion de la Partida Aduanera_clean': 'Descripcion de la Partida Aduanera'}, axis=1, inplace=True)

#### **Fecha (date)**

For this column we will map the month of shipment

In [None]:
data['Fecha'] = pd.to_datetime(data['Fecha'].values, infer_datetime_format=True).month

#### **Kg Neto (net weight in of good KG) and U$ FOB Tot (total price of good)**

As we said before, here we will get the price by kg of the good. To do this we will use both columns and transform them into one.

In [None]:
import numpy as np

data['Kg Neto'] = data['Kg Neto'].str.replace(',', '.').astype(float).values
data['U$ FOB Tot'] = data['U$ FOB Tot'].str.replace(',', '.').astype(float).values

In [None]:
(data["Kg Neto"] == 0).value_counts()

In [None]:
data = data.drop(data[data["Kg Neto"] == 0].index)

In [None]:
data["usd_kg"] = np.divide(data['Kg Neto'], data['U$ FOB Tot'])
data['usd_kg'] = data['usd_kg'].fillna(0)
data['usd_kg'] = data['usd_kg'].replace([[np.inf, -np.inf]], 0)

data = data.drop(columns=["Kg Neto", "U$ FOB Tot"])

In [None]:
data['usd_kg'].describe()

#### **País de destino (country of destiny)**

In [None]:
countries = data["Pais de Destino"].unique()
countries.sort()
countries

The column is correct and shows not corrupted data. We will only set the values to lowercase and remove accents.

In [None]:
data["Pais de Destino"] = data["Pais de Destino"].apply(lambda country: unidecode.unidecode(country).lower())

#### **Descripcion Comercial (comercial description)**

As it will be shown below, there are values in these columns with repeated sentences inside

In [None]:
comercial_description = data["Descripcion Comercial"].tolist()
comercial_description[0]

Let's clean this and also remove accents, double or more white spaces, stopwords, punctuations and set to lowercase

In [None]:
def remove_repetitions(source: str) -> str:
    return re.match(r'^\s*([\w\s]+?)(?:\s*\1)*\s*$', source)[1]

In [None]:
data['Descripcion Comercial'] = data['Descripcion Comercial'].apply(clean_str).apply(remove_repetitions)
data['Descripcion Comercial']

#### **Categoria macro Aurum (subcategories)**

This is the column to predict

Finally, our data is clean and ready to be preprocessed. As a last step, we will reset the indexes.

In [None]:
data.reset_index(drop=True, inplace=True)
data = data[[col for col in data if col != 'Categoría macro Aurum'] + ['Categoría macro Aurum']]
data.to_csv('../data/cleaned_data/cleaned_data.csv', index=False)
data