# **Agricultural Exports Categories Analysis**
*by Sergio Postigo and Víctor Diví*

## **5. Data Preprocessing**

In this stage we will preprocess the data to be used in a classification model. As seen in the Data Exploration section, there is a big class inbalance. We will adress this issue as first step.



In [None]:
import pandas as pd

data = pd.read_csv('../data/cleaned_data/cleaned_data.csv')

We will use down-sampling to reduce the number of instances of the more popular categories. For each category we will have at most 20.000 instances.

In [None]:
samples_per_group = {name: min(count, 20000) for name, count in data['Categoría macro Aurum'].value_counts().items()}
data = data.groupby('Categoría macro Aurum').apply(
    lambda group: group.sample(samples_per_group[group.name], random_state=5)).reset_index(drop=True)
data['Categoría macro Aurum'].value_counts()

Convert the resample dataset into a dataframe and persist locally it for easy future use

In [None]:

data.to_csv("../data/preprocessed_data/resampled_data.csv", index=False)

Until now, we have worked over the whole dataset, since the actions performed would be also done over new data. However, the next steps should only be performed with the training data, so we will split the data into two sets (80-20), and carry on working with only the 80% of the data

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=5)
train.to_csv('../data/preprocessed_data/train_data.csv', index=False)
test.to_csv('../data/preprocessed_data/test_data.csv', index=False)
data = train

We are dealing with text, categorical and numerical data in this dataset. The next step will be then to represent the text columns as numbers, which is known as *sentence embedding*. This will be done in the columns *Descripcion de la Partida Aduanera* and *Descripcion Comercial*. Let's create a function to convert the text columns into vectors.

In [None]:
from typing import List
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


def col2vectors(rows: List[str]) -> Doc2Vec:
    model_input = [TaggedDocument(row.split(), [i]) for i, row in enumerate(rows)]

    doc2vec_model = Doc2Vec(vector_size=10, min_count=2, epochs=10)
    doc2vec_model.build_vocab(model_input)

    doc2vec_model.train(model_input, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

    return doc2vec_model

Convert *Descripcion de la Partida Aduanera (description of the customs code)* and save locally for future use

In [None]:
model = col2vectors(data["Descripcion de la Partida Aduanera"].values)
model.save("../models/custom_descriptions_doc2vec_model")

Let's do the same for *Descripcion Comercial (comercial description)*

In [None]:
model = col2vectors(data["Descripcion Comercial"].values)
model.save("../models/comercial_descriptions_doc2vec_model")