# Goal 1
## Given a partial permit, can we predict what classification it belongs to?

Trade permits can be messy and incomplete. Can we use this partial data to successfully predict which Purpose code the permit should belong to?

### Getting Started
- Open up the CITES trade database at https://trade.cites.org/
- Select a year range and click *Search*
- Download a Comparative Tabulation report and place it in `data/`
- Install requirements with `pip install -r requirements.txt`
- Run this notebook

In [342]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

In [343]:
seed = 1
np.random.seed(seed)

### Importing our data
Let's import our data into a pandas dataframe and take a look at it.

In [344]:
dataframe = pd.read_csv("data/data.csv", skipinitialspace=True, dtype={
    'Importer reported quantity': float,
    'Exporter reported quantity': float
})

dataframe

Unnamed: 0,Year,App.,Taxon,Class,Order,Family,Genus,Importer,Exporter,Origin,Importer reported quantity,Exporter reported quantity,Term,Unit,Purpose,Source
0,2016,I,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,TR,NL,CZ,,1.0,bodies,,T,C
1,2016,I,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,XV,RS,RS,,1.0,bodies,,Q,O
2,2016,I,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,,43.0,feathers,,S,W
3,2016,I,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,,43.0,specimens,,S,W
4,2016,I,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,DK,IS,,700.00,,specimens,,S,W
5,2016,I,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,XV,RS,RS,,1.0,bodies,,Q,O
6,2016,I,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,12.0,feathers,,S,C
7,2016,I,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,4.0,feathers,,S,U
8,2016,I,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,2.0,feathers,,S,W
9,2016,I,Acipenser brevirostrum,Actinopteri,Acipenseriformes,Acipenseridae,Acipenser,CH,DE,,,4.0,live,,T,C


### Formatting the data
The Year and App. columns probably aren't going to matter to us in how we classify these records, so let's drop those first...
We'll also remove any additional whitespace from the column names to make things easier to deal with later on...

In [345]:
dataframe.columns = dataframe.columns.str.strip()
dataframe = dataframe.drop(columns=['Year', 'App.'])

dataframe

Unnamed: 0,Taxon,Class,Order,Family,Genus,Importer,Exporter,Origin,Importer reported quantity,Exporter reported quantity,Term,Unit,Purpose,Source
0,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,TR,NL,CZ,,1.0,bodies,,T,C
1,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,XV,RS,RS,,1.0,bodies,,Q,O
2,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,,43.0,feathers,,S,W
3,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,,43.0,specimens,,S,W
4,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,DK,IS,,700.00,,specimens,,S,W
5,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,XV,RS,RS,,1.0,bodies,,Q,O
6,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,12.0,feathers,,S,C
7,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,4.0,feathers,,S,U
8,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,,2.0,feathers,,S,W
9,Acipenser brevirostrum,Actinopteri,Acipenseriformes,Acipenseridae,Acipenser,CH,DE,,,4.0,live,,T,C


We're going to have to replace those NaN values in the reported quantities columns...

In [346]:
dataframe['Importer reported quantity'].fillna(0, inplace=True)
dataframe['Exporter reported quantity'].fillna(0, inplace=True)

dataframe

Unnamed: 0,Taxon,Class,Order,Family,Genus,Importer,Exporter,Origin,Importer reported quantity,Exporter reported quantity,Term,Unit,Purpose,Source
0,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,TR,NL,CZ,0.00,1.0,bodies,,T,C
1,Aquila heliaca,Aves,Falconiformes,Accipitridae,Aquila,XV,RS,RS,0.00,1.0,bodies,,Q,O
2,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,0.00,43.0,feathers,,S,W
3,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,BE,NO,,0.00,43.0,specimens,,S,W
4,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,DK,IS,,700.00,0.0,specimens,,S,W
5,Haliaeetus albicilla,Aves,Falconiformes,Accipitridae,Haliaeetus,XV,RS,RS,0.00,1.0,bodies,,Q,O
6,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,0.00,12.0,feathers,,S,C
7,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,0.00,4.0,feathers,,S,U
8,Harpia harpyja,Aves,Falconiformes,Accipitridae,Harpia,BR,FR,,0.00,2.0,feathers,,S,W
9,Acipenser brevirostrum,Actinopteri,Acipenseriformes,Acipenseridae,Acipenser,CH,DE,,0.00,4.0,live,,T,C


We have a lot of text data in the form of ISO country codes, and specialist categories. We'll need to encode these as one hot vectors in the next step so that our neural net can understand them. We'll grab a list of all the columns we'll need to encode first and remove the ones we don't want to encode (the numeric columns)

In [347]:
print("Number of Columns: ", len(dataframe.columns))
columns = list(dataframe.columns)
columns.remove('Importer reported quantity')
columns.remove('Exporter reported quantity')
columns.remove('Purpose')
print(columns)

Number of Columns:  14
['Taxon', 'Class', 'Order', 'Family', 'Genus', 'Importer', 'Exporter', 'Origin', 'Term', 'Unit', 'Source']


### Encoding our labels and data

In order to test the performance of our neural net, we'll need to split up our data into the data, and their corresponding classifications. 

The purpose column will be what we are going to attempt to predict (notice we removed it from the list of columns we'd like to one hot encode already).

Let's pop off our labels from our dataframe, and keep them separate...

In [348]:
labels = dataframe.pop('Purpose')

labels

0        T
1        Q
2        S
3        S
4        S
5        Q
6        S
7        S
8        S
9        T
10       T
11       P
12       T
13       Z
14       Z
15       Z
16       B
17       Z
18       E
19       E
20       Z
21       Z
22       E
23       Z
24       S
25       Z
26       S
27       S
28       Q
29       Z
        ..
75861    S
75862    S
75863    H
75864    T
75865    T
75866    T
75867    T
75868    T
75869    T
75870    T
75871    T
75872    T
75873    T
75874    S
75875    T
75876    S
75877    H
75878    H
75879    T
75880    T
75881    S
75882    S
75883    S
75884    T
75885    T
75886    T
75887    T
75888    T
75889    T
75890    T
Name: Purpose, Length: 75891, dtype: object

We'll need to convert our classifications into one hot vectors...

In [349]:
labels = pd.get_dummies(labels)

labels

Unnamed: 0,B,E,G,H,L,M,N,P,Q,S,T,Z
0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,1,0,0
8,0,0,0,0,0,0,0,0,0,1,0,0
9,0,0,0,0,0,0,0,0,0,0,1,0


Next we'll create one hot vectors for the rest of our datatable and call this *data*

In [None]:
data = pd.get_dummies(dataframe, columns=columns)

data

Our data is looking better, but to make things easier on our model, we can scale everything to between 0-1...

In [None]:
scaler      = MinMaxScaler(feature_range=(0, 1))
data_scaled = scaler.fit_transform(data)
data        = pd.DataFrame(data_scaled)

data

### Creating a train / test split

In order to evaluate our model, we'll split our data into two groups, a group for training, which the neural net will learn on, and a group for validation, which the neural net will not see, but be validated against once trained.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)

In [None]:
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

### Building a simple model

We'll build a simple neural network which accepts our input of 9400 bits of data, and passes it to 6279 neurons in a hidden layer (two thirds of the input layer plus the output layer, is a good rule of thumb for how many neurons a hidden layer should have). Finally, our hidden layer is passed to an output layer representing our categories (so 12 neurons in this) and uses a softmax activation function to turn our predictions into probabilities of it being that class...

In [None]:
def build_model():
    model = Sequential()
    model.add(Dense(6279, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dropout(0.1))
    model.add(Dense(y_train.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

### Hyperparameters

In [None]:
epochs = 5
batch_size = 2000

Let's train our simple model...

In [None]:
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)

In [None]:
score = model.evaluate(X_test, y_test)

print("%s: %.2f%%" % (model.metrics_names[1], score[1]*100))

### Evaluating our model with K-Fold Cross Validation

We'll use k-fold validation to get a better representation of how our model did...

In [None]:
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cv_scores = []

for train, test in kfold.split(data, labels):
    model = build_model()
    model.fit(data[train], labels[train], epochs=150, batch_size=10, verbose=0)
    scores = model.evaluate(data[test], labels[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    
    cv_scores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))