# <center>Tracking Illegal Fishing Using Machine Learning</center>

##### <center>STUDENT NAME : ARUL RAYMONDS GEORGE JOSEPH</center>
##### <center>STUDENT ID: C00278718</center>

## Table of Contents:
1. [Introduction](#first-bullet)
2. [Businiess Understanding](#second-bullet)
3. [Data Understanding](#third-bullet)<br>
    3.1 [Load Data](#ld)<br>
    3.2 [Data Description](#desc)<br>
    3.3 [Visualization](#vs)<br>
4. [Data Preparation](#fourth-bullet)<br>
    4.1 [Fishing Activity](#fa)<br>
    4.2 [Data Cleaning](#dc)<br>
    4.3 [Vessel movements](#vm)<br>
5. [Modelling](#fifth-bullet)<br>
    5.1 [Random Forest](#rf)<br>
    5.2 [K-Nearest neighbor](#knn)<br>
    5.3 [Gaussian Naive Bayes](#gnb)<br>
    5.4 [Logistic Regression](#lr)<br>
    5.5 [Neural Networks](#nn)<br>
6. [Evaluation](#sixth-bullet)
7. [Deployments](#seventh-bullet)
8. [Discussion](#eighth-bullet)
9. [Results](#nineth-bullet)
10. [References](#tenth-bullet)


## 1. Introduction <a class="anchor" id="first-bullet"></a>

The world's 12% of the protein intake comes from seafood by the fishing industry globally(Sans, P. and Combris, P., 2015)(What does the world eat? - Sustainable Fisheries UW, 2022). Additionally, the demand for seafood is increasing due to economic availability in developing nations and consumption of exotic fishes in developed countries is on the rise. Also, the fishing industry generates employment for millions of people. So the pressure on the fishing industry is increasing which leads to overfishing, poaching, human rights abuse and other illegal activities. Furthermore, overfishing greatly contributes to the reducing fish stocks of many fish species due to its demand and rarity. So, many governments and non-profit organisations like Global Fishing Watch (GFW), google, Skytruth and Spire are creating tools to actively monitor fishing activities that generate public awareness and help policymakers to take action against illegal activities. 
<br><br>
In this project, we will try to understand and predict fishing activities by different vessel types involved in fishing across the globe by using anonymized open source data provided by GFW. Furthermore, using multiple machine learning(ML) algorithms and techniques to predict fishing activities. Moreover, improving and tweaking the accuracy of the machine learning model's predictions to generate models for new data. Finally, we will discuss on how different ML techniques produces better results.

## 2. Business Understanding <a class="anchor" id="second-bullet"></a>
Global Fishing Watch, Google, Skytruth, Oceana and many governments like the USA and Europe are working together to create a system of information gathering on the fishing industry to address many issues such as illegal fishing, human trafficking and fishing stock estimations. For this reason, GFW created a platform to compile data involving fishing vessels' movement and activity through monitoring VMS (Vessel Monitoring System) and AIS (Automatic Identification System). Additionally, GFW provides a website platform to visualise historical fishing and non-fishing vessel movements in the sea (GFW | Map, 2022). Furthermore, GFW collects AIS data and satellite imagery from different satellite providers. Also, Google has partnered with GFW to run machine learning models to predict fishing activities from various data sources and compile that to provide an open-source database on vessels. So, in this project, the data is downloaded from GFW open-source database (Global Fishing Watch | Data download portal, 2022).

In this project, the data comprises anonymized AIS records of fishing vessels like trawlers, drifting longline, purse seines and trollers. However, only trawlers and drifting longliners data are used in the project since, the feautures are the same in all data files and machines performance bottleneck hinders the complete use of all files. However, the same models can be used with other files after minor data cleaning efforts.

Overall, the objective of this project is to identify whether a vessel is fishing or not using ML models.



## 3. Data understanding <a class="anchor" id="third-bullet"></a>

The data is downloaded from GFW [website](https://globalfishingwatch.org/data-download/) (Global Fishing Watch | Data download portal, 2022). which contains seven files separated by the vessel type. In this section, data is loaded, visualised and analysed. For simplicity, trawlers.csv file is chosen since all files have the same features.

First lets import pandas and numpy to load the csv files.

In [None]:
import numpy as np
import pandas as pd
from geopy import distance
from global_land_mask import globe
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
plt.rcParams['figure.figsize'] = [6, 2]

### 3.1 Load Data from datapath <a class="anchor" id="ld"></a>
I have uploded the data and the code in [Github](https://github.com/raymond571/ML-assignment) in compressed format. Please unzip the file and place it under data foler in the project directoy.
<br>Note: for drifting_longline use the same notebook and change df = pd.read_csv(datapath+drifting_longlines).

In [None]:
datapath = "data/0dab1200-c004-11ec-8a45-f167084fd93d/"
purse_seines = "purse_seines.csv"
unknown = "unknown.csv"
trollers = "trollers.csv"
trawlers = "trawlers.csv"
pole_and_line = "pole_and_line.csv"
fixed_gear = "fixed_gear.csv"
drifting_longlines = "drifting_longlines.csv"

pd.set_option('display.float_format', lambda x: '%0.4f' % x)
df = pd.read_csv(datapath+purse_seines)

In [None]:
df.shape

### 3.2 Description of attributes in the data file: <a class="anchor" id="desc"></a>
As we can see from above, we have 9 attributes in this 'source' is not used since its just tells us which organisation validated this data.<br>
Also, all the datatypes are floats by deafult form the database so there is no need of any conversion in data types.
<br> 

* mmsi: Anonymized vessel identifier
* timestamp: Unix timestamp
* distance_from_shore: Distance from shore (meters)
* distance_from_port: Distance from port (meters)
* speed: Vessel speed (knots)
* course: Vessel course
* lat: Latitude in decimal degrees
* lon: Longitude in decimal degrees
* is_fishing: Label indicating fishing activity.
    0 = Not fishing <br>
    >0 = Fishing. Data values between 0 and 1 indicate the average score for the position if scored by multiple people.<br>
    -1 = No data <br>
* source: The training data batch. Data was prepared by GFW, Dalhousie, and a crowd sourcing campaign. False positives are marked as false_positives.

### 3.3 Visualization <a class="anchor" id="vs"></a>
Corellation heatmap of attributes.

Above heatmap we can observe there's a high correlation with distance_from_port and distance_from_shore. So we will be poping distance_from_shore as both are having similar values and meaning. Also, we can observe mmsi

## 4. Data Preparation <a class="anchor" id="fourth-bullet"></a>



### 4.1 Analysing is_fishing atrribute <a class="anchor" id="fa"></a>
is_fishing = -1; not available <br>
is_fishing = 0; not fishing <br>
is_fishing > 0; possiblithy of fishing activity <br> <br>
Lets find out unique values of is_fishing is captured

### 4.2 Data cleaning <a class="anchor" id="dc"></a>
There are 7 unique values. We do not need -1 since it has no data on fishing activity identifier. So, we will clean the data. Additionally, create new columns from timestamp like year, month, day and hour, since timestamp is a float number which doesnt provide much informations for the ML models. Also, filter records with null or emplty values.

In [None]:
# create new column for datetime for convinence
df["datetime"] = pd.to_datetime(df['timestamp'],unit='s')
df = df.dropna()
df = df[df["is_fishing"] != -1]
df["year"] = df["datetime"].dt.year
df["month"] = df["datetime"].dt.month
df["hour"] = df["datetime"].dt.hour
df["day"] = df["datetime"].dt.day
df = df.drop_duplicates()

This is a cleaning for a special case which i have found during plotting the geographic map with the data. The mmsi value: 186746307373264 have false location in the land rather than ocean. so removing this mmsi from data will reduce errors in prediction.

In [None]:
# find a location coordinate is in land or ocean
def land(row):
    if globe.is_land(row["lat"],row["lon"]):
        return 1
    else:
        return 0        
df["land"]= df.apply(land,axis=1)

As mentioned earlier, we will remove distance_from_shore

In [None]:
# data for fishing vessel location in ocean
ocean_cover = df[df["land"]==0].copy()
ocean_cover.drop(['land','distance_from_shore'], axis=1, inplace=True)

fishing_vessels = ocean_cover.copy()

#get distance between two consecutive vessels with same mmsi
def get_dist(dataframe):
    dist = [0]
    for i in range(len(dataframe)-1):
        if dataframe.iloc[i+1,0] == dataframe.iloc[i,0]:
            cod1 = (dataframe.iloc[i,5],dataframe.iloc[i,6])
            cod2 = (dataframe.iloc[i+1,5],dataframe.iloc[i+1,6])
            dist.append( distance.geodesic(cod1, cod2).km )
        else:
            dist.append(0)
    return dist

# Group by mmsi and sort with date to get the distance
grp_by = fishing_vessels.sort_values("datetime").groupby("mmsi")
pddf = grp_by.apply(lambda x: x) 
pddf["dist"] = get_dist(pddf)
fishing_vessels = pddf.copy()

fishing_vessels.shape

we have 175320 rows of data to train and test our ML models

In [None]:
# fishing vessel error location in land
land_cover = df[df["land"]==1].copy()
land_cover.drop(['land','distance_from_shore'], axis=1, inplace=True)
grp_by = land_cover.sort_values("datetime").groupby("mmsi")
pddf = grp_by.apply(lambda x: x)
pddf["dist"] = get_dist(pddf)
pddf.shape

In [None]:
dataf = pddf.copy()
def get_dist2(dataframe):
    global dataf
    sum = 0
    for i in range(len(dataframe)-1):
        if (dataframe.iloc[i+1,9] - dataframe.iloc[i,9]).days <=1 and dataframe.iloc[i,14] >1500 and dataframe.iloc[i+1,0] == dataframe.iloc[i,0]:
            sum+=1
            dataf = dataf.append(dataframe.iloc[i,])
    print(sum)
get_dist2(fishing_vessels)

# fishing_vessels[fishing_vessels["dist"]>1500].count()

In [None]:
dataf.drop_duplicates()
# dataf.to_csv(datapath+"purse_seine_land.csv")
dataf.shape

In [None]:
# fishing_vessels.hist(layout=(5,3), figsize=(15,15), bins = 100)
# fishing_vessels["day"].hist(figsize=(6,2), bins = 100)
fishing_vessels[fishing_vessels["is_fishing"]==0]["day"].hist(figsize=(6,2), bins = 100)
plt.title("Distribution of fishing activities in a month", loc = 'left')
plt.xlabel("day")
plt.ylabel("Frequency")
plt.show()

fishing_vessels[fishing_vessels["is_fishing"]==0]["month"].hist(figsize=(6,2), bins = 100)
plt.title("Distribution of fishing activities in a year", loc = 'left')
plt.xlabel("month")
plt.ylabel("Frequency")
plt.show()


In [None]:
del [[grp_by,pddf,dataf]]

Visualising the cleaned data with pair plot

### 4.3 Vessel movements  <a class="anchor" id="vm"></a>
Using geographic ploting using plotly we can see the different vessels movement on the sea over the time

Below is the special case data cleaning for drifting_longline data file. Please enable this code while loading drifting_longline

## 5. Modelling <a class="anchor" id="fifth-bullet"></a>

In this section we are going to train ML models and evaluate the algorithms performance metrics.
Machine Leanring models uses:
1. Random Forest Classifier/Regressor
2. K-nerarest neighbours
3. Naive Bayes
4. Logistic Regression
5. Neural Networks

setting features and target of the models used

In [None]:
features = ["mmsi","timestamp","distance_from_port","speed","course","lat","lon","month","hour","day","dist"]
target = ["is_fishing"]

Importing machine learning ,metrics and validation libraries from scikit learn.<br>
Loading features and target into variables.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import KFold

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc

from sklearn import metrics
from sklearn import preprocessing
from sklearn import utils

X = fishing_vessels[features]
y = fishing_vessels[target]
lab = preprocessing.LabelEncoder()

### 5.1 Random forest <a class="anchor" id="rf"></a>

To optimize the random forest algorithm we have to find the optimal parameters.<br>

1. Find the optimal n value(number of decision trees) of the random forest.
2. Find optimal max_depth (maximum depth of the tree)
3. Find the optimal min_smaple_split (node splits).

1. Find n_estimator.

In [None]:
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]
y_transformed = lab.fit_transform(y.values.flatten())
X_train, X_test, y_train, y_test = train_test_split(X,y_transformed, test_size=0.25)

a=[]
for estimator in n_estimators:
    clf=RandomForestClassifier(n_estimators=estimator,min_samples_split=2,n_jobs=-1)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    a.append(clf.score(X_test, y_test))
    
plt.title("Optimal n-estimators", loc = 'left')
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
line = plt.plot(n_estimators, a)
plt.show()

From the the above graph we can observer optimal n_estimator is 65.

In [None]:
n_estimator = 100

2. Find max_depth.

In [None]:
max_depths = [1, 2, 4, 8, 16, 32, 64, 100]
y_transformed = lab.fit_transform(y.values.flatten())
X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.25)

a=[]
for depth in max_depths:
    clf=RandomForestClassifier(n_estimators=n_estimator,max_depth=depth,min_samples_split=2,n_jobs=-1)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    a.append(clf.score(X_test, y_test))
    
plt.title("Optimal Max depth", loc = 'left')
plt.xlabel("max depth")
plt.ylabel("Accuracy")
line = plt.plot(max_depths, a)
plt.show()

From the above graph we can conclude 35 is optimal max_depth.

In [None]:
depth = 15

3. Find min_samples_split.

In [None]:
splits = [2, 4, 8, 16]
y_transformed = lab.fit_transform(y.values.flatten())
X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.25)

a=[]
for split in splits:
    clf=RandomForestClassifier(n_estimators=n_estimator,max_depth=depth,min_samples_split=split,n_jobs=-1)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    a.append(clf.score(X_test, y_test))

plt.title("Optimal Tree Node Split", loc = 'left')
plt.xlabel("Number of splits")
plt.ylabel("Accuracy")
line = plt.plot(splits, a)
plt.show()

In [None]:
split = 2

Using the optimal parameters, we can train the model and cross validate the predictions using stratified KFold splits to split the datasets.

In [None]:
y_transformed = lab.fit_transform(y.values.flatten())
skf = StratifiedKFold(n_splits=6,random_state=None,shuffle=True)
# skf = RepeatedKFold(n_splits=6, n_repeats=2, random_state=42)

a = []
for train_index,test_index in skf.split(X,y_transformed):
    print(f"Train:{train_index} Test:{test_index}")
    X_train,X_test = X.iloc[train_index], X.iloc[test_index]
    y_train,y_test = y_transformed[train_index], y_transformed[test_index]
    clf=RandomForestClassifier(n_estimators=n_estimator,max_depth=depth,min_samples_split=split,n_jobs=-1)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    a.append(clf.score(X_test, y_test))
    
sum=0
for aa in a:
    sum+=aa
print(f"mean score from K-fold cross validation: {sum/len(a)}")

The results of the evaluations will be discussed in Evaluation section.

Also used cross_val_score to find the accurace.

In [None]:
accuracy = cross_val_score(clf, X, y_transformed, scoring='accuracy', cv = 6)
print(accuracy)

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

print(classification_report(y_test, y_pred))

We can see the results in confusion matrix of the results and classification reports. The accuracy score is 98% using random forest with cross validations.

Now we can visualise the feature importance of the random classifier. Also to note, Using the feauture importance on the previous runs i have removed the year column since it was having less importance.

In [None]:

feature_imp = pd.Series(clf.feature_importances_,\
                        index=X_train.columns).sort_values(ascending=False)
feature_imp

In [None]:
# %matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

We can observe that speed is having high importance. On the contrary, vessel course is having less importance in the model.

### 5.2 K-Nearest Neighbors <a class="anchor" id="knn"></a>

 To find the optimal n neighbors. I have used GridSearch

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

y_transformed = lab.fit_transform(y.values.flatten())

X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.25)

parameters = {"n_neighbors": range(1, 50,)}
gridsearch = GridSearchCV(KNeighborsClassifier(),  parameters,cv=2,n_jobs=-1)
gridsearch.fit(X_train, y_train)
gridsearch.best_params_

The optimal neighbors is 2.

In [None]:
n_neighbors = 1

Now we using stratified KFold data splits we tarin and predict the model.

In [None]:
y_transformed = lab.fit_transform(y.values.flatten())

X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.25,random_state=109)
#skf = KFold(n_splits=5,random_state=None,shuffle=True)
# skf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
skf = StratifiedKFold(n_splits=6,random_state=None,shuffle=True)
a = []
for train_index,test_index in skf.split(X,y_transformed):
    print(f"Train:{train_index} Test:{test_index}")
    X_train,X_test = X.iloc[train_index], X.iloc[test_index]
    y_train,y_test = y_transformed[train_index], y_transformed[test_index]
    knn_model = KNeighborsClassifier(n_neighbors=n_neighbors,n_jobs=-1)
    knn_model.fit(X_train, y_train)
    y_pred=knn_model.predict(X_test)
    a.append(knn_model.score(X_test, y_test))
    print()
sum=0
for aa in a:
    sum+=aa
print(f"mean score from K-fold cross validation: {sum/len(a)}")

In [None]:
knn_model.score(X_test,y_test)

The KNN model with 2 neighbors, accuracy score is 95%. !. However the mean of scores in cross validations is 60%. We will discuss this in detail in evaluation section.

In [None]:
accuracy = cross_val_score(clf, X, y_transformed, scoring='accuracy', cv = 2)
print(accuracy)

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

print(classification_report(y_test, y_pred))

### 5.3 Gaussian Naive Bayes <a class="anchor" id="gnb"></a>

Using Stratified KFold we split the data to train predict the model.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB


y = fishing_vessels[target]
y_transformed = lab.fit_transform(y.values.flatten())
y[target].apply(lambda x: 1 if x.is_fishing>=0.5 else 0, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.25,random_state=109)
#skf = KFold(n_splits=5,random_state=None,shuffle=True)
# skf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)
skf = StratifiedKFold(n_splits=6,random_state=None,shuffle=True)
a = []
for train_index,test_index in skf.split(X,y_transformed):
    print(f"Train:{train_index} Test:{test_index}")
    X_train,X_test = X.iloc[train_index], X.iloc[test_index]
    y_train,y_test = y_transformed[train_index], y_transformed[test_index]
    gnb = GaussianNB()
    gnb.fit(X_train,y_train)
    y_pred=gnb.predict(X_test)
    a.append(gnb.score(X_test, y_test))
    print()
sum=0
for aa in a:
    sum+=aa
print(f"mean score from K-fold cross validation: {sum/len(a)}")


The mean score is 61%!. 

Also the accuracy score for 10 fold cross validations is shown below.

In [None]:
accuracy = cross_val_score(gnb, X, y_transformed, scoring='accuracy', cv = 2)
print(accuracy)

Confusion matrix and classification report.

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

print(classification_report(y_test, y_pred))

### 5.4 Logistic Regression <a class="anchor" id="lr"></a>

Using Startified KFold data split to train and predict the model.

In [None]:
from sklearn.linear_model import LogisticRegression
X = fishing_vessels[features]
y = fishing_vessels[target]
# y[target].apply(lambda x: 1 if x.is_fishing>=0.5 else 0, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y.values.flatten(), test_size=0.25,random_state=109)
#skf = KFold(n_splits=5,random_state=None,shuffle=True)
# skf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
skf = StratifiedKFold(n_splits=6,random_state=None,shuffle=True)
a = []
for train_index,test_index in skf.split(X,y_transformed):
    print(f"Train:{train_index} Test:{test_index}")
    X_train,X_test = X.iloc[train_index], X.iloc[test_index]
    y_train,y_test = y_transformed[train_index], y_transformed[test_index]
    lr = LogisticRegression(multi_class='multinomial',solver='lbfgs',max_iter=20000)
    lr.fit(X_train,y_train)
    y_pred=lr.predict(X_test)
    a.append(lr.score(X_test, y_test))
    print()
sum=0
for aa in a:
    sum+=aa
print(f"mean score from K-fold cross validation: {sum/len(a)}")

The mean accuracy score is 64% !. However the convergence having a issue even if the solver is changed and max_iter is changed.

The accuracy scores of 10 fold cross validations.

In [None]:
accuracy = cross_val_score(lr, X, y_transformed, scoring='accuracy', cv = 6)
print(accuracy)

Confusion matrix and classification report.

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

print(classification_report(y_test, y_pred))

### 5.5 Neural Networks <a class="anchor" id="nn"></a>

I have used sequential neural network model to do regression on the data.

Importing tesorflow-gpu libraries for trainig neural networks

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.losses import MeanSquaredLogarithmicError
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

Set the tesorflow session and enable GPU

In [None]:
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))

List all CPUs and GPUs

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
# tf.debugging.set_log_device_placement(True)

Transform features data into standard Scalar to help fit the model

In [None]:
X = fishing_vessels[features]
y = fishing_vessels[target]
# y[target].apply(lambda x: 1 if x.is_fishing>0 else 0, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y.values.flatten(), test_size=0.25,random_state=109)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=109)
def scale_datasets(x_train, x_test):
    standard_scaler = StandardScaler()
    x_train_scaled = pd.DataFrame(standard_scaler.fit_transform(X_train),columns=x_train.columns)
    x_test_scaled = pd.DataFrame(standard_scaler.transform(X_test),columns = x_test.columns)
    return x_train_scaled, x_test_scaled
x_train_scaled, x_test_scaled = scale_datasets(X_train, X_test)

3 dense layes are used in the neural network and model is built

In [None]:
hidden_units1 = 160
hidden_units2 = 480
hidden_units3 = 256
learning_rate = 0.01
# Creating model using the Sequential in tensorflow
def build_model_using_sequential():
  model = Sequential([\
    Dense(hidden_units1, kernel_initializer='normal', activation='sigmoid'),\
    Dropout(0.2),\
    Dense(hidden_units2, kernel_initializer='normal', activation='sigmoid'),\
    Dropout(0.2),\
    Dense(hidden_units3, kernel_initializer='normal', activation='sigmoid'),\
    Dense(1, kernel_initializer='normal', activation='linear')\
  ])
  return model
# build the model
model = build_model_using_sequential()

Loss function is set and model is trained.

In [None]:
# loss function
msle = MeanSquaredLogarithmicError()
model.compile(
    loss=msle, 
    optimizer=Adam(learning_rate=learning_rate), 
    metrics=['accuracy',msle]
)
# train the model
history = model.fit(
    x_train_scaled.values, 
    y_train, 
    epochs=10, 
    batch_size=64,
    validation_split=0.2,
    verbose=1
)
accuracy = model.evaluate(x_test_scaled, y_test)
print(accuracy)

In [None]:
def plot_history(history, key):
  plt.plot(history.history[key])
  plt.plot(history.history['val_'+key])
  plt.xlabel("Epochs")
  plt.ylabel(key)
  plt.legend([key, 'val_'+key])
  plt.show()
# Plot the history
# binary_crossentropy
# mean_squared_logarithmic_error
plot_history(history, 'mean_squared_logarithmic_error')

In [None]:
scores = model.evaluate(x_test_scaled, y_test, verbose=1)
scores

The error is low when for 4 epocs. However, its not a steady graph and its not good as other ML model used above.

## 6. Evaluation <a class="anchor" id="sixth-bullet"></a>

1. Random Forest<br>
    The evaluation is done by finding optimal parameters one by one and using stratified KFold data split resulting in mean accuracy of 98%. However, when shuffle is set to false the model perform very badly by giving 58% score. This shows the that the data is balanced for the model and can produce better predictions.
    
2. K-Nearest Neighbours<br>
    The optimal neighbor is genreated by GridSearch and using Startified KFold data splits the model is able to give the mean accuracy score of 60%. However, the acuracy score when test with all test data gives 95%. Also like other models when shuffle is set to false the model performs poorly. 
    
3. Naive Bayes<br>
    The model is evaluated by using Stratified KFold data split and the mean accuracy scores is 61%.
4. Logistic Regression<br>
    The model generates the mean accuracy score of 64% using startified Kfold data split technique. However, the model did not converge properly as expected. I have tried to increase the max_depth and solver function to 'saga' but the model us still not converging to the gradient descent. So this model may not be accurate.
5. Neural Networks
    The model is generated with 3 dense layer in the network but the prediction and the mean erros are not consistent while running it again.

## 7. Deployments <a class="anchor" id="seventh-bullet"></a>


All models are deployed in my machine with use of GPU (neural netowrk - tensor flow) and parallel processing using n_jobs=-1 in models from scikit learn. Only, neural netowrk workload was high due to the dense layes.

## 8. Discussion <a class="anchor" id="eighth-bullet"></a>

* Random forest model produces confident predictions since the data imbalance was kept to lower by tweaking the algorithm. On the other hand, when the model trained with default parameter gives low accuracy scores of 94%.
* KNN algorithm performed quickest of all algorithms using n_jobs=-1 for parallel threads. The accuracy scores is the second best of all 5 models. However while cross validation the accuracy score drops to 60%. So, This may be caused by the model to overfit with all the test data and this model cant be used in real time.
* Naive Bayes produced 61% accuracy score and is the fourth best of five models. Naive Bayes was also quick . However, iterating with KFold technique increased the overall time.
* Logistic Regression model produced 64% accuracy score the third best of five models. However the gradient descend was not working as expected. Even after tweaking the parameter, i could still not able to fix the model. Making the max_iter to 4000 might work but the process kept running. So this model is not a trustable model for prediction of the fishing activities data
* Neural networks produced very extreme results while running it different times. Also, the network is set to be big for regression and leverages the performance from GPUs. However, the process waits and halts making it more time cinsuming to train and tweak. In future moving the workload to cloud machines may help tweak the model with consistent accuracy results.

## 9. Result <a class="anchor" id="nineth-bullet"></a>

After, experimenting with five machine learning algorithms, clearly Random Forest model performs better with 98% accuracy with cross validations. Future scope of this project is to classify the vessel type from the data and find possible illegal fishing activities.

## 9. References <a class="anchor" id="tenth-bullet"></a>



1. Sans, P. and Combris, P., 2015. World meat consumption patterns: An overview of the last fifty years (1961–2011). Meat Science, 109, pp.106-111.
2. Globalfishingwatch.org. 2022. GFW | Map. [online] Available at: <https://globalfishingwatch.org/map/> [Accessed 29 April 2022].
3. Global Fishing Watch | Data download portal. 2022. Global Fishing Watch | Data download portal. [online] Available at: <https://globalfishingwatch.org/data-download/> [Accessed 29 April 2022].
4. Sustainable Fisheries UW. 2022. What does the world eat? - Sustainable Fisheries UW. [online] Available at: <https://sustainablefisheries-uw.org/seafood-101/what-does-the-world-eat/> [Accessed 29 April 2022].
5. Kroodsma, D.A., Mayorga, J., Hochberg, T., Miller, N.A., Boerder, K., Ferretti, F., Wilson, A., Bergman, B., White, T.D., Block, B.A. and Woods, P., 2018.
6. Improving fishing pattern detection from satellite AIS using data mining and machine learning. de Souza, E.N., Boerder, K., Matwin, S. and Worm, B., 2016.

Tutorial Websites:

https://realpython.com/knn-python/
https://www.datacamp.com/community/tutorials/random-forests-classifier-python
https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d
https://www.analyticsvidhya.com/blog/2021/05/4-ways-to-evaluate-your-machine-learning-model-cross-validation-techniques-with-python-code/
https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right
https://neptune.ai/blog/how-to-deal-with-imbalanced-classification-and-regression-data
https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/
https://www.analyticsvidhya.com/blog/2021/08/a-walk-through-of-regression-analysis-using-artificial-neural-networks-in-tensorflow/
https://towardsdatascience.com/is-a-trawler-fishing-modelling-the-global-fishing-watch-dataset-d1ffb3e7624a


################################################# END of FILE ###########################################################