# PART 1 - Machine Learning

### Task 1.1 - Data Preparation
There are 100 rows and 11 dimensions in the dataset. The data is 100% dense. Feature labels were added.

Task is to predict the number of containers a ship can carry. 
The target is known hence this is supervised learning.

From the inputs we want to get a output of the weighted sum : 𝑦 = 𝑓(𝑥𝑤)

- Import necessary modules and Read data,add feature names
- Call the head method to get a general overview of the data

In [None]:
#Import required modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

cols = ["IMO_NO.", "Vessel_Name", "Year_Built", "Gross_Tonnage", "Deadweight_Tonnage", "Length", "Beam", "Capacity_(TEU)", "Forward_Bays", "Center_Bays", "Aft_Bays"]
data = pd.read_csv('containers.csv',names=cols)
data.head()



- **Explore the data**

Our first step is to summarize the DataFrame by  computing aggregations. We can do this by using the info method in Pandas. We can see that all data is non-null as expected and we have 8 numeric values, one target, and one String.

In [None]:
data.info()
data.describe()

**Finding outliers and inconsistent data** 

For each of these features, comparing the max and 75% values, we can start to see a huge difference in the Beam (Width) feature. This confirms that there may be an error with some of the tuples. 

In [None]:
print(data['IMO_NO.'].value_counts())

**Duplicate Vessel Identification Number (IMO)**

Now lets count the vessel identification number, which should be unique for the given dataset, however we can easily spot a duplicate IMO vessel. An online search shows that this vessel was renamed/purchased, we will leave in this duplicate vessel. IMO No. 9314947 in the dataset.

Let us investigate further.

Let’s have a quick look at the distribution of the Beam feature by plotting the histograms.

In [None]:

# for col in data.iloc[:,2:11]:
data['Beam'].hist(figsize=(5, 3), bins=30, edgecolor="black", )
plt.subplots_adjust(hspace=0.7, wspace=0.4)
plt.title('Beam (width distribution)')
plt.xlabel('Beam (width)')
plt.ylabel("Sum of Totals")
plt.show()

**Fix the outlier, incorrect data**

Here we can see one ship has a length and width of 300x290 meters. I've never seen a almost square vessel before, probably doesn't go very fast!  The outlier is identified as MSC Albany with IMO 9619438. Correct beam should be 48meters. (https://www.vesselfinder.com/vessels/details/9619438) Since we have good reason to believe this is factually incorrect data it would be appropriate to correct the Beam to 48 meters

In [None]:
# Find the outlier as index 32, width cannot be the same as the length of the ship.

print(data.loc[32])

- Clean the data

In [None]:
corrected_beam_MSC_Albany = 48
edited_data = data.copy()
edited_data.at[32,'Beam'] = corrected_beam_MSC_Albany
edited_data.loc[32]
edited_data


* **Split the data (training & testing)**

- 75% Training data (Default)
- 25% Testing data (Default)
  A high traininset set of 75% will prevent overfitting*
  underfitting is too simple

* Shuffle the data

By default train_test_split method shuffle the Dataframe randomly prior to splitting, hence we do not need to shuffle beforehand. Shuffling the data ensures that there are no patterns or structure in the order of the data that could *bias the results* of the model. It also ensures that both the training and testing dataset contains a good generalization of the model and is representative of the overall distribution of the vessel data.

In [None]:

from sklearn.model_selection import train_test_split

X = edited_data.copy() # Copy prevents mutation of the original dataset incase we need to revert changes.
y = edited_data['Capacity_(TEU)'].copy() # Prevents mutation.
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
X_Test_With_All_Columns = X_test.copy() # Keep a copy of the X_Test before dropping columns & Normalization below

# Log the number of training and testing data, you can see 75% for training,a nd 25% for testing.
print(X_train.shape, X_test.shape,y_train.shape,y_test.shape)



  * Drop features that are not required to train the model, such as the target variable and the IMO_NO & Vessel_Name in order to lower dimension. 


In [None]:
cols_to_drop = ['IMO_NO.','Vessel_Name','Capacity_(TEU)']
X_train.drop(columns=cols_to_drop, axis=1, inplace=True) #Dropped cols
X_test.drop(columns=cols_to_drop, axis=1, inplace=True) #Dropped Cols

print(len(X_train)) # 75% Training data
print(len(X_test)) # 25% Testing data


**Normalize using StandardScaler**

Now let us normalize the data to ensure that the values share a common scale, this will reduce complexity and optimize the data for machine learning. In this case we will use the StandardScaler

In [None]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaled_X_Train = scaler.fit_transform(X_train)
scaled_X_Test = scaler.fit_transform(X_test)
print(scaled_X_Train) # Prints the scaled Training data



### Our data is now prepared for Modeling

### TASK 1.2 REGRESSION

**Pipelines**

We will be training our model on 3 regression algorithm's mentioned below. In order to efficiently load the models and evaluate each of them. We can use *make_pipeline from sklearn.pipeline*.

- Random Forest (Decision Trees)
- Multi Layer Perceptron  (MLP)
- Support Vector Regression

In [None]:
# Load the required dependencies
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

* Set up pipelines for each algorithm

In [None]:
from sklearn.pipeline import make_pipeline

# Store all the pipeline inside a dictionary.
pipelines = {
    'Random_Forest': make_pipeline(RandomForestRegressor(random_state=0)),
    'Multi_Layer_Percepton' : make_pipeline(MLPRegressor(random_state=0)),
    'Support_Vector_Regression' : make_pipeline(SVR())
}


Each of the 3 algorithms contain tunable hyperparameters, instead of manually tuning parameter for our dataset, we can utilize sklearn GridSearchCV. 

GridSearchCV provides a exhaustive search *(2-10 mins depending on CPU power)* on our predefined parameters for each algorithm. This returns the best possible combination of hyperparameters for each of our 3 algorithms used.

In [None]:

#Set up a hyperparameter grid, the model will go through each of the parameters one by one in order to find the best convergence.

hyper_param_grid = {
    'Random_Forest': {
        'randomforestregressor__n_estimators':[50,100,200]
    },
    'Multi_Layer_Percepton' : {
        'mlpregressor__hidden_layer_sizes':[100],
        'mlpregressor__solver':['adam','lbfgs'],
        'mlpregressor__max_iter':[1000,10000,20000]
    },
    'Support_Vector_Regression': {
        'svr__kernel':['rbf','sigmoid'],

    }    
}

In [None]:
# import Grid Search CV

from sklearn.model_selection import GridSearchCV
from sklearn.exceptions import NotFittedError #Suppress warnings from stdout
import warnings #Import python warning package
from sklearn.exceptions import ConvergenceWarning # Disable Convergence Warnings
warnings.filterwarnings(action='ignore', category=ConvergenceWarning,)

fit_model = {} #Dictionary that holds our models

for algo,pipeline in pipelines.items():
    try:
        model = GridSearchCV(pipeline,hyper_param_grid[algo], cv=10, n_jobs=1)
        print('Training started for',algo,'...')
        model.fit(scaled_X_Train,y_train)
        fit_model[algo] = model
        print (algo, 'has been fitted! 👏')
        print ("========================================")
    except NotFittedError as e:
        print ("Error detected")
        print(repr(e))

print("All Training has been completed!! 👏👏")

**TOP 10 CONTAINERS ORDERED BY PREDICTED CAPACITY**

In [35]:
predictions = {}
Top10 = X_Test_With_All_Columns.loc[:,cols_to_drop]
for algo,pipeline in pipelines.items():
   predictions[algo] = fit_model[algo].predict(scaled_X_Test)
   Top10[algo] = predictions[algo]

In [None]:
# print(Top10.sort_values('Random_Forest', ascending=False).reset_index(drop=True).head(10))
RandomForestSet = Top10.drop(["Multi_Layer_Percepton","Support_Vector_Regression"],axis=1).sort_values('Random_Forest', ascending=False).reset_index(drop=True).head(10)
MLPSet = Top10.drop(["Random_Forest","Support_Vector_Regression"],axis=1).sort_values('Multi_Layer_Percepton', ascending=False).reset_index(drop=True).head(10)
SVR = Top10.drop(["Random_Forest","Multi_Layer_Percepton"],axis=1).sort_values('Support_Vector_Regression', ascending=False).reset_index(drop=True).head(10)

In [36]:
RandomForestSet

Unnamed: 0,IMO_NO.,Vessel_Name,Capacity_(TEU),Random_Forest
0,9776418,CMA CGM ANTOINE DE SAINT EXUPERY,20776,23414.075
1,9695121,CSCL GLOBE,19100,20288.46
2,9454436,CMA CGM MARCO POLO,16022,18796.99
3,9869186,HMM GARAM,16010,16867.355
4,9728942,TAURUS,14354,15654.37
5,9467263,CSCL JUPITER,14074,15246.165
6,9467392,MSC BERYL,12400,13974.76
7,9612997,ANTWERPEN EXPRESS,13167,13962.58
8,9739680,MAERSK GENOA,10100,10180.66
9,9685334,MOL BRILLIANCE,10100,10081.775


Use R2 SCORE AND MAE
high r2 is better
low mae is better

In [37]:
MLPSet

Unnamed: 0,IMO_NO.,Vessel_Name,Capacity_(TEU),Multi_Layer_Percepton
0,9776418,CMA CGM ANTOINE DE SAINT EXUPERY,20776,24122.937263
1,9695121,CSCL GLOBE,19100,20258.303437
2,9869186,HMM GARAM,16010,19456.650494
3,9454436,CMA CGM MARCO POLO,16022,18437.310102
4,9728942,TAURUS,14354,16682.308123
5,9467263,CSCL JUPITER,14074,14674.559537
6,9612997,ANTWERPEN EXPRESS,13167,14037.431501
7,9739680,MAERSK GENOA,10100,13375.84818
8,9467392,MSC BERYL,12400,12903.598315
9,9685334,MOL BRILLIANCE,10100,12600.573498


In [38]:
SVR

Unnamed: 0,IMO_NO.,Vessel_Name,Capacity_(TEU),Support_Vector_Regression
0,9776418,CMA CGM ANTOINE DE SAINT EXUPERY,20776,8591.403943
1,9695121,CSCL GLOBE,19100,8586.345558
2,9869186,HMM GARAM,16010,8585.08941
3,9454436,CMA CGM MARCO POLO,16022,8583.338198
4,9728942,TAURUS,14354,8580.994141
5,9467263,CSCL JUPITER,14074,8576.421311
6,9612997,ANTWERPEN EXPRESS,13167,8574.979698
7,9467392,MSC BERYL,12400,8571.885834
8,9739680,MAERSK GENOA,10100,8554.684207
9,9685334,MOL BRILLIANCE,10100,8551.865549


In [None]:
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error



In [None]:
for algo,model in fit_model.items():
    y_prediction = model.predict(X_test_Dropped)
    print(f'{algo}| R2 = {r2_score(y_test, y_prediction)} | MAE = {mean_absolute_error(y_test,y_prediction)} | MSE = {mean_squared_error(y_test,y_prediction)}')
    

In [None]:
# X_test
X_Test_Unscaled = pd.DataFrame(scaler.inverse_transform(X_test).astype(int))
X_Test_Unscaled.columns = ['Year_Built','Gross_Tonnage','Deadweight_Tonnage','Length','Beam','Forward_Bays','Center_Bays','Aft_Bays']
X_Test_Unscaled = pd.concat([X_Test_Unscaled,])
# print('edidt date lengt ', len(edited_data))
# addedBack = X_Test_Unscaled.join(edited_data.set_index(edited_data.index)[cols_to_drop]).reindex(columns=edited_data.columns)
# print(addedBack)
addedBack = pd.concat(X_edited_data.loc[X_Test_Unscaled.index, cols_to_drop],left_index=True,right_index=True)
addedBack
df_test_unnormalized = pd.concat([X_Test_Unscaled.reset_index(drop=True), edited_data.loc[X_test.index, ['A', 'C']].reset_index(drop=True), pd.DataFrame({'A_pred': y_pred_unnormalized})], axis=1)