[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/richard-cartwright/personal/blob/master/Fractal_IronOre.ipynb)

# SUMMARY

Q1. How much iron ore did Brazil export Jan-May 2017? 
- = ~125million DWT

Q2. Proportion of ships coming from South Africa, Indian Ocean, North Atlantic Ocean 
- = ~0.6

Q3. What features of the Capesize position history are helpful in predicting the behaviour of the index?
- If all ships far from shore = no ships available that day = low supply = higher price
- If all ships far from their destination = no ships available in coming days = low supply = high price
- If all ships low under water (high draft) = they're already carrying lots of weight = limited spare capacity = low supply = high price


# PACKAGES

## Installs

In [0]:
# to visualise ROC curve
# !pip install scikit-plot

# for read_excel
!pip install xlrd

# for extracting country from lat-lon
!pip install reverse_geocode

## Imports

In [0]:
# Basic imports, including ML libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import xlrd

import pprint
%matplotlib inline

# Setting plotting styles
plt.style.use('fivethirtyeight')
sns.set_style('white')

# Displays all cell's output, not just last output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Sklearn
# import scikitplot as skplt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures, Imputer
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, recall_score, brier_score_loss, f1_score

# Tensorflow & Keras
# import tensorflow as tf
# from tensorflow import keras
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense

# ENVIRON SET-UP


In [0]:
# Add GDrive to Colab environment

from google.colab import drive
drive.mount('/content/drive')

# Create path for data
path = '/content/drive/My Drive/Colab Notebooks/Personal/Fractal/Data/'

# View files in folder
!ls '/content/drive/My Drive/Colab Notebooks/Personal/Fractal/Data/'

In [0]:
# Extract data

fleet_reg_df = pd.read_csv(path+'fleet_reg.csv',
                           index_col='imo',
                           parse_dates=['built','last_update','launch_date','order_date','year','broken_up','created_at','updated_at'])\
                  .sort_index()

vessel_position_df = pd.read_csv(path+'vsl_pos_170101_170531.csv',
                                 index_col=['imo','date'],
                                 parse_dates=['date','eta'])\
                        .sort_index()

capesize_index_df = pd.read_csv(path+'bci_170101_170531.csv',
                                index_col='date',
                                parse_dates=['date'])\
                        .sort_index()\
                        .drop(columns=['name']) # Always 'BCI 5TC'

portlog_df = pd.read_excel(path+'portlog.xlsx',
                           index_col='name')\
                  .sort_index()

# INITIAL EDA

## fleet_reg_df

*A fleet register which identifies dry bulk cargo ships (name, deadweight = carrying capacity, age etc.) by their unique IMO (International Maritime Organization) number.*

In [0]:
# fleet_reg_df

# fleet_reg_df.info()

# x=0
# num_cols = 8
# while x<=38:
#   fleet_reg_df.iloc[:,x:x+num_cols].head(2)
#   x+=num_cols

## vessel_position_df

*Vessel position reports for Jan-May 2017. This is AIS (Automatic Identification System) satellite data. Column meanings: IMO, date of reading, latitude of ship, longitude of ship, speed (in knots), draft (in meters - the depth of the ship under water), indicated ETA, indicated destination (a free text field populated by the crew), indicated vessel status.*

In [0]:
# vessel_position_df

# vessel_position_df.head(2)
# vessel_position_df.info()

## capesize_index_df

*The Baltic Capesize Timecharter Index: this represents the average daily hire rate which Capesize vessels (deadweight > 120,000t) earned (in $/day) when agreeing that day to a new round-trip voyage from a given discharge port, to a load port, and back to a discharge port. This is the index against which Capesize freight futures settle.*

In [0]:
# capesize_index_df

# capesize_index_df.head(2)
# capesize_index_df.info()

## portlog_df

*A portlog which specifies the geographical coordinates (latitude minmax, longitude minmax) of areas containing some of the major world ports.*

In [0]:
# portlog_df

# portlog_df.head(2)
# portlog_df.info()

# Q1. How much iron ore did Brazil export Jan-May 2017? = ~125million DWT
*This is comprised of the cumulative deadweight of all large vessels (with a dwt > 60,000) that entered and then exited the following set of ports during the period: Ponta da Madeira, Tubarao, Ponta Ubu, Porto Acu, ItaguaiGuaiba.*

## Clean 'destination' data

In [0]:
# Clean 'destination' column: standardise name for the ports I care about

# Make 'destination' column: 1) string dtype; 2) lower case for better regex
vessel_position_df['destination'] = vessel_position_df['destination'].astype(str).apply(lambda dest: dest.lower())

# List of ports I care about
ports = ['madeira','tubarao','ubu','acu','itaguai','guaiba']

# List of unique unclean 'destination' for all vessel_position_df
destinations_raw = vessel_position_df['destination'].unique()

destinations_ports = {}
# Create dictionary of all raw unclean 'destination' for each port
for port in ports:
  destinations_ports[port] = [dest for dest in destinations_raw if port in dest]

# Make manual tweaks to extract just the names relevant to the port
destinations_ports['ubu'] = [dest for dest in destinations_ports['ubu'] 
                                  if 'aubu' not in dest 
                                  and 'lubu' not in dest
                                  ]
destinations_ports['acu'] = [dest for dest in destinations_ports['acu'] 
                                  if 'macu' not in dest 
                                  and 'jacu' not in dest
                                  and 'yacu' not in dest
                                  and 'sin_acu' not in dest
                                  and 'iacu' not in dest
                                  ]

# Display unclean port names
# for port in ports:
#   print(port,'\n')
#   destinations_ports[port]
#   print('\n')

# Replace unclean port names with cleaned
for port in ports:
  vessel_position_df['destination'] = vessel_position_df['destination'].replace(destinations_ports[port],port)

## Extract export movements data

In [0]:
# Create df of the previous and current destination of each vessel position
export_moment_df = pd.concat(axis=1, 
                             objs=[vessel_position_df['destination'].shift(1),
                                  vessel_position_df['destination']])
export_moment_df.columns = ['previous_destination','new_destination']

# ----------
# ISSUE: because of .shift(), last destination of the previous imo erroneously becomes the 'previous_destination' for the next imo
# Therefore refill these first 'previous_destination' with 'Unknown'

# Create series of first dates
first_dates = export_moment_df.reset_index().groupby('imo',as_index=False)['date'].first()

# Set previous destination of first dates as 'Unknown'
first_dates['previous_destination'] = 'Unknown'
first_dates.set_index(['imo','date'],inplace=True)

# Set first previous_destination as 'Unknown', instead of just the erroneous shifted destination
export_moment_df.loc[first_dates.index,'previous_destination'] = first_dates['previous_destination']


In [115]:
# Extract only those moments where the previous_destination is a Brazilian IO port,
# and the new destination is NOT one of those ports (therefore exported out of Brazil)
export_moment_df = export_moment_df[
    (export_moment_df['previous_destination'].isin(ports)) 
    & (~export_moment_df['new_destination'].isin(ports))
]

# Join on 'dwt' from fleet reference table
export_moment_df = pd.merge(export_moment_df.reset_index(),
                            fleet_reg_df[['dwt']],
                            how='left',
                            left_on='imo',
                            right_index=True)\
                         .set_index(['imo','date'])\
                         .sort_index()

export_moment_df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,previous_destination,new_destination,dwt
imo,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8420804,2017-05-02 00:11:35,tubarao,br pdm,364767
8800286,2017-05-14 23:59:31,tubarao,kwang yang,290160


## Exports = ~125million DWT

In [116]:
# Restrict to large (dwt>60000) and only Jan to May 2017
large_janToMay = export_moment_df.loc[(slice(None),slice('2017-01','2017-05')),:][export_moment_df.dwt>60000]

print('The cumulative DWT of all large vessels who exported from Brazilian Iron Ore ports between Jan & May 2017 is:',
      large_janToMay['dwt'].sum())

The cumulative DWT of all large vessels who exported from Brazilian Iron Ore ports between Jan & May 2017 is: 125335892


# Q2. Proportion of ships coming from South Africa, Indian Ocean, North Atlantic Ocean = ~0.6

*Those are the two common areas for ships that come and load iron ore in Brazil.*

In [117]:
# reverse_geocode package for extracting country from lat-lon

import reverse_geocode

# Example
coordinates = ((-19.900000, 148.1000), (31.76, 35.21)) #tuple
reverse_geocode.search(coordinates)

[{'city': 'Bowen', 'country': 'Australia', 'country_code': 'AU'},
 {'city': 'Jerusalem', 'country': 'Israel', 'country_code': 'IL'}]

In [0]:
# Create df of 'previous_destination', 'new_destination' & lat-lon for each vessel position
departure_moment_df = pd.concat(axis=1, 
                             objs=[vessel_position_df['destination'].shift(1),
                                  vessel_position_df[['destination','lat','lon']]])
departure_moment_df.columns = ['previous_destination','new_destination','departure_lat','departure_lon']

# Extract only those moments where the previous_destination is NOT a Brazilian IO port,
# and the new_destination IS a one of those ports
departure_moment_df = departure_moment_df[
    (~departure_moment_df['previous_destination'].isin(ports)) 
    & (departure_moment_df['new_destination'].isin(ports))
]

# Create departure_country from departure lat-lon
departure_moment_df['departure_country'] = departure_moment_df.apply(
    axis=1,
    func=lambda row: reverse_geocode.search(
        ((row['departure_lat'],row['departure_lon']),)
    )[0]['country']
)


In [0]:
# countries from 'departure_country' in: South Africa, Indian Ocean, North Atlantic Ocean

selected_countries = [
    'Belgium',
    'Canada',
    'Cape Verde',
    'Cocos (Keeling) Islands',
    'Comoros',
    'Finland',
    'France',
    'Germany',
    'Gibraltar',
    'India',
    'Indonesia',
    'Kenya',
    'Latvia',
    'Madagascar',
    'Maldives',
    'Mauritius',
    'Morocco',
    'Mozambique',
    'Namibia',
    'Netherlands',
    'Portugal',
    'Saint Helena',
    'South Africa',
    'Spain',
    'Sri Lanka',
    'Sweden',
    'United Kingdom',
    'United States']

In [120]:
print('Proportion of ships travelling to Brazilian IO ports coming from South Africa, Indian Ocean, North Atlantic Ocean: ',
      round(sum(departure_moment_df['departure_country'].isin(selected_countries))
            / len(departure_moment_df),
            3))

Proportion of ships travelling to Brazilian IO ports coming from South Africa, Indian Ocean, North Atlantic Ocean:  0.581


# Q3. What features of the Capesize position history are helpful in predicting the behaviour of the index? 

Early answer:
- draft
- lat & lon

## Supply-Demand of DWT

Hypothesis: as available supply of DWT **increases**, price **decreases**

Capesize is >120k but will be affected by supply of ships <=120k

Dynamics:
- This price is forward looking: "agreeing that day to a new round-trip voyage from a given discharge port"
- Therefore, if a ship space **seller** expects a higher price tomorrow, the space seller will not agree to the (lower) price today and instead will wait for tomorrow
- Therefore, the ship space **buyer** will bid a higher price today to tempt the seller to accept
- Therefore, tomorrow's price feeds positively back into today's price: if tomorrow's price will be higher, then this will drag today's price higher
- Tomorrow's price is dictated by **supply of DWT**: if tomorrow's DWT is lower, tomorrow's price will be higher, pushing today's price higher

## How to measure supply of DWT?

Intuitively:
- If all ships **far from shore** = no ships available that day = low supply = higher price
- If all ships **far from their destination** = no ships available in coming days = low supply = high price
- If all ships **low under water (high draft)** = they're already carrying lots of weight = limited spare capacity = low supply = high price

(*Features are aggregated across all ships for each day*)

Features: 
- **draft** mean,median,std: if average draft larger, less spare capacity
- **hours_until_ETA** mean,median,std: if longer until destination, less capacity at the shore
- **status** dummies: whether more ships are Anchored vs Moored vs Sailing will dictate how much spare capacity there is
- **speed** mean,median,std: proxies activity of ship. High speed if sailing, low speed if moored

Features I'm unsure about but may have predictive power:
- **number of unique**: this is number of ships each day which give out vessel_position
- **lat & lon**: this proxies positions of ship so can say how far ships are from shores

## Create aggregate daily features for vessel positions

In [121]:
vessel_position_df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,speed,draft,eta,destination,status
imo,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7105495,2017-01-01 00:09:29,46.6974,-92.0187,0.0,8.5,2017-01-03 05:11:00,burns hbr,Moored
7105495,2017-01-01 22:54:29,47.2661,-86.5826,13.0346,8.5,2017-01-03 15:11:00,burns hbr,Under way using engine


In [0]:
# Derive 'status' dummies & 'hours_until_ETA'

# Dummies for categories of 'status'
vessel_position_features_df = pd.get_dummies(vessel_position_df.reset_index(),
                                                   columns=['status'],
                                                   dummy_na=True)

# Derive 'hours_until_ETA'
vessel_position_features_df['hours_until_ETA'] = (vessel_position_features_df['eta'] 
                                                  - vessel_position_features_df['date']).dt.seconds / 3600


In [0]:
# Create aggregate daily features

daily_features_dict = {}

# Extract series of the datetimes
dates = vessel_position_features_df['date'].dt.date

# Number of unique imo each day 
daily_features_dict['nunique'] = vessel_position_features_df.groupby(dates)[['imo']].nunique().rename(columns={'imo':'nunique'})

# distribution variables each day for: lat,lon,speed,draft,hours_until_ETA
daily_features_dict['numerical'] = vessel_position_features_df.groupby(dates)[
    ['lat','lon','speed','draft','hours_until_ETA']].agg(['mean','median','std'])

# Get mean each day of status dummies
daily_features_dict['status_dummies'] = vessel_position_features_df.groupby(dates)[
    [col for col in vessel_position_features_df.columns if col.startswith('status')]].mean()

# Create new df
position_daily_features_df = pd.concat(axis=1,
                                       objs=[daily_features_dict['nunique'],
                                             daily_features_dict['numerical'],
                                             daily_features_dict['status_dummies']])
# Datetimeindex
position_daily_features_df.index = pd.to_datetime(position_daily_features_df.index)

In [0]:
# EDA

# position_daily_features_df.info()

# # viz each variable over time
# for col in position_daily_features_df.columns:
#   position_daily_features_df[col].iloc[0:-1].plot();
#   plt.title(col);
#   plt.figure();

## Early model on basic data

- Just using at-time variables, no moving averages or pct changes
- The motivation is to get a baseline accuracy

In [0]:
# Create different types of targets: regression & classification

basic_targets = capesize_index_df.shift(-1).rename(columns={'value':'next_price'})

# Regression target
basic_targets['next_pct_change'] = capesize_index_df.pct_change().shift(-1)

# Classification target
basic_targets['next_price_higher'] = (capesize_index_df.pct_change().shift(-1) > 0)
basic_targets = basic_targets.dropna()

# Use only data when I have a target
basic_model_data = position_daily_features_df.reindex(index=basic_targets.index)

### Classification

In [126]:
# Classification - up or down

# Data train-test split
y = basic_targets['next_price_higher']
X = basic_model_data

train_threshold = round(0.7*len(basic_model_data))

X_train = X.iloc[:train_threshold]
y_train = y[:train_threshold]
X_test = X.iloc[train_threshold:]
y_test = y[train_threshold:]

# ---
# Models
forest = RandomForestClassifier(random_state=42,
                                n_estimators=100,
                                max_depth=2)
forest.fit(X_train,y_train)

logreg = LogisticRegression()
logreg.fit(X_train,y_train)

print('\n')
print('Score by chance = 0.5')

print('\n')
print('Basic RandomForest Classificationtrain score:',forest.score(X_train,y_train))
print('Basic RandomForest Classification test score:',forest.score(X_test,y_test))

print('\n')
print('Basic LogReg Classification train score:',logreg.score(X_train,y_train))
print('Basic LogReg Classification test score:',logreg.score(X_test,y_test))

# Feature importances
feature_importances = pd.DataFrame(forest.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance',ascending=False)

# Print 10 most important features
print('\n',feature_importances.head(10))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



Score by chance = 0.5


Basic RandomForest Classificationtrain score: 0.8873239436619719
Basic RandomForest Classification test score: 0.5806451612903226


Basic LogReg Classification train score: 0.7746478873239436
Basic LogReg Classification test score: 0.6774193548387096

                                importance
(draft, mean)                    0.102725
status_nan                       0.097739
status_Under way using engine    0.087403
(lon, mean)                      0.073068
status_Anchored                  0.071703
nunique                          0.065296
(lat, std)                       0.059790
(hours_until_ETA, median)        0.053535
status_Moored                    0.050951
(speed, mean)                    0.047769


### Regression

In [127]:
# Regression - quantify percentage price change

# Data train-test split
y = basic_targets['next_pct_change']
X = basic_model_data

train_threshold = round(0.7*len(basic_model_data))

X_train = X.iloc[:train_threshold]
y_train = y[:train_threshold]
X_test = X.iloc[train_threshold:]
y_test = y[train_threshold:]

# ---
# Models
forest = RandomForestRegressor(random_state=42,
                               n_estimators=100,
                               max_depth=2)
forest.fit(X_train,y_train)

linreg = LinearRegression()
linreg.fit(X_train,y_train)

print('\n')
print('Score by zero prediction = 0')

print('\n')
print('Basic RandomForest Regression train score:',forest.score(X_train,y_train))
print('Basic RandomForest Regression test score:',forest.score(X_test,y_test))

print('\n')
print('Basic LinReg Regression train score:',linreg.score(X_train,y_train))
print('Basic LinReg Regression test score:',linreg.score(X_test,y_test))

# Feature importances
feature_importances = pd.DataFrame(forest.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance',ascending=False)

# Print 10 most important features
print('\n',feature_importances.head(10))

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



Score by zero prediction = 0


Basic RandomForest Regression train score: 0.5530932840451284
Basic RandomForest Regression test score: -0.3678249600751802


Basic LinReg Regression train score: 0.5117790499048771
Basic LinReg Regression test score: -16.282206944099705

                                importance
(draft, mean)                    0.202861
status_Under way using engine    0.167726
(lon, mean)                      0.133484
nunique                          0.076158
status_Anchored                  0.065567
(hours_until_ETA, median)        0.057347
status_Moored                    0.036879
status_nan                       0.034002
(speed, mean)                    0.028698
(draft, median)                  0.025647


## Model data

Using:
- unweighted **rolling means**: 2,5,10 day
- **percent changes**: 1,4,9 day

### MAs & pct_changes

In [0]:
# Capesize MA & pct_change to capture autoregressive tendencies

# Capesize index interpolated to include weekends
interpolated_capesize_index = capesize_index_df.reindex(position_daily_features_df.index).interpolate()

# Capesize MAs & pct_changes
capesize_features_df = pd.concat(axis=1,
                                 objs=[interpolated_capesize_index.pct_change(1),
                                       interpolated_capesize_index.pct_change(4),
                                       interpolated_capesize_index.pct_change(9),
                                       interpolated_capesize_index.rolling(2).mean(),
                                       interpolated_capesize_index.rolling(5).mean(),
                                       interpolated_capesize_index.rolling(10).mean()])
capesize_features_df.columns = ['capesize_'+x for x in ['pct1','pct4','pct9','ma2','ma5','ma10']]

In [0]:
# MAs and pct_changes for all model features

# ---
# pct_changes

# Tweak zero values to avoid infinities with pct_change
position_daily_features_df.replace(0,0.000001,inplace=True)

pct_1d = position_daily_features_df.pct_change(1)
pct_1d.columns = ['pct1_'+str(col) for col in pct_1d.columns]

pct_4d = position_daily_features_df.pct_change(4)
pct_4d.columns = ['pct4_'+str(col) for col in pct_4d.columns]

pct_9d = position_daily_features_df.pct_change(9)
pct_9d.columns = ['pct9_'+str(col) for col in pct_9d.columns]

# ---
# MAs
ma_2d = position_daily_features_df.rolling(2).mean()
ma_2d.columns = ['ma2_'+str(col) for col in ma_2d.columns]

ma_5d = position_daily_features_df.rolling(5).mean()
ma_5d.columns = ['ma5_'+str(col) for col in ma_5d.columns]

ma_10d = position_daily_features_df.rolling(10).mean()
ma_10d.columns = ['ma10_'+str(col) for col in ma_10d.columns]


In [0]:
# Model data contains all MAs and pct_changes
model_df = pd.concat(axis=1,
                     objs=[capesize_features_df,
                           position_daily_features_df,
                           pct_1d,
                           pct_4d,
                           pct_9d,
                           ma_2d,
                           ma_5d,
                           ma_10d])

# NaNs caused by lag from MAs & pct_changes - therefore can't interpolate
# Fill NaN with *median* - to dampen signal but avoid distortion
model_df = model_df.fillna(value=model_df.median())

In [0]:
# Only include data with an associated target
X_data = model_df.reindex(index=basic_targets.index)

### Classification

In [132]:
# Classification - up or down
y_target = basic_targets['next_price_higher']

# Data train-test split
train_threshold = round(0.7*len(basic_model_data))

X_train = X_data.iloc[:train_threshold]
y_train = y_target[:train_threshold]

X_test = X_data.iloc[train_threshold:]
y_test = y_target[train_threshold:]

# ---
# Models
forest = RandomForestClassifier(random_state=42,
                                n_estimators=100,
                                max_depth=2)
forest.fit(X_train,y_train)

logreg = LogisticRegression()
logreg.fit(X_train,y_train)

print('\n')
print('Basic RandomForest Classificationtrain score:',forest.score(X_train,y_train))
print('Basic RandomForest Classification test score:',forest.score(X_test,y_test))

print('\n')
print('Basic LogReg Classification train score:',logreg.score(X_train,y_train))
print('Basic LogReg Classification test score:',logreg.score(X_test,y_test))

# Feature importances
feature_importances = pd.DataFrame(forest.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance',ascending=False)

# Print 10 most important features
print('\n',feature_importances.head(10))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



Basic RandomForest Classificationtrain score: 0.971830985915493
Basic RandomForest Classification test score: 0.5806451612903226


Basic LogReg Classification train score: 0.9859154929577465
Basic LogReg Classification test score: 0.3225806451612903

                         importance
capesize_pct1             0.066114
ma5_('lon', 'median')     0.046237
ma10_('lon', 'mean')      0.041038
ma10_('lon', 'median')    0.034937
ma5_('draft', 'mean')     0.030897
pct9_('draft', 'std')     0.027329
nunique                   0.026596
ma5_('lon', 'mean')       0.026387
ma5_('speed', 'mean')     0.023495
ma10_('draft', 'mean')    0.020926


### Regression

In [133]:
# Regression - quantify percentage price change
y = basic_targets['next_pct_change']

# Data train-test split
train_threshold = round(0.7*len(basic_model_data))

X_train = X_data.iloc[:train_threshold]
y_train = y_target[:train_threshold]

X_test = X_data.iloc[train_threshold:]
y_test = y_target[train_threshold:]


forest = RandomForestRegressor(random_state=42,
                               n_estimators=100,
                               max_depth=2)
forest.fit(X_train,y_train)

linreg = LinearRegression()
linreg.fit(X_train,y_train)

print('\n')
print('Basic RandomForest Regression train score:',forest.score(X_train,y_train))
print('Basic RandomForest Regression test score:',forest.score(X_test,y_test))

print('\n')
print('Basic LinReg Regression train score:',linreg.score(X_train,y_train))
print('Basic LinReg Regression test score:',linreg.score(X_test,y_test))

# Feature importances
feature_importances = pd.DataFrame(forest.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance',ascending=False)
print('\n',feature_importances.head(10))

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



Basic RandomForest Regression train score: 0.8064312326481122
Basic RandomForest Regression test score: 0.11841351202218797


Basic LinReg Regression train score: 1.0
Basic LinReg Regression test score: -1697.1970900038077

                                         importance
capesize_pct1                             0.349743
ma5_('lon', 'mean')                       0.120646
ma5_('lon', 'median')                     0.048977
ma5_('draft', 'mean')                     0.034593
ma10_('lon', 'median')                    0.024235
pct1_('hours_until_ETA', 'median')        0.021969
pct4_status_Under way sailing             0.016222
pct9_('draft', 'std')                     0.015718
ma10_status_Restricted maneuverability    0.015170
pct1_status_Restricted maneuverability    0.013231


## Feature Selection & Scaling

- I face an issue of low obs: only **m=102** days of the Capesize index
- With the moving averages and pct_changes included, I have **n=188** features
- Therefore I need to use feature selection to extract only the most relevant features
- However I also should train my model on a higher m dataset (Capesize index over a longer period)