In this notebook we will train a multi-layer perceptron and a logistic regression on our ship data

We load our libraries below

In [1]:
import csv
import random
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

Below we download our ship data

In [2]:
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/data/ship.csv

--2023-11-21 19:14:59--  https://github.com/mlittmancs/great_courses_ml/raw/master/data/ship.csv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/data/ship.csv [following]
--2023-11-21 19:15:00--  https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/data/ship.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61194 (60K) [text/plain]
Saving to: ‘ship.csv’


2023-11-21 19:15:00 (725 KB/s) - ‘ship.csv’ saved [61194/61194]



Next, we write a script to process the data

In [3]:
first = True
with open("ship.csv") as f:
    csvdata = csv.reader(f, delimiter=',')
    data = []
    for row in csvdata:
      if not first: data += [row]
      first = False

array = []
for col in range(len(data[0])):
  array += [{}]
  new = 0
  for i in range(len(data)):
     line = data[i]
     if line[col] not in array[col]:
      array[col][line[col]] = new
      new += 1
alldat = []
alllabs = []
for line in data:
  alllabs += [int(line[1])]
  if line[5] == '': line[5] = '50'
  alldat += [ [int(line[2]), array[4][line[4]], float(line[5]), int(line[6]), int(line[7]), float(line[9]), line[11]=='S', line[11]=='C', line[11]=='Q' ]]
feats = ['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked S', 'Embarked C', 'Embarked Q']

With our data processed, we create a `trainmask` to randomly separate our training from test data.  We will use the mask to get the training and test data.

In [4]:
trainmask = [random.randint(0,2) for i in range(len(alldat))]

traindat = [alldat[i] for i in range(len(alldat)) if trainmask[i]<2]
trainlabs = [alllabs[i] for i in range(len(alldat)) if trainmask[i]<2]
testdat = [alldat[i] for i in range(len(alldat)) if trainmask[i]==2]
testlabs = [alllabs[i] for i in range(len(alldat)) if trainmask[i]==2]

We next will train a multi-layer perceptron with 60 hidden units to classify the data and print the accuracy.

In [5]:
nhidden = 60
clf = MLPClassifier(hidden_layer_sizes=[nhidden], max_iter = 50000)
clf = clf.fit(traindat, trainlabs)
pred = clf.predict(testdat)
[sum([pred[i] != testlabs[i] for i in range(len(testlabs))]) / len(testlabs)]


[0.22456140350877193]

Next we will calculate how much higher the predictions are for females vs. males.

In [6]:
# feats = ['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked S', 'Embarked C', 'Embarked Q']

imp = []
for v in alldat:
  real = v[1]
  v[1] = 0
  asmale = clf.predict_proba([v])[0][1]
  v[1] = 1
  asfemale = clf.predict_proba([v])[0][1]
  v[1] = real
  imp += [ asfemale-asmale ]

print(sum(imp)/len(imp))

0.3616355420908969


Next we will train a logistic regression and print the accuracy of the model on the train and test datasets

In [7]:
clf = LogisticRegression(max_iter = 500)

clf.fit(traindat, trainlabs)

pred = clf.predict(traindat)
trainerr = sum([pred[i] != trainlabs[i] for i in range(len(trainlabs))]) / len(trainlabs)
pred = clf.predict(testdat)
testerr = sum([pred[i] != testlabs[i] for i in range(len(testlabs))]) / len(testlabs)

print(trainerr, testerr)

0.19966996699669967 0.18596491228070175


Finally we'll plot the coefficients of the logistic regression model

In [8]:
for i in range(len(feats)):
  print(feats[i], clf.coef_[0][i])


Pclass -0.899621945705204
Sex 2.5289706956110853
Age -0.03675846019496377
SibSp -0.47032861923436736
Parch -0.0427788814558152
Fare 0.0031436434597836363
Embarked S -0.2547469522890994
Embarked C 0.20134951139286178
Embarked Q -0.014571985989051354
