# Linear Regression on Ship Dataset

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 


In [1]:
# Lets start by creating Spark Sessions

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('cruise').getOrCreate()

## Read the dataset

In [2]:
df = spark.read.csv('resources/cruise_ship_info.csv', inferSchema=True, header=True)

df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [4]:
# Lets see few datalines

df.show(10)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [5]:
# Lets view summary statistics for numerical columns

df.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|     null|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

## Data Transformation for Machine Learning

In [10]:
# We will ignore the Ship_name variable, but use the Cruise_line variable as a categorical variable
# Check the unique values
df.select('Cruise_line').distinct().show()

+-----------------+
|      Cruise_line|
+-----------------+
|            Costa|
|              P&O|
|           Cunard|
|Regent_Seven_Seas|
|              MSC|
|         Carnival|
|          Crystal|
|           Orient|
|         Princess|
|        Silversea|
|         Seabourn|
| Holland_American|
|         Windstar|
|           Disney|
|        Norwegian|
|          Oceania|
|          Azamara|
|        Celebrity|
|             Star|
|  Royal_Caribbean|
+-----------------+



In [11]:
# Lets check cruise_line variable by count

df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [12]:
# Cruise line may have an effect as to how many crew members we need

# In order to use this variable in the algorithm we would convert it into categorical and assign numerical values
# for each of the categories

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_cat")
indexed = indexer.fit(df).transform(df)

# Check the first three rows for verification
indexed.head(3)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_cat=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_cat=16.0),
 Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7, Cruise_cat=1.0)]

In [13]:
# Now we will convert the data into features and labels for the algorithms

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [14]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_cat']

In [15]:
assembler = VectorAssembler(inputCols=['Age', 'Tonnage', 'passengers', 'length', 'cabins', 
                                       'passenger_density', 'Cruise_cat'],
                           outputCol="features")

In [16]:
# We will transform the data and create the final dataframe

output = assembler.transform(indexed)

final_data = output.select('features', 'crew')

final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



## Perform Regression

In [18]:
# To do a train, test split

train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [19]:
from pyspark.ml.regression import LinearRegression

# we will create a Linear regression model object

lr = LinearRegression(labelCol='crew')

# Fit the model to the data

lrmodel = lr.fit(train_data)

In [20]:
# Get the coefficients and intercept

print(f"coefficients : {lrmodel.coefficients}")
print("\n")
print(f"Intercept : {lrmodel.intercept}")

coefficients : [-0.014325219988419435,0.006716902017975473,-0.14713140519486861,0.4779449718256105,0.8570924324777663,0.0022912208408813668,0.04818430726512149]


Intercept : -1.4875218211025563


In [21]:
# Now we are going to evaluate this model on test data

test_results = lrmodel.evaluate(test_data)

In [22]:
# Get the residuals, RMSE and Adjusted R2

test_results.residuals.show()

print(f"RMSE : {test_results.rootMeanSquaredError}")
print(f"MSE : {test_results.meanSquaredError}")
print(f"R2 : {test_results.r2}")
print(f"Adjusted R2 : {test_results.r2adj}")

+--------------------+
|           residuals|
+--------------------+
| 0.43922100476459125|
| -0.8091203901501052|
| -1.1966072203035676|
|  0.4078105401217762|
| -0.6758720728350998|
| -0.5746595993770978|
|  0.9739545363227098|
|  0.9739545363227098|
| -0.6098178633633555|
|-0.33559532459755914|
|  0.6673696232667918|
| -1.1573508537167303|
| -1.4502146898831745|
| -0.3405661936221378|
| -1.2330256337283085|
|   0.247154431896063|
|  -1.135889469894754|
| -0.8215923826700102|
| 0.07243317652413239|
| -1.6601333835526209|
+--------------------+
only showing top 20 rows

RMSE : 0.7749058544822227
MSE : 0.6004790833108237
R2 : 0.9159526022956866
Adjusted R2 : 0.8963415428313468


Adjusted R2 of 88% is fairly good... We will inspect the data a little closer

In [23]:
from pyspark.sql.functions import corr

df.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [24]:
df.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

