# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

# B1: Create SparkSession

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("cuise_ship").getOrCreate()

# B2: Load input corpus
'cruise_ship_info.csv'

In [2]:
dir_input_path = "./../input_data/"
file_input_path = dir_input_path + "cruise_ship_info.csv"

In [3]:
import os

if not os.path.exists(file_input_path):
    print("File Not Found :", file_input_path)
else:
    print("Verified input file path :", file_input_path)

Verified input file path : ./../input_data/cruise_ship_info.csv


In [4]:
df = spark.read.csv(file_input_path, header=True, inferSchema=True)

# B3: Show overview of input corpus
## Schema

In [5]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



## Description

In [6]:
df.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|      NaN|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

In [7]:
df.describe("crew").show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



## The column names

In [8]:
df.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

## Sample Data

In [9]:
df.head(2)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)]

## Print each item in the first line

In [10]:
for item in df.head():
    print(item)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55


# B4: Data Preprocessing

## Deal with the categorical variable
Using StringIndexer

In [11]:
df.show(1)

+---------+-----------+---+------------------+----------+------+------+-----------------+----+
|Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+---------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
+---------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 1 row



We could see the categorical feature which is **'Cruise_line'**.

In [12]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_line_index")

indexed = indexer.fit(df).transform(df)

In [13]:
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|              1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|              1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|              1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

# B5: Create VectorAssembler for 'features'

In [14]:
from pyspark.ml.feature import VectorAssembler

In [15]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_index']

In [16]:
assembler = VectorAssembler(inputCols=['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'Cruise_line_index'], outputCol="features")

In [17]:
df_all = assembler.transform(indexed).select("features", "crew")

In [18]:
df_all.show(2)

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
+--------------------+----+
only showing top 2 rows



# B6: Split Full Data to Training set & Testing set

In [19]:
train_set, test_set = df_all.randomSplit([0.7, 0.3])

In [20]:
train_set.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              106|
|   mean|7.942358490566049|
| stddev|3.614685536447589|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [21]:
test_set.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                52|
|   mean| 7.492115384615383|
| stddev|3.2782308335732013|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+



# B7: Train & Test Phase

## Create Model object

In [22]:
from pyspark.ml.regression import LinearRegression

In [23]:
lr = LinearRegression(featuresCol="features", labelCol="crew")

## Fit the model to data

In [24]:
model = lr.fit(train_set)

## Print coefficients and intercept for ML model (if needed)

In [25]:
model.coefficients

DenseVector([-0.01, 0.0047, -0.1622, 0.32, 0.9754, -0.0029, 0.0486])

In [26]:
model.intercept

-0.7365500255603772

## Evaluate the model based on the testing set

In [27]:
test_result = model.evaluate(test_set)

## Show residuals after evaluating testing set

In [28]:
test_result.residuals.show(3)

+--------------------+
|           residuals|
+--------------------+
|  1.3382764391615396|
|0.003148013346427...|
| -0.6881511873256505|
+--------------------+
only showing top 3 rows



## Show the relative scores of training set and after evaluating testing set
### Regression Model: MSE, RMSE, R2

In [29]:
test_result.meanSquaredError

0.7182991970650386

In [30]:
test_result.rootMeanSquaredError

0.8475253371227545

In [31]:
test_result.r2

0.9318510000783998

### Show the correlation between 2-variable features
Using function corr in pyspark.sql.functions

In [32]:
from pyspark.sql.functions import corr

In [33]:
df.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

In [34]:
list_cols = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
col_target = "crew"
for col in list_cols:
    print("-" * 72)
    print("Correlation between target feature '{}' and feature '{}'':".format(col_target, col))    
    df.select(corr(col_target, col)).show()

------------------------------------------------------------------------
Correlation between target feature 'crew' and feature 'Age'':
+-------------------+
|    corr(crew, Age)|
+-------------------+
|-0.5306565039638852|
+-------------------+

------------------------------------------------------------------------
Correlation between target feature 'crew' and feature 'Tonnage'':
+-------------------+
|corr(crew, Tonnage)|
+-------------------+
|  0.927568811544939|
+-------------------+

------------------------------------------------------------------------
Correlation between target feature 'crew' and feature 'passengers'':
+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

------------------------------------------------------------------------
Correlation between target feature 'crew' and feature 'length'':
+------------------+
|corr(crew, length)|
+------------------+
|0.8958566271016579|
+-------------

As we see, we could see the best correlation between the feature 'crew' and feature 'cabins', about 95%. But the worst correlation is the relation between 'crew' and 'passenger_density', about 16%.