# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

# Start

First thing is starting a new spark session. Let's call it cruise_ship:

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cruise_ship').getOrCreate()

Next is reading the data, which is in a csv file:

In [2]:
df = spark.read.csv('input data/cruise_ship_info.csv', header=True, inferSchema=True)

Before actually diving into the data let's check the schema:

In [3]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



There are two string columns and the others are numeric so let's now check some of the first examples:

In [4]:
df.head(3)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55),
 Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)]

It is very confusing to read anything this way. Let's print it again using the following instead:

In [5]:
for s in df.head(3):
    print(s)
    print('-------')
    print('\n')

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)
-------


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)
-------


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)
-------




After checking the features and some of their possible values, let's explore what is this Cruise_line categorical variable.

To check how many cruise lines there are in the dataset, one can use the following:

In [6]:
df.groupBy(df['Cruise_line']).count().orderBy('count', ascending=False).show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|  Royal_Caribbean|   23|
|         Carnival|   22|
|         Princess|   17|
| Holland_American|   14|
|        Norwegian|   13|
|            Costa|   11|
|        Celebrity|   10|
|              MSC|    8|
|             Star|    6|
|              P&O|    6|
|Regent_Seven_Seas|    5|
|        Silversea|    4|
|           Cunard|    3|
|          Oceania|    3|
|         Windstar|    3|
|         Seabourn|    3|
|           Disney|    2|
|          Crystal|    2|
|          Azamara|    2|
|           Orient|    1|
+-----------------+-----+



Some cruise lines have a lot of ships while others have only a couple.

In the introduction it is said that the client mentioned they found that particular cruise lines will differ in acceptable crew counts, and that it is most likely an important feature to include in this analysis. Thus, the cruise_line feature will be explored further.

Since the cruise_line is a textual feature the StringIndexer is used to encode the cruise_line column into numerical values for the chosen model to use it.

Note that a different approach could be used, such as dummy variables (a new column with 1 or 0 values for each cruise_line name).

In [7]:
from pyspark.ml.feature import StringIndexer

str_indexer = StringIndexer(inputCol='Cruise_line', outputCol='cruise_category')
fit_indexer = str_indexer.fit(df)
indexed = fit_indexer.transform(df)

Let's use the same approach as earlier to print the top rows with the new feature included:

In [8]:
for s in indexed.head(3):
    print(s)
    print('-------')
    print('\n')

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_category=16.0)
-------


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_category=16.0)
-------


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7, cruise_category=1.0)
-------




One can verify that for each cruise_line feature value corresponds a different cruise_category integer value. This value can now be used by the chosen algorithm.

Speaking of which, let's check the features names and choose which should be left out of the model:

In [9]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_category']

The ship_name is a random string while the cruise_line names are encoded in the cruise_category. These two will not be part of the inputCols.

The crew column will also not be included as it is the actual value that we wish to predict (label).

In [10]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['Age',
                                      'Tonnage',
                                      'passengers',
                                      'length',
                                      'cabins',
                                      'passenger_density',
                                      'cruise_category'],
                           outputCol='features')
output = assembler.transform(indexed)

The dataset can now be checked:

In [11]:
output.select('features', 'crew').show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



Everything seems right therefore one can split the data into train and test sets:

In [12]:
data = output.select('features', 'crew')
train_data, test_data = data.randomSplit([0.7, 0.3])

Let's now build a Linear Regression Model to perform the prediction:

In [13]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol='crew')
#fit the model to the training data
lr_model = lr.fit(train_data)
print('Coefficients: {} Intercept: {}'.format(lr_model.coefficients,lr_model.intercept))

Coefficients: [-0.016129154511980208,-0.0035431773991121203,-0.1846486466263946,0.42865033012709447,1.042592215183075,0.0034210798228779915,0.04497639421954517] Intercept: -1.3335048135839587


After building the model, metrics such as the Root Mean Squared Error can be analyzed.

In [14]:
result = lr_model.evaluate(test_data)
print('RMSE: {}'.format(result.rootMeanSquaredError))
print("MSE: {}".format(result.meanSquaredError))
print("R2: {}".format(result.r2))

RMSE: 0.7597612951960527
MSE: 0.5772372256779836
R2: 0.9452027721570069


Intuitively, the more passengers and cabins a ship has, the more crew members there will be. The relationship between these two features and the label can be verified by their correlation:

In [15]:
from pyspark.sql.functions import corr

df.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [16]:
df.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



As expected, these two feature have high correlation with the number of crew members which mean that they have a really good indication of what is the number of crew members of the ship. With these kind of features it is natural that, for instance, the R2 value is high as we can explain a great part of the variance.

Thank you!