# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("linear_regression_project").getOrCreate()

In [3]:
from pyspark.ml.regression import LinearRegression

In [4]:
df = spark.read.csv("hdfs:///...cruise_ship_info.csv",
                     inferSchema = True, header=True)

In [5]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)

In [6]:
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [26]:
df.head(10)

[Row(Ship_name=u'Journey', Cruise_line=u'Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55), Row(Ship_name=u'Quest', Cruise_line=u'Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55), Row(Ship_name=u'Celebration', Cruise_line=u'Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7), Row(Ship_name=u'Conquest', Cruise_line=u'Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1), Row(Ship_name=u'Destiny', Cruise_line=u'Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0), Row(Ship_name=u'Ecstasy', Cruise_line=u'Carnival', Age=22, Tonnage=70.367, passengers=20.52, length=8.55, cabins=10.2, passenger_density=34.29, crew=9.2), Row(Ship_name=u'Elation', Cruise_line=u'Car

In [29]:
for ship in df.head(5):
    print(ship)
    print("\n")

Row(Ship_name=u'Journey', Cruise_line=u'Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name=u'Quest', Cruise_line=u'Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name=u'Celebration', Cruise_line=u'Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)


Row(Ship_name=u'Conquest', Cruise_line=u'Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1)


Row(Ship_name=u'Destiny', Cruise_line=u'Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0)

In [30]:
df.groupBy("Cruise_line").count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+

## Convert category into index (numerical value)

In [9]:
# make cruise line string into categorical index
from pyspark.ml.feature import StringIndexer

indexed_df = StringIndexer(inputCol = "Cruise_line", 
                        outputCol="Cruise_index").fit(df).transform(df)

In [10]:
indexed_df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|         1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|         1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|         1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|         1.0|
|    Elati

# Setting up DataFrame for Machine Learning

In [12]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [13]:
assembler = VectorAssembler(inputCols = ["Tonnage", 
                                         "passengers", 
                                         "cabins",
                                        "Cruise_index"],
                            outputCol = "features")

In [28]:
# Tranform "indexed", not "df", because it has our index categories.
transformed_data = assembler.transform(indexed_df)

In [29]:
transformed_data.select("features", "crew").show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[30.2769999999999...|3.55|
|[30.2769999999999...|3.55|
|[47.262,14.86,7.4...| 6.7|
|[110.0,29.74,14.8...|19.1|
|[101.353,26.42,13...|10.0|
|[70.367,20.52,10....| 9.2|
|[70.367,20.52,10....| 9.2|
|[70.367,20.56,10....| 9.2|
|[70.367,20.52,10....| 9.2|
|[110.238999999999...|11.5|
|[110.0,29.74,14.8...|11.6|
|[46.052,14.52,7.2...| 6.6|
|[70.367,20.52,10....| 9.2|
|[70.367,20.52,10....| 9.2|
|[86.0,21.24,10.62...| 9.3|
|[110.0,29.74,14.8...|11.6|
|[88.5,21.24,10.62...|10.3|
|[70.367,20.52,10....| 9.2|
|[88.5,21.24,11.62...| 9.3|
|[70.367,20.52,10....| 9.2|
+--------------------+----+
only showing top 20 rows

In [30]:
model_data = transformed_data.select("features", "crew")

In [31]:
(train_data, test_data) = model_data.randomSplit([0.7, 0.3])

In [32]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               120|
|   mean|           7.76525|
| stddev|3.5520700468384714|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+

## Make Model

In [33]:
# create linear regression model object
lr = LinearRegression(labelCol = "crew")

In [34]:
# fit model to data and call this model to lrModel
lrModel = lr.fit(train_data)

In [35]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [0.0242189465605,-0.152448332961,0.907057096827,0.0451451781922] Intercept: 0.680324626225

### Evaluate

In [22]:
test_results = lrModel.evaluate(test_data)  
# do evaluate so you can see the stats data

In [23]:
test_results.rootMeanSquaredError

0.839594599951821

In [24]:
# These results are too good. However, apparently, if you have 
# columns that are highly correlated than it's okay for this particular case.

test_results.r2

0.9272247492370356

In [37]:

test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| -1.8407250250156704|
| -0.6866724815307287|
|0.009825325005489383|
|  0.3605277433566414|
|-0.01570126046498821|
|-0.13891629631967994|
|-0.13672111643644413|
| -1.2550623953487747|
|  0.7444960346300222|
| -1.0896967284812895|
| -1.9307250250156702|
|  -1.017606709381445|
| -0.3138878980915827|
|  -0.840910460688594|
| 0.39293577855625905|
|-0.13891629631967994|
|  1.0634777219518075|
|-0.40011576615990485|
| -0.7986637937735797|
| -0.8467774334507521|
+--------------------+
only showing top 20 rows

In [38]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 0.839594599952
MSE: 0.704919092268

In [40]:
# How to see correlations between label and features. Closer to 1 is a high correlelation

from pyspark.sql.functions import corr

df.select(corr("crew", "cabins")).show()
df.select(corr("crew", "passengers")).show()
# ...

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

# Applying model on data to see those predictions

Took test data features and made our predictions with it

In [60]:
unlabeled_data = test_data.select("features")

In [61]:
predictions = lrModel.transform(unlabeled_data)

In [62]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[5.35,1.58,0.74,1...|2.0240807080685204|
|[10.0,2.08,1.04,1...| 2.270795701209788|
|[16.8,2.96,1.48,1...| 2.589600873644948|
|[16.852,9.52,3.83...|3.4904817369216885|
|[22.08,8.26,4.25,...|4.6985393076699005|
|[25.0,3.82,1.94,1...|3.0675422291753147|
|[25.0,3.88,1.94,1...|3.0615871700681527|
|[30.2769999999999...| 4.069554415267965|
|[33.0,4.9,2.45,10.0]| 3.503352712263592|
|[35.143,12.5,5.32...| 4.798701319186012|
|[38.0,7.49,3.96,3.0]| 4.143827277300444|
|[38.0,10.56,5.28,...| 4.880263215383586|
|[44.348,12.0,6.0,...| 5.341159758825772|
|[46.052,14.52,7.2...| 6.027872994636762|
|[50.0,7.0,3.54,10.0]| 4.564861248644508|
|[50.76,17.48,8.74...| 7.125486032211407|
|[53.872,14.94,7.4...| 6.456913917962715|
|[55.451,12.66,6.3...|5.8714290382392225|
|[58.6,15.66,7.83,...|6.9908772644481365|
|[59.652,13.2,6.6,...| 6.132137708230873|
+--------------------+------------

# Prediction column shows how many crew members the ship will need in hundreds