# Linear Regression Project

* Question: 모델을 만들고 이를 사용하여 선박에 필요한 승무원 수를 예측
* Data desciption <br>

    Variables/Columns  
    Ship Name     1-20  
    Cruise Line   21-40  
    Age (as of 2013)   46-48  
    Tonnage (1000s of tons)   50-56  
    passengers (100s)   58-64  
    Length (100s of feet)  66-72  
    Cabins  (100s)   74-80  
    Passenger Density   82-88  
    Crew  (100s)   90-96  
    
mission: 향후 선박에 필요한 선원 수를 예측하는 데 도움이 되는 회귀 모델을 만드는 것  
고객은 또한 특정 크루즈 라인이 허용되는 승무원 수에 차이가 있음을 발견 했으므로 분석에 포함하는 것이 가장 중요한 기능이라고 언급

In [2]:
import repackage
repackage.up(2)

from configuration import make_engine
import pandas as pd

In [4]:
engine = make_engine()

In [7]:
query = "SELECT * FROM cruise_ship_info"
cruise_ship_info = pd.read_sql(query, con=engine)

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("cruise").getOrCreate()

In [19]:
df = spark.createDataFrame(cruise_ship_info)
df.show()

+-----------+-----------+---+-------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+-------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6| 30.277|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6| 30.277|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26| 47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|  110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22| 70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15| 70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 70.367|     20.56|  8.55| 10.22|            34.23| 9.2|
|Fascination|   Carnival| 19| 70.367|     20.52|  8.55

In [14]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



이 데이터프레임의 간단한 통계값은 다음과 같다.

In [28]:
df.describe().show()

+-------+---------+-----------+------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|        passengers|            length|           cabins|passenger_density|              crew|
+-------+---------+-----------+------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+
|  count|      158|        158|               158|               158|               158|               158|              158|              158|               158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607594|18.457405063291148| 8.130632911392405|8.830000000000002|39.90094936708861|7.7941772151898725|
| stddev|     null|       null| 7.615691058751412|37.229540025907866| 9.677094775143413|1.7934735480548247|4.471417222148062|8.639217113915418| 3.503486564627033|
|    min|Adventure|   

위 모양은 너무 안좋아서 가독성이 떨어지니 가독성을 `format_number`함수로 높이자!!

In [27]:
from pyspark.sql.functions import format_number

In [29]:
df.describe().columns

['summary',
 'Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

In [32]:
summary = df.describe()
summary.select(
    summary['summary'],
    summary['Ship_name'],
    summary['Cruise_line'],
    format_number(summary['Age'].cast('float'), 2).alias('Age'),
    format_number(summary['Tonnage'].cast('float'), 2).alias('Tonnage'),
    format_number(summary['passengers'].cast('float'), 2).alias('passengers'),
    format_number(summary['length'].cast('float'), 2).alias('length'),
    format_number(summary['cabins'].cast('float'), 2).alias('cabins'),
    format_number(summary['passenger_density'].cast('float'), 2).alias('passenger_density'),
    format_number(summary['crew'].cast('float'), 2).alias('crew')
).show()

+-------+---------+-----------+------+-------+----------+------+------+-----------------+------+
|summary|Ship_name|Cruise_line|   Age|Tonnage|passengers|length|cabins|passenger_density|  crew|
+-------+---------+-----------+------+-------+----------+------+------+-----------------+------+
|  count|      158|        158|158.00| 158.00|    158.00|158.00|158.00|           158.00|158.00|
|   mean| Infinity|       null| 15.69|  71.28|     18.46|  8.13|  8.83|            39.90|  7.79|
| stddev|     null|       null|  7.62|  37.23|      9.68|  1.79|  4.47|             8.64|  3.50|
|    min|Adventure|    Azamara|  4.00|   2.33|      0.66|  2.79|  0.33|            17.70|  0.59|
|    max|Zuiderdam|   Windstar| 48.00| 220.00|     54.00| 11.82| 27.00|            71.43| 21.00|
+-------+---------+-----------+------+-------+----------+------+------+-----------------+------+



## Dealing with the Cruise_line categorical variable
Ship Name은 쓸모없는 임의의 문자열이지만 cruise_line은 유용할 수가 있으므로 categorical variable로 만들자!

In [34]:
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



string을 categorical 변수로 만드는 방법은 `pyspark.ml.feature.StringIndexer`를 이용한다. 

In [41]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Cruise_line', outputCol='cruise_cat')
indexed = indexer.fit(df).transform(df)
# indexed.take(5)
indexed.show()

+-----------+-----------+---+-------+----------+------+------+-----------------+----+----------+
|  Ship_name|Cruise_line|Age|Tonnage|passengers|length|cabins|passenger_density|crew|cruise_cat|
+-----------+-----------+---+-------+----------+------+------+-----------------+----+----------+
|    Journey|    Azamara|  6| 30.277|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|      Quest|    Azamara|  6| 30.277|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|Celebration|   Carnival| 26| 47.262|     14.86|  7.22|  7.43|             31.8| 6.7|       1.0|
|   Conquest|   Carnival| 11|  110.0|     29.74|  9.53| 14.88|            36.99|19.1|       1.0|
|    Destiny|   Carnival| 17|101.353|     26.42|  8.92| 13.21|            38.36|10.0|       1.0|
|    Ecstasy|   Carnival| 22| 70.367|     20.52|  8.55|  10.2|            34.29| 9.2|       1.0|
|    Elation|   Carnival| 15| 70.367|     20.52|  8.55|  10.2|            34.29| 9.2|       1.0|
|    Fantasy|   Carnival| 23| 

In [42]:
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler

In [43]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

데이터들은 VectorAssembler를 이용하여 합친다.

In [52]:
assembler = VectorAssembler(
    inputCols=['Age',
               'Tonnage',
               'passengers',
               'length',
               'cabins',
               'passenger_density',
               'cruise_cat'],
    outputCol="features"
)

In [53]:
output = assembler.transform(indexed)

In [54]:
output.select("features", "crew").show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.277,6.94,...|3.55|
|[6.0,30.277,6.94,...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.239,37.0...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [56]:
final_data = output.select("features", "crew")

데이터를 훈련 70%, 테스트 30%로 분리

In [58]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

머신러닝 모델을 임포트한다.

In [59]:
from pyspark.ml.regression import LinearRegression

# Create a Linear Regression Model object
lr = LinearRegression(labelCol='crew')

In [60]:
# Fit the model to the data and cell this model lrModel
lrModel = lr.fit(train_data)

In [63]:
# print the coefficients and intercept for linear regression
coef = lrModel.coefficients
icept = lrModel.intercept
print(f"Coeffifients: {coef} \nIntercept: {icept}")

Coeffifients: [-0.013212300187179156,0.013439477517109303,-0.15972137899026934,0.41712345581937577,0.8489020304260144,-0.009166958930580426,0.06371581235596006] 
Intercept: -0.8342786207559152


test데이터의 평가는 evaluate를 이용한다.

In [64]:
test_results = lrModel.evaluate(test_data)

In [65]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

RMSE: 0.546254296608282
MSE: 0.2983937565630088
R2: 0.9671733267796191


$R^{2}$값이 0.96으로 꽤 높다. 조금더 살펴보면

In [66]:
from pyspark.sql.functions import corr

In [71]:
df.select(corr("crew", "passengers"))\
        .withColumnRenamed("corr(crew, passengers)", "corr")\
        .show()

+------------------+
|              corr|
+------------------+
|0.9152341306065387|
+------------------+



In [72]:
df.select(corr("crew", "cabins"))\
        .withColumnRenamed("corr(crew, cabins)", "corr")\
        .show()

+----------------+
|            corr|
+----------------+
|0.95082260635785|
+----------------+

