# Big Data Final
## Shahin Mammadov

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 59 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 69.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845514 sha256=7df5bed79589cf37b779d1fac89e0c1e67d47cecbd731ff161a34d9dea5d5994
  Stored in directory: /root/.cache/pip/wheels/42/59/f5/79a5bf931714dcd201b26025347785f087370a10a3329a899c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1


In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

In [7]:
spark = SparkSession.builder \
          .appName("Number of crew members") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()
  
df=spark.read.csv('cruise_ship_info.csv',inferSchema=True,header=True)

df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



Converting string features to numeric and creating a new dataframe.

In [8]:
indexer=StringIndexer(inputCol='Cruise_line',outputCol='cruise_cat')
indexed=indexer.fit(df).transform(df)

New vector which has columns of features and crew number, where features are vectorized.

In [9]:
assembler=VectorAssembler(inputCols=['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'cruise_cat'],outputCol='features')
output=assembler.transform(indexed)
output.select('features','crew').show(5)

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
+--------------------+----+
only showing top 5 rows



Splitting data into train and test by 80:20

In [10]:
final_data=output.select('features','crew')
train_data,test_data=final_data.randomSplit([0.8,0.2])
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               124|
|   mean| 7.802661290322591|
| stddev|3.6656553909660734|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+



Creating an object of class LinearRegression, and fitting the model on train data

In [11]:
ship_lr=LinearRegression(featuresCol='features',labelCol='crew')
trained_ship_model=ship_lr.fit(train_data)

Evaluating the model

In [12]:
ship_results=trained_ship_model.evaluate(train_data)
print('Rsquared Error :',ship_results.r2)

Rsquared Error : 0.9353659513220284


Predicting based on unlabeled data

In [13]:
unlabeled_data=test_data.select('features')
predictions=trained_ship_model.transform(unlabeled_data)
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[6.0,112.0,38.0,9...|11.412998237912813|
|[8.0,110.0,29.74,...|12.119792251482512|
|[9.0,88.5,21.24,9...|  9.50212887029444|
|[9.0,90.09,25.01,...| 9.251100953816968|
|[9.0,105.0,27.2,8...|11.282565591701264|
|[10.0,68.0,10.8,7...| 6.561346194347827|
|[11.0,90.0,22.4,9...|  9.97288808790399|
|[12.0,91.0,20.32,...| 9.166838570283591|
|[13.0,25.0,3.82,5...|2.9809299679804218|
|[13.0,63.0,14.4,7...| 6.657408579290374|
|[13.0,138.0,31.14...|13.041381458193513|
|[14.0,30.27699999...| 3.418480873984252|
|[14.0,77.104,20.0...| 8.710771403726774|
|[14.0,83.0,17.5,9...|  9.15541545905515|
|[14.0,138.0,31.14...|13.030971246030663|
|[15.0,30.27699999...|3.9256704457582368|
|[15.0,70.367,20.5...| 8.559052459859284|
|[15.0,70.367,20.5...| 8.559052459859284|
|[15.0,75.33800000...| 8.679989022937857|
|[16.0,59.652,13.2...| 6.247927040632935|
+--------------------+------------