# **PySpark**

### **1. Preparation (Set up Spark Session)**

In [1]:
# Create spark session
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DataFrame').getOrCreate()

23/05/15 13:07:18 WARN Utils: Your hostname, Wahyus-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.16.28.60 instead (on interface en0)
23/05/15 13:07:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/15 13:07:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

23/05/15 13:07:29 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


### **2. Import Dataset**

In [4]:
df_pyspark = spark.read.csv('Machining_Data_Full.csv')
df_pyspark.show(4)

+------+-------------+---------+------------+----+--------+
|   _c0|          _c1|      _c2|         _c3| _c4|     _c5|
+------+-------------+---------+------------+----+--------+
|T_awal|Spindle_Speed|Feed_Rate|Depth_of_Cut|Time|T_Output|
|  24.8|          800|       50|           3|   1|    24.8|
|  24.8|          800|       50|           3|   2|    26.6|
|  24.8|          800|       50|           3|   3|    26.6|
+------+-------------+---------+------------+----+--------+
only showing top 4 rows



From dataframe read with pyspark above, we can see that columns name are not right.

In [6]:
# Process dataframe with right columns name
df_pyspark = spark.read.option('header', 'true').csv('Machining_Data_Full.csv')
df_pyspark.show(4)

+------+-------------+---------+------------+----+--------+
|T_awal|Spindle_Speed|Feed_Rate|Depth_of_Cut|Time|T_Output|
+------+-------------+---------+------------+----+--------+
|  24.8|          800|       50|           3|   1|    24.8|
|  24.8|          800|       50|           3|   2|    26.6|
|  24.8|          800|       50|           3|   3|    26.6|
|  24.8|          800|       50|           3|   4|    26.6|
+------+-------------+---------+------------+----+--------+
only showing top 4 rows



From pyspark dataframe above, we can see those columns name are it should be.

### **3. Check Type**

In [7]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

### **4. Check Information of Pyspark Dataframe**

In [8]:
df_pyspark.printSchema()

root
 |-- T_awal: string (nullable = true)
 |-- Spindle_Speed: string (nullable = true)
 |-- Feed_Rate: string (nullable = true)
 |-- Depth_of_Cut: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- T_Output: string (nullable = true)



From print schema above, we can see all columns are detected as string. We want to change these into the right type.

In [9]:
df_pyspark = spark.read.csv('Machining_Data_Full.csv', header=True, inferSchema=True)
df_pyspark.show(4)

+------+-------------+---------+------------+----+--------+
|T_awal|Spindle_Speed|Feed_Rate|Depth_of_Cut|Time|T_Output|
+------+-------------+---------+------------+----+--------+
|  24.8|          800|       50|           3|   1|    24.8|
|  24.8|          800|       50|           3|   2|    26.6|
|  24.8|          800|       50|           3|   3|    26.6|
|  24.8|          800|       50|           3|   4|    26.6|
+------+-------------+---------+------------+----+--------+
only showing top 4 rows



In [10]:
df_pyspark.printSchema()

root
 |-- T_awal: double (nullable = true)
 |-- Spindle_Speed: integer (nullable = true)
 |-- Feed_Rate: integer (nullable = true)
 |-- Depth_of_Cut: integer (nullable = true)
 |-- Time: integer (nullable = true)
 |-- T_Output: double (nullable = true)



From print schema above, we can see that type of each columns are right

### **5. Check Statistic Description of Data**

In [12]:
df_pyspark.describe().show()

23/05/15 13:27:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+------------------+------------------+------------------+-----------------+------------------+------------------+
|summary|            T_awal|     Spindle_Speed|         Feed_Rate|     Depth_of_Cut|              Time|          T_Output|
+-------+------------------+------------------+------------------+-----------------+------------------+------------------+
|  count|              1516|              1516|              1516|             1516|              1516|              1516|
|   mean|23.622493403693962|1136.5435356200528|106.18073878627969|6.342348284960422| 45.30474934036939|31.779023746702016|
| stddev| 2.460374139749325|341.86007224372935|103.59793592746297|2.877627511072085|34.697934671939564| 6.817404099087923|
|    min|              20.2|               800|                50|                3|                 1|              20.2|
|    max|              31.0|              1600|               400|               10|               121|              51.5|
+-------+-------

### **6. Columns Processes**

In [15]:
# Add column in dataframe
df_pyspark = df_pyspark.withColumn('Column_added', df_pyspark.Feed_Rate+2)
df_pyspark.show(4)

+------+-------------+---------+------------+----+--------+------------+
|T_awal|Spindle_Speed|Feed_Rate|Depth_of_Cut|Time|T_Output|Column_added|
+------+-------------+---------+------------+----+--------+------------+
|  24.8|          800|       50|           3|   1|    24.8|          52|
|  24.8|          800|       50|           3|   2|    26.6|          52|
|  24.8|          800|       50|           3|   3|    26.6|          52|
|  24.8|          800|       50|           3|   4|    26.6|          52|
+------+-------------+---------+------------+----+--------+------------+
only showing top 4 rows



In [16]:
# Drop column
df_pyspark = df_pyspark.drop('Column_added')
df_pyspark.show(5)

+------+-------------+---------+------------+----+--------+
|T_awal|Spindle_Speed|Feed_Rate|Depth_of_Cut|Time|T_Output|
+------+-------------+---------+------------+----+--------+
|  24.8|          800|       50|           3|   1|    24.8|
|  24.8|          800|       50|           3|   2|    26.6|
|  24.8|          800|       50|           3|   3|    26.6|
|  24.8|          800|       50|           3|   4|    26.6|
|  24.8|          800|       50|           3|   5|    26.6|
+------+-------------+---------+------------+----+--------+
only showing top 5 rows



In [17]:
df_pyspark = df_pyspark.withColumnRenamed('Feed_Rate', 'Kecepatan_Gerak')
df_pyspark.show(5)

+------+-------------+---------------+------------+----+--------+
|T_awal|Spindle_Speed|Kecepatan_Gerak|Depth_of_Cut|Time|T_Output|
+------+-------------+---------------+------------+----+--------+
|  24.8|          800|             50|           3|   1|    24.8|
|  24.8|          800|             50|           3|   2|    26.6|
|  24.8|          800|             50|           3|   3|    26.6|
|  24.8|          800|             50|           3|   4|    26.6|
|  24.8|          800|             50|           3|   5|    26.6|
+------+-------------+---------------+------------+----+--------+
only showing top 5 rows

