# **Labs 1 and 2 PySpark:**

In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv file that we will be using in this lab is **PatientInfo**.

## PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

### Import the pyspark and check it's version

In [1]:
import pyspark


In [4]:
print(spark.version)
#print(SparkContext.version)

3.1.2


### Import and create SparkSession

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lab1-2').getOrCreate()

### Load the PatientInfo.csv file and show the first 5 rows

In [None]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [3]:
df = spark.read.csv('PatientInfo.csv',header=True)

In [7]:
df.show(5,truncate=False)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|sex   |age|country|province|city       |infection_case      |infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state   |
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|male  |50s|Korea  |Seoul   |Gangseo-gu |overseas inflow     |null       |75            |2020-01-22        |2020-01-23    |2020-02-05   |null         |released|
|1000000002|male  |30s|Korea  |Seoul   |Jungnang-gu|overseas inflow     |null       |31            |null              |2020-01-30    |2020-03-02   |null         |released|
|1000000003|male  |50s|Korea  |Seoul   |Jongno-gu  |contact with patient|2002000001 |17            |null              |2020-01-30    |2020-0

### Display the schema of the dataset

In [8]:
df.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: string (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: string (nullable = true)
 |-- state: string (nullable = true)



### Display the statistical summary

In [9]:
df.describe().show(truncate=False)

+-------+--------------------+------+----+----------+--------+--------------+--------------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|summary|patient_id          |sex   |age |country   |province|city          |infection_case            |infected_by         |contact_number      |symptom_onset_date|confirmed_date|released_date|deceased_date|state   |
+-------+--------------------+------+----+----------+--------+--------------+--------------------------+--------------------+--------------------+------------------+--------------+-------------+-------------+--------+
|count  |5165                |4043  |3785|5165      |5165    |5071          |4246                      |1346                |791                 |690               |5162          |1587         |66           |5165    |
|mean   |2.8636345618679576E9|null  |null|null      |null    |null          |null                      |2.2845944015643125E9|1.6

In [10]:
df.describe(["patient_id","sex"   ,"age" ,"country" ,  "province","city","infection_case" ,"infected_by"]).show()

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|
|   mean|2.8636345618679576E9|  null|null|      null|    null|          null|                null|2.2845944015643125E9|
| stddev| 2.074210725277473E9|  null|null|      null|    null|          null|                null|1.5265072953383324E9|
|    min|          1000000001|female|  0s|Bangladesh|   Busan|     Andong-si|Anyang Gunpo Past...|          1000000002|
|    max|          7000000019|  male| 90s|   Vietnam|   Ulsan|sankyeock-dong|     overseas inflow|          7000000009|
+-------+--------------------+------+---

In [11]:
df.describe(["contact_number" , "symptom_onset_date","confirmed_date","released_date","deceased_date","state"]).show()

+-------+--------------------+------------------+--------------+-------------+-------------+--------+
|summary|      contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+-------+--------------------+------------------+--------------+-------------+-------------+--------+
|  count|                 791|               690|          5162|         1587|           66|    5165|
|   mean|1.6772572523506988E7|              null|          null|         null|         null|    null|
| stddev| 3.093097580985502E8|              null|          null|         null|         null|    null|
|    min|                   -|                  |    2020-01-20|   2020-02-05|   2020-02-19|deceased|
|    max|                  95|        2020-06-28|    2020-06-30|   2020-06-28|   2020-05-25|released|
+-------+--------------------+------------------+--------------+-------------+-------------+--------+



### Using the state column.
### How many people survived (released), and how many didn't survive (isolated/deceased)?

In [12]:
from pyspark.sql.functions import *

In [13]:
df.groupBy('state').agg(count("patient_id").alias("people count")).show()


+--------+------------+
|   state|people count|
+--------+------------+
|isolated|        2158|
|released|        2929|
|deceased|          78|
+--------+------------+



In [14]:
print("survived people count :",(df.select("state").where(col("state") == "released")).count())
print("unSurvived people count :",df.filter((col("state") == "isolated") | (col("state") == "deceased")).count())

survived people count : 2929
unSurvived people count : 2236


### Display the number of null values in each column

In [15]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|patient_id| sex| age|country|province|city|infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+
|         0|1122|1380|      0|       0|  94|           919|       3819|          4374|              4475|             3|         3578|         5099|    0|
+----------+----+----+-------+--------+----+--------------+-----------+--------------+------------------+--------------+-------------+-------------+-----+



In [35]:
df.count()

5165

In [16]:
print("nulls in each columns")
l=["patient_id","sex"   ,"age" ,"country" ,  "province","city","infection_case" ,"infected_by"]
l2=["contact_number" , "symptom_onset_date","confirmed_date","released_date","deceased_date","state"]
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in l]).show()
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in l2]).show()

nulls in each columns
+----------+----+----+-------+--------+----+--------------+-----------+
|patient_id| sex| age|country|province|city|infection_case|infected_by|
+----------+----+----+-------+--------+----+--------------+-----------+
|         0|1122|1380|      0|       0|  94|           919|       3819|
+----------+----+----+-------+--------+----+--------------+-----------+

+--------------+------------------+--------------+-------------+-------------+-----+
|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+--------------+------------------+--------------+-------------+-------------+-----+
|          4374|              4475|             3|         3578|         5099|    0|
+--------------+------------------+--------------+-------------+-------------+-----+



## Data preprocessing

### Fill the nulls in the deceased_date with the released_date. 
- You can use <b>coalesce</b> function

In [56]:
df.rdd.getNumPartitions()

1

In [63]:
#The Coalesce function reduces the number of partitions in the PySpark Data Frame
df2 = df.coalesce(5165)
#df2.count()
df2.rdd.getNumPartitions()

1

In [17]:
#to create anew column deceased_date (nulls in the deceased_date fill with  values in the released_date) to ,from
from pyspark.sql.functions import coalesce
df=df.withColumn("deceased_date",coalesce(df.deceased_date,df.released_date)) 

In [18]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in l2]).show()

+--------------+------------------+--------------+-------------+-------------+-----+
|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state|
+--------------+------------------+--------------+-------------+-------------+-----+
|          4374|              4475|             3|         3578|         3514|    0|
+--------------+------------------+--------------+-------------+-------------+-----+



In [67]:
"""
before
A|B
0,1
2,null
3,null
4,2
after
A|B
0,1
2,2
3,3
4,2

from a to b
from pyspark.sql.functions import coalesce   
df.withColumn("B",coalesce(df.B,df.A)) #to,from

another way 
from to 
df.na.fill(df.A,"B")
3-
df1.select('A',
           when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B')
          )\
   .show()
"""

'\nbefore\nA|B\n0,1\n2,null\n3,null\n4,2\nafter\nA|B\n0,1\n2,2\n3,3\n4,2\n\nfrom a to b\nfrom pyspark.sql.functions import coalesce   \ndf.withColumn("B",coalesce(df.B,df.A)) #to,from\n\nanother way \nfrom to \ndf.na.fill(df.A,"B")\n'

### Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows. Print the schema.
- <b> Hint: You need to typecast these columns as date first <b>

In [19]:
from pyspark.sql.functions import col
from pyspark.sql.types import DateType

df=df.withColumn("deceased_date",col("deceased_date").cast(DateType())) \
.withColumn("confirmed_date",col("confirmed_date").cast(DateType()))

In [20]:
df.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)



In [57]:
from pyspark.sql.functions import *
df2=df.withColumn("no_days",datediff(col("deceased_date"),col("confirmed_date")))


In [58]:
df2.select(["no_days","deceased_date","confirmed_date"]).show(5)

+-------+-------------+--------------+
|no_days|deceased_date|confirmed_date|
+-------+-------------+--------------+
|     13|   2020-02-05|    2020-01-23|
|     32|   2020-03-02|    2020-01-30|
|     20|   2020-02-19|    2020-01-30|
|     16|   2020-02-15|    2020-01-30|
|     24|   2020-02-24|    2020-01-31|
+-------+-------------+--------------+
only showing top 5 rows



In [23]:
df2.show(5,truncate=False)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+---------------+
|patient_id|sex   |age|country|province|city       |infection_case      |infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state   |no_days        |
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+---------------+
|1000000001|male  |50s|Korea  |Seoul   |Gangseo-gu |overseas inflow     |null       |75            |2020-01-22        |2020-01-23    |2020-02-05   |2020-02-05   |released|13 days        |
|1000000002|male  |30s|Korea  |Seoul   |Jungnang-gu|overseas inflow     |null       |31            |null              |2020-01-30    |2020-03-02   |2020-03-02   |released|1 months 2 days|
|1000000003|male  |50s|Korea  |Seoul   |Jongno-gu  |contact 

In [75]:
'''
adjustedDf = df.withColumn("deceased_date", to_date(col("deceased_date"), "yyyy-MM-d"))\
.withColumn("confirmed_date", to_date(col("confirmed_date"), "yyyy-MM-d"))
    .withColumn('no_days', col("deceased_date")-col("confirmed_date"))
'''


'\nadjustedDf = df.withColumn("deceased_date", to_date(col("deceased_date"), "yyyy-MM-d")).withColumn("confirmed_date", to_date(col("confirmed_date"), "yyyy-MM-d"))\n    .withColumn(\'no_days\', col("deceased_date")-col("confirmed_date"))\n'

### Add a is_male column if male then it should yield true, else then False

In [59]:
df2=df2.withColumn("is_male",when( (col("sex") == "male"), True).otherwise(False)) 

In [60]:
df2.show(2,truncate=False)

+----------+----+---+-------+--------+-----------+---------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+
|patient_id|sex |age|country|province|city       |infection_case |infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|state   |no_days|is_male|
+----------+----+---+-------+--------+-----------+---------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+
|1000000001|male|50s|Korea  |Seoul   |Gangseo-gu |overseas inflow|null       |75            |2020-01-22        |2020-01-23    |2020-02-05   |2020-02-05   |released|13     |true   |
|1000000002|male|30s|Korea  |Seoul   |Jungnang-gu|overseas inflow|null       |31            |null              |2020-01-30    |2020-03-02   |2020-03-02   |released|32     |true   |
+----------+----+---+-------+--------+-----------+---------------+-----------+--------------+--

In [61]:
df2.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: integer (nullable = true)
 |-- is_male: boolean (nullable = false)



### Add a is_dead column if patient state is not released then it should yield true, else then False

- Use <b>UDF</b> to perform this task. 
- However, UDF is not recommended there is no built in function can do the required operation.
- UDF is slower than built in functions.

In [62]:
from pyspark.sql.types import BooleanType,StringType

In [63]:
from pyspark.sql.functions import udf


In [29]:
#df2=df2.withColumn("is_dead",when( (col("state") == "released"), False).otherwise(True)) 
#User Defined Function (UDF)
def is_dead_fun(x):
    if x == "released":
        y=False
    else:
        y=True
    return y
deadUDF = udf(lambda z:is_dead_fun(z),BooleanType())   

df2=df2.withColumn("is_deade", deadUDF(col("state"))) 

In [30]:
df2.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: interval (nullable = true)
 |-- is_male: boolean (nullable = false)
 |-- is_deade: boolean (nullable = true)



In [31]:
df2.show(2)

Py4JJavaError: An error occurred while calling o396.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 1 times, most recent failure: Lost task 0.0 in stage 33.0 (TID 224) (basamoka.mshome.net executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:81)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:130)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
	at java.net.ServerSocket.implAccept(ServerSocket.java:545)
	at java.net.ServerSocket.accept(ServerSocket.java:513)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 24 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2929)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:81)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:130)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
	at java.net.ServerSocket.implAccept(ServerSocket.java:545)
	at java.net.ServerSocket.accept(ServerSocket.java:513)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 24 more


In [32]:
df2=df2.drop("is_deade")

In [64]:
df2=df2.withColumn("is_dead",when( (col("state") == "released"), False).otherwise(True)) 

In [65]:
df2.select(["is_dead","state"]).show(5)

+-------+--------+
|is_dead|   state|
+-------+--------+
|  false|released|
|  false|released|
|  false|released|
|  false|released|
|  false|released|
+-------+--------+
only showing top 5 rows



In [66]:
#df2.write.csv("PatientInfo_modified.csv")

#df2.toPandas().to_csv("PatientInfo_modified.csv")
#df2.repartition(1).write.csv("PatientInfo_modified.csv")
#df2.coalesce(1).write.csv("PatientInfo_modified.csv")
df2.write.format('com.databricks.spark.csv').save('PatientInfo_modified.csv')

AnalysisException: path file:/D:/iti studing core/spark/lab1-2/PatientInfo_modified.csv already exists.

In [35]:
#same error: Cannot save interval data type into external storage.
"""
df2.write.mode("overwrite").csv(r"D:\iti studing core\spark\lab1-2\PatientInfo_modified")
/////
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string,StringType())

df3 = df2.withColumn('state-stringified',array_to_string_udf(df2["state"]))
ll=["is_dead","is_male"]
ll.extend(l)
ll.extend(l2)
df3.drop(ll).write.csv("dd.csv")
////
df2.select(
    "age", "city", "confirmed_date", "contact_number", "country", "deceased_date", "infected_by", "infection_case", "is_dead", "is_male", "no_days", "patient_id", "province", "released_date", "sex", "state", "symptom_onset_date"
    ).write.csv("good_loc.csv")
"""

'\ndf2.write.mode("overwrite").csv(r"D:\\iti studing core\\spark\\lab1-2\\PatientInfo_modified")\n/////\nfrom pyspark.sql.functions import udf\nfrom pyspark.sql.types import StringType\n\ndef array_to_string(my_list):\n    return \'[\' + \',\'.join([str(elem) for elem in my_list]) + \']\'\n\narray_to_string_udf = udf(array_to_string,StringType())\n\ndf3 = df2.withColumn(\'state-stringified\',array_to_string_udf(df2["state"]))\nll=["is_dead","is_male"]\nll.extend(l)\nll.extend(l2)\ndf3.drop(ll).write.csv("dd.csv")\n////\ndf2.select(\n    "age", "city", "confirmed_date", "contact_number", "country", "deceased_date", "infected_by", "infection_case", "is_dead", "is_male", "no_days", "patient_id", "province", "released_date", "sex", "state", "symptom_onset_date"\n    ).write.csv("good_loc.csv")\n'

### Change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20

In [37]:
df2.select("age").show(5)

+---+
|age|
+---+
|50s|
|30s|
|50s|
|20s|
|20s|
+---+
only showing top 5 rows



In [67]:
from pyspark.sql.functions import translate

df2=df2.withColumn('age', translate('age', 's', ''))

df2.select('age').show(10,truncate=False)

+---+
|age|
+---+
|50 |
|30 |
|50 |
|20 |
|20 |
|50 |
|20 |
|20 |
|30 |
|60 |
+---+
only showing top 10 rows



### Change age, and no_days  to be typecasted as Double

In [9]:
from pyspark.sql.functions import col
import pyspark.sql.types as t


In [69]:
df2=df2.withColumn("age",col("age").cast(t.DoubleType())) 


In [70]:
df2=df2.withColumn("no_days",col("no_days").cast(t.DoubleType()))

In [71]:
df2.printSchema()

root
 |-- patient_id: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: string (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)
 |-- no_days: double (nullable = true)
 |-- is_male: boolean (nullable = false)
 |-- is_dead: boolean (nullable = false)



In [72]:
df2.select('no_days').show(5,truncate=False)

+-------+
|no_days|
+-------+
|13.0   |
|32.0   |
|20.0   |
|16.0   |
|24.0   |
+-------+
only showing top 5 rows



df2.printSchema()

### Drop the columns
["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [73]:
df2=df2.drop("patient_id","sex","infected_by","contact_number","released_date","state", "symptom_onset_date","confirmed_date","deceased_date","country","no_days", "city","infection_case")

In [74]:
df2.printSchema()

root
 |-- age: double (nullable = true)
 |-- province: string (nullable = true)
 |-- is_male: boolean (nullable = false)
 |-- is_dead: boolean (nullable = false)



### Recount the number of nulls now

In [77]:
#df2.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df2.columns] ).show()
l4=["age","province"]
df2.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in l4] ).show()

+----+--------+
| age|province|
+----+--------+
|1380|       0|
+----+--------+



In [6]:
"""
df2.select([count(when(col(c).contains('None') | \
                            col(c).contains('NULL') | \
                            (col(c) == '' ) | \
                            col(c).isNull() | \
                            isnan(c), c 
                           )).alias(c)
                    for c in l4]).show()

"""


"\ndf2.select([count(when(col(c).contains('None') |                             col(c).contains('NULL') |                             (col(c) == '' ) |                             col(c).isNull() |                             isnan(c), c \n                           )).alias(c)\n                    for c in l4]).show()\n\n"

## Now do the same but using SQL select statement

### From the original Patient DataFrame, Create a temporary view (table).

In [4]:
df.createOrReplaceTempView("patients_tabel")

### Use SELECT statement to select all columns from the dataframe and show the output.

In [5]:
spark.sql("""SELECT * from patients_tabel """).show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|         null|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|

### *Using SQL commands*, limit the output to only 5 rows 

In [6]:
spark.sql("""SELECT * from patients_tabel """).limit(5).show()

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|    2020-01-23|   2020-02-05|         null|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|    2020-01-30|   2020-03-02|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              null|    2020-01-30|   202

### Select the count of males and females in the dataset

In [7]:
spark.sql("""SELECT sex, count(sex)  FROM patients_tabel GROUP BY sex ORDER BY count(sex)
            
          """).show()

+------+----------+
|   sex|count(sex)|
+------+----------+
|  null|         0|
|  male|      1825|
|female|      2218|
+------+----------+



### How many people did survive, and how many didn't?

In [8]:
spark.sql("""SELECT state, count(state)  FROM patients_tabel GROUP BY state ORDER BY count(state)
            
          """).show()

+--------+------------+
|   state|count(state)|
+--------+------------+
|deceased|          78|
|isolated|        2158|
|released|        2929|
+--------+------------+



### Now, let's perform some preprocessing using SQL:
1. Convert *age* column to double after removing the 's' at the end -- *hint: check SUBSTRING method*
2. Select only the following columns: `['sex', 'age', 'province', 'state']`
3. Store the result of the query in a new dataframe

In [11]:
import pyspark.sql.functions as fs

In [21]:
from pyspark.sql.functions import translate

In [24]:
from pyspark.sql.functions import substring


In [18]:
spark.sql("""SELECT substring(age, 0,2)  FROM patients_tabel 
            
          """).show(5)

+--------------------+
|substring(age, 0, 2)|
+--------------------+
|                  50|
|                  30|
|                  50|
|                  20|
|                  20|
+--------------------+
only showing top 5 rows



In [22]:
translate('age', 's', '')
spark.sql("""SELECT translate(age, 's', '')  FROM patients_tabel 
            
          """).show(5)

+-------------------+
|translate(age, s, )|
+-------------------+
|                 50|
|                 30|
|                 50|
|                 20|
|                 20|
+-------------------+
only showing top 5 rows



In [23]:
spark.sql("""UPDATE patients_tabel SET age=translate(age, 's', '')
          """)

Py4JJavaError: An error occurred while calling o22.sql.
: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.
	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:716)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
	at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
	at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67)
	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:391)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:104)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:104)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:97)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:117)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:117)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$simpleString$2(QueryExecution.scala:161)
	at org.apache.spark.sql.execution.ExplainUtils$.processPlan(ExplainUtils.scala:115)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:161)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:206)
	at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


In [26]:
#ALTER TABLE table_identifier ADD COLUMNS ( col_spec [ , ... ] )

In [25]:
df3=spark.sql("""SELECT
  sex
 ,age
 ,province
 ,state
FROM patients_tabel
""")
df3.show(5)

+------+---+--------+--------+
|   sex|age|province|   state|
+------+---+--------+--------+
|  male|50s|   Seoul|released|
|  male|30s|   Seoul|released|
|  male|50s|   Seoul|released|
|  male|20s|   Seoul|released|
|female|20s|   Seoul|released|
+------+---+--------+--------+
only showing top 5 rows



## Machine Learning 
### Create a pipeline model to predict is_dead and evaluate the performance.
- Use <b>StringIndexer</b> to transform <b>string</b> data type to indices.
- Use <b>OneHotEncoder</b> to deal with categorical values.
- Use <b>Imputer</b> to fill missing data with mean.