# **Labs 1 PySpark:**

In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv file that we will be using in this lab is **PatientInfo**.

## PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

### Import the pyspark and check it's version

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
 import pyspark
 pyspark.__version__


'3.5.1'

### Import and create SparkSession

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName("lab1").getOrCreate()

### Load the PatientInfo.csv file and show the first 5 rows

In [None]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [None]:
df=spark.read.csv("/content/drive/MyDrive/Data (1)/Data/PatientInfo.csv",header=True,inferSchema=True)
df.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|         NULL|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|         NULL|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|   202

### Display the schema of the dataset

In [None]:
df.printSchema()

root
 |-- patient_id: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: date (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)



### Display the statistical summary

In [None]:
df.describe()

DataFrame[summary: string, patient_id: string, sex: string, age: string, country: string, province: string, city: string, infection_case: string, infected_by: string, contact_number: string, symptom_onset_date: string, state: string]

In [None]:
df1=df.describe()
df1.show()

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|    5165|
|   mean|2.8636345618679576E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|2.2845944015643125E9|1.6772572523506988E7|              NULL|    NULL|
| stddev| 2.074210725277473E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|1.5265072953383324E9| 3.093097580985502E8|              N

In [None]:
df.summary().show()

+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|summary|          patient_id|   sex| age|   country|province|          city|      infection_case|         infected_by|      contact_number|symptom_onset_date|   state|
+-------+--------------------+------+----+----------+--------+--------------+--------------------+--------------------+--------------------+------------------+--------+
|  count|                5165|  4043|3785|      5165|    5165|          5071|                4246|                1346|                 791|               690|    5165|
|   mean|2.8636345618679576E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|2.2845944015643125E9|1.6772572523506988E7|              NULL|    NULL|
| stddev| 2.074210725277473E9|  NULL|NULL|      NULL|    NULL|          NULL|                NULL|1.5265072953383324E9| 3.093097580985502E8|              N

### Using the state column.
### How many people survived (released), and how many didn't survive (isolated/deceased)?

In [None]:
df2=df.groupBy("state").count()
df2.show()

+--------+-----+
|   state|count|
+--------+-----+
|isolated| 2158|
|released| 2929|
|deceased|   78|
+--------+-----+



### Display the number of null values in each column

In [None]:
for c in df.columns:
  print(c,df.filter(df[c].isNull()).count())

patient_id 0
sex 1122
age 1380
country 0
province 0
city 94
infection_case 919
infected_by 3819
contact_number 4374
symptom_onset_date 4475
confirmed_date 3
released_date 3578
deceased_date 5099
state 0


## Data preprocessing

### Fill the nulls in the deceased_date with the released_date.
- You can use <b>coalesce</b> function

In [None]:
df_deceased=df.withColumn("deceased_date",pyspark.sql.functions.coalesce(df['deceased_date'],df['released_date']))
df_deceased.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|

### Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows. Print the schema.
- <b> Hint: You need to typecast these columns as date first <b>

In [None]:
df_deceased.printSchema()

root
 |-- patient_id: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: date (nullable = true)
 |-- released_date: date (nullable = true)
 |-- deceased_date: date (nullable = true)
 |-- state: string (nullable = true)



In [None]:
df_with_difference=df_deceased.withColumn("no_days",pyspark.sql.functions.datediff(df_deceased['deceased_date'],
                                                                                   df_deceased['confirmed_date']
                                                                                   ))
df_with_difference.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|

### Remove null values of sex column.
### Add a is_male column if male then it should yield true, else (Female) then False

In [None]:
df4=df_with_difference.filter(df_with_difference['sex'].isNotNull())
df4.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|         

In [None]:
df_male=df4.withColumn("is_male",pyspark.sql.functions.when(df4['sex']=='male','True').otherwise('Female'))
df_male.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|is_male|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|   True|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|   True|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|co

### Add a is_dead column if patient state is not released then it should yield true, else then False

- Use <b>UDF</b> to perform this task.
- However, UDF is not recommended there is no built in function can do the required operation.
- UDF is slower than built in functions.

In [None]:
import pyspark.sql.functions as F

In [None]:
def is_state(state):
  if state!='released':
    return True
  else:
    return False

In [None]:
df_dead=df_male.withColumn("is_dead",F.udf(is_state)(df_male['state']))
df_dead.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|is_male|is_dead|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|   True|  false|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|   True|  false|
|1000000003|  m

### Change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20

In [None]:
df_with_bins=df_dead.withColumn("age",pyspark.sql.functions.regexp_replace(df_dead['age'],'s',''))
df_with_bins.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|is_male|is_dead|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|1000000001|  male| 50|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|   True|  false|
|1000000002|  male| 30|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|   True|  false|
|1000000003|  m

### Change age, and no_days  to be typecasted as Double

In [None]:
df_change=df_with_bins.withColumn("age",df_with_bins['age'].cast("double"))
df_change.show()

+----------+------+----+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|patient_id|   sex| age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|is_male|is_dead|
+----------+------+----+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|1000000001|  male|50.0|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|     13|   True|  false|
|1000000002|  male|30.0|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|     32|   True|  false|
|100000000

In [None]:
df_change_days=df_change.withColumn("no_days",df_with_bins['no_days'].cast("double"))
df_change_days.show()

+----------+------+----+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|patient_id|   sex| age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|no_days|is_male|is_dead|
+----------+------+----+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+-------+-------+-------+
|1000000001|  male|50.0|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|   2020-02-05|released|   13.0|   True|  false|
|1000000002|  male|30.0|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|   2020-03-02|released|   32.0|   True|  false|
|100000000

### Drop the columns
["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [None]:
drop_columns=["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [None]:
df_dropped_columns=df_change_days.drop(*drop_columns)
df_dropped_columns.show()

+----+--------+-------+-------+
| age|province|is_male|is_dead|
+----+--------+-------+-------+
|50.0|   Seoul|   True|  false|
|30.0|   Seoul|   True|  false|
|50.0|   Seoul|   True|  false|
|20.0|   Seoul|   True|  false|
|20.0|   Seoul| Female|  false|
|50.0|   Seoul| Female|  false|
|20.0|   Seoul|   True|  false|
|20.0|   Seoul|   True|  false|
|30.0|   Seoul|   True|  false|
|60.0|   Seoul| Female|  false|
|50.0|   Seoul| Female|  false|
|20.0|   Seoul|   True|  false|
|80.0|   Seoul|   True|   true|
|60.0|   Seoul| Female|  false|
|70.0|   Seoul|   True|  false|
|70.0|   Seoul|   True|  false|
|70.0|   Seoul|   True|  false|
|20.0|   Seoul|   True|  false|
|70.0|   Seoul| Female|  false|
|70.0|   Seoul| Female|  false|
+----+--------+-------+-------+
only showing top 20 rows



### Recount the number of nulls now

In [None]:
for col in df_dropped_columns.columns:
  print(col,df_dropped_columns.filter(df_dropped_columns[col].isNull()).count())

age 261
province 0
is_male 0
is_dead 0


## Now do the same but using SQL select statement

### From the original Patient DataFrame, Create a temporary view (table).

In [None]:
veiw_name="patient"
df.createOrReplaceTempView('patient')

### Use SELECT statement to select all columns from the dataframe and show the output.

In [None]:
data_selected=df.select("*")
data_selected.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|         NULL|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|         NULL|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|

### *Using SQL commands*, limit the output to only 5 rows

In [None]:
data_selected.limit(5).show()

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+--------------+-------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       NULL|            75|        2020-01-22|    2020-01-23|   2020-02-05|         NULL|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       NULL|            31|              NULL|    2020-01-30|   2020-03-02|         NULL|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|            17|              NULL|    2020-01-30|   202

### Select the count of males and females in the dataset

In [None]:
data_count_male_and_female=df.select("sex").groupBy("sex").count()
data_count_male_and_female.show()

+------+-----+
|   sex|count|
+------+-----+
|  NULL| 1122|
|female| 2218|
|  male| 1825|
+------+-----+



### How many people did survive, and how many didn't?

In [None]:
data_survived_1= spark.sql("select state,count(state) as count from patient group by state")
data_survived_1.show()

+--------+-----+
|   state|count|
+--------+-----+
|isolated| 2158|
|released| 2929|
|deceased|   78|
+--------+-----+



### Now, let's perform some preprocessing using SQL:
1. Convert *age* column to double after removing the 's' at the end -- *hint: check SUBSTRING method*
2. Select only the following columns: `['sex', 'age', 'province', 'state']`
3. Store the result of the query in a new dataframe

In [None]:
age_remove_s=spark.sql("select sex ,cast(substring(age,0,2) as double) as age, province, state from patient")
age_remove_s.show()

+------+----+--------+--------+
|   sex| age|province|   state|
+------+----+--------+--------+
|  male|50.0|   Seoul|released|
|  male|30.0|   Seoul|released|
|  male|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|female|20.0|   Seoul|released|
|female|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|30.0|   Seoul|released|
|female|60.0|   Seoul|released|
|female|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|80.0|   Seoul|deceased|
|female|60.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|female|70.0|   Seoul|released|
|female|70.0|   Seoul|released|
+------+----+--------+--------+
only showing top 20 rows



In [None]:
data_clean=age_remove_s.select("sex","age","province","state")
data_clean.show()

+------+----+--------+--------+
|   sex| age|province|   state|
+------+----+--------+--------+
|  male|50.0|   Seoul|released|
|  male|30.0|   Seoul|released|
|  male|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|female|20.0|   Seoul|released|
|female|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|30.0|   Seoul|released|
|female|60.0|   Seoul|released|
|female|50.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|  male|80.0|   Seoul|deceased|
|female|60.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|70.0|   Seoul|released|
|  male|20.0|   Seoul|released|
|female|70.0|   Seoul|released|
|female|70.0|   Seoul|released|
+------+----+--------+--------+
only showing top 20 rows



In [None]:
data_clean.write.csv("data_clean.csv")

In [None]:
data_clean.distinct().show()

+------+----+-----------------+--------+
|   sex| age|         province|   state|
+------+----+-----------------+--------+
|female|40.0|     Jeollabuk-do|released|
|  male|50.0|            Ulsan|released|
|  male|30.0|      Gyeonggi-do|released|
|female|10.0|      Gyeonggi-do|isolated|
|female|50.0|     Jeollanam-do|released|
|  male|10.0|            Ulsan|released|
|  male|30.0|Chungcheongbuk-do|isolated|
|  male|30.0|     Jeollanam-do|isolated|
|female|10.0| Gyeongsangbuk-do|released|
|female|50.0| Gyeongsangbuk-do|isolated|
|female|30.0|          Jeju-do|released|
|  male|50.0|          Incheon|isolated|
|  male|20.0|          Incheon|released|
|female|60.0|          Daejeon|isolated|
|  male|80.0| Gyeongsangbuk-do|released|
|female|80.0|            Busan|deceased|
|female|70.0|            Daegu|isolated|
|  male|40.0|            Busan|released|
|female|80.0|Chungcheongnam-do|isolated|
|female|20.0|Chungcheongbuk-do|released|
+------+----+-----------------+--------+
only showing top

#                Machine Learning

In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder,Imputer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
dtypes=data_clean.dtypes
dtypes

[('sex', 'string'),
 ('age', 'double'),
 ('province', 'string'),
 ('state', 'string')]

## StringIndexer

In [None]:
CatCols= [ s for (s,d) in dtypes if d=="string"]
CatCols

['sex', 'province', 'state']

In [None]:
catCols_indexed= [ s+"_indexed" for s in CatCols]
catCols_indexed

['sex_indexed', 'province_indexed', 'state_indexed']

In [None]:
stind=StringIndexer(inputCols=CatCols,outputCols=catCols_indexed,handleInvalid="keep")


## OneHotEncoder

In [None]:
catCols_ohe= [ s+"_ohe" for s in CatCols]
catCols_ohe

['sex_ohe', 'province_ohe', 'state_ohe']

In [None]:
ohe=OneHotEncoder(inputCols=catCols_indexed,outputCols=catCols_ohe)

In [None]:
numCols= [ s for (s,d) in dtypes if d!="string"]
numCols

['age']

# Imputing

In [None]:
imput=Imputer(inputCols=numCols,outputCols=numCols)

## collect categorical and number columns

In [None]:
vec_Cols=catCols_ohe+numCols
vec_Cols

['sex_ohe', 'province_ohe', 'state_ohe', 'age']

In [None]:
final_cols=['sex_ohe', 'province_ohe','age']

## VectorAssembler

In [None]:
vecAssem=VectorAssembler(inputCols=final_cols,outputCol="features")

### Divide the data into Train/Test

In [None]:
train_df, test_df=data_clean.randomSplit([0.8,0.2],seed=42)
print(f"There are {train_df.count()} rows in the training set, and {test_df.count()} in the test set")

There are 4166 rows in the training set, and 999 in the test set


### Create a Linear Regression Model

In [None]:
lr=LinearRegression(featuresCol="features",labelCol="state_indexed",predictionCol="prediction")

### Create a Pipeline model

In [None]:
pipe=Pipeline(stages=[stind,ohe,imput,vecAssem,lr])

### Fit the Pipeline model to the trainig data

In [None]:
pipe_model=pipe.fit(train_df)

### Make a prediction for the test data and evaluate the model performance using RMSE and r2

In [None]:
pred_test_df=pipe_model.transform(test_df)

In [None]:
pred_test_df.show(5)

+----+-----------------+-----------+--------+-----------+----------------+-------------+---------+--------------+-------------+--------------------+-----------------+
| sex|              age|   province|   state|sex_indexed|province_indexed|state_indexed|  sex_ohe|  province_ohe|    state_ohe|            features|       prediction|
+----+-----------------+-----------+--------+-----------+----------------+-------------+---------+--------------+-------------+--------------------+-----------------+
|NULL|40.83025210084033|Gyeonggi-do|isolated|        2.0|             2.0|          1.0|(2,[],[])|(17,[2],[1.0])|(3,[1],[1.0])|(20,[4,19],[1.0,4...|1.131520721686016|
|NULL|40.83025210084033|Gyeonggi-do|isolated|        2.0|             2.0|          1.0|(2,[],[])|(17,[2],[1.0])|(3,[1],[1.0])|(20,[4,19],[1.0,4...|1.131520721686016|
|NULL|40.83025210084033|Gyeonggi-do|isolated|        2.0|             2.0|          1.0|(2,[],[])|(17,[2],[1.0])|(3,[1],[1.0])|(20,[4,19],[1.0,4...|1.131520721686016

In [None]:
rmse_evaluator_test=RegressionEvaluator(predictionCol="prediction",labelCol="state_indexed",metricName="rmse")
rmse_evaluator_test.evaluate(pred_test_df)

0.3766367508769946

In [None]:
r2_evaluator_test=RegressionEvaluator(predictionCol="prediction",labelCol="state_indexed",metricName="r2")
r2_evaluator_test.evaluate(pred_test_df)

0.4892391406165031

In [None]:
print(f"RMSE is {rmse_evaluator_test.evaluate(pred_test_df)}")
print(f"R2 is {r2_evaluator_test.evaluate(pred_test_df)}")

RMSE is 0.3766367508769946
R2 is 0.4892391406165031
