Considering left as dependent variable in HR dataset, split the dataset according to your last digit of roll no. (Example: if your roll no is ending with 0, the ratio will be 70, 30; if your roll no is ending with 1, the ratio will be 71, 29; if your roll no is ending with 2, the ratio will be 72, 28; if your roll no is ending with 3, the ratio will be 73, 27 etc.). Determine the accuracy of the model.

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 29 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 42.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=8b46831582fe33020b51646e24a86fc72a74a84b94240776aece0e2d97207751
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [2]:
from pyspark.sql import SparkSession
session = SparkSession.builder.appName("HR_comma_Dataset").getOrCreate()
data = session.read.csv("HR comma.csv", header = True, inferSchema = True)

In [3]:
data.show(10)

+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----+------+
|satisfaction_level|last_evaluation|number_project|average_montly_hours|time_spend_company|Work_accident|left|promotion_last_5years|sales|salary|
+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----+------+
|              0.38|           0.53|             2|                 157|                 3|            0|   1|                    0|sales|   low|
|               0.8|           0.86|             5|                 262|                 6|            0|   1|                    0|sales|medium|
|              0.11|           0.88|             7|                 272|                 4|            0|   1|                    0|sales|medium|
|              0.72|           0.87|             5|                 223|                 5|            0|   1|              

In [4]:
data.columns

['satisfaction_level',
 'last_evaluation',
 'number_project',
 'average_montly_hours',
 'time_spend_company',
 'Work_accident',
 'left',
 'promotion_last_5years',
 'sales',
 'salary']

In [5]:
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
str_idx = StringIndexer(inputCols = ['sales','salary'], outputCols = ["newsales", "newsalary"])

In [6]:
one_hot = OneHotEncoder(inputCols = ["newsales","newsalary"], outputCols = ["newsales_onehot","newsalary_onehot"])

In [7]:
vec_ass = VectorAssembler(inputCols = ['satisfaction_level','last_evaluation','number_project','average_montly_hours','time_spend_company','Work_accident','promotion_last_5years','newsales_onehot','newsalary_onehot'], outputCol = "all_features")

In [8]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol= "all_features", labelCol = "left")

In [9]:
from pyspark.ml import Pipeline
mypipeline = Pipeline(stages = [str_idx, one_hot, vec_ass, lr])

In [10]:
training, test = data.randomSplit([0.71, 0.29])

In [11]:
lr_model = mypipeline.fit(training)

In [12]:
result = lr_model.transform(test)

In [13]:
result.show(4, truncate = False)

+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----------+------+--------+---------+---------------+----------------+--------------------------------------------------------+----------------------------------------+----------------------------------------+----------+
|satisfaction_level|last_evaluation|number_project|average_montly_hours|time_spend_company|Work_accident|left|promotion_last_5years|sales      |salary|newsales|newsalary|newsales_onehot|newsalary_onehot|all_features                                            |rawPrediction                           |probability                             |prediction|
+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----------+------+--------+---------+---------------+----------------+--------------------------------------------------------+--------------------------------

In [14]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
eval = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol = "left")

In [15]:
eval.evaluate(result)

0.8297104126919155