<a href="https://colab.research.google.com/github/tiasaxena/PySpark/blob/main/Adult_Census_Income_(Using_PySpark)_Kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Predict whether income exceeds $50K/yr based on census data**

Competition Link: [https://www.kaggle.com/datasets/uciml/adult-census-income](https://www.kaggle.com/datasets/uciml/adult-census-income)



In [87]:
import os
from google.colab import files

# Check if kaggle.json already exists
if not os.path.exists('kaggle.json'):
  # Upload the kaggle.json file
  uploaded = files.upload()

  # Confirm the upload
  if 'kaggle.json' in uploaded:
    print("kaggle.json has been successfully uploaded!")
  else:
    print("Something went wrong while uploading kaggle.json")
else:
  print("kaggle.json already exists!")

kaggle.json already exists!


In [71]:
! pip install -q kaggle
! kaggle datasets download uciml/adult-census-income
! unzip adult-census-income.zip

Dataset URL: https://www.kaggle.com/datasets/uciml/adult-census-income
License(s): CC0-1.0
adult-census-income.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  adult-census-income.zip
replace adult.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


### 1. Start the Spark Session

In [72]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("adult_income_prediction").getOrCreate()
spark

### 2. Read the CSV Data

In [73]:
df = spark.read.csv('adult.csv', header=True, inferSchema=True)

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education.num: integer (nullable = true)
 |-- marital.status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital.gain: integer (nullable = true)
 |-- capital.loss: integer (nullable = true)
 |-- hours.per.week: integer (nullable = true)
 |-- native.country: string (nullable = true)
 |-- income: string (nullable = true)



In [74]:
# Rename the column name of `income` to `label`

# Method - 1
df = df.withColumnRenamed('income', 'label')

# Method - 2
new_cols = ['age','workclass','fnlwgt','education','education_num','marital','occupation','relationship','race','sex','capital_gain','capital_loss','hours_week','native_country','label']
df=df.toDF(*new_cols)

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: integer (nullable = true)
 |-- marital: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: integer (nullable = true)
 |-- capital_loss: integer (nullable = true)
 |-- hours_week: integer (nullable = true)
 |-- native_country: string (nullable = true)
 |-- label: string (nullable = true)



In [75]:
df.show(5)

+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
|age|workclass|fnlwgt|   education|education_num|  marital|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_week|native_country|label|
+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
| 90|        ?| 77053|     HS-grad|            9|  Widowed|                ?|Not-in-family|White|Female|           0|        4356|        40| United-States|<=50K|
| 82|  Private|132870|     HS-grad|            9|  Widowed|  Exec-managerial|Not-in-family|White|Female|           0|        4356|        18| United-States|<=50K|
| 66|        ?|186061|Some-college|           10|  Widowed|                ?|    Unmarried|Black|Female|           0|        4356|        40| United-States|<=50K|
| 54|  Private|140359|

In [76]:
df.describe().show()

+-------+------------------+-----------+------------------+------------+-----------------+--------+----------------+------------+------------------+------+------------------+-----------------+------------------+--------------+-----+
|summary|               age|  workclass|            fnlwgt|   education|    education_num| marital|      occupation|relationship|              race|   sex|      capital_gain|     capital_loss|        hours_week|native_country|label|
+-------+------------------+-----------+------------------+------------+-----------------+--------+----------------+------------+------------------+------+------------------+-----------------+------------------+--------------+-----+
|  count|             32561|      32561|             32561|       32561|            32561|   32561|           32561|       32561|             32561| 32561|             32561|            32561|             32561|         32561|32561|
|   mean| 38.58164675532078|       NULL|189778.36651208502|        N

#### Crosstab computation

A **crosstab** table shows how two variables are related by counting how often different combinations of their values occur. It helps you see the distribution of one variable across the categories of another.

In [77]:
df.crosstab('age', 'label').sort('age_label').show()

+---------+-----+----+
|age_label|<=50K|>50K|
+---------+-----+----+
|       17|  395|   0|
|       18|  550|   0|
|       19|  710|   2|
|       20|  753|   0|
|       21|  717|   3|
|       22|  752|  13|
|       23|  865|  12|
|       24|  767|  31|
|       25|  788|  53|
|       26|  722|  63|
|       27|  754|  81|
|       28|  748| 119|
|       29|  679| 134|
|       30|  690| 171|
|       31|  705| 183|
|       32|  639| 189|
|       33|  684| 191|
|       34|  643| 243|
|       35|  659| 217|
|       36|  635| 263|
+---------+-----+----+
only showing top 20 rows



#### Drop columns

* `.drop()` - Drop a column
* `.dropna()` or `.na.drop()` - Drops rows with NULL values

In [78]:
df.show(5)

+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
|age|workclass|fnlwgt|   education|education_num|  marital|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_week|native_country|label|
+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
| 90|        ?| 77053|     HS-grad|            9|  Widowed|                ?|Not-in-family|White|Female|           0|        4356|        40| United-States|<=50K|
| 82|  Private|132870|     HS-grad|            9|  Widowed|  Exec-managerial|Not-in-family|White|Female|           0|        4356|        18| United-States|<=50K|
| 66|        ?|186061|Some-college|           10|  Widowed|                ?|    Unmarried|Black|Female|           0|        4356|        40| United-States|<=50K|
| 54|  Private|140359|

In [79]:
df = df.replace({"?": None})
df = df.dropna()
df.show(5)

+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
|age|workclass|fnlwgt|   education|education_num|  marital|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_week|native_country|label|
+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+
| 82|  Private|132870|     HS-grad|            9|  Widowed|  Exec-managerial|Not-in-family|White|Female|           0|        4356|        18| United-States|<=50K|
| 54|  Private|140359|     7th-8th|            4| Divorced|Machine-op-inspct|    Unmarried|White|Female|           0|        3900|        40| United-States|<=50K|
| 41|  Private|264663|Some-college|           10|Separated|   Prof-specialty|    Own-child|White|Female|           0|        3900|        40| United-States|<=50K|
| 34|  Private|216864|

### 3. Data Processing

`age` is not in a linear relation with the `label`. Thus, the preictions wil come out to be wrong if we try to bit a linear line along the feature `age`.

So, we introduce non-linearity, by adding a column of age<sup>2</sup> to the dataset with will make the equation to a quadriatic instead of linear. Thus, we can achieve a best-fitting curve.

In [80]:
# Add age_square column to the df
df = df.withColumn('age_square', df.age**2)

df.show(5)

+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+----------+
|age|workclass|fnlwgt|   education|education_num|  marital|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_week|native_country|label|age_square|
+---+---------+------+------------+-------------+---------+-----------------+-------------+-----+------+------------+------------+----------+--------------+-----+----------+
| 82|  Private|132870|     HS-grad|            9|  Widowed|  Exec-managerial|Not-in-family|White|Female|           0|        4356|        18| United-States|<=50K|    6724.0|
| 54|  Private|140359|     7th-8th|            4| Divorced|Machine-op-inspct|    Unmarried|White|Female|           0|        3900|        40| United-States|<=50K|    2916.0|
| 41|  Private|264663|Some-college|           10|Separated|   Prof-specialty|    Own-child|White|Female|           0|        3900|

It is important to check that if for any group, the count is 1, the model will not have very much of a learning and it can instead lead to an error during the cross-validation. So, we must remove such feature.

In [81]:
df.filter(df.native_country == 'Holand-Netherlands').count()

# ---------- OR ------------

df.groupby('native_country').count().orderBy('count', ascending=True).show()

+--------------------+-----+
|      native_country|count|
+--------------------+-----+
|  Holand-Netherlands|    1|
|            Scotland|   11|
|            Honduras|   12|
|             Hungary|   13|
|Outlying-US(Guam-...|   14|
|          Yugoslavia|   16|
|                Laos|   17|
|            Thailand|   17|
|            Cambodia|   18|
|     Trinadad&Tobago|   18|
|                Hong|   19|
|             Ireland|   24|
|              France|   27|
|             Ecuador|   27|
|              Greece|   29|
|                Peru|   30|
|           Nicaragua|   33|
|            Portugal|   34|
|              Taiwan|   42|
|                Iran|   42|
+--------------------+-----+
only showing top 20 rows



**Note:** So, let us drop Holand-Netherlands.

In [82]:
df = df.filter(df.native_country != 'Holand-Netherlands')

df.filter(df.native_country == 'Holand-Netherlands').count()

0

### 4. Data Processing Pipeline

Following steps will be performed in teh pipeline for data processing:
* Encode the categorical data(education, marital, workclass, etc.)
* Index the `label` feature -> in form of 0/1.
* Convert continuous variables in the right format and add them.
* Assemble the steps.

In [83]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
# Import all from `sql.types`
from pyspark.sql.types import *

In [84]:
categorical_features = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'sex', 'native_country']

# pipeline stages
stages = []

# 1. Encode the categorical data into one-hot encoded vector
for category in categorical_features:
  # StringIndexer converts categorical string values into numerical indices
  stringIndexer = StringIndexer(inputCol=category, outputCol=category+"_Indexed")

  # OneHotEncoderEstimator converts the numerical indices (from StringIndexer) into a one-hot encoded vector
  encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[category + "classVec"])

  stages += [stringIndexer, encoder]

# 2. Index the label feature
label_stringIndexer = StringIndexer(inputCol='label', outputCol='label_Indexed')
stages += [label_stringIndexer]

# 3. Add continuous values
def convertColumn(df, names, newType):
  """
  A custom function to convert the data type of DataFrame columns.This is done to ensure that all the numerical values are converted to the Float Datatype which can also lead to more precised calculations.
  """
  for name in names:
    df = df.withColumn(name, df[name].cast(newType))
  return df

# List of continuous features
continuous_features  = ['age', 'fnlwgt','capital_gain', 'education_num', 'capital_loss', 'hours_week']

# Convert the type
df = convertColumn(df, continuous_features, FloatType())

assemblerInputs = [c + "classVec" for c in categorical_features] + continuous_features

print(f"Assembler Inputs: {assemblerInputs}")

# 4. Assemble the steps
# Vector Assembler - All one-hot encoded categorical columns and continuous features are combined into a single features column
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

stages

Assembler Inputs: ['workclassclassVec', 'educationclassVec', 'maritalclassVec', 'occupationclassVec', 'relationshipclassVec', 'raceclassVec', 'sexclassVec', 'native_countryclassVec', 'age', 'fnlwgt', 'capital_gain', 'education_num', 'capital_loss', 'hours_week']


[StringIndexer_4d9497cf0b27,
 OneHotEncoder_10a48ba17e54,
 StringIndexer_1b7b5c3d7352,
 OneHotEncoder_927608e24eee,
 StringIndexer_eb7d599a9c02,
 OneHotEncoder_aa0277272085,
 StringIndexer_a827a4742f2c,
 OneHotEncoder_7cb8cb4a8b67,
 StringIndexer_6a1ba361657d,
 OneHotEncoder_350cac765aca,
 StringIndexer_760f0d5a4270,
 OneHotEncoder_8cc89d01cb8c,
 StringIndexer_2f6ff47f8663,
 OneHotEncoder_8e11ec8ef36f,
 StringIndexer_c9132ac482c9,
 OneHotEncoder_e3cd0ade5a28,
 StringIndexer_037ad6731af2,
 VectorAssembler_79a287c432ac]

In [86]:
# Create the pipeline
pipeline = Pipeline(stages=stages)
# Process the input data (df) according to the defined stages.
pipelineModel = pipeline.fit(df)
# After fitting(learning), the data is tranformed(implementation of the learning)
model = pipelineModel.transform(df)

model

DataFrame[age: float, workclass: string, fnlwgt: float, education: string, education_num: float, marital: string, occupation: string, relationship: string, race: string, sex: string, capital_gain: float, capital_loss: float, hours_week: float, native_country: string, label: string, age_square: double, workclass_Indexed: double, workclassclassVec: vector, education_Indexed: double, educationclassVec: vector, marital_Indexed: double, maritalclassVec: vector, occupation_Indexed: double, occupationclassVec: vector, relationship_Indexed: double, relationshipclassVec: vector, race_Indexed: double, raceclassVec: vector, sex_Indexed: double, sexclassVec: vector, native_country_Indexed: double, native_countryclassVec: vector, label_Indexed: double, features: vector]

### 5. Build the Classifier: Logistic

The `features` is now a DenseVector. A **dense vector** is a way of representing data as a list or array of numbers where all elements are **explicitly** stored.

In [98]:
from pyspark.ml.linalg import DenseVector

print(model.rdd.take(5))

# Map the RDD to the required format
try:
  input_data = model.rdd.map(lambda x: (x['label_Indexed'], DenseVector(x['features'])))
except KeyError as e:
  print(f"KeyError: {e} - Check if the RDD contains the 'newlabel' and 'features' keys.")
  raise

[Row(age=82.0, workclass='Private', fnlwgt=132870.0, education='HS-grad', education_num=9.0, marital='Widowed', occupation='Exec-managerial', relationship='Not-in-family', race='White', sex='Female', capital_gain=0.0, capital_loss=4356.0, hours_week=18.0, native_country='United-States', label='<=50K', age_square=6724.0, workclass_Indexed=0.0, workclassclassVec=SparseVector(6, {0: 1.0}), education_Indexed=0.0, educationclassVec=SparseVector(15, {0: 1.0}), marital_Indexed=4.0, maritalclassVec=SparseVector(6, {4: 1.0}), occupation_Indexed=2.0, occupationclassVec=SparseVector(13, {2: 1.0}), relationship_Indexed=1.0, relationshipclassVec=SparseVector(5, {1: 1.0}), race_Indexed=0.0, raceclassVec=SparseVector(4, {0: 1.0}), sex_Indexed=1.0, sexclassVec=SparseVector(1, {}), native_country_Indexed=0.0, native_countryclassVec=SparseVector(39, {0: 1.0}), label_Indexed=0.0, features=SparseVector(95, {0: 1.0, 6: 1.0, 25: 1.0, 29: 1.0, 41: 1.0, 45: 1.0, 50: 1.0, 89: 82.0, 90: 132870.0, 92: 9.0, 93: 4

In [99]:
# Create DataFrame from the transformed RDD
df_train = spark.createDataFrame(input_data, ['label', 'features'])

# Show the resulting DataFrame
df_train.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[1.0,0.0,0.0,0.0,...|
|  0.0|[1.0,0.0,0.0,0.0,...|
|  0.0|[1.0,0.0,0.0,0.0,...|
|  0.0|[1.0,0.0,0.0,0.0,...|
|  0.0|[1.0,0.0,0.0,0.0,...|
+-----+--------------------+
only showing top 5 rows



### 6. Create a train/test split

In [101]:
train_data, test_data = df_train.randomSplit([0.8, 0.2], seed=1234)

In [104]:
train_data.groupby('label').agg({'label': 'count'}).show()

+-----+------------+
|label|count(label)|
+-----+------------+
|  0.0|       18148|
|  1.0|        5995|
+-----+------------+



In [105]:
test_data.groupby('label').agg({'label': 'count'}).show()

+-----+------------+
|label|count(label)|
+-----+------------+
|  0.0|        4505|
|  1.0|        1513|
+-----+------------+



### 7. Logistic Regression Model

In [106]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
    labelCol="label",
    featuresCol="features",
    maxIter=10,
    regParam=0.01
)

lrModel = lr.fit(train_data)

In [109]:
# Print the coefficients and intercept for logistic regression
print(f"Cofficients: {str(lrModel.coefficients)}")
print(f"Intercept: {str(lrModel.intercept)}")

Cofficients: [0.09101634409615882,-0.3076456696313937,-0.11410322061773778,-0.21228286732936735,0.24998018548992554,0.5871464481380892,-0.2278909389604252,0.018891039657095016,0.3617872616070221,0.6146346711775287,0.0233525804676015,-0.4458636603060742,0.004935191642457577,-0.5470463251904182,-0.7484567342055787,1.0590967625496734,-0.4445223469799122,-0.3799207213348473,0.9845214078731364,-0.5681373296175183,-0.42608583656379784,0.7705968940736075,-0.6925166117226711,-0.2907412398589956,-0.3855129606845306,-0.04717841721244779,-0.26114206913611254,0.3721642580873677,-0.00868455011001916,0.6787909765126494,-0.1110378387683327,0.22435682212922511,-0.7283004480467764,-0.2745954078578791,-0.06807804756350536,-0.6243339708159834,-1.0052893125041196,0.5302803077845539,0.505259845514715,-1.40877118213896,0.527741335427786,-0.09632853161893945,-0.8385372794989355,-0.20097276940617143,1.4970830257224557,0.09623195036713898,-0.06457282228310895,0.20987828639264514,-0.28457425999732594,0.61005844

### 8. Train and evaluate the model

In [110]:
predictions = lrModel.transform(test_data)

predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [113]:
result = predictions.select('label', 'prediction', 'probability')
result.show(5)

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  0.0|       0.0|[0.68028723598545...|
|  0.0|       0.0|[0.66176467093464...|
|  0.0|       0.0|[0.72160480855499...|
|  0.0|       0.0|[0.68439278901977...|
|  0.0|       1.0|[0.42374648158736...|
+-----+----------+--------------------+
only showing top 5 rows



### 9. Evaluate the model

In [116]:
def accuracy(model):
  predictions = model.transform(test_data)
  confusion_matrix = predictions.select('label', 'prediction')
  acc = confusion_matrix.filter(confusion_matrix.label == confusion_matrix.prediction).count() / confusion_matrix.count()
  print(f"Accuracy of the Logistic Regression Model is: {acc*100:.3f}%")

accuracy(lrModel)

Accuracy of the Logistic Regression Model is: 84.812%


### 10. ROC metrics

In [117]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
print(evaluator.getMetricName(), evaluator.evaluate(predictions))


areaUnderROC 0.9035533698695667


### 11. Tune the hyperparameter

In [121]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Add the paramateres you want to tune
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5])
             .build())

paramGrid

[{Param(parent='LogisticRegression_255150048b5d', name='regParam', doc='regularization parameter (>= 0).'): 0.01},
 {Param(parent='LogisticRegression_255150048b5d', name='regParam', doc='regularization parameter (>= 0).'): 0.5}]

In [123]:
from time import *
start_time = time()

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(train_data)

end_time = time()
elapsed_time = end_time - start_time
print(f"Time taken to train the model: {elapsed_time:.3f} seconds")
accuracy(model = cvModel)

Time taken to train the model: 450.879 seconds
Accuracy of the Logistic Regression Model is: 84.812%


**Note:** We exctract the recommended parameter by chaining `cvModel.bestModel` with `extractParamMap()`

In [125]:
# Get the best hyperparams
bestModel = cvModel.bestModel
print(f"Best hyperparameters: {bestModel.extractParamMap()}")

Best hyperparameters: {Param(parent='LogisticRegression_255150048b5d', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_255150048b5d', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_255150048b5d', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_255150048b5d', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_255150048b5d', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_255150048b5d', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_255150048b5d', name='maxBlockSizeInMB', doc='maximum me