<a href="https://colab.research.google.com/github/saleh-imran/BigDataProcessing/blob/main/DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import Necessary Libraries**

In [9]:
!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils




**Create Spark Session**

In [10]:
spark = SparkSession.builder.appName("MLlibDecisionTreeClassification").getOrCreate()


**Load Data**

In [14]:
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

# Load and prepare the Titanic dataset
data = spark.read.csv('titanic.csv', header=True, inferSchema=True)
data = data.select("Pclass", "Age", "Survived")  # Selecting relevant columns
data = data.na.drop()  # Dropping rows with missing values

**Convert data to RDD of LabeledPoints**


In [15]:
data = data.rdd.map(lambda row: LabeledPoint(row["Survived"], [row["Pclass"], row["Age"]]))


**Split the data into training and testing sets**


In [16]:
trainData, testData = data.randomSplit([0.8, 0.2], seed=123)


**Train a DecisionTree model**


In [17]:
model = DecisionTree.trainClassifier(trainData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)


**Evaluate Model**

In [18]:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print("Test Error = {}".format(testErr))

Test Error = 0.3212121212121212


**Display Learned Decision Tree**

In [19]:
print("Learned classification tree model:")
print(model.toDebugString)


Learned classification tree model:
<bound method DecisionTreeModel.toDebugString of DecisionTreeModel classifier of depth 5 with 23 nodes>


In [8]:
spark.stop()
