# **KDDCup Data Analytics with PySpark RDD: A structured case study**

## YouTube channel: Code with Kristi
## Tutor: Dr Sachin Saxena (PhD, MTech, BTech)

### data source: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

In [None]:
########## ONLY in Colab ##########
# !pip3 install pyspark
########## ONLY in Colab ##########

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Ititialize Spark content
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('SivaBigData17MB').setMaster('local[*]')
sc=SparkContext(conf=conf)
print(sc)
print('Ready to Go!!!!')

<SparkContext master=local[*] appName=SivaBigData17MB>
Ready to Go!!!!


In [2]:
# read and load data to Spark
rdd = sc.textFile('/content/drive/MyDrive/Big Data Pyspark Project/kddcup.data.gz')

In [5]:
# Repartition and Cache Data:

In [3]:
# How many partitions do we have?
# By default, the number of partitions is determined by the number of cores available
# in your local setup or cluster.
# If you are running it locally, it's often based on the number of CPU cores.
rdd.getNumPartitions()

1

In [4]:
type(rdd)

In [5]:
rdd.glom().map(len).collect()

# glom(): Transforms each partition of the RDD into a list. Instead of working with individual elements, you now have a list of elements for each partition.

# map(len): Applies the len function to each partition (which is now a list) to get the count of elements in that partition.

# collect(): Collects the result back to the driver as a list, giving the count of elements in each partition.

[4898431]

In [31]:
# To check the contents of the RDD
# print(rdd.collect())

In [8]:
rdd =rdd.repartition(10)

# Can increase or decrease the level of parallelism in this RDD.
# Internally, this uses a shuffle to redistribute data.
# If you are decreasing the number of partitions in this RDD, consider using coalesce,
#  which can avoid performing a shuffle.


In [9]:
rdd.glom().map(len).collect()

[489850,
 489850,
 489850,
 489830,
 489850,
 489850,
 489840,
 489831,
 489830,
 489850]

In [10]:
print(sc.defaultParallelism)
print(rdd.getNumPartitions())

rdd.persist()
# 2 cores and 10 partitions, 5 partitions in each core

2
10


MapPartitionsRDD[12] at coalesce at NativeMethodAccessorImpl.java:0

# Custom dataset

In [8]:
# your list of data
data = [('Siva',30), ('Sachin',25),('Manish',41),('Lavya',47),('Varun',72)]



In [9]:
type(data)

list

In [10]:
# Convert the list into an RDD
rdd = sc.parallelize(data)

In [11]:
# To check the contents of the RDD
print(rdd.collect())

[('Siva', 30), ('Sachin', 25), ('Manish', 41), ('Lavya', 47), ('Varun', 72)]


In [12]:
type(rdd)

In [13]:
rdd.glom().map(len).collect()

# [number of partitions, elements in each list]
# For example, if the RDD is divided into 2 partitions like this:

# Partition 1: [('Lavya', 47), ('Varun', 72)]
# Partition 2: [('Siva', 30), ('Sachin', 25), ('Manish', 41)]

[2, 3]

In [19]:
# Create RDD with a specific number of partitions (e.g., 5 partitions)

rdd = sc.parallelize(data, 5)



In [20]:
# Check the number of partitions again

num_partitions = rdd.getNumPartitions()

In [21]:
num_partitions

5

In [22]:
rdd.glom().map(len).collect()

[1, 1, 1, 1, 1]

In [23]:
print(sc.defaultParallelism)

2


In [24]:
print(rdd.getNumPartitions())

5


In [25]:
rdd.persist()
# Set this RDD’s storage level to persist its values across operations after
# the first time it is computed. This can only be used to
# assign a new storage level if the RDD does not have a storage level set yet.

ParallelCollectionRDD[8] at readRDDFromFile at PythonRDD.scala:289