# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Additional Notebook (Ungraded): Student Data Analysis RDD

## Learning Objectives

At the end of the experiment, you will be able to:

* set up a PySpark environment and initialize RDDs for distributed data processing.
* Perform basic transformations like map, filter, and actions like count and collect to process data.
* Filter students based on grades (pass/fail) and aggregate data such as total and average marks.
* Use key-value pairs to group data and perform operations like summing and averaging by categories (e.g., department).
* Analyze the highest and lowest marks within each group using reduceByKey to find maximum and minimum values.

## Information

#Student Data Analysis using PySpark and RDDs
PySpark is the Python API for Apache Spark, which enables large-scale distributed data processing. Using PySpark with Resilient Distributed Datasets (RDDs) allows for efficient data handling, transformation, and analysis. This approach is ideal for processing large datasets, such as student data in this case.

In this Student Data Analysis, we utilize various RDD operations to perform tasks like filtering, aggregating, and summarizing data from a CSV file. The key operations include:

###Loading and Initializing RDDs:

* The data is loaded from a CSV file and converted into an RDD. This represents the dataset in a distributed manner, which can then be processed across multiple machines.

###Data Transformation:

* Using map and filter functions, the dataset is transformed by selecting relevant columns (such as student IDs and scores) and filtering students based on conditions (e.g., passing or failing based on scores).

###Aggregation and Grouping:

* With reduceByKey, we perform aggregation operations like summing up scores by student or department, calculating average scores, and grouping data to perform key-based operations (e.g., sum or count by department).
###Analysis of Results:

* The dataset is analyzed for insights such as the total number of students who passed or failed, department-wise performance, and finding maximum and minimum scores in various categories.

## Benefits of using PySpark and RDDs:
* Scalability: Can handle large datasets efficiently by distributing computations across a cluster.
* Fault Tolerance: RDDs are designed to be fault-tolerant, recovering lost data from transformations automatically.
* Flexibility: Allows a wide range of transformations and actions to process data according to the needs of the analysis.

In [None]:
#@title Install packages and Download Dataset
!wget https://cdn.iisc.talentsprint.com/CDS/Datasets/StudentsDataAnalysis.csv
!pip install pyspark

Importing necessary packages

In [None]:
from pyspark import SparkConf, SparkContext

Let's initializes a SparkContext with a configuration to set the application name as "Mini Project," allowing for distributed data processing in PySpark.

In [None]:
conf = SparkConf().setAppName("Mini Project")
sc = SparkContext.getOrCreate(conf=conf)

Loads a CSV file into an RDD, removes the header, and splits each row into a list of values in PySpark.

In [None]:
rdd = sc.textFile('/content/StudentsDataAnalysis.csv')
headers = rdd.first()
rdd = rdd.filter(lambda x: x!=headers)
rdd = rdd.map(lambda x: x.split(','))

Counts the total number of students in the RDD.

In [None]:
rdd.count()

Total Marks by Male & Female Students

In [None]:
rdd2 = rdd
rdd2 = rdd2.map(lambda x: (x[1], int(x[5])))
rdd2 = rdd2.reduceByKey(lambda x,y : x+y)
rdd2.collect()

Total Passed & Failed Student

In [None]:
rdd3 = rdd
passed = rdd3.filter(lambda x: int(x[5]) > 50).count()
failed = rdd3.filter(lambda x: int(x[5]) <= 50).count()
print(passed,failed)

In [None]:
passed2 = rdd3.filter(lambda x: int(x[5]) > 50).count()
failed2 = rdd.count() - passed2
print(passed2,failed2)

Total Enrollments per Course

In [None]:
rdd4 = rdd
rdd4 = rdd4.map(lambda x: (x[3],1))
rdd4.reduceByKey(lambda x,y: x+y).collect()

Total Marks per Course

In [None]:
rdd5 = rdd
rdd5 = rdd5.map(lambda x: (x[3], int(x[5])))
rdd5.reduceByKey(lambda x,y: x+y).collect()

Avarage marks per Course

In [None]:
rdd6 = rdd
rdd6 = rdd6.map(lambda x: (x[3], (int(x[5]), 1) ))
rdd6 = rdd6.reduceByKey( lambda x,y : (x[0] + y[0], x[1] + y[1]))

In [None]:
rdd6.map(lambda x: (x[0], (x[1][0] / x[1][1]))).collect()
rdd6.mapValues(lambda x: (x[0] / x[1])).collect()

Finiding minimum & maximum marks

In [None]:
rdd7 = rdd
rdd7 = rdd7.map(lambda x: (x[3], int(x[5])))
print(rdd7.reduceByKey(lambda x,y: x if x > y else y).collect())
print(rdd7.reduceByKey(lambda x,y: x if x < y else y).collect())