## Start SparkSession

In [1]:
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate()

## Load the Titanic Dataset

In [2]:
df = spark.read.csv("dataset/titanic.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|   

## Count Total Records

In [3]:
print("Total Records:", df.count())

Total Records: 891


## Group by Survival

In [4]:
df.groupBy("Survived").count().show()

+--------+-----+
|Survived|count|
+--------+-----+
|       1|  342|
|       0|  549|
+--------+-----+



## Average Age

In [5]:
df.select("Age").describe().show()

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|               714|
|   mean| 29.69911764705882|
| stddev|14.526497332334035|
|    min|              0.42|
|    max|              80.0|
+-------+------------------+



## Average Fare by Class

In [6]:
df.groupBy("Pclass").avg("Fare").show()

+------+------------------+
|Pclass|         avg(Fare)|
+------+------------------+
|     1| 84.15468749999992|
|     3|13.675550101832997|
|     2| 20.66218315217391|
+------+------------------+



## Stop Spark

In [7]:
spark.stop()

## Insights Derived from Big Data Analysis
- **Total Records Processed** : 891
- **Survival Distribution** :
  - Survived: 342
  - Not Survived: 549
- **Gender Distribution** :
  - Male: 577
  - Female: 314
- **Average Age**: ~29.7 years
- **Average Fare by Class** :
  - 1st Class: 84.15
  - 2nd Class: 20.66
  - 3rd Class: 13.68