# Chapter 2: Spark Application Development

This notebook covers foundational Python programming concepts, data structures, Pandas for data analysis, and working with PySpark DataFrames.

## Agenda
1. Basic Python Programming Concepts
2. Python Data Structures
3. Pandas for Data Analysis
4. PySpark DataFrames
5. Exploratory Data Analysis

In [None]:
! pip install pandas

## 1. Basic Python Programming Concepts

In [None]:
# Hello World Program in Python
print("Hello, World!")

### Variables and Data Types

In [None]:
# Integer, Float, and String
x = 10          # Integer
y = 3.14        # Float
name = "Spark" # String

print(f"Integer: {x}, Float: {y}, String: {name}")

## 2. Python Data Structures

### Lists

In [None]:
# Creating and accessing lists
numbers = [1, 2, 3, 4, 5]
print("List elements:", numbers)
print("First element:", numbers[0])

### Tuples

In [None]:
# Creating and accessing tuples
coordinates = (10, 20)
print("Tuple elements:", coordinates)
print("First element:", coordinates[0])

### Dictionaries

In [None]:
# Creating and accessing dictionaries
person = {"name": "Alice", "age": 25}
print("Name:", person["name"])
print("Age:", person["age"])

### Sets

In [None]:
# Creating and using sets
unique_numbers = {1, 2, 3, 4, 4, 5}
print("Set elements:", unique_numbers)

## 3. Pandas for Data Analysis

In [None]:
import pandas as pd

# Creating a DataFrame
data = {
    "Name": ["Alice", "Bob", "Cathy"],
    "Age": [25, 30, 27],
    "City": ["New York", "San Francisco", "Los Angeles"]
}
df = pd.DataFrame(data)
print("Pandas DataFrame:")
print(df)

### Selecting and Filtering Data

In [None]:
# Selecting a column
print("Names:")
print(df["Name"])

# Filtering rows
print("People older than 25:")
print(df[df["Age"] > 25])

## 4. PySpark DataFrames

In [None]:
from pyspark.sql import SparkSession

# Initializing a Spark Session
spark = SparkSession.builder.appName("PySparkDataFrame").getOrCreate()

# Creating a PySpark DataFrame
data = [("Alice", 25, "New York"), ("Bob", 30, "San Francisco"), ("Cathy", 27, "Los Angeles")]
columns = ["Name", "Age", "City"]
spark_df = spark.createDataFrame(data, columns)

print("PySpark DataFrame:")
spark_df.show()

### Querying and Filtering DataFrames

In [None]:
# Selecting columns
spark_df.select("Name", "City").show()

# Filtering rows
spark_df.filter(spark_df.Age > 25).show()

## 5. Exploratory Data Analysis

In [None]:
# Descriptive statistics in PySpark
spark_df.describe().show()

### Measuring Central Tendency and Dispersion

In [None]:
# Aggregating data
spark_df.groupBy().mean("Age").show()  # Mean
spark_df.groupBy().max("Age").show()  # Maximum
spark_df.groupBy().min("Age").show()  # Minimum

## Chapter Summary
In this notebook, we covered:
- Python programming fundamentals
- Data structures in Python (lists, tuples, dictionaries, sets)
- Using Pandas for data manipulation
- PySpark DataFrames for distributed data analysis
- Basic exploratory data analysis techniques