### Working with CSV Data in Python
Now let’s see how to read and process **CSV data** using Python’s built-in `csv` module:

1. **Create CSV data as a string**  
   - Instead of reading from a file, we store CSV data in a multiline string.  

2. **Use `StringIO` to treat string like a file**  
   - `io.StringIO(csv_data)` allows us to use the string as if it were a file object.  

3. **Read CSV using `DictReader`**  
   - `csv.DictReader(file_like)` reads each row as a dictionary where column names are keys.  

4. **Process the records**  
   - Loop through each row, strip extra spaces from keys, and print in a formatted way.


In [None]:
import csv
import io

# Step 1: Create CSV data as a string
csv_data = """id, name, department, salary
1, Rahul Sharma, IT, 55000
2, Priya Singh, HR, 60000
3, Aman Kumar, Finance, 48000
4, Sneha Reddy, Marketing, 52 52000
5, Arjun Mehta, IT, 75000"""

# Step 2: Use String10 to treat string like a file
file_like = io.StringIO(csv_data)

# Step 3: Read CSV using DictReader reader csv.DictReader(file_like)
reader = csv.DictReader(file_like)

print("Employee Records:")

for row in reader:
    # Remove leading and trailing spaces from keys
    row = {k.strip(): v for k, v in row.items()}
    print(f"{row['id']} - {row['name']} ({row['department']}) {row['salary']}")

Employee Records:
1 -  Rahul Sharma ( IT)  55000
2 -  Priya Singh ( HR)  60000
3 -  Aman Kumar ( Finance)  48000
4 -  Sneha Reddy ( Marketing)  52 52000
5 -  Arjun Mehta ( IT)  75000


### Install and Initialize PySpark
1. **Install PySpark**  
   - Since Google Colab does not come with PySpark pre-installed, we use `!pip install pyspark`.  

2. **Create a SparkSession**  
   - `SparkSession` is the entry point for PySpark applications.  
   - Here, we set the app name as `"Employee.Analysis"`.  
   - Once created, we can use `spark` to work with DataFrames and SQL queries.


In [None]:
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Employee.Analysis").getOrCreate()



### Create a CSV File in Colab
Instead of uploading a file manually, we can **create a CSV file directly in Colab**:

1. Define the CSV data as a **multiline string**.  
   - Each row contains `id, name, department, salary`.  

2. Use Python’s `open()` in **write mode ("w")**.  
   - This creates a new file named `employees.csv` in the current working directory.  

3. Write the CSV string into the file using `f.write(csv_data)`.  

After this step, we will have an actual file (`employees.csv`) stored in Colab that can be read using PySpark or Pandas.


In [None]:
import io


csv_data = """id,name,department,salary
1,Rahul Sharma,IT,55000
2,Priya Singh,HR,60000
3,Aman Kumar,Finance,48000
4,Sneha Reddy,Marketing,52000
5,Arjun Mehta,IT,75000
6,Divya Nair,Finance,67000
"""

with open ("employees.csv", "w") as f:
    f.write(csv_data)

### Load CSV into PySpark DataFrame
Now we will **read the CSV file we created** into a PySpark DataFrame:

1. `spark.read.csv("employees.csv", header=True, inferSchema=True)`  
   - `header=True` → Treat the first row as column names.  
   - `inferSchema=True` → Automatically detect the data type of each column.  

2. `df.show()` → Display the contents of the DataFrame in a tabular format.


In [None]:
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show()

+---+------------+----------+------+
| id|        name|department|salary|
+---+------------+----------+------+
|  1|Rahul Sharma|        IT| 55000|
|  2| Priya Singh|        HR| 60000|
|  3|  Aman Kumar|   Finance| 48000|
|  4| Sneha Reddy| Marketing| 52000|
|  5| Arjun Mehta|        IT| 75000|
|  6|  Divya Nair|   Finance| 67000|
+---+------------+----------+------+



---

# 📝 Key Points about Transformations

* **Lazy Execution**:

  Spark doesn’t run transformations right away. Instead, it builds a **logical plan** (a DAG – Directed Acyclic Graph).

  The computation only runs when an **action** (like `.show()` or `.count()`) is called.

* **Return Type**:

  A transformation always returns a **new DataFrame or RDD**. It does **not modify the existing one**.

* **Two Types of Transformations**:

  1. **Narrow Transformations** → Each input partition contributes to only one output partition.

     (e.g., `map()`, `filter()`, `select()`)

  2. **Wide Transformations** → Data is shuffled across partitions.

     (e.g., `groupBy()`, `join()`)

---


In [None]:
#Select name & salary
df.select("name", "salary").show()
#Filter employees with salary > 60,000
df.filter(df["salary"] > 60000).show()
# Order by salary descending
df.orderBy(df["salary"].desc()).show()

+------------+------+
|        name|salary|
+------------+------+
|Rahul Sharma| 55000|
| Priya Singh| 60000|
|  Aman Kumar| 48000|
| Sneha Reddy| 52000|
| Arjun Mehta| 75000|
|  Divya Nair| 67000|
+------------+------+

+---+-----------+----------+------+
| id|       name|department|salary|
+---+-----------+----------+------+
|  5|Arjun Mehta|        IT| 75000|
|  6| Divya Nair|   Finance| 67000|
+---+-----------+----------+------+

+---+------------+----------+------+
| id|        name|department|salary|
+---+------------+----------+------+
|  5| Arjun Mehta|        IT| 75000|
|  6|  Divya Nair|   Finance| 67000|
|  2| Priya Singh|        HR| 60000|
|  1|Rahul Sharma|        IT| 55000|
|  4| Sneha Reddy| Marketing| 52000|
|  3|  Aman Kumar|   Finance| 48000|
+---+------------+----------+------+



---

# 📝 What is Aggregation?

* An operation that **groups data** and applies a **summary function** (like sum, avg, count, min, max).

* Used to answer questions like:

  * *“What is the average salary per department?”*

  * *“How many employees are in each department?”*

  * *“What is the highest salary in Finance?”*

---


In [None]:
#Average salary per department
df.groupBy("department").avg("salary").show()
#Maximum salary per department
df.groupBy("department").max("salary").show()
#Count employees per department
df.groupBy("department").count().show()

+----------+-----------+
|department|avg(salary)|
+----------+-----------+
|        HR|    60000.0|
|   Finance|    57500.0|
| Marketing|    52000.0|
|        IT|    65000.0|
+----------+-----------+

+----------+-----------+
|department|max(salary)|
+----------+-----------+
|        HR|      60000|
|   Finance|      67000|
| Marketing|      52000|
|        IT|      75000|
+----------+-----------+

+----------+-----+
|department|count|
+----------+-----+
|        HR|    1|
|   Finance|    2|
| Marketing|    1|
|        IT|    2|
+----------+-----+



### SQL Queries on PySpark DataFrame
PySpark allows us to run **SQL queries** on DataFrames by creating a temporary view:

1. `df.createOrReplaceTempView("employees")`  
   - Creates a **temporary view** named `"employees"` that we can query using SQL syntax.  

2. Run an **SQL query**:  
   ```sql
   SELECT department, AVG(salary) as avg_salary
   FROM employees
   GROUP BY department


In [None]:
df.createOrReplaceTempView("employees")

spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department").show()

+----------+----------+
|department|avg_salary|
+----------+----------+
|        HR|   60000.0|
|   Finance|   57500.0|
| Marketing|   52000.0|
|        IT|   65000.0|
+----------+----------+

