### Install PySpark
We first need to install **PySpark** in Google Colab because it is not available by default.  
The command below will install the latest version of PySpark using `pip`.


In [None]:
!pip install pyspark



### Create a SparkSession
To use PySpark, we need to create a **SparkSession**.  
- `SparkSession` is the entry point to programming with Spark in Python.  
- Here, we set an application name as `"Colab-PySpark-Basics"`.  
- After creating the session, we print the current **Apache Spark version** to confirm everything is set up correctly.


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Colab-PySpark-Basics").getOrCreate()
print("Apache Spark Vesrion:",spark.version)

Apache Spark Vesrion: 3.5.1


### Create a Sample DataFrame
Now we will create a simple **DataFrame** in PySpark.  

1. Define some **sample data** as a list of tuples.  
   - Example: `("Rahul", 21)` represents a row.  
2. Define the **schema** (column names) → `["Name", "Age"]`.  
3. Use `spark.createDataFrame(data, columns)` to create the DataFrame.  
4. Finally, use `df.show()` to display the DataFrame in a tabular format.


In [None]:
# Sample data

data = [("Rahul", 21), ("Priya", 22), ("Aman", 20)]

# Define schema (columns)

columns = ["Name", "Age"]

# Create DataFrame

df = spark.createDataFrame (data, columns)

# Show DataFrame

df.show()

+-----+---+
| Name|Age|
+-----+---+
|Rahul| 21|
|Priya| 22|
| Aman| 20|
+-----+---+



### Basic DataFrame Operations
Let’s perform some common operations on the DataFrame:

1. **Select a column** → `df.select("Name").show()`  
   - Displays only the `Name` column.  

2. **Filter rows** → `df.filter(df["Age"] > 20).show()`  
   - Returns only the rows where `Age` is greater than 20.  

3. **Count rows** → `df.count()`  
   - Returns the total number of rows in the DataFrame.


In [None]:
# Select column

df.select("Name").show()

# Filter rows

df.filter(df["Age"] > 20).show()

# Count rows

print("Total rows:", df.count)

+-----+
| Name|
+-----+
|Rahul|
|Priya|
| Aman|
+-----+

+-----+---+
| Name|Age|
+-----+---+
|Rahul| 21|
|Priya| 22|
+-----+---+

Total rows: <bound method DataFrame.count of DataFrame[Name: string, Age: bigint]>
