Below are some commonly used PySpark DataFrame actions and transformations, along with brief explanations. You can chain many of these methods together, and also combine them with .select(...).show(10, truncate=False) to inspect subsets of your data:
	1.	df.dtypes
	•	Purpose: Returns a list of (columnName, columnType) pairs for the DataFrame.
	•	Usage:

print(df.dtypes)


	2.	df.columns
	•	Purpose: Returns a list of the column names.
	•	Usage:

print(df.columns)


	3.	df.printSchema()
	•	Purpose: Prints the schema of the DataFrame in a tree format.
	•	Usage:

df.printSchema()


	4.	df.count()
	•	Purpose: Returns the number of rows in the DataFrame.
	•	Usage:

row_count = df.count()


	5.	df.head(n) or df.take(n)
	•	Purpose: Returns the first n rows as a list of Row objects.
	•	Usage:

first_ten_rows = df.head(10)


	6.	df.show(n, truncate=False)
	•	Purpose: Prints the first n rows in a tabular form.
	•	Usage:

df.show(10, truncate=False)


	7.	*df.select(cols)
	•	Purpose: Selects a subset of columns.
	•	Usage:

df.select("col1", "col2").show(10, truncate=False)


You can chain .select(...) before .show(...) to print only certain columns:

df.select("course_cd", "race_date", "horse_id").show(10, truncate=False)


	8.	df.filter(condition) or df.where(condition)
	•	Purpose: Filters rows based on a condition.
	•	Usage:

df.filter(col("speed") > 10).select("course_cd", "horse_id").show(10, truncate=False)


	9.	*df.groupBy(cols)
	•	Purpose: Groups DataFrame by the specified columns and returns a GroupedData object for aggregation.
	•	Usage:

df.groupBy("course_cd").count().show(10, truncate=False)


	10.	df.agg(…)
	•	Purpose: Use aggregation functions like count, mean, sum on the entire DataFrame or grouped data.
	•	Usage:

from pyspark.sql.functions import mean, count
df.agg(mean("speed").alias("avg_speed"), count("*").alias("row_count")).show()


	11.	*df.orderBy(cols) or *df.sort(cols)
	•	Purpose: Sorts the DataFrame by specified columns.
	•	Usage:

df.orderBy("speed").select("course_cd", "speed").show(10, truncate=False)


	12.	df.distinct()
	•	Purpose: Returns a new DataFrame containing distinct rows.
	•	Usage:

distinct_horses = df.select("horse_id").distinct()
distinct_horses.show(10, truncate=False)


	13.	*df.drop(cols)
	•	Purpose: Drops specified columns.
	•	Usage:

df.drop("time_stamp", "location").show(10, truncate=False)


	14.	df.withColumn(newColName, expression)
	•	Purpose: Adds or replaces a column based on a column expression.
	•	Usage:

from pyspark.sql.functions import col, lit
df.withColumn("adjusted_speed", col("speed") * lit(1.1)).show(10, truncate=False)


	15.	*df.describe(cols)
	•	Purpose: Computes basic statistics for numeric columns.
	•	Usage:

df.describe("speed", "progress").show()


    16. Check to see if a column contains nulls/missing values:
    
    results.select("equip").filter("equip IS NULL").show()
    OR
    results.filter(col("equip").isNull()).count()
    
Combining Methods

You can chain these methods. For example, to show 10 records of a filtered dataset with only certain columns:

from pyspark.sql.functions import col

df.filter(col("speed") > 10).select("course_cd", "horse_id", "speed").show(10, truncate=False)

This command:
	•	Filters rows where speed > 10
	•	Selects only course_cd, horse_id, and speed columns
	•	Displays the first 10 rows without truncating string columns.

By mixing and matching these operations, you can inspect, transform, and analyze your data effectively before modeling.