PySpark, the crosstab() function is used to compute a cross-tabulation (contingency table) of two columns, showing the frequency counts of their combinations. It’s particularly useful for analyzing relationships or correlations between categorical columns in a DataFrame.

📌 Syntax
DataFrame.crosstab(col1, col2)


col1 → Name of the first column (row dimension).

col2 → Name of the second column (column dimension).

It returns a new DataFrame where:

The first column is named col1_col2.

The other columns represent distinct values of col2.

The cells contain the frequency counts.

In [0]:
data = [
    ("A", "X"),
    ("A", "Y"),
    ("B", "X"),
    ("B", "Y"),
    ("B", "X"),
    ("C", "Y"),
]

df = spark.createDataFrame(data, ["col1", "col2"])
df.display()


col1,col2
A,X
A,Y
B,X
B,Y
B,X
C,Y


In [0]:
cross_df = df.crosstab("col1", "col2")
cross_df.display()


col1_col2,X,Y
B,2,1
C,0,1
A,1,1


✅ Notes

The first column name is a concatenation like col1_col2.

In this example, it became "col1_col2".

The table shows counts of occurrences of each (col1, col2) pair.

Example: For B, X, the count is 2.

Good for categorical analysis, exploratory data analysis (EDA), or feature engineering.

For multiple columns, you can use groupBy() + pivot() instead of crosstab().

In [0]:
df.groupBy("col1").pivot("col2").count().display()


col1,X,Y
B,2.0,1
C,,1
A,1.0,1
