# Lesson 3: Cross-Tabulation Analysis in Clustering: A Python Approach

Welcome! Today, we’ll focus on **Cross-Tabulation Analysis**, a vital tool for evaluating clustering models. Cross-tabulation helps study relationships between categorical variables, offering insights into data distribution and clustering performance. This lesson will guide you through its role in clustering evaluation and its implementation using Python, particularly the `pandas.crosstab` function. Let’s dive in!

---

## What is Cross-Tabulation Analysis?

**Cross-Tabulation Analysis** (or contingency table analysis) is a statistical method that summarizes the frequency distribution of categorical variables. It’s an efficient way to quantify relationships between variables.

In clustering, cross-tabulation reveals how data objects are distributed across clusters, uncovering potential associations between clusters.

### Example Cross-Tabulation Table

| Category 1 | Category 2 | ... | Category n |
|------------|------------|-----|------------|
| **Class 1** | n₁₁       | n₁₂ | ...        | n₁ₙ |
| **Class 2** | n₂₁       | n₂₂ | ...        | n₂ₙ |
| ...         | ...        | ... | ...        | ... |
| **Class m** | nₘ₁       | nₘ₂ | ...        | nₘₙ |

Here, `nᵢⱼ` represents the frequency of each category within each class.

---

## Implementing Cross-Tabulation in Python

### Using Python Dictionaries

We can implement cross-tabulation manually using Python dictionaries. Below is an example function:

```python
def cross_tabulation(data, feature):
    classes = set(data['Target'])
    feature_values = set(data[feature])

    # Initialize cross table with zeros
    cross_tab = {value: {class_: 0 for class_ in classes} for value in feature_values}

    # Populate cross table with counts
    for i in range(len(data['Target'])):
        cross_tab[data[feature][i]][data['Target'][i]] += 1

    return cross_tab
```

This dictionary-based approach is efficient and straightforward for small datasets.

---

### Using `pandas.crosstab`

Python’s `pandas` library simplifies cross-tabulation with the `crosstab` method. Here’s an example:

```python
import pandas as pd

data = {
    'Feature1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Feature2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
    'Target': [1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
print(pd.crosstab(df['Target'], df['Feature1']))
```

**Output:**

| Feature1 | A | B |
|----------|---|---|
| **Target** |   |   |
| 0        | 0 | 4 |
| 1        | 4 | 0 |

This table shows that all observations with `Target = 1` have `Feature1 = A`, and all with `Target = 0` have `Feature1 = B`.

---

### Applying Cross-Tabulation

We can apply the `cross_tabulation` function to analyze multiple features:

```python
table1 = cross_tabulation(data, 'Feature1')
table2 = cross_tabulation(data, 'Feature2')

print(pd.DataFrame(table1))
print(pd.DataFrame(table2))
```

**Output:**

For `Feature1`:

|   | A | B |
|---|---|---|
| 0 | 0 | 4 |
| 1 | 4 | 0 |

For `Feature2`:

|   | X | Y |
|---|---|---|
| 0 | 4 | 0 |
| 1 | 0 | 4 |

These tables summarize how observations for each feature are distributed across class labels.

---

## Summary

Great work! You’ve learned about **Cross-Tabulation Analysis** and its importance in evaluating clustering models. You’ve also explored how to implement it using Python, both manually with dictionaries and with the `pandas.crosstab` method. These techniques will help you uncover valuable insights from your data. Keep practicing and enjoy your journey into clustering!

--- 

This version is more concise, structured, and easier to follow while retaining all the key information.


## Exploring Cluster Assignments with Cross-Tabulation

Great job on learning about cross-tabulation using pandas, Stellar Navigator! Here, we have a small dataset and code that create cross-tabulation tables. These tables allow us to see how frequently different categories of Feature1 and Feature2 occur within each cluster labeled by Target. Click Run to observe the created tables!

```py
import pandas as pd

# Sample dataset with features and 'Target' representing cluster assignments
data = {
    'Feature1': ['A', 'B', 'A', 'A', 'B', 'B'],
    'Feature2': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
    'Target': [1, 2, 1, 1, 2, 1]
}

# The pandas DataFrame holding our data
df = pd.DataFrame(data)

# Using pandas crosstab to create cross-tabulation tables
table1 = pd.crosstab(df['Target'], df['Feature1'])
table2 = pd.crosstab(df['Target'], df['Feature2'])

# Displaying the cross-tabulation tables
print("Cross-tabulation table for Feature1:")
print(table1)
print("\nCross-tabulation table for Feature2:")
print(table2)
```

This code snippet demonstrates how to use `pandas.crosstab` to create cross-tabulation tables for analyzing the frequency distribution of categorical variables (`Feature1` and `Feature2`) across clusters labeled by `Target`. Here's a breakdown of what happens when you run the code:

### Dataset
The dataset contains:
- **Feature1**: Categories `A` and `B`.
- **Feature2**: Categories `X` and `Y`.
- **Target**: Cluster labels `1` and `2`.

### Code Explanation
1. **Data Preparation**:
   - A dictionary is used to define the dataset, which is then converted into a pandas DataFrame.

2. **Cross-Tabulation**:
   - `pd.crosstab` is used to compute the frequency of each category in `Feature1` and `Feature2` for each cluster (`Target`).

3. **Output**:
   - Two cross-tabulation tables are printed: one for `Feature1` and one for `Feature2`.

### Expected Output
When you run the code, the output will look like this:

```
Cross-tabulation table for Feature1:
Feature1  A  B
Target         
1         2  1
2         0  2

Cross-tabulation table for Feature2:
Feature2  X  Y
Target         
1         1  2
2         2  0
```

### Interpretation
1. **Feature1 Table**:
   - For `Target = 1`, there are 2 occurrences of `A` and 1 occurrence of `B`.
   - For `Target = 2`, there are 0 occurrences of `A` and 2 occurrences of `B`.

2. **Feature2 Table**:
   - For `Target = 1`, there is 1 occurrence of `X` and 2 occurrences of `Y`.
   - For `Target = 2`, there are 2 occurrences of `X` and 0 occurrences of `Y`.

### Key Takeaways
- Cross-tabulation tables provide a clear summary of how categorical variables are distributed across clusters.
- This method is particularly useful for understanding relationships between features and cluster assignments in clustering analysis.

In [1]:
import pandas as pd

# Sample dataset with features and 'Target' representing cluster assignments
data = {
    'Feature1': ['A', 'B', 'A', 'A', 'B', 'B'],
    'Feature2': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
    'Target': [1, 2, 1, 1, 2, 1]
}

# The pandas DataFrame holding our data
df = pd.DataFrame(data)

# Using pandas crosstab to create cross-tabulation tables
table1 = pd.crosstab(df['Target'], df['Feature1'])
table2 = pd.crosstab(df['Target'], df['Feature2'])

# Displaying the cross-tabulation tables
print("Cross-tabulation table for Feature1:")
print(table1)
print("\nCross-tabulation table for Feature2:")
print(table2)

Cross-tabulation table for Feature1:
Feature1  A  B
Target        
1         3  1
2         0  2

Cross-tabulation table for Feature2:
Feature2  X  Y
Target        
1         1  3
2         2  0


## Cross-Tabulation Power Unleashed

Space Voyager, it's time to implement a crucial piece from our data analysis toolkit. Use the power of pandas to reveal the relationship between our data categories. Remember, cross-tabulation tells us how categories and classes intertwine!

```py
import pandas as pd

# Example dataset
data = {
    'Target': ['Class1', 'Class2', 'Class1', 'Class3', 'Class2', 'Class1'],
    'Feature1': ['Category1', 'Category2', 'Category1', 'Category2', 'Category1', 'Category3']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# TODO: Use pandas to calculate the cross-tabulation of 'Target' and 'Feature1'

print(cross_tab_result)

```

To calculate the cross-tabulation of `Target` and `Feature1` using pandas, we can use the `pd.crosstab` function. Here's the completed code:

```python
import pandas as pd

# Example dataset
data = {
    'Target': ['Class1', 'Class2', 'Class1', 'Class3', 'Class2', 'Class1'],
    'Feature1': ['Category1', 'Category2', 'Category1', 'Category2', 'Category1', 'Category3']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Use pandas to calculate the cross-tabulation of 'Target' and 'Feature1'
cross_tab_result = pd.crosstab(df['Target'], df['Feature1'])

# Display the result
print(cross_tab_result)
```

### Expected Output
When you run the code, the output will look like this:

```
Feature1  Category1  Category2  Category3
Target                                   
Class1           2          0          1
Class2           1          1          0
Class3           0          1          0
```

### Explanation
1. **Cross-Tabulation**:
   - The `pd.crosstab` function computes the frequency of each combination of `Target` and `Feature1`.
   - Rows represent the `Target` classes (`Class1`, `Class2`, `Class3`).
   - Columns represent the categories of `Feature1` (`Category1`, `Category2`, `Category3`).

2. **Interpretation**:
   - For `Class1`, there are 2 occurrences of `Category1`, 0 occurrences of `Category2`, and 1 occurrence of `Category3`.
   - For `Class2`, there is 1 occurrence of `Category1` and 1 occurrence of `Category2`, but no occurrences of `Category3`.
   - For `Class3`, there is 1 occurrence of `Category2` and no occurrences of the other categories.

This cross-tabulation table provides a clear summary of how the categories in `Feature1` are distributed across the classes in `Target`.Executed 1st Code Block

Execution Result:

```
Feature1  Category1  Category2  Category3
Target                                   
Class1            2          0          1
Class2            1          1          0
Class3            0          1          0
```

## Implementing Cross-Tabulation Analysis with Pandas

Now that you understand how to apply cross-tabulation analysis, let's test your knowledge, Space Voyager! Create a Python script to examine the relationship between features and class labels in a dataset using pandas.crosstab. To achieve this, use a sample dataset and perform cross-tabulation for two features against the target.

```py
import pandas as pd

# Sample dataset
data = {
    'Feature1': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C'],
    'Feature2': ['X', 'Y', 'X', 'Z', 'Y', 'Y', 'X', 'Z'],
    'Target': [1, 2, 1, 1, 2, 2, 1, 1]
}

# TODO: Create a DataFrame called df with a sample dataset

# TODO: Perform cross-tabulation between 'Target' and 'Feature1', save it to cross_tab1

# TODO: Perform cross-tabulation between 'Target' and 'Feature2', save it to cross_tab2

# TODO: Print out cross_tab1 and cross_tab2 with a descriptive message

```


```python
import pandas as pd

# Sample dataset
data = {
    'Feature1': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C'],
    'Feature2': ['X', 'Y', 'X', 'Z', 'Y', 'Y', 'X', 'Z'],
    'Target': [1, 2, 1, 1, 2, 2, 1, 1]
}

# Create a DataFrame with the sample dataset
df = pd.DataFrame(data)

# Perform cross-tabulation between 'Target' and 'Feature1'
cross_tab1 = pd.crosstab(df['Target'], df['Feature1'])

# Perform cross-tabulation between 'Target' and 'Feature2'
cross_tab2 = pd.crosstab(df['Target'], df['Feature2'])

# Print the results with descriptive messages
print("Cross-tabulation between 'Target' and 'Feature1':")
print(cross_tab1)
print("\nCross-tabulation between 'Target' and 'Feature2':")
print(cross_tab2)
```

### Expected Output
When you run the script, you should see the following output:

```
Cross-tabulation between 'Target' and 'Feature1':
Feature1  A  B  C
Target            
1         3  0  2
2         0  3  0

Cross-tabulation between 'Target' and 'Feature2':
Feature2  X  Y  Z
Target            
1         3  0  2
2         0  3  0
```

### Explanation
1. **Cross-tabulation for `Feature1`**:
   - For `Target = 1`, there are 3 occurrences of `A`, 0 occurrences of `B`, and 2 occurrences of `C`.
   - For `Target = 2`, there are 0 occurrences of `A`, 3 occurrences of `B`, and 0 occurrences of `C`.

2. **Cross-tabulation for `Feature2`**:
   - For `Target = 1`, there are 3 occurrences of `X`, 0 occurrences of `Y`, and 2 occurrences of `Z`.
   - For `Target = 2`, there are 0 occurrences of `X`, 3 occurrences of `Y`, and 0 occurrences of `Z`.

This script provides a clear summary of how the categories in `Feature1` and `Feature2` are distributed across the `Target` classes.Executed 1st Code Block

Execution Result:

```
Cross-tabulation between 'Target' and 'Feature1':
Feature1  A  B  C
Target           
1         3  0  2
2         0  3  0

Cross-tabulation between 'Target' and 'Feature2':
Feature2  X  Y  Z
Target           
1         3  0  2
2         0  3  0
```