# Data Cleaning and Belief Network Creation Guide

This notebook explains the process of cleaning the General Social Survey (GSS) data and creating a belief network from it. We'll walk through the key steps in our data pipeline.

## Overview

Our process consists of three main steps:
1. Importing the raw data (from CLEAN/datasets/import_gss.py)
2. Cleaning/preparing the data (CLEAN/datasets/create_clean_dataset.py)
3. (Optional) Validating the data (CLEAN/datasets/validate_cleaned_datasets.py)

Let's examine each step in detail.

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path to import CLEAN modules
import sys
sys.path.append('../../')

# Set plotting style
%matplotlib inline
plt.style.use('seaborn')
sns.set_palette('husl')

OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)

## 1. Data Import

The GSS data is imported using the `import_gss.py` script. This script handles:
- Reading the raw GSS data
- Initial column selection
- Basic data type conversions

In [None]:
from CLEAN.datasets.import_gss import import_gss_data

# Import the raw data
raw_data = import_gss_data()

# Display basic information about the dataset
print("Dataset Shape:", raw_data.shape)
print("\nColumns:", raw_data.columns.tolist())
print("\nSample of the data:")
raw_data.head()

## 2. Data Cleaning

The cleaning process is handled by `clean_data.py` and involves:
- Handling missing values
- Standardizing categorical variables
- Creating derived features
- Removing invalid responses

In [None]:
from CLEAN.datasets.clean_data import clean_gss_data

# Clean the data
cleaned_data = clean_gss_data(raw_data)

# Show the changes in the dataset
print("Cleaned Dataset Shape:", cleaned_data.shape)
print("\nMissing values per column:")
print(cleaned_data.isnull().sum())

# Visualize the distribution of key variables
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, col in zip(axes.flat, cleaned_data.select_dtypes(include=['category']).columns[:4]):
    cleaned_data[col].value_counts().plot(kind='bar', ax=ax)
    ax.set_title(f'Distribution of {col}')
    ax.tick_params(axis='x', rotation=45)
plt.tight_layout()

## 3. Validation

The `validate_cleaned_datasets.py` script performs various checks to ensure data quality:
- Verifies data types
- Checks value ranges
- Ensures categorical variables contain only expected values
- Validates relationships between variables

In [None]:
from CLEAN.datasets.validate_cleaned_datasets import validate_dataset

# Validate the cleaned dataset
validation_results = validate_dataset(cleaned_data)

# Display validation results
print("Validation Results:")
for check, result in validation_results.items():
    print(f"{check}: {'Passed' if result else 'Failed'}")

## Creating the Belief Network

After cleaning, we create a belief network to model relationships between variables. This involves:

1. Identifying key variables that might influence beliefs
2. Calculating correlation matrices
3. Building a directed graph of relationships

In [None]:
# Calculate correlations between variables
correlation_matrix = cleaned_data.select_dtypes(include=['number']).corr()

# Visualize correlations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='RdBu_r',
            center=0)
plt.title('Correlation Matrix of Numerical Variables')
plt.tight_layout()

## Summary

This notebook has demonstrated:
1. How we import and clean the GSS data
2. The validation steps we perform
3. The process of creating a belief network

For more details, you can examine the individual Python files in the CLEAN/datasets directory:
- `import_gss.py`
- `clean_data.py`
- `validate_cleaned_datasets.py`

The cleaned dataset is ready for further analysis and modeling.