# Data Optimiser: Job Posting Dataset - Integrity Check

## Project Overview
This analysis is for **Data Optimiser**, a fictional recruitment company that needs insights into job posting data to better understand market demands for data science roles.

### Objective
Perform a comprehensive data integrity check on the job posting dataset to identify and document data quality issues before proceeding with analysis.

### Dataset Description
The dataset contains job postings for three main data roles:
- Data Scientists
- Data Analysts  
- Data Engineers

Each record includes information about job requirements, company details, location, salary, and required skills.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load the dataset
df = pd.read_csv('job_postings_dataset.csv')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## 1. Basic Dataset Overview

In [None]:
# Display basic information about the dataset
print("=== DATASET OVERVIEW ===")
print(f"Number of records: {len(df)}")
print(f"Number of features: {len(df.columns)}")
print("\n=== COLUMN INFORMATION ===")
print(df.info())

print("\n=== FIRST 5 ROWS ===")
df.head()

## 2. Data Quality Assessment

In [None]:
# Check for missing values
print("=== MISSING VALUES ANALYSIS ===")
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percentage
})

print(missing_summary[missing_summary['Missing Count'] > 0])

# Visualize missing data
plt.figure(figsize=(10, 6))
missing_data[missing_data > 0].plot(kind='bar')
plt.title('Missing Values by Column')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Check for duplicate records
print("=== DUPLICATE RECORDS ANALYSIS ===")
total_duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {total_duplicates}")

# Check for duplicate job IDs (should be unique)
duplicate_job_ids = df['job_id'].duplicated().sum()
print(f"Duplicate job IDs: {duplicate_job_ids}")

# Check for potential duplicates based on key fields
key_fields = ['job_title', 'company', 'location', 'salary']
potential_duplicates = df.duplicated(subset=key_fields, keep=False).sum()
print(f"Potential duplicates based on key fields: {potential_duplicates}")

if potential_duplicates > 0:
    print("\nSample potential duplicates:")
    duplicate_samples = df[df.duplicated(subset=key_fields, keep=False)].head(10)
    print(duplicate_samples[['job_id'] + key_fields])

## 3. Data Consistency Checks

In [None]:
# Check categorical variables for consistency
print("=== CATEGORICAL VARIABLES ANALYSIS ===")

categorical_columns = ['job_title', 'company_size', 'experience_level', 'work_arrangement', 'location']

for col in categorical_columns:
    print(f"\n{col.upper()}:")
    value_counts = df[col].value_counts()
    print(f"Unique values: {len(value_counts)}")
    print(value_counts.head(10))
    
    # Check for potential data inconsistencies
    if col == 'location':
        # Check for location format inconsistencies
        locations = df[col].unique()
        problematic_locations = [loc for loc in locations if ',' not in loc or loc != loc.title()]
        if problematic_locations:
            print(f"Potentially problematic location formats: {problematic_locations[:5]}")

In [None]:
# Salary data quality analysis
print("=== SALARY DATA QUALITY ANALYSIS ===")

# Basic salary statistics
salary_stats = df['salary'].describe()
print("Salary Statistics:")
print(salary_stats)

# Check for unrealistic salary values
min_reasonable_salary = 30000
max_reasonable_salary = 200000

low_salaries = df[(df['salary'] < min_reasonable_salary) & (df['salary'].notna())]
high_salaries = df[(df['salary'] > max_reasonable_salary) & (df['salary'].notna())]

print(f"\nSalaries below ${min_reasonable_salary:,}: {len(low_salaries)}")
if len(low_salaries) > 0:
    print("Sample low salaries:")
    print(low_salaries[['job_title', 'salary', 'experience_level']].head())

print(f"\nSalaries above ${max_reasonable_salary:,}: {len(high_salaries)}")
if len(high_salaries) > 0:
    print("Sample high salaries:")
    print(high_salaries[['job_title', 'salary', 'experience_level']].head())

# Visualize salary distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
df['salary'].hist(bins=30, alpha=0.7)
plt.title('Salary Distribution')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='job_title', y='salary')
plt.title('Salary Distribution by Job Title')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 4. Date Format Consistency

In [None]:
# Check date format consistency
print("=== DATE FORMAT ANALYSIS ===")

# Check posting_date formats
print("Posting Date samples:")
print(df['posting_date'].head(10).tolist())

print("\nDate Posted samples:")
print(df['date_posted'].head(10).tolist())

# Try to identify different date formats
date_formats_posting = df['posting_date'].apply(lambda x: len(x.split('-')) if '-' in str(x) else len(x.split('/')))
date_formats_posted = df['date_posted'].apply(lambda x: len(x.split('-')) if '-' in str(x) else len(x.split('/')))

print("\nPosting date format distribution:")
print(date_formats_posting.value_counts())

print("\nDate posted format distribution:")
print(date_formats_posted.value_counts())

# Check for inconsistent date ranges
try:
    df['posting_date_clean'] = pd.to_datetime(df['posting_date'])
    print(f"\nDate range: {df['posting_date_clean'].min()} to {df['posting_date_clean'].max()}")
    
    # Check for future dates
    future_dates = df[df['posting_date_clean'] > datetime.now()]
    print(f"Job postings with future dates: {len(future_dates)}")
    
except Exception as e:
    print(f"Error parsing dates: {e}")
    print("This indicates inconsistent date formats that need cleaning.")

## 5. Skills Data Analysis

In [None]:
# Analyze skills data structure
print("=== SKILLS DATA ANALYSIS ===")

# Check skills format
print("Sample skills entries:")
for i, skills in enumerate(df['required_skills'].head(5)):
    print(f"{i+1}. {skills}")

# Count number of skills per job
df['num_skills'] = df['required_skills'].apply(lambda x: len(x.split(', ')) if pd.notna(x) else 0)

print(f"\nSkills per job statistics:")
print(df['num_skills'].describe())

# Most common skills across all jobs
all_skills = []
for skills_str in df['required_skills'].dropna():
    skills_list = [skill.strip() for skill in skills_str.split(',')]
    all_skills.extend(skills_list)

from collections import Counter
skill_counts = Counter(all_skills)

print("\nTop 15 most required skills:")
for skill, count in skill_counts.most_common(15):
    print(f"{skill}: {count} ({count/len(df)*100:.1f}%)")

# Visualize skills distribution
plt.figure(figsize=(12, 6))
top_skills = dict(skill_counts.most_common(10))
plt.bar(top_skills.keys(), top_skills.values())
plt.title('Top 10 Most Required Skills')
plt.xlabel('Skills')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 6. Data Quality Summary Report

In [None]:
# Generate comprehensive data quality report
print("=== DATA QUALITY SUMMARY REPORT ===")
print("="*50)

# Dataset overview
print(f"📊 DATASET OVERVIEW")
print(f"   Total Records: {len(df):,}")
print(f"   Total Features: {len(df.columns)}")
print(f"   Memory Usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# Data completeness
print(f"\n📋 DATA COMPLETENESS")
completeness = (1 - df.isnull().sum() / len(df)) * 100
for col in df.columns:
    print(f"   {col}: {completeness[col]:.1f}% complete")

# Data consistency issues
print(f"\n⚠️  DATA QUALITY ISSUES IDENTIFIED")
issues = []

if df.isnull().sum().sum() > 0:
    issues.append(f"Missing Values: {df.isnull().sum().sum()} total missing values")

if df.duplicated().sum() > 0:
    issues.append(f"Duplicate Records: {df.duplicated().sum()} duplicate rows")

# Check for salary outliers
salary_q1 = df['salary'].quantile(0.25)
salary_q3 = df['salary'].quantile(0.75)
salary_iqr = salary_q3 - salary_q1
salary_outliers = len(df[(df['salary'] < salary_q1 - 1.5*salary_iqr) | (df['salary'] > salary_q3 + 1.5*salary_iqr)])
if salary_outliers > 0:
    issues.append(f"Salary Outliers: {salary_outliers} potential outliers")

# Check for date inconsistencies
if len(df['date_posted'].apply(lambda x: '-' if '-' in str(x) else '/').unique()) > 1:
    issues.append("Date Format Inconsistency: Multiple date formats detected")

if issues:
    for issue in issues:
        print(f"   ❌ {issue}")
else:
    print("   ✅ No major data quality issues detected")

# Recommendations
print(f"\n💡 RECOMMENDATIONS FOR DATA CLEANING")
print("   1. Standardize date formats to YYYY-MM-DD")
print("   2. Handle missing salary values (imputation or exclusion)")
print("   3. Remove or investigate duplicate records")
print("   4. Standardize location format (Title Case, consistent punctuation)")
print("   5. Validate salary ranges against industry benchmarks")
print("   6. Standardize company names for consistency")

print(f"\n✅ DATA INTEGRITY CHECK COMPLETE")
print("="*50)

## 7. Power BI Translation Notes

### For Power BI Implementation:

**Power Query Steps:**
1. **Data Import**: Use `Data > Get Data > Text/CSV` to import the job_postings_dataset.csv
2. **Data Type Detection**: Power Query will auto-detect data types, verify and adjust as needed
3. **Missing Value Handling**: Use `Transform > Replace Values` to handle null values
4. **Date Standardization**: Use `Transform > Date > Parse` to standardize date formats
5. **Duplicate Removal**: Use `Home > Remove Duplicates` based on key columns
6. **Data Validation**: Use `Data > Column Quality` to visualize data quality metrics

**DAX Measures for Data Quality:**
- `Data Quality Score = DIVIDE(COUNTROWS(FILTER(JobData, NOT(ISBLANK([salary])))), COUNTROWS(JobData))`
- `Missing Salary Count = COUNTBLANK(JobData[salary])`
- `Duplicate Records = COUNTROWS(JobData) - DISTINCTCOUNT(JobData[job_id])`

**Next Steps:**
After completing this integrity check, proceed to:
1. Exploratory Data Analysis (EDA)
2. Business Question Investigation
3. Dashboard Creation