# Data Optimiser: Business Questions Investigation

## Strategic Questions for Data Optimiser Recruitment Company

Based on our EDA findings, this notebook addresses specific business questions that will help Data Optimiser make strategic decisions about their recruitment services.

### Key Business Questions:
1. **Skills Gap Analysis**: What skills are most in-demand vs. available talent?
2. **Market Positioning**: Which data roles should we prioritize for recruitment?
3. **Geographic Strategy**: Where should we focus our recruitment efforts?
4. **Salary Benchmarking**: How can we advise clients on competitive compensation?
5. **Client Targeting**: What company sizes and types should we focus on?
6. **Talent Segmentation**: How should we categorize and market different experience levels?

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Load and prepare data
df = pd.read_csv('job_postings_dataset.csv')
df['posting_date_clean'] = pd.to_datetime(df['posting_date'])
df_clean = df.dropna(subset=['job_title', 'company', 'location']).copy()
salary_clean = df_clean[df_clean['salary'].notna()]

print(f"Dataset loaded: {len(df_clean)} job postings ready for business analysis")

## Business Question 1: Skills Gap Analysis
### *"What are the critical skills gaps we should help our clients address?"*

In [None]:
print("=== BUSINESS QUESTION 1: SKILLS GAP ANALYSIS ===")

# Extract all skills and analyze demand patterns
def extract_skills_data(df):
    skills_by_role = {}
    all_skills = []
    
    for role in df['job_title'].unique():
        role_data = df[df['job_title'] == role]
        role_skills = []
        
        for skills_str in role_data['required_skills'].dropna():
            skills_list = [skill.strip() for skill in skills_str.split(',')]
            role_skills.extend(skills_list)
            all_skills.extend(skills_list)
        
        skills_by_role[role] = Counter(role_skills)
    
    return skills_by_role, Counter(all_skills)

skills_by_role, overall_skills = extract_skills_data(df_clean)

# Identify high-demand skills (top 20%)
total_jobs = len(df_clean)
high_demand_threshold = total_jobs * 0.20  # Skills required in 20%+ of jobs
medium_demand_threshold = total_jobs * 0.10  # Skills required in 10-20% of jobs

high_demand_skills = {skill: count for skill, count in overall_skills.items() 
                     if count >= high_demand_threshold}
medium_demand_skills = {skill: count for skill, count in overall_skills.items() 
                       if medium_demand_threshold <= count < high_demand_threshold}
emerging_skills = {skill: count for skill, count in overall_skills.items() 
                  if count < medium_demand_threshold}

print(f"\nüî• HIGH-DEMAND SKILLS (Required in {int(high_demand_threshold)}+ jobs):")
for skill, count in sorted(high_demand_skills.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / total_jobs) * 100
    print(f"   ‚Ä¢ {skill}: {count} jobs ({percentage:.1f}%)")

print(f"\nüìà MEDIUM-DEMAND SKILLS ({int(medium_demand_threshold)}-{int(high_demand_threshold)} jobs):")
for skill, count in sorted(medium_demand_skills.items(), key=lambda x: x[1], reverse=True)[:10]:
    percentage = (count / total_jobs) * 100
    print(f"   ‚Ä¢ {skill}: {count} jobs ({percentage:.1f}%)")

# Skills specialization by role
print(f"\nüéØ ROLE-SPECIFIC SKILL SPECIALIZATION:")
for role in df_clean['job_title'].unique():
    role_total = len(df_clean[df_clean['job_title'] == role])
    top_skills = skills_by_role[role].most_common(5)
    print(f"\n{role} (Top 5 skills):")
    for skill, count in top_skills:
        percentage = (count / role_total) * 100
        print(f"   ‚Ä¢ {skill}: {percentage:.1f}% of {role} jobs")

# Visualize skills demand
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# High-demand skills
hd_skills_df = pd.DataFrame(list(high_demand_skills.items()), columns=['Skill', 'Count'])
hd_skills_df = hd_skills_df.sort_values('Count', ascending=True)
axes[0,0].barh(hd_skills_df['Skill'], hd_skills_df['Count'], color='red')
axes[0,0].set_title('High-Demand Skills (Critical for Recruitment)', fontweight='bold')
axes[0,0].set_xlabel('Number of Job Requirements')

# Medium-demand skills
md_skills_df = pd.DataFrame(list(medium_demand_skills.items()), columns=['Skill', 'Count'])
md_skills_df = md_skills_df.sort_values('Count', ascending=True).tail(10)
axes[0,1].barh(md_skills_df['Skill'], md_skills_df['Count'], color='orange')
axes[0,1].set_title('Medium-Demand Skills (Growth Opportunities)', fontweight='bold')
axes[0,1].set_xlabel('Number of Job Requirements')

# Skills by role comparison
role_skills_matrix = pd.DataFrame(index=overall_skills.most_common(15))
role_skills_data = []
for role in df_clean['job_title'].unique():
    role_data = []
    for skill, _ in overall_skills.most_common(15):
        role_total = len(df_clean[df_clean['job_title'] == role])
        skill_count = skills_by_role[role].get(skill, 0)
        percentage = (skill_count / role_total) * 100
        role_data.append(percentage)
    role_skills_data.append(role_data)

role_skills_df = pd.DataFrame(role_skills_data, 
                             columns=[skill for skill, _ in overall_skills.most_common(15)],
                             index=df_clean['job_title'].unique())

sns.heatmap(role_skills_df, annot=True, fmt='.1f', cmap='YlOrRd', ax=axes[1,0])
axes[1,0].set_title('Skills Demand by Role (%)', fontweight='bold')
axes[1,0].set_xlabel('Skills')
axes[1,0].set_ylabel('Job Roles')

# Skills frequency distribution
skill_freq_dist = list(overall_skills.values())
axes[1,1].hist(skill_freq_dist, bins=20, alpha=0.7, color='skyblue')
axes[1,1].set_title('Skills Frequency Distribution', fontweight='bold')
axes[1,1].set_xlabel('Frequency (Number of Jobs)')
axes[1,1].set_ylabel('Number of Skills')

plt.tight_layout()
plt.show()

## Business Question 2: Market Positioning Strategy
### *"Which data roles should we prioritize for maximum business impact?"*

In [None]:
print("=== BUSINESS QUESTION 2: MARKET POSITIONING STRATEGY ===")

# Analyze market opportunity by role
role_analysis = df_clean.groupby('job_title').agg({
    'job_id': 'count',  # Market size
    'salary': ['mean', 'median', 'std'],  # Compensation levels
    'required_skills': lambda x: x.apply(lambda skills: len(skills.split(', ')) if pd.notna(skills) else 0).mean()  # Complexity
}).round(2)

role_analysis.columns = ['Market_Size', 'Avg_Salary', 'Median_Salary', 'Salary_Std', 'Avg_Skills_Required']

# Calculate market opportunity score
# Higher score = larger market + higher salaries + moderate complexity
role_analysis['Market_Share_Pct'] = (role_analysis['Market_Size'] / len(df_clean)) * 100
role_analysis['Salary_Premium'] = (role_analysis['Avg_Salary'] - role_analysis['Avg_Salary'].mean()) / role_analysis['Avg_Salary'].mean()
role_analysis['Complexity_Score'] = role_analysis['Avg_Skills_Required'] / role_analysis['Avg_Skills_Required'].max()

# Opportunity score: Market size (40%) + Salary premium (40%) + Complexity factor (20%)
role_analysis['Opportunity_Score'] = (
    (role_analysis['Market_Share_Pct'] / 100) * 0.4 +
    (role_analysis['Salary_Premium'] + 1) * 0.4 +
    role_analysis['Complexity_Score'] * 0.2
) * 100

role_analysis = role_analysis.sort_values('Opportunity_Score', ascending=False)

print("\nüéØ ROLE PRIORITIZATION ANALYSIS:")
print(role_analysis[['Market_Size', 'Market_Share_Pct', 'Avg_Salary', 'Avg_Skills_Required', 'Opportunity_Score']])

print("\nüìä STRATEGIC RECOMMENDATIONS:")
for i, (role, data) in enumerate(role_analysis.iterrows(), 1):
    print(f"\n{i}. {role} (Opportunity Score: {data['Opportunity_Score']:.1f})")
    print(f"   ‚Ä¢ Market size: {data['Market_Size']} jobs ({data['Market_Share_Pct']:.1f}% of market)")
    print(f"   ‚Ä¢ Average salary: ${data['Avg_Salary']:,.0f}")
    print(f"   ‚Ä¢ Skill complexity: {data['Avg_Skills_Required']:.1f} skills per job")
    
    if i == 1:
        print(f"   ‚≠ê PRIMARY FOCUS: Highest opportunity for recruitment business")
    elif i == 2:
        print(f"   üîÑ SECONDARY FOCUS: Strong secondary market")
    else:
        print(f"   üìà GROWTH POTENTIAL: Consider for specialized services")

# Experience level demand by role
exp_role_crosstab = pd.crosstab(df_clean['job_title'], df_clean['experience_level'], normalize='index') * 100
print(f"\nüëî EXPERIENCE LEVEL DEMAND BY ROLE:")
print(exp_role_crosstab.round(1))

# Visualize role positioning
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Market size vs average salary
scatter_data = role_analysis.reset_index()
axes[0,0].scatter(scatter_data['Market_Size'], scatter_data['Avg_Salary'], 
                 s=scatter_data['Opportunity_Score']*10, alpha=0.7)
for i, role in enumerate(scatter_data['job_title']):
    axes[0,0].annotate(role, (scatter_data['Market_Size'][i], scatter_data['Avg_Salary'][i]))
axes[0,0].set_xlabel('Market Size (Number of Jobs)')
axes[0,0].set_ylabel('Average Salary ($)')
axes[0,0].set_title('Market Size vs Salary (Bubble size = Opportunity Score)', fontweight='bold')

# Opportunity score comparison
opportunity_data = role_analysis['Opportunity_Score'].sort_values(ascending=True)
axes[0,1].barh(range(len(opportunity_data)), opportunity_data.values, 
               color=['gold' if x == opportunity_data.max() else 'lightblue' for x in opportunity_data.values])
axes[0,1].set_yticks(range(len(opportunity_data)))
axes[0,1].set_yticklabels(opportunity_data.index)
axes[0,1].set_xlabel('Opportunity Score')
axes[0,1].set_title('Role Opportunity Ranking', fontweight='bold')

# Experience level distribution by role
exp_role_crosstab.plot(kind='bar', stacked=True, ax=axes[1,0])
axes[1,0].set_title('Experience Level Distribution by Role', fontweight='bold')
axes[1,0].set_xlabel('Job Role')
axes[1,0].set_ylabel('Percentage')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].legend(title='Experience Level', bbox_to_anchor=(1.05, 1), loc='upper left')

# Market share pie chart
market_share = role_analysis['Market_Size']
axes[1,1].pie(market_share.values, labels=market_share.index, autopct='%1.1f%%',
             colors=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[1,1].set_title('Market Share by Role', fontweight='bold')

plt.tight_layout()
plt.show()

## Business Question 3: Geographic Strategy
### *"Where should we establish recruitment offices for maximum market coverage?"*

In [None]:
print("=== BUSINESS QUESTION 3: GEOGRAPHIC STRATEGY ===")

# Geographic market analysis
geo_analysis = df_clean.groupby('location').agg({
    'job_id': 'count',  # Market size
    'salary': ['mean', 'median', 'std'],  # Salary levels
    'job_title': lambda x: len(x.unique())  # Role diversity
}).round(2)

geo_analysis.columns = ['Market_Size', 'Avg_Salary', 'Median_Salary', 'Salary_Std', 'Role_Diversity']
geo_analysis = geo_analysis[geo_analysis['Market_Size'] >= 10]  # Focus on substantial markets

# Calculate market attractiveness score
geo_analysis['Market_Share_Pct'] = (geo_analysis['Market_Size'] / len(df_clean)) * 100
geo_analysis['Salary_Index'] = geo_analysis['Avg_Salary'] / geo_analysis['Avg_Salary'].mean()
geo_analysis['Diversity_Score'] = geo_analysis['Role_Diversity'] / geo_analysis['Role_Diversity'].max()

# Attractiveness score: Market size (50%) + Salary premium (30%) + Role diversity (20%)
geo_analysis['Attractiveness_Score'] = (
    (geo_analysis['Market_Share_Pct'] / geo_analysis['Market_Share_Pct'].max()) * 0.5 +
    (geo_analysis['Salary_Index']) * 0.3 +
    geo_analysis['Diversity_Score'] * 0.2
) * 100

geo_analysis = geo_analysis.sort_values('Attractiveness_Score', ascending=False)

print("\nüåç TOP 10 GEOGRAPHIC MARKETS:")
top_10_markets = geo_analysis.head(10)
for i, (location, data) in enumerate(top_10_markets.iterrows(), 1):
    print(f"\n{i}. {location}")
    print(f"   ‚Ä¢ Jobs available: {data['Market_Size']} ({data['Market_Share_Pct']:.1f}% of total market)")
    print(f"   ‚Ä¢ Average salary: ${data['Avg_Salary']:,.0f} (Index: {data['Salary_Index']:.2f})")
    print(f"   ‚Ä¢ Role diversity: {data['Role_Diversity']}/3 role types")
    print(f"   ‚Ä¢ Attractiveness score: {data['Attractiveness_Score']:.1f}")
    
    if i <= 3:
        print(f"   üè¢ TIER 1: High priority for office establishment")
    elif i <= 6:
        print(f"   üè¨ TIER 2: Strong market for regional coverage")
    else:
        print(f"   üè™ TIER 3: Consider remote/partner coverage")

# Role distribution by top markets
top_markets = top_10_markets.index[:5]
role_geo_matrix = pd.crosstab(df_clean[df_clean['location'].isin(top_markets)]['location'], 
                             df_clean[df_clean['location'].isin(top_markets)]['job_title'])

print(f"\nüéØ ROLE DISTRIBUTION IN TOP MARKETS:")
print(role_geo_matrix)

# Market concentration analysis
total_jobs = len(df_clean)
top_5_concentration = (top_10_markets.head(5)['Market_Size'].sum() / total_jobs) * 100
top_10_concentration = (top_10_markets['Market_Size'].sum() / total_jobs) * 100

print(f"\nüìà MARKET CONCENTRATION INSIGHTS:")
print(f"   ‚Ä¢ Top 5 markets control {top_5_concentration:.1f}% of all jobs")
print(f"   ‚Ä¢ Top 10 markets control {top_10_concentration:.1f}% of all jobs")
print(f"   ‚Ä¢ Market fragmentation: {len(geo_analysis)} cities with 10+ jobs")

# Visualize geographic strategy
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Top markets by size
top_10_markets['Market_Size'].plot(kind='barh', ax=axes[0,0], color='steelblue')
axes[0,0].set_title('Top 10 Markets by Job Volume', fontweight='bold')
axes[0,0].set_xlabel('Number of Jobs')

# Salary vs market size
axes[0,1].scatter(top_10_markets['Market_Size'], top_10_markets['Avg_Salary'], 
                 s=top_10_markets['Attractiveness_Score']*2, alpha=0.7, color='orange')
for i, location in enumerate(top_10_markets.index):
    axes[0,1].annotate(location.split(',')[0], 
                      (top_10_markets['Market_Size'][i], top_10_markets['Avg_Salary'][i]))
axes[0,1].set_xlabel('Market Size (Number of Jobs)')
axes[0,1].set_ylabel('Average Salary ($)')
axes[0,1].set_title('Market Size vs Salary Premium', fontweight='bold')

# Role distribution in top markets
role_geo_pct = role_geo_matrix.div(role_geo_matrix.sum(axis=1), axis=0) * 100
sns.heatmap(role_geo_pct, annot=True, fmt='.1f', cmap='Blues', ax=axes[1,0])
axes[1,0].set_title('Role Distribution in Top Markets (%)', fontweight='bold')
axes[1,0].set_xlabel('Job Roles')
axes[1,0].set_ylabel('Markets')

# Market attractiveness ranking
attract_data = top_10_markets['Attractiveness_Score'].sort_values(ascending=True)
colors = ['gold' if x == attract_data.max() else 'silver' if x == attract_data.iloc[-2] 
         else '#CD7F32' if x == attract_data.iloc[-3] else 'lightgray' for x in attract_data.values]
axes[1,1].barh(range(len(attract_data)), attract_data.values, color=colors)
axes[1,1].set_yticks(range(len(attract_data)))
axes[1,1].set_yticklabels([loc.split(',')[0] for loc in attract_data.index])
axes[1,1].set_xlabel('Attractiveness Score')
axes[1,1].set_title('Market Attractiveness Ranking', fontweight='bold')

plt.tight_layout()
plt.show()

## Business Question 4: Salary Benchmarking Strategy
### *"How should we advise clients on competitive compensation packages?"*

In [None]:
print("=== BUSINESS QUESTION 4: SALARY BENCHMARKING STRATEGY ===")

# Comprehensive salary analysis
salary_data = salary_clean.copy()

# Create salary benchmarking framework
def calculate_salary_benchmarks(df, group_by_cols):
    """Calculate salary benchmarks for different groupings"""
    benchmarks = df.groupby(group_by_cols)['salary'].agg([
        'count',
        ('P25', lambda x: x.quantile(0.25)),
        ('P50_Median', lambda x: x.quantile(0.50)),
        ('Mean', 'mean'),
        ('P75', lambda x: x.quantile(0.75)),
        ('P90', lambda x: x.quantile(0.90)),
        ('Std', 'std')
    ]).round(0)
    
    # Calculate competitive ranges
    benchmarks['Competitive_Min'] = benchmarks['P25']
    benchmarks['Competitive_Max'] = benchmarks['P75']
    benchmarks['Premium_Range'] = benchmarks['P90']
    
    return benchmarks

# 1. Salary benchmarks by role
role_benchmarks = calculate_salary_benchmarks(salary_data, ['job_title'])
print("\nüí∞ SALARY BENCHMARKS BY ROLE:")
print(role_benchmarks[['count', 'P25', 'P50_Median', 'Mean', 'P75', 'P90']])

# 2. Salary benchmarks by role and experience
role_exp_benchmarks = calculate_salary_benchmarks(salary_data, ['job_title', 'experience_level'])
print("\nüìä DETAILED BENCHMARKS BY ROLE AND EXPERIENCE:")
for role in salary_data['job_title'].unique():
    print(f"\n{role.upper()}:")
    role_data = role_exp_benchmarks.loc[role]
    for exp_level in role_data.index:
        data = role_data.loc[exp_level]
        print(f"  {exp_level}:")
        print(f"    ‚Ä¢ Market rate: ${data['P50_Median']:,.0f} (median)")
        print(f"    ‚Ä¢ Competitive range: ${data['Competitive_Min']:,.0f} - ${data['Competitive_Max']:,.0f}")
        print(f"    ‚Ä¢ Premium offers: ${data['Premium_Range']:,.0f}+")
        print(f"    ‚Ä¢ Sample size: {data['count']} jobs")

# 3. Geographic salary adjustments
geo_salary_index = salary_data.groupby('location')['salary'].mean() / salary_data['salary'].mean()
geo_salary_index = geo_salary_index[geo_salary_index.index.isin(top_10_markets.index)]
geo_salary_index = geo_salary_index.sort_values(ascending=False)

print(f"\nüåç GEOGRAPHIC SALARY ADJUSTMENT FACTORS:")
print("(Base = 1.00, >1.00 = Premium market, <1.00 = Below average)")
for location, factor in geo_salary_index.items():
    adjustment = (factor - 1) * 100
    print(f"  {location}: {factor:.2f} ({adjustment:+.0f}%)")

# 4. Company size impact on compensation
company_size_benchmarks = calculate_salary_benchmarks(salary_data, ['company_size'])
print(f"\nüè¢ SALARY BY COMPANY SIZE:")
print(company_size_benchmarks[['count', 'P50_Median', 'Mean', 'P75']])

# Create salary recommendation engine
def salary_recommendation_engine(role, experience, location, company_size):
    """Generate salary recommendation based on inputs"""
    try:
        # Base salary from role and experience
        base_salary = role_exp_benchmarks.loc[(role, experience), 'P50_Median']
        competitive_min = role_exp_benchmarks.loc[(role, experience), 'Competitive_Min']
        competitive_max = role_exp_benchmarks.loc[(role, experience), 'Competitive_Max']
        
        # Geographic adjustment
        geo_factor = geo_salary_index.get(location, 1.0)
        
        # Company size adjustment (estimated)
        company_adjustments = {
            'Startup (1-50)': 0.95,
            'Small (51-200)': 1.0,
            'Medium (201-1000)': 1.05,
            'Large (1001+)': 1.10
        }
        company_factor = company_adjustments.get(company_size, 1.0)
        
        # Calculate adjusted ranges
        adjusted_base = base_salary * geo_factor * company_factor
        adjusted_min = competitive_min * geo_factor * company_factor
        adjusted_max = competitive_max * geo_factor * company_factor
        
        return {
            'market_rate': adjusted_base,
            'competitive_min': adjusted_min,
            'competitive_max': adjusted_max,
            'geo_factor': geo_factor,
            'company_factor': company_factor
        }
    except KeyError:
        return None

# Example recommendations
print(f"\nüéØ EXAMPLE SALARY RECOMMENDATIONS:")
examples = [
    ('Data Scientist', 'Senior Level (6-10 years)', 'San Francisco, CA', 'Large (1001+)'),
    ('Data Analyst', 'Mid Level (3-5 years)', 'New York, NY', 'Medium (201-1000)'),
    ('Data Engineer', 'Lead Level (10+ years)', 'Seattle, WA', 'Large (1001+)')
]

for role, exp, location, company in examples:
    rec = salary_recommendation_engine(role, exp, location, company)
    if rec:
        print(f"\n{role} | {exp} | {location} | {company}:")
        print(f"  ‚Ä¢ Market rate: ${rec['market_rate']:,.0f}")
        print(f"  ‚Ä¢ Competitive range: ${rec['competitive_min']:,.0f} - ${rec['competitive_max']:,.0f}")
        print(f"  ‚Ä¢ Geo adjustment: {rec['geo_factor']:.2f}x")
        print(f"  ‚Ä¢ Company size adjustment: {rec['company_factor']:.2f}x")

# Visualize salary benchmarking
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Salary ranges by role
role_salary_viz = role_benchmarks[['P25', 'P50_Median', 'P75', 'P90']]
role_salary_viz.plot(kind='bar', ax=axes[0,0])
axes[0,0].set_title('Salary Benchmarks by Role', fontweight='bold')
axes[0,0].set_ylabel('Salary ($)')
axes[0,0].tick_params(axis='x', rotation=45)
axes[0,0].legend(['25th %ile', 'Median', '75th %ile', '90th %ile'])

# Geographic salary factors
geo_salary_index.plot(kind='bar', ax=axes[0,1], color='orange')
axes[0,1].set_title('Geographic Salary Adjustment Factors', fontweight='bold')
axes[0,1].set_ylabel('Salary Factor (1.0 = Average)')
axes[0,1].axhline(y=1.0, color='red', linestyle='--', alpha=0.7)
axes[0,1].tick_params(axis='x', rotation=45)

# Company size impact
company_size_benchmarks['P50_Median'].plot(kind='bar', ax=axes[1,0], color='green')
axes[1,0].set_title('Median Salary by Company Size', fontweight='bold')
axes[1,0].set_ylabel('Median Salary ($)')
axes[1,0].tick_params(axis='x', rotation=45)

# Salary distribution overall
salary_data['salary'].hist(bins=30, ax=axes[1,1], alpha=0.7, color='purple')
axes[1,1].axvline(salary_data['salary'].median(), color='red', linestyle='--', 
                 label=f"Median: ${salary_data['salary'].median():,.0f}")
axes[1,1].axvline(salary_data['salary'].mean(), color='orange', linestyle='--',
                 label=f"Mean: ${salary_data['salary'].mean():,.0f}")
axes[1,1].set_title('Overall Salary Distribution', fontweight='bold')
axes[1,1].set_xlabel('Salary ($)')
axes[1,1].set_ylabel('Frequency')
axes[1,1].legend()

plt.tight_layout()
plt.show()

## Business Question 5: Client Targeting Strategy
### *"What types of companies should we prioritize as clients?"*

In [None]:
print("=== BUSINESS QUESTION 5: CLIENT TARGETING STRATEGY ===")

# Company analysis for client targeting
company_analysis = df_clean.groupby(['company_size', 'job_title']).agg({
    'job_id': 'count',
    'salary': ['mean', 'std'],
    'required_skills': lambda x: x.apply(lambda skills: len(skills.split(', ')) if pd.notna(skills) else 0).mean()
}).round(2)

company_analysis.columns = ['Job_Count', 'Avg_Salary', 'Salary_Std', 'Avg_Skills']

# Calculate client attractiveness metrics
company_summary = df_clean.groupby('company_size').agg({
    'job_id': 'count',  # Hiring volume
    'salary': ['mean', 'std'],  # Budget capacity
    'job_title': lambda x: len(x.unique()),  # Role diversity
    'required_skills': lambda x: x.apply(lambda skills: len(skills.split(', ')) if pd.notna(skills) else 0).mean()  # Complexity
}).round(2)

company_summary.columns = ['Total_Jobs', 'Avg_Budget', 'Budget_Std', 'Role_Diversity', 'Avg_Complexity']

# Calculate client priority score
company_summary['Market_Share'] = (company_summary['Total_Jobs'] / len(df_clean)) * 100
company_summary['Budget_Index'] = company_summary['Avg_Budget'] / company_summary['Avg_Budget'].mean()
company_summary['Complexity_Factor'] = company_summary['Avg_Complexity'] / company_summary['Avg_Complexity'].max()

# Priority score: Volume (40%) + Budget (35%) + Complexity (25%)
company_summary['Client_Priority_Score'] = (
    (company_summary['Market_Share'] / company_summary['Market_Share'].max()) * 0.40 +
    company_summary['Budget_Index'] * 0.35 +
    company_summary['Complexity_Factor'] * 0.25
) * 100

company_summary = company_summary.sort_values('Client_Priority_Score', ascending=False)

print("\nüéØ CLIENT TARGETING ANALYSIS:")
for i, (company_size, data) in enumerate(company_summary.iterrows(), 1):
    print(f"\n{i}. {company_size} Companies")
    print(f"   ‚Ä¢ Hiring volume: {data['Total_Jobs']} jobs ({data['Market_Share']:.1f}% of market)")
    print(f"   ‚Ä¢ Average budget: ${data['Avg_Budget']:,.0f} (Index: {data['Budget_Index']:.2f})")
    print(f"   ‚Ä¢ Role diversity: {data['Role_Diversity']}/3 different roles")
    print(f"   ‚Ä¢ Job complexity: {data['Avg_Complexity']:.1f} skills per role")
    print(f"   ‚Ä¢ Client priority score: {data['Client_Priority_Score']:.1f}")
    
    # Strategic recommendations
    if data['Client_Priority_Score'] >= 80:
        print(f"   ü•á TIER 1 CLIENT: High volume, high budget - Priority target")
        strategy = "Direct sales, dedicated account management, premium service packages"
    elif data['Client_Priority_Score'] >= 60:
        print(f"   ü•à TIER 2 CLIENT: Good opportunity - Secondary focus")
        strategy = "Targeted marketing, standard service packages, scalable solutions"
    else:
        print(f"   ü•â TIER 3 CLIENT: Specialized approach needed")
        strategy = "Digital marketing, self-service tools, volume-based pricing"
    
    print(f"   üìã Strategy: {strategy}")

# Analyze hiring patterns by company type
hiring_patterns = pd.crosstab(df_clean['company_size'], df_clean['job_title'], normalize='index') * 100
print(f"\nüìä HIRING PATTERNS BY COMPANY SIZE:")
print(hiring_patterns.round(1))

# Work arrangement preferences by company size
work_patterns = pd.crosstab(df_clean['company_size'], df_clean['work_arrangement'], normalize='index') * 100
print(f"\nüè¢ WORK ARRANGEMENT PREFERENCES:")
print(work_patterns.round(1))

# Experience level requirements by company size
exp_patterns = pd.crosstab(df_clean['company_size'], df_clean['experience_level'], normalize='index') * 100
print(f"\nüëî EXPERIENCE LEVEL REQUIREMENTS:")
print(exp_patterns.round(1))

# Client segmentation insights
print(f"\nüí° CLIENT SEGMENTATION INSIGHTS:")
print("\nStartup Companies (1-50 employees):")
startup_data = company_summary.loc['Startup (1-50)']
print(f"  ‚Ä¢ Typically hire fewer people but need specialized skills")
print(f"  ‚Ä¢ Budget constraints may require flexible pricing models")
print(f"  ‚Ä¢ Likely to value speed and cultural fit over extensive requirements")

print("\nLarge Companies (1000+ employees):")
large_data = company_summary.loc['Large (1001+)']
print(f"  ‚Ä¢ High volume hiring with substantial budgets")
print(f"  ‚Ä¢ Complex requirements and longer hiring processes")
print(f"  ‚Ä¢ Value comprehensive candidate screening and compliance")

# Visualize client targeting strategy
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Client priority scores
priority_scores = company_summary['Client_Priority_Score'].sort_values(ascending=True)
colors = ['gold' if x == priority_scores.max() else 'lightblue' for x in priority_scores.values]
axes[0,0].barh(range(len(priority_scores)), priority_scores.values, color=colors)
axes[0,0].set_yticks(range(len(priority_scores)))
axes[0,0].set_yticklabels(priority_scores.index)
axes[0,0].set_xlabel('Client Priority Score')
axes[0,0].set_title('Client Priority Ranking', fontweight='bold')

# Volume vs Budget scatter
axes[0,1].scatter(company_summary['Total_Jobs'], company_summary['Avg_Budget'],
                 s=company_summary['Client_Priority_Score']*3, alpha=0.7)
for i, company_type in enumerate(company_summary.index):
    axes[0,1].annotate(company_type.split(' ')[0], 
                      (company_summary['Total_Jobs'][i], company_summary['Avg_Budget'][i]))
axes[0,1].set_xlabel('Total Jobs (Market Volume)')
axes[0,1].set_ylabel('Average Budget ($)')
axes[0,1].set_title('Market Volume vs Budget Capacity', fontweight='bold')

# Hiring patterns heatmap
sns.heatmap(hiring_patterns, annot=True, fmt='.1f', cmap='Blues', ax=axes[1,0])
axes[1,0].set_title('Hiring Patterns by Company Size (%)', fontweight='bold')
axes[1,0].set_xlabel('Job Roles')
axes[1,0].set_ylabel('Company Size')

# Work arrangement preferences
work_patterns.plot(kind='bar', stacked=True, ax=axes[1,1])
axes[1,1].set_title('Work Arrangement Preferences by Company Size', fontweight='bold')
axes[1,1].set_xlabel('Company Size')
axes[1,1].set_ylabel('Percentage')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].legend(title='Work Arrangement', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

## Business Summary and Strategic Recommendations

In [None]:
print("=== DATA OPTIMISER: STRATEGIC BUSINESS RECOMMENDATIONS ===")
print("="*70)

print("\nüéØ EXECUTIVE SUMMARY")
print(f"Based on analysis of {len(df_clean):,} job postings across {df_clean['location'].nunique()} markets")
print(f"for {df_clean['job_title'].nunique()} data professional roles.")

print("\nüèÜ TOP STRATEGIC PRIORITIES")
print("\n1. ROLE FOCUS STRATEGY:")
top_opportunity_role = role_analysis.index[0]
print(f"   ‚Ä¢ PRIMARY: {top_opportunity_role} (Highest opportunity score: {role_analysis.iloc[0]['Opportunity_Score']:.1f})")
print(f"   ‚Ä¢ Market size: {role_analysis.iloc[0]['Market_Size']} jobs ({role_analysis.iloc[0]['Market_Share_Pct']:.1f}% market share)")
print(f"   ‚Ä¢ Average salary: ${role_analysis.iloc[0]['Avg_Salary']:,.0f}")

print("\n2. GEOGRAPHIC EXPANSION STRATEGY:")
top_market = geo_analysis.index[0]
print(f"   ‚Ä¢ TIER 1 MARKET: {top_market}")
print(f"   ‚Ä¢ Market size: {geo_analysis.iloc[0]['Market_Size']} jobs ({geo_analysis.iloc[0]['Market_Share_Pct']:.1f}% of total)")
print(f"   ‚Ä¢ Attractiveness score: {geo_analysis.iloc[0]['Attractiveness_Score']:.1f}")
print(f"   ‚Ä¢ Recommended action: Establish primary office")

tier_2_markets = geo_analysis.index[1:4]
print(f"   ‚Ä¢ TIER 2 MARKETS: {', '.join([m.split(',')[0] for m in tier_2_markets])}")
print(f"   ‚Ä¢ Recommended action: Regional coverage or partnerships")

print("\n3. CLIENT TARGETING STRATEGY:")
top_client_type = company_summary.index[0]
print(f"   ‚Ä¢ PRIMARY CLIENT TYPE: {top_client_type}")
print(f"   ‚Ä¢ Priority score: {company_summary.iloc[0]['Client_Priority_Score']:.1f}")
print(f"   ‚Ä¢ Market volume: {company_summary.iloc[0]['Total_Jobs']} jobs ({company_summary.iloc[0]['Market_Share']:.1f}%)")
print(f"   ‚Ä¢ Average budget: ${company_summary.iloc[0]['Avg_Budget']:,.0f}")

print("\n4. SKILLS SPECIALIZATION STRATEGY:")
critical_skills = list(high_demand_skills.keys())[:3]
print(f"   ‚Ä¢ CRITICAL SKILLS TO FOCUS ON: {', '.join(critical_skills)}")
print(f"   ‚Ä¢ These skills appear in {len(high_demand_skills)} of the most demanded skill categories")
print(f"   ‚Ä¢ Recommended action: Build specialized talent pools for these skills")

print("\nüí∞ SALARY BENCHMARKING RECOMMENDATIONS")
print("\nUse our dynamic salary recommendation engine that considers:")
print("   ‚Ä¢ Base role and experience level benchmarks")
print("   ‚Ä¢ Geographic salary adjustments (up to +30% in premium markets)")
print("   ‚Ä¢ Company size premiums (up to +10% for large companies)")
print("   ‚Ä¢ Market positioning (25th-90th percentile ranges)")

print("\nüìä KEY PERFORMANCE INDICATORS (KPIs) TO TRACK")
print("\n1. Market Coverage:")
print(f"   ‚Ä¢ Current analysis covers {top_10_concentration:.1f}% of market with top 10 cities")
print(f"   ‚Ä¢ Target: 80%+ market coverage in Tier 1 markets")

print("\n2. Skills Alignment:")
print(f"   ‚Ä¢ High-demand skills identified: {len(high_demand_skills)} critical skills")
print(f"   ‚Ä¢ Target: 90%+ candidate coverage for critical skills")

print("\n3. Client Mix Optimization:")
print(f"   ‚Ä¢ Current market distribution: {company_summary.iloc[0]['Market_Share']:.1f}% from top client type")
print(f"   ‚Ä¢ Target: Balanced portfolio across company sizes")

print("\nüöÄ IMPLEMENTATION ROADMAP")
print("\nPHASE 1 (0-3 months): Foundation")
print(f"   ‚Ä¢ Establish presence in {top_market.split(',')[0]}")
print(f"   ‚Ä¢ Build {top_opportunity_role} talent pipeline")
print(f"   ‚Ä¢ Target {top_client_type} companies")
print(f"   ‚Ä¢ Develop {', '.join(critical_skills)} skill assessments")

print("\nPHASE 2 (3-6 months): Expansion")
tier_2_cities = [m.split(',')[0] for m in tier_2_markets]
print(f"   ‚Ä¢ Expand to {tier_2_cities[0]} and {tier_2_cities[1]}")
print(f"   ‚Ä¢ Add secondary role specializations")
print(f"   ‚Ä¢ Implement salary benchmarking tools")
print(f"   ‚Ä¢ Scale client acquisition")

print("\nPHASE 3 (6-12 months): Optimization")
print(f"   ‚Ä¢ Cover all Tier 1 and Tier 2 markets")
print(f"   ‚Ä¢ Achieve market leadership in target segments")
print(f"   ‚Ä¢ Launch advanced analytics and AI-powered matching")
print(f"   ‚Ä¢ Expand internationally")

print(f"\n‚úÖ ANALYSIS COMPLETE - Data Optimiser Strategic Plan Ready")
print("="*70)

## Power BI Implementation Guide

### Dashboard Structure for Power BI Service:

**1. Executive Dashboard**
- KPI cards: Total jobs, Average salary, Top markets, Critical skills
- Market opportunity matrix (Role vs Market size)
- Geographic heat map
- Trend analysis over time

**2. Skills Analysis Dashboard**
- Skills demand by role (horizontal bar charts)
- Skills gap analysis matrix
- Emerging vs established skills
- Skills correlation network

**3. Salary Benchmarking Dashboard**
- Interactive salary calculator
- Salary ranges by role and experience
- Geographic salary adjustments
- Company size premium analysis

**4. Client Targeting Dashboard**
- Client priority scoring
- Company size analysis
- Hiring pattern analysis
- Market segmentation views

### Key DAX Measures:
```dax
Total Jobs = COUNTROWS(JobData)
Average Salary = AVERAGE(JobData[salary])
Market Share = DIVIDE([Total Jobs], CALCULATE([Total Jobs], ALL(JobData[location])))
Skills Count = LEN(JobData[required_skills]) - LEN(SUBSTITUTE(JobData[required_skills], ",", "")) + 1
Opportunity Score = [Market Share] * 0.4 + [Salary Index] * 0.4 + [Complexity Score] * 0.2
```

### Next Steps for Power BI Service:
1. Create workspace in Power BI Service
2. Import cleaned dataset
3. Build dashboard following the structure above
4. Set up automatic data refresh
5. Share with stakeholders and gather feedback
6. Publish live dashboard link for submission