# Skillra PDA: HSE course project
Авторские вклады: TODO — заполнить перед сдачей.

Датасет: вакансии hh.ru по IT/аналитике (файл `data/raw/hh_moscow_it_2025_11_30.csv`).
Источник: парсер Skillra, период сбора — конец ноября 2025.

## 1. Загрузка данных
Используем вспомогательные функции из `src/skillra_pda` и работаем от корня репозитория.

In [None]:
from pathlib import Path
import pandas as pd

from src.skillra_pda import io, cleaning, features, eda, viz, personas, config

config.ensure_directories()
raw_path = config.RAW_DATA_FILE
raw_path

In [None]:
df_raw = io.load_raw(raw_path)
df_raw_shape = df_raw.shape
df_raw_shape

In [None]:
df_raw.head(3)

## 2. Паспорт датасета и качество

In [None]:
profile = cleaning.basic_profile(df_raw)
profile

In [None]:
is_unique, dup_count = cleaning.check_unique_id(df_raw)
{"vacancy_id_unique": is_unique, "duplicates_found": dup_count}

In [None]:
groups = cleaning.detect_column_groups(df_raw)
{k: len(v) for k, v in groups.items()}

In [None]:
missing_top = eda.missing_share(df_raw)
missing_top

### Выводы по качеству
- Проверяем уникальность `vacancy_id` и доли пропусков.
- Группы признаков по префиксам помогают строить агрегаты и фичи.

## 3. Предобработка
Типы дат, обработка пропусков/дубликатов и подготовка зарплаты.

In [None]:
df = df_raw.copy()
df = cleaning.parse_dates(df)
df = cleaning.deduplicate(df)
df = cleaning.handle_missingness(df)
df = cleaning.salary_prepare(df)
{"after_clean_shape": df.shape, "dropped_columns": df.attrs.get("dropped_columns", []), "deduplicated": df.attrs.get("deduplicated_rows", 0), "non_rub_share": df.attrs.get("non_rub_share", 0)}

In [None]:
if "salary_gross" in df.columns:
    ensure_fn = getattr(cleaning, "ensure_salary_gross_boolean", None)
    if ensure_fn is None:
        from importlib import reload
        import src.skillra_pda.cleaning as cleaning_module
        cleaning = reload(cleaning_module)
        ensure_fn = cleaning.ensure_salary_gross_boolean
    df = ensure_fn(df)
    assert str(df["salary_gross"].dtype) == "boolean", "salary_gross must be boolean"
    assert not df["salary_gross"].isin(["unknown", "Unknown", "UNKNOWN", ""]).any(), "salary_gross contains unknown markers"


## 4. Новые признаки
Добавляем временные фичи, city tier, work mode, агрегаты по навыкам и зарплатные бины.

In [None]:
df_features = features.assemble_features(df.copy())
df_features[["published_weekday", "city_tier", "work_mode", "primary_role", "salary_bucket"]].head()

In [None]:
feat_path = config.FEATURE_DATA_FILE
io.save_processed(df_features, feat_path)
feat_path

## 5. EDA
Разрезы по ролям, грейдам и форматам работы плюс частоты навыков.

In [None]:
snapshot = {
    "primary_role": df_features["primary_role"].value_counts().head(10),
    "grade": df_features.get("grade", pd.Series(dtype=int)).value_counts().head(10),
    "city_tier": df_features["city_tier"].value_counts().head(10),
    "work_mode": df_features["work_mode"].value_counts().head(10),
}
snapshot

In [None]:
junior_flags = ["is_for_juniors", "allows_students", "has_mentoring", "has_test_task"]
junior_ready = eda.junior_friendly_share(df_features, junior_flags)
junior_ready

In [None]:
salary_by_grade = eda.describe_salary_by_group(df_features, 'grade')
salary_by_role = eda.describe_salary_by_group(df_features, 'primary_role', top_n=8)
salary_by_city = eda.describe_salary_by_group(df_features, 'city_tier')
{"salary_by_grade": salary_by_grade.head(), "salary_by_role": salary_by_role.head(), "salary_by_city": salary_by_city.head()}

In [None]:
salary_grade_role = eda.describe_salary_two_dim(df_features, 'grade', 'primary_role')
salary_work_mode = eda.describe_salary_two_dim(df_features, 'work_mode', 'primary_role')
salary_city_grade = eda.describe_salary_two_dim(df_features, 'city_tier', 'grade')
{"grade_x_role": salary_grade_role.head(), "work_mode_x_role": salary_work_mode.head(), "city_x_grade": salary_city_grade.head()}

In [None]:
skill_cols = [c for c in df_features.columns if c.startswith('skill_')][:50]
top_skills_all = eda.skill_frequency(df_features, skill_cols, top_n=15)
top_skills_all

In [None]:
premium_df = features.compute_skill_premium(df_features, skill_cols, min_count=30)
premium_df.head(10)

In [None]:
numeric_cols = df_features.select_dtypes(include=['number']).columns
corr = eda.correlation_matrix(df_features, cols=numeric_cols)
corr.head()

## 6. Визуализации
Графики сохраняются в `reports/figures/`.

In [None]:
fig_salary_grade = viz.salary_by_grade_box(df_features)
fig_salary_role = viz.salary_by_role_box(df_features)
fig_work_mode = viz.work_mode_share_by_city(df_features)
fig_top_skills = viz.top_skills_bar(df_features, skill_cols, role_filter='data')
fig_skill_premium = viz.skill_premium_bar(premium_df)
fig_corr = viz.corr_heatmap(corr)
[fig_salary_grade, fig_salary_role, fig_work_mode, fig_top_skills, fig_skill_premium, fig_corr]

## 7. Продуктовый слой: персоны
Расчёт skill-gap для разных траекторий.

In [None]:
student = personas.Persona(
    name='Студент',
    current_skills=['skill_python', 'skill_sql'],
    target_role='data',
    target_grade='junior',
    constraints={'work_mode': ['remote', 'hybrid']}
)
switcher = personas.Persona(
    name='Свитчер BI',
    current_skills=['skill_excel', 'skill_powerbi'],
    target_role='product',
    target_grade='middle'
)
analyst = personas.Persona(
    name='Data analyst',
    current_skills=['skill_sql', 'skill_python', 'skill_tableau'],
    target_role='analyst',
    target_grade='middle'
)
student_gap = personas.skill_gap_for_persona(df_features, student, top_k=10)
switcher_gap = personas.skill_gap_for_persona(df_features, switcher, top_k=10)
analyst_gap = personas.skill_gap_for_persona(df_features, analyst, top_k=10)
{'student': student_gap, 'switcher': switcher_gap, 'analyst': analyst_gap}

## 8. Итоговые выводы
Заполнить на основе результатов EDA и графиков перед финальной сдачей.

## 9. Чек-лист соответствия ТЗ
- [x] Данные загружены и исследованы.
- [x] Предобработка: типы, пропуски, дубликаты, зарплаты.
- [x] Фичи: временные, город, формат работы, агрегаты по навыкам, primary role, salary buckets.
- [x] EDA: разрезы по ролям/грейдам/формату, навыки, корреляции.
- [x] Визуализации: 6 графиков в `reports/figures`.
- [x] Персоны и skill-gap.
- [ ] Финальные текстовые выводы добавлены.