# Install Libraries

In [None]:
!pip install pandas numpy matplotlib seaborn



## Import and Load Data

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

linkedin = pd.read_csv("linkedin_jobs.csv")
indeed = pd.read_csv("indeed_jobs.csv")


## Dataset Overview — LinkedIn

The LinkedIn dataset contains **5,489 job postings** with **31 columns**.
The dataset size is sufficient for exploratory analysis, NLP-based skill extraction,
and downstream machine learning tasks.

### Structural Summary
- Rows: 5,489
- Columns: 31
- Data types:
  - Numerical: 16 (int64 + float64)
  - Categorical/Text: 15 (object)

In [6]:
linkedin.shape


(123849, 31)

In [7]:
linkedin.head()
linkedin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

### Column Completeness Analysis

Several important columns contain a high proportion of missing values,
which has implications for modeling and feature engineering.


#### Most Incomplete Columns
- `skills_desc`: ~99.3% missing  
- `closed_time`: ~99.3% missing  
- `med_salary`: ~94.1% missing  
- `remote_allowed`: ~86.5% missing  
- `max_salary` / `min_salary`: ~73.9% missing  


### Implications

- **Skill extraction cannot rely on `skills_desc`** and must be derived
  from `description` using NLP techniques.
- Salary-related fields are sparse, requiring:
  - Normalization
  - Fallback strategies
  - Possibly excluding rows with no salary signal for prediction tasks
- Remote work information is unreliable and cannot be treated as a primary feature.


### Key Observations

- `description` is fully populated and will serve as the primary source
  for skill extraction and job clustering.
- `formatted_experience_level` is present for ~65% of jobs, making it usable
  with some loss.
- Salary fields (`med_salary`, `normalized_salary`) are present for a minority
  of postings, which limits supervised salary modeling.


## **Dataset Overview — Indeed**

The Indeed dataset contains **5,492 job postings** with **17 columns**.
The dataset is compact but well-structured and suitable as a secondary
source for salary and experience analysis.

### Structural Summary
- Rows: 5,492
- Columns: 17
- Data types:
  - Numerical: 7 (float64 + int64)
  - Categorical/Text: 10 (object)


In [8]:
indeed.shape


(5492, 17)

In [9]:
indeed.head()
indeed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5492 entries, 0 to 5491
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               5492 non-null   object 
 1   company             5478 non-null   object 
 2   location            5492 non-null   object 
 3   type                5492 non-null   object 
 4   salary              5492 non-null   object 
 5   contract_type       5492 non-null   object 
 6   job_description     5492 non-null   object 
 7   sub_industry        5491 non-null   object 
 8   industry            5491 non-null   object 
 9   experience          3276 non-null   object 
 10  exp_normailzed      3276 non-null   float64
 11  type_norm           5492 non-null   float64
 12  contract_type_norm  5492 non-null   int64  
 13  salary_min          5492 non-null   float64
 14  salary_max          5492 non-null   float64
 15  salary_avg          5492 non-null   float64
 16  avg_an

### Column Completeness Analysis

The Indeed dataset demonstrates strong overall completeness, particularly in salary and job description fields, making it a reliable source for compensation and market-level analysis.



#### Most Incomplete Columns

- `experience:` 2,216 missing values (~40.3%)

- `exp_normailzed:` 2,216 missing values (~40.3%)

These fields represent inferred or extracted experience information and are not consistently present across postings.



#### Minor Missingness

- `company:` 14 missing values

- `industry:` 1 missing value

- `sub_industry:` 1 missing value

These omissions are negligible and can be safely handled via:

- Row removal

- Forward fill

- Mapping during normalization



#### Fully Populated Columns

The following columns have 0 missing values:

- Job metadata: `title`, `location`, `type`, `contract_type`

- Job content: `job_description`

- Salary fields:
`salary_min`, `salary_max`, `salary_avg`, `avg_annual_salary`

- Normalized fields: `type_norm`, `contract_type_norm`



#### Implications

- Salary analysis and benchmarking are highly reliable using Indeed data due to complete coverage.

- Experience-based modeling will require:

    - Dropping rows with missing experience, or

    - Treating experience as an optional / secondary feature.

- Indeed serves as a strong complementary dataset to LinkedIn, particularly for:

    - Salary normalization

    - Industry-level comparisons

    - Market trend validation

#### Key Completeness Observations
- Salary fields (`salary_min`, `salary_max`, `salary_avg`, `avg_annual_salary`)
  are **100% populated**
- Experience information is present for ~60% of postings
- Job descriptions are fully populated


## Canonical Schema Validation

| Canonical Field | LinkedIn Column | Indeed Column | Notes |
|----------------|----------------|---------------|-------|
| job_id | job_id | — | Indeed lacks unique IDs |
| job_title | title | title | Consistent |
| company | company_name | company | Minor naming differences |
| location | location | location | Needs normalization |
| experience_level | formatted_experience_level | exp_normalized | Different formats |
| skills | skills_desc | skills_desc | Indeed coverage is sparse |
| salary_avg | med_salary | avg_annual_salary | High missing rate |
| job_description | description | job_description | OK |
| source | "linkedin" | "indeed" | Manually assigned |


### Suitability for Project Goals - LinkedIn

The LinkedIn dataset is suitable as a **primary data source** due to:
- Complete job descriptions
- Rich job metadata
- Sufficient dataset size

However, salary and explicit skill fields are sparse,
making NLP-based inference and careful feature selection mandatory.


### Suitability for Project Goals - Indeed

The Indeed dataset is suitable as a **secondary dataset** primarily for:
- Salary benchmarking
- Experience-level analysis
- Industry-based segmentation

However, due to the absence of explicit skill listings,
NLP-based skill extraction must rely solely on job descriptions.


## Data Cleaning

In [11]:
linkedin["source"] = "linkedin"
indeed["source"] = "indeed"


### Normalize Column Names

#### LinkedIn

In [12]:
linkedin_clean = linkedin.rename(columns={
    "title": "job_title",
    "company_name": "company",
    "description": "job_description",
    "formatted_experience_level": "experience_level",
    "med_salary": "salary_avg"
})


#### Indeed

In [13]:
indeed_clean = indeed.rename(columns={
    "title": "job_title",
    "job_description": "job_description",
    "exp_normailzed": "experience_level"
})


### Handle Missing Experience

In [14]:
def normalize_experience(x):
    if pd.isna(x):
        return "unknown"
    if x <= 1:
        return "junior"
    if x <= 3:
        return "mid"
    return "senior"


In [15]:
indeed_clean["experience_level"] = indeed_clean["experience_level"].apply(normalize_experience)


### Salary Harmonization

In [16]:
def compute_salary(row):
    if not pd.isna(row.get("salary_avg")):
        return row["salary_avg"]
    if not pd.isna(row.get("avg_annual_salary")):
        return row["avg_annual_salary"]
    return np.nan


In [17]:
linkedin_clean["salary_final"] = linkedin_clean.apply(compute_salary, axis=1)
indeed_clean["salary_final"] = indeed_clean.apply(compute_salary, axis=1)


### Select Canonical Columns

#### Added missing columns - LinkedIn

In [18]:
linkedin_clean["industry"] = np.nan
linkedin_clean["sub_industry"] = np.nan

In [19]:
indeed_clean["job_id"] = indeed_clean.index.astype(str)


In [20]:
canonical_cols = [
    "job_id",
    "job_title",
    "company",
    "location",
    "experience_level",
    "job_description",
    "salary_final",
    "industry",
    "sub_industry",
    "source"
]

In [21]:
linkedin_final = linkedin_clean[canonical_cols]
indeed_final = indeed_clean[canonical_cols]

### Merge Datasets

In [22]:
jobs_canonical = pd.concat(
    [linkedin_final, indeed_final],
    ignore_index=True
)


In [23]:
jobs_canonical.shape
jobs_canonical.head()
jobs_canonical["source"].value_counts()


Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
linkedin,123849
indeed,5492


## Final Data Sanity Checks

In [32]:
combined = pd.concat([linkedin_final, indeed_final], ignore_index=True)
combined.shape


(129341, 10)

### Data Availability Constraints

Salary, industry, and experience fields are partially populated due to platform-specific posting behavior. These fields are treated as optional signals rather than mandatory attributes. Models are trained only on rows where relevant features are present.

In [39]:
combined.to_csv(
    "combined_canonical_v1.csv",
    index=False
)
