# QCTO - Workplace Module

### Project Title: AI-Powered Job Market Insights
#### Done By: samukelisiwe nkosi

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** 
The purpose of this project is to analyze a large dataset of job market information to understand the evolving dynamics in various industries, particularly how technology adoption, automation risk, and required skills impact salaries, job growth projections, and remote work trends. This analysis aims to provide actionable insights for job seekers, companies, and policymakers. By examining factors such as AI adoption levels, automation risk, and location-specific salary data, the project seeks to identify key trends shaping the future of work, highlighting which industries and roles are likely to experience growth or disruption.

* **Details:** 

This project operates within the domain of labor market analytics, focusing on the intersection of technology and employment. The dataset includes columns for Job Title, Industry, Company Size, Location, AI Adoption Level, Automation Risk, Required Skills, Salary in USD, Remote Friendly, and Job Growth Projection.

Key challenges and questions addressed in this project include:

* How are AI adoption and automation risks affecting job roles and industries? We aim to uncover which sectors are more vulnerable to automation and which are benefiting from AI, reshaping workforce needs.
* What are the trends in required skills across industries? Understanding which skills are most in demand can help job seekers tailor their training and development.
* How do salaries vary across locations and industries? This question is crucial for identifying high-paying regions and roles, considering local costs of living and market demand.
* What is the prevalence of remote work options, and how does it correlate with salary and industry? We aim to explore how remote-friendly roles compare with traditional office-based jobs in terms of compensation and job growth prospects.
* Which jobs and industries are projected to grow? This project will analyze job growth projections, identifying which fields are likely to experience significant expansion in the near future.
---


---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [None]:
pip install scikit-learn

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** 
The dataset, titled "AI-Powered Job Market Insights," was collected to simulate and analyze the modern job market, with a particular focus on the impact of artificial intelligence (AI) and automation on various industries. This dataset offers insights into key job market dynamics, including AI adoption levels, automation risks, salaries, required skills, and job growth projections. The purpose of the dataset is to facilitate research into the future of work, helping researchers, policymakers, and data scientists explore trends driven by AI.

* **Details:** 
The "AI-Powered Job Market Insights" dataset was sourced from Kaggle, and it provides a synthetic yet realistic collection of 500 unique job listings. It --includes information on a wide range of job attributes, such as:

- Job Title: Categorical data on various job roles.
- Industry: Categorical data covering different sectors like healthcare, finance, entertainment, etc.
- Company Size: Categorical data indicating small, medium, or large companies.
- Location: Categorical data based on geographic regions.
- AI Adoption Level: Ordinal data indicating the extent of AI integration (e.g., low, medium, high).
- Automation Risk: Ordinal data showing the risk of automation for each role (e.g., low, medium, high).
- Required Skills: Categorical data showing the skills needed for each job.
- Salary in USD: Numerical data showing salary figures for different roles.
- Remote Friendly: Boolean data indicating whether the job allows remote work.
- Job Growth Projection: Categorical data predicting whether the job market for the role is expected to grow or shrink.
---

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>


In [5]:
file_path = 'ai_job_market_insights.csv'  
df = pd.read_csv(file_path)

# Displaying the first few rows of the dataset to inspect
df.head()

Unnamed: 0,Job_Title,Industry,Company_Size,Location,AI_Adoption_Level,Automation_Risk,Required_Skills,Salary_USD,Remote_Friendly,Job_Growth_Projection
0,Cybersecurity Analyst,Entertainment,Small,Dubai,Medium,High,UX/UI Design,111392.165243,Yes,Growth
1,Marketing Specialist,Technology,Large,Singapore,Medium,High,Marketing,93792.562466,No,Decline
2,AI Researcher,Technology,Large,Singapore,Medium,High,UX/UI Design,107170.263069,Yes,Growth
3,Sales Manager,Retail,Small,Berlin,Low,High,Project Management,93027.953758,No,Growth
4,Cybersecurity Analyst,Entertainment,Small,Tokyo,Low,Low,JavaScript,87752.922171,Yes,Decline


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>



In [8]:
# Get a summary of the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Job_Title              500 non-null    object 
 1   Industry               500 non-null    object 
 2   Company_Size           500 non-null    object 
 3   Location               500 non-null    object 
 4   AI_Adoption_Level      500 non-null    object 
 5   Automation_Risk        500 non-null    object 
 6   Required_Skills        500 non-null    object 
 7   Salary_USD             500 non-null    float64
 8   Remote_Friendly        500 non-null    object 
 9   Job_Growth_Projection  500 non-null    object 
dtypes: float64(1), object(9)
memory usage: 39.2+ KB
None


In [9]:
# Get descriptive statistics
print(df.describe(include='all'))



             Job_Title       Industry Company_Size       Location  \
count              500            500          500            500   
unique              10             10            3             10   
top     Data Scientist  Manufacturing        Small  San Francisco   
freq                62             58          171             62   
mean               NaN            NaN          NaN            NaN   
std                NaN            NaN          NaN            NaN   
min                NaN            NaN          NaN            NaN   
25%                NaN            NaN          NaN            NaN   
50%                NaN            NaN          NaN            NaN   
75%                NaN            NaN          NaN            NaN   
max                NaN            NaN          NaN            NaN   

       AI_Adoption_Level Automation_Risk     Required_Skills     Salary_USD  \
count                500             500                 500     500.000000   
unique       

In [10]:
# Check for missing values
print(df.isnull().sum())

Job_Title                0
Industry                 0
Company_Size             0
Location                 0
AI_Adoption_Level        0
Automation_Risk          0
Required_Skills          0
Salary_USD               0
Remote_Friendly          0
Job_Growth_Projection    0
dtype: int64


In [11]:
# Check for duplicate rows
print(df.duplicated().sum())

0


In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Salary_USD'].quantile(0.25)
Q3 = df['Salary_USD'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers_iqr = df[(df['Salary_USD'] < (Q1 - 1.5 * IQR)) | (df['Salary_USD'] > (Q3 + 1.5 * IQR))]
print(outliers_iqr)

In [7]:
# Convert 'Salary_USD' to numeric, if not already
df['Salary_USD'] = pd.to_numeric(df['Salary_USD'], errors='coerce')

# Check data types
print(df.dtypes)

Job_Title                 object
Industry                  object
Company_Size              object
Location                  object
AI_Adoption_Level         object
Automation_Risk           object
Required_Skills           object
Salary_USD               float64
Remote_Friendly           object
Job_Growth_Projection     object
dtype: object


In [8]:
# Standardize text columns (strip, lowercasing)
df['Job_Title'] = df['Job_Title'].str.strip().str.lower()
df['Industry'] = df['Industry'].str.strip().str.lower()
df['Location'] = df['Location'].str.strip().str.lower()
df['Required_Skills'] = df['Required_Skills'].str.strip().str.lower()


In [9]:
# Check final cleaned data
print(df.head())

               Job_Title       Industry Company_Size   Location  \
0  cybersecurity analyst  entertainment        Small      dubai   
1   marketing specialist     technology        Large  singapore   
2          ai researcher     technology        Large  singapore   
3          sales manager         retail        Small     berlin   
4  cybersecurity analyst  entertainment        Small      tokyo   

  AI_Adoption_Level Automation_Risk     Required_Skills     Salary_USD  \
0            Medium            High        ux/ui design  111392.165243   
1            Medium            High           marketing   93792.562466   
2            Medium            High        ux/ui design  107170.263069   
3               Low            High  project management   93027.953758   
4               Low             Low          javascript   87752.922171   

  Remote_Friendly Job_Growth_Projection  
0             Yes                Growth  
1              No               Decline  
2             Yes         

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
