# In this final notebook, our objective is to assess the skills mentioned in PDF files by comparing them to the skills extracted from LinkedIn profiles. To accomplish this, we have two main components: a dataframe consisting of skills in the data science field and a collection of skills extracted from the PDF files.

### To begin, we will import both datasets into our notebook and proceed with the evaluation.

In [6]:
import pandas as pd
import openai
import time
import re

In [7]:
df=pd.read_excel('data_2.xlsx')
df_fiche_module = pd.read_excel('data_fiche_ds_2.xlsx')


### Data overview

In [9]:
df.head()

Unnamed: 0,skill_name,frequency,course_unit
0,Python (Programming Language),443,Web Development
1,Python (Programming Language),443,Data Visualization
2,Python (Programming Language),443,Artificial Intelligence
3,Machine Learning,294,Artificial Intelligence
4,Deep Learning,181,Artificial Intelligence


In [10]:
df_fiche_module.head()

Unnamed: 0,raw_text,skill_type,skill_name,course,Unnamed: 4
0,big datum,Hard Skill,Big Data,bigdata.pdf,Data Visualization
1,mapreduce,Hard Skill,MapReduce,bigdata.pdf,Data Visualization
2,spark streaming,Hard Skill,Spark Streaming,bigdata.pdf,Data Visualization
3,HDFS,Hard Skill,Hadoop Distributed File System (HDFS),bigdata.pdf,Data Visualization
4,hive,Hard Skill,Apache Hive,bigdata.pdf,Data Visualization


### Comparing skills in evry category that exists in pdf files and also exists in collected data 

In [7]:
import pandas as pd


# Find the common skill_names
common_skill_names = pd.Series(list(set(df['skill_name']).intersection(set(df_fiche_module['skill_name']))))

    
# Iterate over the course names in the second dataframe
for course_name in df_fiche_module['Unnamed: 4'].unique():
    print(f"Course Names in university: {course_name}:")
    matching_skills = df[df['course_unit'] == course_name]['skill_name']
    matching_skills = matching_skills[matching_skills.isin(common_skill_names)]
    
    if not matching_skills.empty:
        print("        Matched Skills in", course_name + ":")
        for skill in matching_skills.unique():
            print(skill)

        print()


Course Names in university: Data Visualization:
        Matched Skills in Data Visualization:
Python (Programming Language)
SQL (Programming Language)
R (Programming Language)
Pandas (Python Package)
Time Series
MySQL
NumPy
Kibana
Apache Spark

Course Names in university: Artificial Intelligence:
        Matched Skills in Artificial Intelligence:
Python (Programming Language)
Machine Learning
Deep Learning
Machine Learning Methods
R (Programming Language)
Recurrent Neural Network (RNN)
Apache Spark
Random Forest Algorithm

Course Names in university: Database Administration:
        Matched Skills in Database Administration:
Big Data
SQL (Programming Language)
Apache Hive
MySQL
Linux

Course Names in university: Web Development:
        Matched Skills in Web Development:
Python (Programming Language)
Application Programming Interface (API)
Scala (Programming Language)
Java (Programming Language)
Web Applications
Web Services

Course Names in university: Operations research:
Course Name

### Missing skills in pdf files 

In [8]:
import pandas as pd

# Find the common skill_names
common_skill_names = pd.Series(list(set(df['skill_name']).intersection(set(df_fiche_module['skill_name']))))

# Iterate over the course units in the second dataframe
for course_unit in df_fiche_module['Unnamed: 4'].unique():
    
    # Skills in df that do not exist in df_fiche_module
    unmatched_skills = df[df['course_unit'] == course_unit]['skill_name']
    unmatched_skills = unmatched_skills[~unmatched_skills.isin(common_skill_names)]
    
    if not unmatched_skills.empty:
        print(f"Course Unit: {course_unit}")
        print("    Missing skill Skills in University:")
        for skill in unmatched_skills.unique():
            print("        Skill Name:", skill)
        print()


Course Unit: Data Visualization
    Missing skill Skills in University:
        Skill Name: Financial Data Analysis
        Skill Name: Power BI
        Skill Name: Prediction
        Skill Name: Analytics
        Skill Name: Data Visualization
        Skill Name: Dashboard
        Skill Name: Forecasting
        Skill Name: SAS (Software)
        Skill Name: SQL Server Express
        Skill Name: Business Intelligence
        Skill Name: Data Analysis
        Skill Name: Cleaned Data
        Skill Name: Survey Data Analysis
        Skill Name: Visualization
        Skill Name: SQL Server Integration Services (SSIS)
        Skill Name: Indicators (Measuring Device)
        Skill Name: Matplotlib
        Skill Name: Feature Engineering
        Skill Name: Sentiment Analysis
        Skill Name: Gitlab
        Skill Name: BigQuery
        Skill Name: Azure Databricks

Course Unit: Artificial Intelligence
    Missing skill Skills in University:
        Skill Name: Prediction
        Skill 

### Printing the top 5 missing skills in pdf files based on the frequency of skills

In [9]:
import pandas as pd

# Find the common skill_names
common_skill_names = pd.Series(list(set(df['skill_name']).intersection(set(df_fiche_module['skill_name']))))

# Iterate over the course units in the second dataframe
for course_unit in df_fiche_module['Unnamed: 4'].unique():
    
    # Skills in df that do not exist in df_fiche_module
    unmatched_skills = df[df['course_unit'] == course_unit]['skill_name']
    unmatched_skills = unmatched_skills[~unmatched_skills.isin(common_skill_names)]
    
    if not unmatched_skills.empty:
        print(f"Course Unit: {course_unit}")
        print("    Missing skill Skills in University:")
        for skill in unmatched_skills.value_counts().head(5).index:
            print("        Skill Name:", skill)
            # Display frequency from df for the unmatched skill
            frequency = df[df['skill_name'] == skill]['frequency']
            print("        Frequency:", frequency.values[0])  # Assuming there is only one frequency per skill in df
        print()


Course Unit: Data Visualization
    Missing skill Skills in University:
        Skill Name: Financial Data Analysis
        Frequency: 159
        Skill Name: Power BI
        Frequency: 119
        Skill Name: BigQuery
        Frequency: 21
        Skill Name: Gitlab
        Frequency: 21
        Skill Name: Sentiment Analysis
        Frequency: 23

Course Unit: Artificial Intelligence
    Missing skill Skills in University:
        Skill Name: Prediction
        Frequency: 109
        Skill Name: Natural Language Programming
        Frequency: 89
        Skill Name: Predictive Modeling
        Frequency: 61
        Skill Name: TensorFlow
        Frequency: 58
        Skill Name: Computer Vision
        Frequency: 54

Course Unit: Database Administration
    Missing skill Skills in University:
        Skill Name: Extract Transform Load (ETL)
        Frequency: 137
        Skill Name: Data Extraction
        Frequency: 48
        Skill Name: Data Modeling
        Frequency: 39
        