## Glassdoor Data Science Job Listings
This dataset comprises information extracted from 1,500 job postings related to data science from Glassdoor.com. The data encompasses essential details about each job listing, facilitating comprehensive analysis and insights into the data science job market.
link: https://www.kaggle.com/datasets/rrkcoder/glassdoor-data-science-job-listings

In [None]:
# Import libraries

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
# Import csv file

df = pd.read_csv("glassdoor_data_jobs.csv")

## Exploring the data

In [None]:
df.head(3)

In [None]:
df.info()

In [None]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")
print(f"The name of the columns are:\n{df.columns}")

In [None]:
#Checking for duplicates
print(f"There are {df.duplicated().sum()} duplicated rows in the dataset.")

In [None]:
#Checking for null data
print(df.isnull().sum())

In [None]:
df.describe()

## Processing data

In [None]:
# Creating a copy to clean and manipulate the data without changing the original dataset 

def processing_data(df):

    # Deleting duplicates
    df.drop_duplicates(inplace = True)

    # Deleting column Job Description
    df.drop(columns=['Job Description'], inplace=True)

    # Eliminating Rating and salary lower than 0
    df.loc[df["Rating"] == -1, "Rating"] = None
    df.loc[df["Salary Estimate"] == "-1", "Salary Estimate"] = None
    df.loc[df["Industry"] == "-1", "Industry"] = None
    df.loc[df["Sector"] == "-1", "Sector"] = None
    df.loc[df["Founded"] == "-1", "Founded"] = None

    # Eliminating company size lower than 0 or unknown
    df.loc[df["Size"] == "-1", "Size"] = None
    df.loc[df["Size"] == "Unknown", "Size"] = None

    #Converting years to int
    df["Founded"] = df["Founded"].astype(int)

    # Eliminating null
    df.dropna(inplace=True)
    
    # Sorting by rating in descending order
    df.sort_values(by = "Rating", ascending = False, inplace = True)

df_processed = df.copy()
processing_data(df_processed)
df_processed

In [None]:
df_processed.describe()

## Analysing

In [None]:
# Plotting helped identifying overlying data to be cleaned up.
plt.figure(figsize=(10, 5))
sns.histplot(data=df_processed, x="Rating", bins=20, kde=True)
plt.title("Distribution of Company Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

In [None]:
# Plotting helped identifying overlying data to be cleaned up.

plt.figure(figsize=(10, 5))
sns.countplot(data=df_processed, x='Size')
plt.title('Company Size Distribution')
plt.xlabel('Company Size')
plt.xticks(rotation=45)
plt.ylabel('Count')
plt.show()

## Conclusions

Várias informações podem ser extraídas do dataset, porém, exige uma melhor limpeza, organização e padronização dos dados. 
No momento, podemos observar que existe uma concentração muito grande de empresas com rating entre 3.5 e 4.5. 
Também observamos que a grande maioria das empresas que oferecem vagas de ciência de dados tem mais de 10000 integrantes.

In [None]:
df.to_csv('data_jobs_cleaned.csv')