# Wuzzuf Egypt Jobs Postings EDA
## Fatimah Ehab Farouk


## Contents

<ul>
<li><a href="#intro">1. Introduction</a></li>
<li><a href="#wrangling">2. Data Wrangling</a></li>
<li><a href="#eda">3. Exploratory Data Analysis</a></li>
<li><a href="#conclusions">4. Conclusion</a></li>
</ul>

<a id='intro'></a>
## 1. Introduction

This dataset includes 4380 Jobs with attributes such as Title, Company, Location, etc.

This is an exploratory data analysis project to discover hidden trends in the Egyptian job postings on Wuzzuf.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd 
import os
import matplotlib.pyplot as plt
import collections


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id='wrangling'></a>

## 2. Data Wrangling

In [None]:
#Import the data as a dataframe
df = pd.read_csv('/kaggle/input/wuzzuf-jobs/Wuzzuf_Jobs.csv')

# Take a look at the data
df

In [None]:
# Check data types and missing values
df.info()

In [None]:
# Checking for null values
df.isnull().any().sum()

In [None]:
# Checking for duplicates
df.duplicated().sum()

So we need to delete duplicates.

Now let's dive deeper into the data. I want to check these for a start:
- All of these job postings are located in Egypt, so having Egypt as the value in the country column is meaningless. Having the country column itself is debatable.
- Will keeping confidential companies be of benefit?
- Clean these columns: `Title` and `Company`.

And of course adding on more items to explore and clean goes on iteratively as we move forward.

In [None]:
#Check country column
df.Country.value_counts()

In [None]:
#Further checking
df[df.Country == 'Egypt'].Location.value_counts()

So, there're a few postings outside of Egypt, so these need to be cleaned. And renaming the country column to **city** will make more sense, because it's all in Egypt anyway. And for that to be successfully done, we need to replace Egypt with the correct corresponding cities values in `Location` column.

In [None]:
#Check location column
df.Location.value_counts()

In [None]:
#Check company column
df.Company.value_counts()

In [None]:
#Check title column
df.Title.value_counts()

### To-clean list:
- Eliminate duplicates.
- Replace the value *Egypt* in ` Country` column with its corresponding values from `Location` column.
- Rename `Country` column to `City`.
- Rename `Location` column to `District` so it makes more sense.
- Wrangle `Skill` column so that each cell contains a list of the skills.
- Clean unnecessary characters from the `Title` column.


In [None]:
# Create a copy to preserve data
df_clean = df.copy()

In [None]:
# Delete duplicates
df_clean.drop_duplicates(inplace=True)
# Test
df_clean.duplicated().sum()

In [None]:
# Replace Egypt with corresponding city values
df_clean.Country.mask(df_clean.Country == 'Egypt', df_clean.Location, axis=0, inplace=True)
#Eliminate excess spaces
df_clean.Country = df_clean.Country.str.replace(' ', '')

In [None]:
#Drop countries that aren't Egypt

countries_list = ['SaudiArabia', 'UnitedArabEmirates', 'Oman', 'ElSalvador', 'Brazil',  'India', 'UnitedStates',
                  'Qatar', 'Kuwait', 'Tunisia', 'China', 'Bahrain', 'Philippines', 'Austria', 'Pakistan',
                  'Indonesia', 'Ukraine', 'SriLanka', 'Iraq']

for country in countries_list:
    df_clean = df_clean.drop(df_clean[df_clean.Country == country].index)

In [None]:
#Rename the country column to city
df_clean.rename(columns = {'Country':'City', 'Location':'District'}, inplace = True)

In [None]:
#Convert skills cells to a list of each skill
df_clean.Skills = df_clean.Skills.str.split(",")

#Test
df_clean.head()

In [None]:
#Create a dataframe counting needed skills
list_df=list(df['Skills'])
mapdic={}
list_str =','.join(list_df).split(',')
skills_counter = collections.Counter(list_str)
skill_df = pd.DataFrame.from_dict(skills_counter, orient='index', columns=['skill_count'])
#Sort values descendingly from the most important skills to the least important
skill_df.sort_values(by='skill_count', ascending=False, inplace=True)

#Test
skill_df

It shows that there're some inaccurate values such as *Maadi* listed as a skill while it's in fact a district. Changing these values isn't significant for the purposes of our analysis.

In [None]:
#Clean the title column from excess words
df_clean.Title = df_clean.Title.str.replace(r'[\W][ \W].*' , '')

<a id='eda'></a>
## 2. Exploratory Data Analysis

This is the part where data visualization is done to explore data and discover insightful patterns in it.

In [None]:
# Pie plot of jobs types
sorted_counts = df_clean.Type.value_counts()
labels = sorted_counts.index

plt.figure(figsize=[14,14])
plt.pie(sorted_counts, labels=labels, rotatelabels =True, startangle=170, radius=2,
        counterclock=False, autopct='%1.00f%%', labeldistance=1.02)
plt.axis('square')
plt.title('Egyptian Wuzzuf Jobs Types 2020', pad=2, fontsize=15);

In [None]:
# Bar plot top 10 skills
plt.figure(figsize=[15,6])
top_skills = skill_df.index[:10]
count = skill_df.skill_count.head(10)

plt.barh(top_skills, count)

plt.title('Top 10 Wanted Skills in Wuzzuf Egypt 2020', fontsize= 15)
plt.xticks(fontsize= 11)
plt.yticks(fontsize= 12)
plt.xlabel('Number of Times the Skill Was Requested', fontsize=12);

In [None]:
# Bar plot top 10 jobs
plt.figure(figsize=[15,6])
top_jobs = df_clean.Title.value_counts().head(10).index
count = df_clean.Title.value_counts().head(10)

plt.barh(top_jobs, count)

plt.title('Top 10 Wuzzuf Jobs Needed in Egypt 2020', fontsize= 15)
plt.xticks(fontsize= 11)
plt.yticks(fontsize= 12)
plt.xlabel('Number of Times the Job Was Requested', fontsize=12);

In [None]:
# Bar plot top 10 districts
plt.figure(figsize=[15,6])
top_districts = df_clean.District.value_counts().head(10).index
count = df_clean.District.value_counts().head(10)

plt.barh(top_districts, count)

plt.title('Top 10 Job Locations at Wuzzuf Egypt 2020', fontsize= 15)
plt.xticks(fontsize= 11)
plt.yticks(fontsize= 12)
plt.xlabel('Number of Jobs in the Location', fontsize=12);

This is excellent! Now let's summarize the most important discovered insights in the conclusions part.

<a id='conclusions'></a>
## Conclusions

This dashboard includes the ***top 10 skills, job titles and job locations*** needed by employers on Wuzzuf in Egypt 2020.

In [None]:
import matplotlib.gridspec as gridspec

def create_figure(plot1, plot2, plot3):
    with plt.style.context(("seaborn","ggplot")):
        fig = plt.figure(constrained_layout=True, figsize=(10,15))
        specs = gridspec.GridSpec(ncols=1, nrows=3, figure=fig) ## Declaring 2x2 figure.

        ax1 = fig.add_subplot(specs[0, 0]) ## First Row
        ax2 = fig.add_subplot(specs[1, 0]) ## Second Row First Column
        ax3 = fig.add_subplot(specs[2, 0]) ## Second Row Second Colums

        #1 Bar plot top 10 skills
        top_skills = skill_df.index[:10]
        count = skill_df.skill_count.head(10)
        ax1.barh(top_skills, count, color='#421244')
        ax1.set_title(plot1, fontsize= 15)
        ax1.set_xlabel('Number of Times the Skill Was Requested', fontsize=12);

        #2 Bar plot top 10 jobs
        top_jobs = df_clean.Title.value_counts().head(10).index
        count = df_clean.Title.value_counts().head(10)
        ax2.barh(top_jobs, count, color='#446664')
        ax2.set_title(plot2, fontsize= 15)
        ax2.set_xlabel('Number of Times the Job Was Requested', fontsize=12);

        #3 Bar plot top 10 districts
        top_districts = df_clean.District.value_counts().head(10).index
        count = df_clean.District.value_counts().head(10)
        ax3.barh(top_districts, count, color='#448844')
        ax3.set_title(plot3, fontsize= 15)
        ax3.set_xlabel('Number of Jobs in the Location', fontsize=12);

        plt.close(fig)
        return fig

create_figure('Top 10 Wanted Skills in Wuzzuf Egypt 2020', 'Top 10 Wuzzuf Jobs Needed in Egypt 2020', 'Top 10 Job Locations at Wuzzuf Egypt 2020')