# 1. Importing libraries and datasets

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
df=pd.read_csv('~/Documents/Naukri Kaggle/Datasets/marketing_sample_for_naukri_com-jobs__20190701_20190830__30k_data.csv')

In [3]:
df.head()

Unnamed: 0,Uniq Id,Crawl Timestamp,Job Title,Job Salary,Job Experience Required,Key Skills,Role Category,Location,Functional Area,Industry,Role
0,9be62c49a0b7ebe982a4af1edaa7bc5f,2019-07-05 01:46:07 +0000,Digital Media Planner,Not Disclosed by Recruiter,5 - 10 yrs,Media Planning| Digital Media,Advertising,Mumbai,"Marketing , Advertising , MR , PR , Media Plan...","Advertising, PR, MR, Event Management",Media Planning Executive/Manager
1,3c52d436e39f596b22519da2612f6a56,2019-07-06 08:04:50 +0000,Online Bidding Executive,Not Disclosed by Recruiter,2 - 5 yrs,pre sales| closing| software knowledge| clien...,Retail Sales,"Pune,Pune","Sales , Retail , Business Development","IT-Software, Software Services",Sales Executive/Officer
2,ffad8a2396c60be2bf6d0e2ff47c58d4,2019-08-05 15:50:44 +0000,Trainee Research/ Research Executive- Hi- Tec...,Not Disclosed by Recruiter,0 - 1 yrs,Computer science| Fabrication| Quality check|...,R&D,Gurgaon,"Engineering Design , R&D","Recruitment, Staffing",R&D Executive
3,7b921f51b5c2fb862b4a5f7a54c37f75,2019-08-05 15:31:56 +0000,Technical Support,"2,00,000 - 4,00,000 PA.",0 - 5 yrs,Technical Support,Admin/Maintenance/Security/Datawarehousing,Mumbai,"IT Software - Application Programming , Mainte...","IT-Software, Software Services",Technical Support Engineer
4,2d8b7d44e138a54d5dc841163138de50,2019-07-05 02:48:29 +0000,Software Test Engineer -hyderabad,Not Disclosed by Recruiter,2 - 5 yrs,manual testing| test engineering| test cases|...,Programming & Design,Hyderabad,IT Software - QA & Testing,"IT-Software, Software Services",Testing Engineer


# 2. Data wrangling

Let us check the type of data which we are given with.

In [4]:
df.dtypes

Uniq Id                    object
Crawl Timestamp            object
Job Title                  object
Job Salary                 object
Job Experience Required    object
Key Skills                 object
Role Category              object
Location                   object
Functional Area            object
Industry                   object
Role                       object
dtype: object

As we can see, all the data types have been encoded as objects. However, upon eyeballing into the data, we clearly see that we have some data that will be preferred to be in another form. For example, it will be better if we change the Crawl Timestamp into timestamp datatype. Similarly, the Job Experience column and salary can be shown as integer datatypes.

## Converting the crawl timestamp column into timestamp datatype

Let us remove the unwanted **+0000** in the timestamp column first.

In [5]:
for i in range(len(df)):
    df['Crawl Timestamp'][i]=df['Crawl Timestamp'][i].replace('+0000','')
    i+=1

In [6]:
df.head()

Unnamed: 0,Uniq Id,Crawl Timestamp,Job Title,Job Salary,Job Experience Required,Key Skills,Role Category,Location,Functional Area,Industry,Role
0,9be62c49a0b7ebe982a4af1edaa7bc5f,2019-07-05 01:46:07,Digital Media Planner,Not Disclosed by Recruiter,5 - 10 yrs,Media Planning| Digital Media,Advertising,Mumbai,"Marketing , Advertising , MR , PR , Media Plan...","Advertising, PR, MR, Event Management",Media Planning Executive/Manager
1,3c52d436e39f596b22519da2612f6a56,2019-07-06 08:04:50,Online Bidding Executive,Not Disclosed by Recruiter,2 - 5 yrs,pre sales| closing| software knowledge| clien...,Retail Sales,"Pune,Pune","Sales , Retail , Business Development","IT-Software, Software Services",Sales Executive/Officer
2,ffad8a2396c60be2bf6d0e2ff47c58d4,2019-08-05 15:50:44,Trainee Research/ Research Executive- Hi- Tec...,Not Disclosed by Recruiter,0 - 1 yrs,Computer science| Fabrication| Quality check|...,R&D,Gurgaon,"Engineering Design , R&D","Recruitment, Staffing",R&D Executive
3,7b921f51b5c2fb862b4a5f7a54c37f75,2019-08-05 15:31:56,Technical Support,"2,00,000 - 4,00,000 PA.",0 - 5 yrs,Technical Support,Admin/Maintenance/Security/Datawarehousing,Mumbai,"IT Software - Application Programming , Mainte...","IT-Software, Software Services",Technical Support Engineer
4,2d8b7d44e138a54d5dc841163138de50,2019-07-05 02:48:29,Software Test Engineer -hyderabad,Not Disclosed by Recruiter,2 - 5 yrs,manual testing| test engineering| test cases|...,Programming & Design,Hyderabad,IT Software - QA & Testing,"IT-Software, Software Services",Testing Engineer


As we can see, the timestamp is in a more understandable form and can be converted into the required timestamp data frame using pandas.to_datetime method .

In [7]:
df['Crawl Timestamp']=pd.to_datetime(df['Crawl Timestamp'])

In [8]:
df['Crawl Timestamp'][:5]

0   2019-07-05 01:46:07
1   2019-07-06 08:04:50
2   2019-08-05 15:50:44
3   2019-08-05 15:31:56
4   2019-07-05 02:48:29
Name: Crawl Timestamp, dtype: datetime64[ns]

As we can see, the above Crawl Timestamp column is in the required datetime format.

## Missing values

We need to check for the presence of any missing values and take care of these missing values. In this case, we will simply drop the missing values as we are primarily dealing with data visualisation and dropping few entries will not severly harm any calculations.

In [43]:
df.isna().any()

Uniq Id                    False
Crawl Timestamp            False
Job Title                   True
Job Salary                  True
Job Experience Required     True
Key Skills                  True
Role Category               True
Location                    True
Functional Area             True
Industry                    True
Role                        True
dtype: bool

In [45]:
cols=[ 'Job Title', 'Job Salary',
       'Job Experience Required', 'Key Skills', 'Role Category', 'Location',
       'Functional Area', 'Industry', 'Role']
for col in cols:
    print('Number of missing values in {}: {}'.format(col,df[col].isna().value_counts()[1]))
print('Total entries:{}'.format(len(df)))

Number of missing values in Job Title: 575
Number of missing values in Job Salary: 50
Number of missing values in Job Experience Required: 573
Number of missing values in Key Skills: 1271
Number of missing values in Role Category: 2305
Number of missing values in Location: 577
Number of missing values in Functional Area: 573
Number of missing values in Industry: 573
Number of missing values in Role: 901
Total entries:30000


As we can see, the number of missing values in each column is not much. Even if we drop all the missing values, we should be able to get a good deptiction of the general data trend. Let us now drop all the missing values in the dataframe.

In [46]:
df.dropna(axis=0,inplace=True)

In [48]:
df.isna().any()

Uniq Id                    False
Crawl Timestamp            False
Job Title                  False
Job Salary                 False
Job Experience Required    False
Key Skills                 False
Role Category              False
Location                   False
Functional Area            False
Industry                   False
Role                       False
dtype: bool

In [52]:
df.size

297055

As we can see, the size of the dataframe reduced from 30000 to 297055. The loss of data isn't much and can be worked with now.