# [v1] - Exploratory Data Analysis/EDA (Data understanding)

> The notebook goal is to create a general *Exploratory Data Analysis (EDA)* on all Dataset features.

However, no preprocessing or modeling will be created here. Just:

 - Analysis.
 - Hypothesis.
 - Observations.

---

## Dataset Overview

In [1]:
import pandas as pd

# For Linux users.
train_df = pd.read_csv("../datalake/landing/Train_rev1.csv")

In [2]:
train_df.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


See we have a big dataset with:

- 244.768 samples.
- And 12 features.

---

## Check data types

In [4]:
train_df.dtypes

Id                     int64
Title                 object
FullDescription       object
LocationRaw           object
LocationNormalized    object
ContractType          object
ContractTime          object
Company               object
Category              object
SalaryRaw             object
SalaryNormalized       int64
SourceName            object
dtype: object

 - **See we have "object" as the most common data type:**
   - These "objects" represent texts (information) about the job advertisement.
   - We will probably need to apply data preprocessing to get insights from these texts.
 - **We have only two numerical features:**
   - **Id:** Job ad identification.
   - **SalaryNormalized:** Target variable.

---

## Check missing data

Here, let's check missing data by:

 - Numerical quantity.
 - Percent (%).

**Numerical quantity aproach:**

In [5]:
missingByQuantity = train_df.isnull().sum()
missingByQuantity

Id                         0
Title                      1
FullDescription            0
LocationRaw                0
LocationNormalized         0
ContractType          179326
ContractTime           63905
Company                32430
Category                   0
SalaryRaw                  0
SalaryNormalized           0
SourceName                 1
dtype: int64

**Percent (%) approach:**

In [6]:
missingByPercent = (missingByQuantity / len(train_df.index)) * 100
missingByPercent

Id                     0.000000
Title                  0.000409
FullDescription        0.000000
LocationRaw            0.000000
LocationNormalized     0.000000
ContractType          73.263662
ContractTime          26.108397
Company               13.249281
Category               0.000000
SalaryRaw              0.000000
SalaryNormalized       0.000000
SourceName             0.000409
dtype: float64

 - See that the features **ContractType**, **ContractTime**, and **Company** have more than 10% missing data.
 - The feature **ContractType** has more than 73% missing data:
   - That's a critical problem because features with more than 60% are almost null in model creation.

---

## Categorical Analysis

> Now, let's see some categorical analysis.

**Categorical Analysis for the "LocationNormalized" feature:**

In [7]:
from collections import Counter
c = Counter(train_df.LocationNormalized)
c.most_common(10)

[('UK', 41093),
 ('London', 30522),
 ('South East London', 11713),
 ('The City', 6678),
 ('Manchester', 3516),
 ('Leeds', 3401),
 ('Birmingham', 3061),
 ('Central London', 2607),
 ('West Midlands', 2540),
 ('Surrey', 2397)]

In [8]:
LocationNormalized_values = c.most_common(10)

In [9]:
# Check categories percent (%) + Missing data.
for category in LocationNormalized_values:
    percentCategory = (category[1] / len(train_df.index)) * 100
    print(f"'{category[0]}' has {category[1]} samples representing {round(percentCategory, 1)}% data.")

'UK' has 41093 samples representing 16.8% data.
'London' has 30522 samples representing 12.5% data.
'South East London' has 11713 samples representing 4.8% data.
'The City' has 6678 samples representing 2.7% data.
'Manchester' has 3516 samples representing 1.4% data.
'Leeds' has 3401 samples representing 1.4% data.
'Birmingham' has 3061 samples representing 1.3% data.
'Central London' has 2607 samples representing 1.1% data.
'West Midlands' has 2540 samples representing 1.0% data.
'Surrey' has 2397 samples representing 1.0% data.


 - *LocationNormalized feature:*
   - LocationNormalized has not missing data.
   - The 10 most common locations for job ad is:
     - 'UK' has 41.093 samples representing 16.8% data.
     - 'London' has 30.522 samples representing 12.5% data.
     - 'South East London' has 11.713 samples representing 4.8% data.
     - 'The City' has 6.678 samples representing 2.7% data.
     - 'Manchester' has 3.516 samples representing 1.4% data.
     - 'Leeds' has 3.401 samples representing 1.4% data.
     - 'Birmingham' has 3.061 samples representing 1.3% data.
     - 'Central London' has 2.607 samples representing 1.1% data.
     - 'West Midlands' has 2.540 samples representing 1.0% data.
     - 'Surrey' has 2.397 samples representing 1.0% data.

**Categorical Analysis for the "ContractType" feature:**

In [10]:
from collections import Counter
c = Counter(train_df.ContractType)
c.most_common()

[(nan, 179326), ('full_time', 57538), ('part_time', 7904)]

In [11]:
ContractType_values = c.most_common()

In [12]:
# Check categories percent (%) + Missing data.
for category in ContractType_values:
    percentCategory = (category[1] / len(train_df.index)) * 100
    print(f"The '{category[0]}' category has {category[1]} samples representing {round(percentCategory, 1)}% data.")

The 'nan' category has 179326 samples representing 73.3% data.
The 'full_time' category has 57538 samples representing 23.5% data.
The 'part_time' category has 7904 samples representing 3.2% data.


 - *ContractType feature:*
   - ContractType has many missing data:
     - 179.326 missing data.
     - Representing 73.2% data.
   - The *'full_time'* category:
     - Has 57.538 samples.
     - Representing 23.5% data.
   - The *'part_time'* category:
     - Has 7.904 samples.
     - Representing 3.2% data.

**Categorical Analysis for the "ContractTime" feature:**

In [13]:
from collections import Counter
c = Counter(train_df.ContractTime)
c.most_common()

[('permanent', 151521), (nan, 63905), ('contract', 29342)]

In [14]:
ContractTime_values = c.most_common()

In [15]:
# Check categories percent (%) + Missing data.
for category in ContractTime_values:
    percentCategory = (category[1] / len(train_df.index)) * 100
    print(f"The '{category[0]}' category has {category[1]} samples representing {round(percentCategory, 1)}% data.")

The 'permanent' category has 151521 samples representing 61.9% data.
The 'nan' category has 63905 samples representing 26.1% data.
The 'contract' category has 29342 samples representing 12.0% data.


 - *ContractTime feature:*
   - ContractType has many missing data:
     - 63.905 missing data.
     - Representing 26.1% data.
   - The *'permanent'* category:
     - Has 151.521 samples.
     - Representing 61.9% data.
   - The *'contract'* category:
     - Has 29.342 samples.
     - Representing 12.0% data.

**Categorical Analysis for the "Category" feature:**

In [16]:
from collections import Counter
c = Counter(train_df.Category)
Category_values = c.most_common()

In [17]:
# Check categories percent (%) + Missing data.
for category in Category_values:
    percentCategory = (category[1] / len(train_df.index)) * 100
    print(f"The '{category[0]}' category has {category[1]} samples representing {round(percentCategory, 1)}% data.")

The 'IT Jobs' category has 38483 samples representing 15.7% data.
The 'Engineering Jobs' category has 25174 samples representing 10.3% data.
The 'Accounting & Finance Jobs' category has 21846 samples representing 8.9% data.
The 'Healthcare & Nursing Jobs' category has 21076 samples representing 8.6% data.
The 'Sales Jobs' category has 17272 samples representing 7.1% data.
The 'Other/General Jobs' category has 17055 samples representing 7.0% data.
The 'Teaching Jobs' category has 12637 samples representing 5.2% data.
The 'Hospitality & Catering Jobs' category has 11351 samples representing 4.6% data.
The 'PR, Advertising & Marketing Jobs' category has 8854 samples representing 3.6% data.
The 'Trade & Construction Jobs' category has 8837 samples representing 3.6% data.
The 'HR & Recruitment Jobs' category has 7713 samples representing 3.2% data.
The 'Admin Jobs' category has 7614 samples representing 3.1% data.
The 'Retail Jobs' category has 6584 samples representing 2.7% data.
The 'Cust

 - *Category feature:*
   - Category has not missing data.
   - The *'IT Jobs'* category:
     - Has 38.483 samples.
     - Representing 15.7% data.
   - The *'Engineering Jobs'* category:
     - Has 25.174 samples.
     - Representing 10.3% data.
   - The *'Accounting & Finance Jobs'* category:
     - Has 21.846 samples.
     - Representing 8.9% data.
   - The *'Healthcare & Nursing Jobs'* category:
     - Has 21.076 samples.
     - Representing 8.6% data.
   - The *'Sales Jobs'* category:
     - Has 17.272 samples.
     - Representing 7.1% data.
   - The *'Other/General Jobs'* category:
     - Has 17.055 samples.
     - Representing 7.0% data.
   - The *'Teaching Jobs'* category:
     - Has 12.637 samples.
     - Representing 5.2% data.
   - The *'Hospitality & Catering Jobs'* category:
     - Has 11.351 samples.
     - Representing 4.6% data.
   - The *'PR, Advertising & Marketing Jobs'* category:
     - Has 8.854 samples.
     - Representing 3.6% data.
   - The *'Trade & Construction Jobs'* category:
     - Has 8.837 samples.
     - Representing 3.6% data.
   - The *'HR & Recruitment Jobs'* category:
     - Has 7.713 samples.
     - Representing 3.2% data.
   - The *'Admin Jobs'* category:
     - has 7.614 samples.
     - Representing 3.1% data.
   - The *'Retail Jobs'* category:
     - Has 6.584 samples.
     - Representing 2.7% data.
   - The *'Customer Services Jobs'* category:
     - Has 6.063 samples.
     - Representing 2.5% data.
   - The *'Legal Jobs'* category:
     - Has 3.939 samples.
     - Representing 1.6% data.
   - The *'Manufacturing Jobs'* category:
     - Has 3.765 samples.
     - Representing 1.5% data.
   - The *'Logistics & Warehouse Jobs'* category:
     - Has 3.633 samples.
     - Representing 1.5% data.
   - The *'Social work Jobs'* category:
     - Has 3.455 samples.
     - Representing 1.4% data.
   - The *'Consultancy Jobs'* category:
     - Has 3.263 samples.
     - Representing 1.3% data.
   - The *'Travel Jobs'* category:
     - Has 3.126 samples.
     - Representing 1.3% data.
   - The *'Scientific & QA Jobs'* category:
     - Has 2.489 samples.
     - Representing 1.0% data.
   - The *'Charity & Voluntary Jobs'* category:
     - Has 2.332 samples.
     - Representing 1.0% data.
   - The *'Energy, Oil & Gas Jobs'* category:
     - Has 2.255 samples.
     - Representing 0.9% data.
   - The *'Creative & Design Jobs'* category:
     - Has 1.605 samples.
     - Representing 0.7% data.
   - The *'Maintenance Jobs'* category:
     - Has 1.542 samples.
     - Representing 0.6% data.
   - The *'Graduate Jobs'* category:
     - Has 1.331 samples.
     - Representing 0.5% data.
   - The *'Property Jobs'* category:
     - Has 1.038 samples.
     - Representing 0.4% data.
   - The *'Domestic help & Cleaning Jobs'* category:
     - Has 291 samples.
     - Representing 0.1% data.
   - The *'Part time Jobs'* category:
     - Has 145 samples.
     - Representing 0.1% data.

---

# Statistical Analysis

> Finally, let's do a brief **Statistical Analysis** on the dataset.

**Statistical Overview:**

In [18]:
train_df.describe()

Unnamed: 0,Id,SalaryNormalized
count,244768.0,244768.0
mean,69701420.0,34122.577576
std,3129813.0,17640.543124
min,12612630.0,5000.0
25%,68695500.0,21500.0
50%,69937000.0,30000.0
75%,71626060.0,42500.0
max,72705240.0,200000.0


First, let's ignore *"Id"* feature and focus only on *"SalaryNormalized"*:

 - *SalaryNormalized:*
   - The less salary (annual) was 5.000.
   - The highest salary (annual) was 200.000.
   - The salary (annual) *mean* was 34.122.
   - The salary (annual) *median* was 30.000:
     - Second quartile (Q2) or 50% of data.
     - See that, our *median* is not far from our *mean*.
   - The *Standard Deviation* was 17.640:
     - The *Standard Deviation* represents how far we are from the *mean*.

**Getting statistical "mode":** The describe() function doesn't return the "mode" from the data.

In [19]:
train_df.SalaryNormalized.mode()

0    35000
Name: SalaryNormalized, dtype: int64

**NOTE:**  
See that, our *mode* is not far from our *mean*.

**Getting the most common salaries:** Now, let's use the "Counter" class to get the top 10 most common salaries.

In [20]:
from collections import Counter
c = Counter(train_df.SalaryNormalized)
c.most_common(10)

[(35000, 9178),
 (30000, 8319),
 (40000, 7688),
 (45000, 6735),
 (25000, 6309),
 (32500, 6215),
 (37500, 5756),
 (27500, 5544),
 (50000, 5424),
 (42500, 4555)]

**NOTE:**  
 - See that the most common salary was 35.000 with 9.178 samples.
 - See also that the most common salary is not far from the mean.

---

# [v1] - Exploratory Data Analysis/EDA (Resume)

 - **We have a big dataset with:**
   - 244.768 samples and 12 features
 - **We will probably need to apply data preprocessing to get insights from these texts.**
 - **Some feature has many missing data:**
   - ContractType, ContractTime, and Company have more than 10% missing data.
   - We have a critical problem with the feature ContractType (more than 73% missing data).
 - **Statistics from "SalaryNormalized" feature:**
   - The less salary (annual) was 5.000.
   - The highest salary (annual) was 200.000.
   - The salary (annual) **mean** was 34.122.
   - The salary (annual) **median** was 30.000:
     - Second quartile (Q2) or 50% of data.
     - See that, our **median** is not far from our **mean**.
   - The **mode (most common salary)** was 35.000 (also is not far from the mean)
   - TOP 10 most common salaries are:
     - 35.000 with 9.178 samples;
     - 30.000 with 8.319 samples;
     - 40.000 with 7.688 samples;
     - 45.000 with 6.735 samples;
     - 25.000 with 6.309 samples;
     - 32.500 with 6.215 samples;
     - 37.500 with 5.756 samples;
     - 27.500 with 5.544 samples;
     - 50.000 with 5.424 samples;
     - 42.500 with 4.555 samples.
   - The **Standard Deviation** was 17.640:
     - The **Standard Deviation** represents how far we are from the **mean**.
 - **Categorical analysis:**-
   - **LocationNormalized feature:**
     - LocationNormalized has not missing data.
     - The 10 most common locations for job ad is:
       - 'UK' has 41.093 samples representing 16.8% data.
       - 'London' has 30.522 samples representing 12.5% data.
       - 'South East London' has 11.713 samples representing 4.8% data.
       - 'The City' has 6.678 samples representing 2.7% data.
       - 'Manchester' has 3.516 samples representing 1.4% data.
       - 'Leeds' has 3.401 samples representing 1.4% data.
       - 'Birmingham' has 3.061 samples representing 1.3% data.
       - 'Central London' has 2.607 samples representing 1.1% data.
       - 'West Midlands' has 2.540 samples representing 1.0% data.
       - 'Surrey' has 2.397 samples representing 1.0% data.
   - **ContractType feature:**
     - ContractType has many missing data:
       - 179.326 missing data.
       - Representing 73.2% data.
     - The **'full_time'** category:
       - Has 57.538 samples.
       - Representing 23.5% data.
     - The **'part_time'** category:
       - Has 7.904 samples.
       - Representing 3.2% data.
   - **ContractTime feature:**
     - ContractType has many missing data:
       - 63.905 missing data.
       - Representing 26.1% data.
     - The **'permanent'** category:
       - Has 151.521 samples.
       - Representing 61.9% data.
     - The **'contract'** category:
       - Has 29.342 samples.
       - Representing 12.0% data.
   - **Category feature:**
     - Category has not missing data.
     - The **'IT Jobs'** category:
       - Has 38.483 samples.
       - Representing 15.7% data.
     - The **'Engineering Jobs'** category:
       - Has 25.174 samples.
       - Representing 10.3% data.
     - The **'Accounting & Finance Jobs'** category:
       - Has 21.846 samples.
       - Representing 8.9% data.
     - The **'Healthcare & Nursing Jobs'** category:
       - Has 21.076 samples.
       - Representing 8.6% data.
     - The **'Sales Jobs'** category:
       - Has 17.272 samples.
       - Representing 7.1% data.
     - The **'Other/General Jobs'** category:
       - Has 17.055 samples.
       - Representing 7.0% data.
     - The **'Teaching Jobs'** category:
       - Has 12.637 samples.
       - Representing 5.2% data.
     - The **'Hospitality & Catering Jobs'** category:
       - Has 11.351 samples.
       - Representing 4.6% data.
     - The **'PR, Advertising & Marketing Jobs'** category:
       - Has 8.854 samples.
       - Representing 3.6% data.
     - The **'Trade & Construction Jobs'** category:
       - Has 8.837 samples.
       - Representing 3.6% data.
     - The **'HR & Recruitment Jobs'** category:
       - Has 7.713 samples.
       - Representing 3.2% data.
     - The **'Admin Jobs'** category:
       - has 7.614 samples.
       - Representing 3.1% data.
     - The **'Retail Jobs'** category:
       - Has 6.584 samples.
       - Representing 2.7% data.
     - The **'Customer Services Jobs'** category:
       - Has 6.063 samples.
       - Representing 2.5% data.
     - The **'Legal Jobs'** category:
       - Has 3.939 samples.
       - Representing 1.6% data.
     - The **'Manufacturing Jobs'** category:
       - Has 3.765 samples.
       - Representing 1.5% data.
     - The **'Logistics & Warehouse Jobs'** category:
       - Has 3.633 samples.
       - Representing 1.5% data.
     - The **'Social work Jobs'** category:
       - Has 3.455 samples.
       - Representing 1.4% data.
     - The **'Consultancy Jobs'** category:
       - Has 3.263 samples.
       - Representing 1.3% data.
     - The **'Travel Jobs'** category:
       - Has 3.126 samples.
       - Representing 1.3% data.
     - The **'Scientific & QA Jobs'** category:
       - Has 2.489 samples.
       - Representing 1.0% data.
     - The **'Charity & Voluntary Jobs'** category:
       - Has 2.332 samples.
       - Representing 1.0% data.
     - The **'Energy, Oil & Gas Jobs'** category:
       - Has 2.255 samples.
       - Representing 0.9% data.
     - The **'Creative & Design Jobs'** category:
       - Has 1.605 samples.
       - Representing 0.7% data.
     - The **'Maintenance Jobs**' category:
       - Has 1.542 samples.
       - Representing 0.6% data.
     - The **'Graduate Jobs'** category:
       - Has 1.331 samples.
       - Representing 0.5% data.
     - The **'Property Jobs'** category:
       - Has 1.038 samples.
       - Representing 0.4% data.
     - The **'Domestic help & Cleaning Jobs'** category:
       - Has 291 samples.
       - Representing 0.1% data.
     - The **'Part time Jobs'** category:
       - Has 145 samples.
       - Representing 0.1% data.


---

Ro**drigo** **L**eite da **S**ilva - **drigols**