# Exploratory Data Analysis

> The notebook goal is creating an **Exploratory Data Analysis (EDA)**.

However, none preprocessing or modeling will be created here. Just:
 - Analysis
 - Hypothesis
 - Observation

---

# Download and Import necessary libraries

In [1]:
# To download the necessary libraries remove the comment and run this cell.
#!pip install --upgrade -r ../requirements.txt --user

In [2]:
import pandas as pd
import py7zr

---

# Decompress dataset
 - How the dataset is very big I chose download the compressed data **(.7z)**.
 - I also choose decompress in a temporary folder **(/temp)**.

In [3]:
# For Linux users.
# with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
#    archive.extractall(path="/tmp")

In [4]:
# For Windows users.
with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
    archive.extractall(path="C:\Windows\Temp")

---

# Dataset Overview
Let's start with **Dataset Overview**:

In [5]:
# For Linux users.
# df = pd.read_csv("/tmp/Train_rev1.csv")

In [6]:
# For Windows users.
df = pd.read_csv("C:\Windows\Temp\Train_rev1.csv")

In [7]:
df.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


 - **See we have a big dataset with:**
   - 244.768 samples.
   - And 12 features.

---

# Check data types

In [9]:
df.dtypes

Id                     int64
Title                 object
FullDescription       object
LocationRaw           object
LocationNormalized    object
ContractType          object
ContractTime          object
Company               object
Category              object
SalaryRaw             object
SalaryNormalized       int64
SourceName            object
dtype: object

 - **See we have "object" as most common data type:**
   - This "objects" represent texts (information) about the job vacancies.
   - Probably, we will need to apply a data preprocessing to get *insights* from this texts.
 - **We have only two numerical features:**
   - **Id:** Job ad identification.
   - **SalaryNormalized:** Target variable.

---

# Check missing data
Here we check missing data by:
 - Numerical quantity;
 - Percent (%).

**Numerical quantity aproach:**

In [10]:
quantityMissing = df.isnull().sum()
quantityMissing

Id                         0
Title                      1
FullDescription            0
LocationRaw                0
LocationNormalized         0
ContractType          179326
ContractTime           63905
Company                32430
Category                   0
SalaryRaw                  0
SalaryNormalized           0
SourceName                 1
dtype: int64

**Percent (%) approach:**

In [11]:
percentMissing = (quantityMissing / len(df.index)) * 100
percentMissing

Id                     0.000000
Title                  0.000409
FullDescription        0.000000
LocationRaw            0.000000
LocationNormalized     0.000000
ContractType          73.263662
ContractTime          26.108397
Company               13.249281
Category               0.000000
SalaryRaw              0.000000
SalaryNormalized       0.000000
SourceName             0.000409
dtype: float64

 - See that the features **ContractType**, **ContractTime** and **Company** has more 10% missing data.
 - However, another observation is that the feature **ContractType** has more 73% missing data:
   - That's a critical problem, because features with more 60% are almost null in model creation.

---

# Statistical Analysis

> Finally, let's do a brief **Statistical Analysis** on the dataset.

### Statistical Overview
Let's get started with a **Statistical Overview** on the features:

In [12]:
df.describe()

Unnamed: 0,Id,SalaryNormalized
count,244768.0,244768.0
mean,69701420.0,34122.577576
std,3129813.0,17640.543124
min,12612630.0,5000.0
25%,68695500.0,21500.0
50%,69937000.0,30000.0
75%,71626060.0,42500.0
max,72705240.0,200000.0


 - **First, let's ignore "Id" feature and focus only in "SalaryNormalized".**
 - **SalaryNormalized:**
   - The less salary (annual) was 5.000.
   - The highest salary (annual) was 200.000.
   - The salary (annual) **mean** was 34.122.
   - The salary (annual) **median** was 30.000:
     - Second quartile (Q2) or 50% of data.
     - See that, our **median** is not far from our **mean**.
   - The **Standard Deviation** was 17.640:
     - The **Standard Deviation** represents how far we are from the **mean**.

### Getting a moda
However, the **describe()** function doesn't return the **"mode"** from the data:

In [13]:
df.SalaryNormalized.mode()

0    35000
Name: SalaryNormalized, dtype: int64

**NOTE:**  
See that, our **mode** is not far from our **mean**.

### Getting the most common salaries
Now, let's use the **class "Counter"** to get a top 10 most common salaries:

In [14]:
from collections import Counter
c = Counter(df.SalaryNormalized)
c.most_common(10)

[(35000, 9178),
 (30000, 8319),
 (40000, 7688),
 (45000, 6735),
 (25000, 6309),
 (32500, 6215),
 (37500, 5756),
 (27500, 5544),
 (50000, 5424),
 (42500, 4555)]

**NOTE:**  
 - See that the most common salary was 35.000 with 9.178 samples.
 - See also that the most common salary is no far from the mean.

---

# EDA: First cycle analysis

> Here we have some important analysis of the **First cycle analysis**.

 - We have a big dataset with:
   - 244.768 samples and 12 features
 - Probably, we will need to apply a data preprocessing to get insights from this texts.
 - Some feature has many missing data:
   - ContractType, ContractTime and Company has more 10% missing data.
   - We have a critical problem with the feature ContractType (more 73% missing data)
 - Statistics from "SalaryNormalized" feature:
   - The less salary (annual) was 5.000.
   - The highest salary (annual) was 200.000.
   - The salary (annual) **mean** was 34.122.
   - The salary (annual) **median** was 30.000:
     - Second quartile (Q2) or 50% of data.
     - See that, our **median** is not far from our **mean**.
   - The **mode (most common salary)** was 35.000 (also is not far from the mean)
   - TOP 10 most common salaries are:
     - 35.000 with 9.178 samples;
     - 30.000 with 8.319 samples;
     - 40.000 with 7.688 samples;
     - 45.000 with 6.735 samples;
     - 25.000 with 6.309 samples;
     - 32.500 with 6.215 samples;
     - 37.500 with 5.756 samples;
     - 27.500 with 5.544 samples;
     - 50.000 with 5.424 samples;
     - 42.500 with 4.555 samples.
   - The **Standard Deviation** was 17.640:
     - The **Standard Deviation** represents how far we are from the **mean**.

---

Ro**drigo** **L**eite da **S**ilva - **drigols**