# Introduction

A company wants to hire data scientists. This scientist have to pass some courses about data science.



<font color = "pink">
Content:
    
1. [Load and Check Data](#1)
    
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable](#4)
        * [Numerical Variable](#5)    
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
5. [Fill Missing Value](#8)
 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt 
plt.style.use('seaborn-whitegrid')
import seaborn as sns
from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a>
# Load and Check Data

In [None]:
test_df = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_test.csv")
train_df = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv")


In [None]:
train_df.columns

In [None]:
train_df.head()    #first 5 sample

In [None]:
train_df.describe()    #Numerical samples's describe

<a id = "2"></a>
# Variable Description
1.  enrollee_id: Unique Id for each candidate
1. city: City Codes
1. city_development_index: Developement index of the city (scaled)
1. gender: Gender of candidate
1. relevent_experience: Experience of candidate(relevent)
1. enrolled_university: Enrolled university type
1. education_level: Education level of each candidate
1. major_discipline: Education major discipline of candidate
1. exprerience: Total experience of candidate
1. company_size: Employees in employer's company
1. company_type: Type of employees
1. last_new_job: Difference in years between previous job and current job
1. training_hours: Completed training hours
1. target: Job change(0-not looking, 1-looking)

In [None]:
train_df.info()

* int64(2): enrollee_id, training_hours
* object(10): city, gender, relevent_experience, enrolled_university, education_level, major_discipline, experience, company_size, company_type, last_new_job
* float64(2): city_development_index, target

<a id = "3"></a>
# Univariate Variable Analysis
* Categorical Variable: city, gender, relevent_experience, enrolled_university, education_level, major_discipline, experience, , company_type, last_new_job, target,company_size
* Numerical Variable: enrollee_id, training_hours,city_development_index

<a id = "4"></a>
## Categorical Variable Analysis

For categorical variable analysis,  we'll use bar_plot in matplotlib.pylot library.

In [None]:
def bar_plot(variable):
    """
    input: variable ex: "gender"
    output: bar plot
    """
    # get feature
    var = train_df[variable]
    
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable, varValue))

In [None]:
category1 = ["city", "gender", "relevent_experience", "enrolled_university", "education_level", "major_discipline", "experience", "company_type", "last_new_job", "target","company_size"]
for c in category1:
    bar_plot(c)

<a id = "5"></a>
## Numerical Variable Analysis

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(train_df[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable)),
    plt.show()

In [None]:
numericVar = ["enrollee_id", "training_hours","city_development_index"]
for n in numericVar:
    plot_hist(n)

<a id = "6"></a>
# Basic Data Analysis
* gender-training_hours
* experience-training_hours  
* enrolled_university-training_hours
* target-training_hours

In [None]:
# gender vs training_hours
train_df[["gender","training_hours"]].groupby(["gender"], as_index = False).mean().sort_values(by="training_hours", ascending= False)

In [None]:
# experience vs training_hours  
train_df[["experience","training_hours"]].groupby(["experience"], as_index = False).mean().sort_values(by="training_hours", ascending= False)

In [None]:
# enrolled_university vs training_hours  
train_df[["enrolled_university","training_hours"]].groupby(["enrolled_university"], as_index = False).mean().sort_values(by="training_hours", ascending= False)

In [None]:
# target vs training_hours  
train_df[["target","training_hours"]].groupby(["target"], as_index = False).mean().sort_values(by="training_hours", ascending= False)

<a id = "7"></a>
# Outlier Detection

In [None]:
def detect_outliers(df, features):
    outlier_indices = []
    
    for c in features:
        
        Q1 = np.percentile(df[c],25)
        
        Q3 = np.percentile(df[c],75)
        
        IQR = Q3 - Q1
        
        outlier_step = IQR + 1.5
        
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        
        outlier_indices.extend(outlier_list_col)
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list( i for i, v in outlier_indices.items() if v>2 ) # extract sample, if there have outlier more than 1 
        
    return multiple_outliers

In [None]:
train_df.loc[detect_outliers(train_df, ["city_development_index","training_hours","target"])]

In [None]:
train_df.isnull().sum()

<a id = "8"></a>
# Fill Missing Value
 
* enrolled_university has 386 null values

In [None]:
train_df[train_df["enrolled_university"].isnull()]

In [None]:
train_df.boxplot(column = "training_hours", by = "enrolled_university")
plt.show()

In [None]:
train_df["enrolled_university"] = train_df["enrolled_university"].fillna("Part time course enrolled_university")

In [None]:
train_df[train_df["enrolled_university"].isnull()]