## <center> Ensemble Modeling
### <center> Problem Statement : For a given dataset which is the best classification algorithm(as per accuracy)
### <center> Dataset : 'fake_job_postings.csv'

    

## Importing Libraries


In [None]:
#Import required Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

In [None]:
#Reading csv file of dataset
df = pd.read_csv("../input/real-or-fake-fake-jobposting-prediction/fake_job_postings.csv")
df.head()   #Display the first 5 examples in the dataset

## Exploring dataset

In [None]:
df.columns

In [None]:
##Info about the data shape of data, type of individual columns
df.info()

In [None]:
df.describe()

In [None]:
df.shape

## Feature Selection

In [None]:
df.columns

In [None]:
df = df[['title', 'location','company_profile', 'requirements', 'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function','salary_range',
       'fraudulent']]

## Check for missing values and outliers

In [None]:
# Check if there is any null value
df.isna().apply(pd.value_counts)

In [None]:
#Check for number of null values
df.isnull().sum()

In [None]:
#Check if any duplicate rows in dataset
df.duplicated().sum()

In [None]:
#drop the duplicate values
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
#Differentiate categorical data and numerical data
df_num = df[['telecommuting','has_company_logo','has_questions','fraudulent','salary_range']]
df_cat = df[['title', 'location','company_profile', 'requirements','employment_type',
       'required_experience', 'required_education', 'industry', 'function']]

In [None]:
# Checking for Outliers in numerical data
plt.figure(figsize=[16,8])
sb.boxplot(data = df_num)
plt.show()

- Columns 'telecommuting', 'has_company_logo' ,'fradulent' has minimal outliers.

## Removing Outliers

In [None]:
#Removing Outliers from columns
df_num = df_num[df_num['telecommuting'] < 0.9 ]
df_num = df_num[df_num['fraudulent'] < 0.9 ]
df_num = df_num[df_num['has_company_logo'] > 0.1 ]
df_num

In [None]:
df.isnull().sum()

In [None]:
df.dropna(axis= 0, how= 'any', inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.shape

- Outliers and missing values has been removed.

## Creating Visual methods to analyze data

In [None]:
# Plots to see the distribution of the continuous features individually

plt.figure(figsize= (25,20))
plt.subplot(3,3,1)
plt.hist(df.employment_type, color='orange', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nEmployment type')

plt.subplot(3,3,2)
plt.hist(df.required_experience, color='lightblue', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nRequired Experience')

plt.subplot(3,3,3)
plt.hist(df.fraudulent, color='red', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nFraud')


plt.show()

- Full-time jobs are posted more comparatively to other type of employment.
- There are more jobs available that requires Mid-senior level Experience.

In [None]:
plt.figure(figsize=(48,20))
sb.set_style("darkgrid")
sb.countplot(x='function',data=df,palette='Set1')

- There are maximum number of jobs are posted in IT field and minimum jobs are posted related to Distribution.
- Jobs in Sales,IT,Marketing,Engineering,Customer Service,Administrative are most in demand jobs.

## Que 1: Which is the job titles have most full time job opportunities and how many?

In [None]:
#Subsetting dataframe which have 'Full-time' emplyment type and also are not fruad.
df_jobs = df[(df['employment_type'] == 'Full-time') & (df['fraudulent']== 0)]

In [None]:
df_jobs.shape

In [None]:
#Checking the counts of each unique value
df_jobs['title'].value_counts()

In [None]:
df_jobs['title'].value_counts().max()

### Ans 1 : Job title 'Agent-Inbound Sales Position' has 12 opportunities for a full time job.

In [None]:
df.head(1)

## Que 2 : Which industry have the maximum number of fake job postings?

In [None]:
#Only including rows which are fake job postings.
df_industry = df[df['fraudulent']== 1]

In [None]:
df_industry.shape

In [None]:
#Checking each unique value counts of industry.
df_industry['industry'].value_counts()

## Ans 2 : Industry with, maximum no. of fake job postings : Oil & energy
                   

**************************

## Balancing dataset

- We have performed Explorartory Data Analysis on dataset,now we need to check if our dataset is balanced or not.
- Unbalanced dataset can lead to biased results for our model.

In [None]:
df['fraudulent'].value_counts()

- As we can see,we have very imbalanced dataset, need to balance it first and then train our model.

-
-> Separete fraudulent and non fraudulent dataframes:

In [None]:
df['fraudulent'].values

In [None]:
fraud = df[df['fraudulent']== 1]
fraud.shape

In [None]:
not_fraud = df[df['fraudulent']== 0]
not_fraud.shape

- we can oversample 'fraud' dataframe in order to get balanced dataset. 

In [None]:
fraud = fraud.sample(1403, replace=True)

In [None]:
fraud.shape, not_fraud.shape

###### Now our dataset is balanced:)

In [None]:
df = fraud.append(not_fraud)
df.reset_index()

- We have so much categorical data, so we need to convert it to numerical data.
- To do so, we perform LabelEncoding.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
df['title'] = le.fit_transform(df['title'])
df['location'] = le.fit_transform(df['location'])
df['company_profile'] = le.fit_transform(df['company_profile'])
df['requirements'] = le.fit_transform(df['requirements'])
df['employment_type'] = le.fit_transform(df['employment_type'])
df['required_experience'] = le.fit_transform(df['required_experience'])
df['required_education'] = le.fit_transform(df['required_education'])
df['industry'] = le.fit_transform(df['industry'])
df['function'] = le.fit_transform(df['function'])
df['salary_range'] = le.fit_transform(df['salary_range'])

In [None]:
df = df.reset_index()
df.head()

### Split dataset into training and testing

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['index', 'title', 'location', 'company_profile', 'requirements',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'salary_range']].values
Y = df[['fraudulent']].values

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [None]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

********************************************************************************************************************************

- Now we will apply 3 different classification algorithms to gain maximum possible accuracy score which are:
 - 1) Logistic Regression
 - 2) K Nearest Neighbours
 - 3) Random Forest
 
- For training these models,
    - Independent variable : X
    - Dependent variable : Y (Check if posted job is fake or not)


### 1) Logistic Regression

#### Train the model:

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LgR = LogisticRegression()

In [None]:
LgR.fit(X_train, Y_train)

#### Test the Model:

In [None]:
Y_pred = LgR.predict(X_test)

In [None]:
Y_test = Y_test.flatten()
Y_pred = Y_pred.flatten()

In [None]:
Y_test.shape, Y_pred.shape

In [None]:
df_lgr = pd.DataFrame({'Y_test': Y_test , 'Y_pred': Y_pred}) 
df_lgr

#### Check Accuracy Score :

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(Y_pred, Y_test)

### Accuracy using Logostic Regression Algorithm : 68%

************************************

### 2) K Nearest Neighbors

#### Train the Model:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn.fit(X_train,Y_train)

#### Test the Model:

In [None]:
Y_pred = knn.predict(X_test)

In [None]:
Y_test = Y_test.flatten()
Y_pred = Y_pred.flatten()

In [None]:
df_knn = pd.DataFrame({'Y_test': Y_test , 'Y_pred': Y_pred}) 
df_knn

#### Check Accuracy Score :

In [None]:
accuracy_score(Y_pred,Y_test)

### Accuracy using K Nearest Neighbors Algorithm : 93.7% =~ 94%

******************************************

### 3) Random Forest Algorithm

#### Train the Model:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=5)

In [None]:
rfc.fit(X_train, Y_train)

#### Test the Model:

In [None]:
Y_pred = rfc.predict(X_test)

In [None]:
Y_test = Y_test.flatten()
Y_pred = Y_pred.flatten()

In [None]:
df_rfc = pd.DataFrame({'Y_test': Y_test , 'Y_pred': Y_pred}) 
df_rfc

#### Check Accuracy Score:

In [None]:
accuracy_score(Y_pred,Y_test)

### Accuracy using Random Forest Classification Algorithm : 99.8%

##### --> As per accuracy scores, Random forest algorithm has highest accuracy score, that's why for given dataset 'Random Forest' algorithm is best suitable to use.