# Introduction
Hello people, welcome to this kernel. In this kernel I am going to classify jobpostings whether they are real or not. This dataset is small and you can handle that using traditional approachs (BoW,TF-IDF) but in this kernel I'll use word embeddings and RNNs.

# Table of Content
1. Data Preprocessing
1. Building Model
1. Training Model
1. Testing Model
1. Conclusion

In [None]:
import numpy as np
import pandas as pd
import time

import re
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

data = pd.read_csv('../input/real-or-fake-fake-jobposting-prediction/fake_job_postings.csv')

# Data Preprocessing
In this section I am going to prepare dataset in order to use in our neural network. Before starting, let's check the dataframe and class distribution.

In [None]:
data.head()

In [None]:
data.info()

* As you can see there are too many features in the dataset, we'll use company profile, description and requirements. And also fraudulent.

In [None]:
import warnings
warnings.filterwarnings('ignore')

sns.countplot(data["fraudulent"])
plt.show()

* As you can see most of the dataset is 0 (non-fraudulent) so we can consider this mission hard.

### Part 1: Concatenating Text Parts
First, we'll start with dropping redundant features and concatenate others.

In [None]:
x = data.loc[:,["company_profile","description","requirements","benefits"]]
y = data["fraudulent"]


* As we've seen from information table, there are NaN values in the set. We'll fill them with spaces.

In [None]:
x.fillna(" ",inplace=True)
x.isnull().sum()

* Now let's concatenate texts.

In [None]:
concat_data = []
for i in range(len(x)):
    txt = x["company_profile"][i] + " "
    txt = txt + x["description"][i] + " "
    txt = txt + x["requirements"][i] + " "
    txt = txt + x["benefits"][i]
    concat_data.append(txt.strip())

   

* Now let's check our data

In [None]:
concat_data[0]

* There are too many information here, let's move on to the next step.

### Part 2: Cleaning Texts Using Regular Expressions
As you can see in the texts there are too many redundant characters such as punctuation steps. In this part we'll clear texts using regular expressions.

In [None]:
pattern = "[^a-zA-Z]"
cleanedTexts = []
for text in concat_data:
    text = re.sub(pattern," ",text)
    cleanedTexts.append(text.lower())


In [None]:
cleanedTexts[0]

### Part 3: Tokenizing and Padding
You know, in natural languages words are the representation of everything, such as we say *hi* when we see someone, h and i letters don't have any special meaning but hi has a special meaning. If 1 means hi, we can use it instead of hi.

In this part we'll convert words into integers and texts into sequences. 
In deep learning we generally use dataset that has predefined shape. But in text dataset shapes might be different, such as one jobposting can have 100 words other can have 102 words. In order to solve this problem we can use different approaches, but in this kernel we'll use **padding**

In padding we will add some spaces to the texts and make all texts with same shape.

In [None]:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(cleanedTexts)

x_tokens = tokenizer.texts_to_sequences(cleanedTexts)

In [None]:
print(x_tokens[0])

* Now let's create a sequence that includes length of arrays and find Q3 value.

In [None]:
seq_lens = [len(seq) for seq in x_tokens]
q3 = np.quantile(seq_lens,.75)
print(q3)

* All texts will have shape 502

In [None]:
x_tokens_pad = np.asarray(pad_sequences(x_tokens,maxlen=int(q3)))

In [None]:
x_tokens_pad.shape

### Step 4: Train Test Splitting
In this section we'll split the dataset into train and test, to test dataset truly.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x_tokens_pad,y,test_size=0.2,random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

# Building Model
In this section I am going to build the model using keras API of Tensorflow. I'll use the developed version of RNNs, GRU, Gated Recurrent Unit. We don't use SimpleRNN, because it has a problem called *vanishing gradient* because of backpropagation.


In [None]:
VOCAB_SIZE = 10000 + 1
VEC_SIZE = 100
TOKEN_SIZE = int(q3)


In [None]:
from tensorflow.compat.v1.keras.layers import CuDNNGRU
model = keras.Sequential()
model.add(layers.Embedding(input_dim=VOCAB_SIZE,
                           output_dim=VEC_SIZE,
                           input_length=TOKEN_SIZE
                          ))



model.add(CuDNNGRU(512,return_sequences=True))
model.add(CuDNNGRU(1024,return_sequences=True))
model.add(CuDNNGRU(2048))
model.add(layers.Dense(1,activation="sigmoid"))

model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])


In [None]:
model.summary()

# Training Model
In this section I am going to train model using prepared dataset.

In [None]:
hist = model.fit(x_train,y_train,validation_split=0.2,epochs=2)

# Testing Model
In this section we'll test model using unused test set.

In [None]:
y_test = np.asarray(y_test)
y_pred = model.predict_classes(x_test)

print("Accuracy score of model is {}%".format(accuracy_score(y_pred=y_pred,y_true=y_test)*100))

plt.subplots(figsize=(4,4))
conf_matrix = confusion_matrix(y_pred=y_pred,y_true=y_test)
sns.heatmap(conf_matrix,annot=True,fmt=".1f",linewidths=1.5)
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show()


# Conclusion
Thanks for your attention, if you have questions in your mind, feel free to ask in comment section. Also if you liked the kernel and upvote, I would be glad :)
