# Problem description

The problem of interest is the prediction of apply rate. Imagine a user visiting a website, and performing a job search. From the set of displayed results, user clicks on certain ones that she is interested in, and after checking job descriptions, she further clicks on apply button therein to land in to an application page. The apply rate is defined as the fraction of applies (after visiting job description pages), and the goal is to predict this metric using the dataset described in the following section.

### Data Collection

In [1]:
# import library

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('data/Apply_Rate_2019.csv')

In [None]:
df.head()

Each row in the dataset corresponds to a user’s view of a job listing. It has 10 columns as described below.

1. title proximity tfidf: Measures the closeness of query and job title.
2. description proximity tfidf: Measures the closeness of query and job description.
3. main query tfidf: A score related to user query closeness to job title and job description.
4. query jl score: Measures the popularity of query and job listing pair.
5. query title score: Measures the popularity of query and job title pair.
6. city match: Indicates if the job listing matches to user (or, user-specified) location.
7. job age days: Indicates the age of job listing posted.
8. apply: Indicates if the user has applied for this job listing.
9. search date pacific: Date of the activity.
10. class id: Class ID of the job title clicked.

There are two parts to the problem.

. We have to only focus on the first 7 columns and use these as features to predict how many users apply to the website.

. We have to consider adding the last column to the feature set (“class id”) and check if the classification performance increases or not.

##Exploratory Data Analysis

In [None]:
# Check the total number of observations in the dataset

print('Total number of observations in the dataset are:', df.shape[0])

In [None]:
df.info()

**Observation:**

1. title_proximity_tfidf, description_proximity_tfidf and city_match contains null values
2. There are 7 float type, 2 integer type and 1 object type features


In [None]:
# Calculate statistics

df.drop(['apply'],axis=1).describe()

**Observation:**

1. There is notably a large difference between 75th %tile and max values of mostly all the predictors.
2. Median value of ‘title_proximity_tfidf’, ‘description_proximity_tfidf’, ‘main_query_tfidf’, ‘query_jl_score’, ‘query_title_score’, ‘job_age_days’ is lower than mean
3. Thus observation 1 and 2 suggest there are lot of outliers in the dat

In [None]:
# Lets check the distribution for classes who applied and did not apply

count_classes = pd.value_counts(df['apply'], sort = True)
count_classes.plot(kind = 'bar')

plt.title("Apply Rate")
plt.xticks(range(2))
plt.xlabel("Class")
plt.ylabel("Frequency");

print('Number of customers who didnt apply:',df['apply'].value_counts()[0])
print('Number of customers who applied:',df['apply'].value_counts()[1])
print('Percentage of apply to non apply',df['apply'].value_counts()[0]/df['apply'].value_counts()[1],'%')

**Observation:**

The data is imbalanced and so we might have to use techniques like resampling (undersampling or oversampling) or use metrics like AUC-ROC curve or AUPRC or SMOTE to handle imbalanced data. Lets explore further which will help us decide what technique should we use. Note: It is already given in the dataset that I have to use AUC as the metric.

In [None]:
# Lets check the correlation between the features

sns.heatmap(df.corr())

**Observation:**

1. title_proximity_tfidf and main_query_tfidf are correlated with value of arounf 0.7
2. Other features are not highly correlated

In [None]:
# Check the outliers

l = ['title_proximity_tfidf', 'description_proximity_tfidf',
       'main_query_tfidf', 'query_jl_score', 'query_title_score',
       'city_match', 'job_age_days']
number_of_columns=7
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(df[l[i]],color='green',orient='v')
    plt.tight_layout()

**Observation:**

As we can see there are lot of outliers in the data

In [None]:
# Check the distribution

# Now to check the linearity of the variables it is a good practice to plot distribution graph and look for skewness 
# of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution.

for feature in df.columns[:-2]:
    ax = plt.subplot()
    sns.distplot(df[df['apply'] == 1][feature], bins=50, label='Anormal')
    sns.distplot(df[df['apply'] == 0][feature], bins=50, label='Normal')
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(feature))
    plt.legend(loc='best')
    plt.show()

**Observation:**

For all the features, both apply and non apply rates have almost similar distributions

### Data Cleaning

In [None]:
# Firstly lets drop duplicate entries 

print(df.shape)
df = df.drop_duplicates(keep = 'first')
df.shape

In [None]:
# Check number of missing values in every columns
df.isnull().sum()

In [None]:
# Lets check the value counts for the three columns
df['title_proximity_tfidf'].value_counts().head()

In [None]:
df['description_proximity_tfidf'].value_counts().head()

In [None]:
df['city_match'].value_counts().head()

**Observation:**

The first 2 columns contains mostly value zero so it would be a safe option to impute a value of '0' to the first two columns. For the 'city-match' column, lets check the percentage of apply and non apply before and after we remove the NaN values. If the percentage is same, we can conclude that it is safe to remove rows that have NaN values in City_match column.

In [None]:
df['title_proximity_tfidf'].fillna(0,inplace=True)
df['description_proximity_tfidf'].fillna(0,inplace=True)
df.dropna(subset=['city_match'],inplace=True)

Note: I will not be removing outliers since there is possibility of them carrying important information which can help us detect the apply and non apply cases

In [None]:
df.info()

In [None]:
# From the correlation graph, we observed that title_proximity_tfidf and main_query_tfidf are quite correlated, 
# lets merge them and get a single feature by multiplying both of them

df['main_title_tfidf'] = df['title_proximity_tfidf']*df['main_query_tfidf']

In [None]:
df = df.drop(['title_proximity_tfidf','main_query_tfidf'], axis=1)

# Resources

[1] https://www.kaggle.com/kerneler/starter-predict-click-through-rate-dcafee5b-2/data

[2] https://towardsdatascience.com/predicting-click-through-rate-for-a-website-7cd2a892d26e

[3] https://github.com/animeshgoyal9/Predicting-Apply-rate-for-a-job-search-agency

In [None]:
df.head()