 # AD Click Prediction

**American Express hosted a machine learning hackathon (https://datahack.analyticsvidhya.com/contest/american-express-amexpert-2018/). We have provided by data of clickstream of 6 days from 2 July 2017 to 7 July 2017. We have to predict depending upon the data wether a session will result in a click or not.**

**Train Data**


session_id - Unique ID for a session

DateTime- Timestamp

user_id- Unique ID for user

product- Product ID

campaign_id- Unique ID for ad campaign

webpage_id- Webpage ID at which the ad is displayed

product_category_1- Product category 1 (Ordered)

product_category_2- Product category 2

user_group_id- Customer segmentation ID

gender- Gender of the user

age_level- Age level of the user

user_depth- Interaction level of user with the web platform (1 - low, 2 - medium, 3 - High)

city_development_index- Scaled development index of the residence city

var_1- Anonymised session feature

is_click- 0 - no click, 1 - click

**Historical User logs**

DateTime- Timestamp

user_id- Unique ID for the user

product- Product ID

action- view/interest (view - viewed the product page, interest - registered interest for the product)

In [None]:
## importing libraries ##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV , train_test_split
from tqdm import tqdm_notebook
import warnings
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import gc
import featuretools as ft
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
data = pd.read_csv('../input/train.csv', parse_dates= ['DateTime'])

In [None]:
test = pd.read_csv('../input/test.csv', parse_dates = ['DateTime'])

In [None]:
hist = pd.read_csv('../input/historical_user_logs.csv', parse_dates= ['DateTime'])

In [None]:
data.head()

In [None]:
hist.head()

In [None]:
data.info()

In [None]:
sns.heatmap(data.isnull())

### Data Imputation

In [None]:
# data imputation
data = data.drop('product_category_2', axis = 1) # dropping the column 
# for rest of the columns with missing values, imputing using forward fill.
data['city_development_index'] = data['city_development_index'].fillna(method = 'ffill') 
data['gender'] = data['gender'].fillna(method = 'ffill')
data['user_group_id'] = data['user_group_id'].fillna(method = 'ffill')
data['age_level'] = data['age_level'].fillna(method = 'ffill')
data['user_depth'] = data['user_depth'].fillna(method = 'ffill')

In [None]:
data.info()

### Data Visualization

In [None]:
day = data.groupby('DateTime')['is_click'].sum()
day = day.resample('H').sum()
plt.figure(figsize=(20,5))
day.plot(kind='bar',grid = None)

#### Visualizing the trends in the data by setting granularity to per hour on daily basis.

In [None]:
part_day = day.loc[slice('2017-07-02','2017-07-03')]
plt.figure(figsize=(20,5))
part_day.plot(kind='bar',grid = None)

#### Portion of the above plot for 2 days of user data.

In [None]:
data1 = data.reset_index()
data1['weekday'] = data1['DateTime'].dt.day_name()
byday  = pd.DataFrame(data1.groupby('weekday')['is_click'].sum())
byday = byday.reset_index()
plt.figure(figsize=(20,5))
sns.barplot(data = byday , x= 'weekday', y = 'is_click')

#### Visualizing the user behavior on weekday basis. It seems that most of the clicks are for MONDAY & SUNDAY.

In [None]:
user = data.groupby(['gender','product'])['is_click'].sum()
user = pd.DataFrame(user.reset_index())
plt.figure(figsize=(20,5))
sns.barplot(data = user, x= 'product', y = 'is_click', hue = 'gender')

#### Visualizing data for different products for male and female user groups.

In [None]:
n_data = data.reset_index()
campaign= pd.DataFrame(n_data.groupby(['campaign_id','product'])['is_click'].sum())
campaign= campaign.reset_index()
campaign= campaign.groupby(['product'])[['campaign_id','is_click']].max()
campaign= campaign.sort_values('is_click',ascending = False).reset_index()
campaign.columns = ['product', 'campaign_id', 'max click in any campaign']
plt.figure(figsize=(15,5))
sns.barplot(y= 'product', x= 'max click in any campaign', data = campaign, orient='h')

#### Barplot showing max clicks for all the product from a single campaign.

In [None]:
n_data = data.reset_index()
campaign= pd.DataFrame(n_data.groupby(['campaign_id','product'])['is_click'].sum())
campaign= campaign.reset_index()
campaign= campaign.groupby('campaign_id')[['product','is_click']].max()
campaign.sort_values('is_click',ascending = False)

#### Table highlighting the most successful product and no. of clicks for each of them for each campaign.

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(x= 'user_group_id', hue= 'gender', data = data)

#### This visualization highlights that all the user groups from 0-6 are Male and from 7-12 are Females.

In [None]:
plt.figure(figsize=(15,5))
user_group = data.groupby('user_group_id')['is_click'].agg(['count','sum'])
user_group['%success']= round((user_group['sum']*100)/user_group['count'], 2)
user_group = user_group.reset_index()
sns.barplot(y= 'user_group_id', x= '%success', data = user_group , order = user_group['%success'])

#### successs % on the basis of the user id group. Most successful user group is 12.

In [None]:
plt.figure(figsize=(15,3))
sns.countplot(x="product", hue= "is_click" , data =data )

#### Visualizing count of clicks and non clicks for each of the product.

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x="product", hue= "product_category_1" , data =data)

#### Performance of all the products compared category wise.

In [None]:
data1 = data[['user_depth', 'is_click']]
data1 = data.groupby(['user_depth','is_click']).size().unstack()
data1['success %'] = round(data1[1]*100/(data1[1]+data1[0]),2)
data1

In [None]:
print(data['is_click'].value_counts())