# Introduction
Here is an Exlporatory Data Analysis for Avito Demand Prediction Challenge where we're tasked estimating the sucess of online ads. We are provided dataset with tabular, text and image data.
Our task is to predict an algorithm that predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.

Let's prepare and have a look at the dataset.

# Preparations
Here we load required libraries for data wrangling and visualization

In [48]:
import pandas as pd
from category_encoders.hashing import HashingEncoder
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import gc
from scipy.sparse import csr_matrix, hstack
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import zipfile
import matplotlib.pyplot as plt

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('russian'))

pd.set_option('precision', 5)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

MAX_FEATURES = 2000

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vravuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
train = pd.read_csv('../data/raw/train.csv', parse_dates=['activation_date'])
test = pd.read_csv('../data/raw/test.csv', parse_dates=['activation_date'])

## Glimpse of dataset

In [24]:
train.head(5)

Unnamed: 0,item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability
0,b912c3c6a6ad,e00f8ff2eaf9,Свердловская область,Екатеринбург,Личные вещи,Товары для детей и игрушки,Постельные принадлежности,,,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2,2017-03-28,Private,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,1008.0,0.12789
1,2dac0150717d,39aeb48f0017,Самарская область,Самара,Для дома и дачи,Мебель и интерьер,Другое,,,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19,2017-03-26,Private,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,692.0,0.0
2,ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,3032.0,0.43177
3,02996f1dd2ea,bf5cccea572d,Татарстан,Набережные Челны,Личные вещи,Товары для детей и игрушки,Автомобильные кресла,,,Автокресло,Продам кресло от0-25кг,2200.0,286,2017-03-25,Company,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,796.0,0.80323
4,7c90be56d2ab,ef50846afc0b,Волгоградская область,Волгоград,Транспорт,Автомобили,С пробегом,ВАЗ (LADA),2110.0,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3,2017-03-16,Private,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,2264.0,0.20797


In [46]:
train.isna().any()

item_id                 False
user_id                 False
region                  False
city                    False
parent_category_name    False
category_name           False
param_1                  True
param_2                  True
param_3                  True
title                   False
description              True
price                    True
item_seq_number         False
activation_date         False
user_type               False
image                    True
image_top_1              True
deal_probability        False
dtype: bool

## Dataset Columns

In [25]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503424 entries, 0 to 1503423
Data columns (total 18 columns):
item_id                 1503424 non-null object
user_id                 1503424 non-null object
region                  1503424 non-null object
city                    1503424 non-null object
parent_category_name    1503424 non-null object
category_name           1503424 non-null object
param_1                 1441848 non-null object
param_2                 848882 non-null object
param_3                 640859 non-null object
title                   1503424 non-null object
description             1387148 non-null object
price                   1418062 non-null float64
item_seq_number         1503424 non-null int64
activation_date         1503424 non-null datetime64[ns]
user_type               1503424 non-null object
image                   1390836 non-null object
image_top_1             1390836 non-null float64
deal_probability        1503424 non-null float64
dtypes: datetim

There are total 18 features,
1. **item_id** - Ad id.
2. **user_id** - User id.
3. **region** - Ad region.
4. **city** - Ad city.
5. **parent_category_name** - Top level ad category as classified by Avito's ad model.
6. **category_name** - Fine grain ad category as classified by Avito's ad model.
7. **param_1** - Optional parameter from Avito's ad model.
8. **param_2** - Optional parameter from Avito's ad model.
9. **param_3** - Optional parameter from Avito's ad model.
10. **title** - Ad title.
11. **description** - Ad description.
12. **price** - Ad price.
13. **item_seq_number** - Ad sequential number for user.
14. **activation_date** - Date ad was placed.
15. **user_type** - User type.
16. **image** - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
17. **image_top_1** - Avito's classification code for the image.

18. **deal_probability** - The target variable. This is the likelihood that an ad actually sold something. It's not possible to verify every transaction with certainty, so this column's value can be any float from zero to one.

In [26]:
CATEGORY_FEATURES = ['region', 'city', 'parent_category_name', 'category_name', 'param_1', 'param_2', 
                     'param_3', 'user_type',  'image_top_1']
TEXT_FEATURES = ['title', 'description']
NUMBER_FEATURES = ['price', 'item_seq_number']
ID_FEATURES = ['item_id', 'user_id', 'image']
DATE_FEATURES = ['activation_date']
GENERATED_DATE_FEATURES = ['month', 'weekday', 'month_day', 'year_day']
TARGET_FEATURE = 'deal_probability'

ENCODED_CATEGORY_FEATURES = ['enc_region', 'enc_city', 'enc_parent_category_name', 'enc_category_name', 'enc_param_1', 'enc_param_2', 
                     'enc_param_3', 'enc_user_type',  'enc_image_top_1']

ALL_FEATURES = CATEGORY_FEATURES + TEXT_FEATURES + NUMBER_FEATURES + ID_FEATURES + DATE_FEATURES

In [27]:
# Encode category features using Hashing Encoder
he = HashingEncoder()
he.fit(train[CATEGORY_FEATURES].values)
train[ENCODED_CATEGORY_FEATURES] = he.transform(train[CATEGORY_FEATURES].values)
test[ENCODED_CATEGORY_FEATURES] = he.transform(test[CATEGORY_FEATURES].values)
# Dropping cateogory features.
train.drop(CATEGORY_FEATURES, axis=1, inplace=True)
test.drop(CATEGORY_FEATURES, axis=1, inplace=True)
gc.collect()

112

In [28]:
train.head()

Unnamed: 0,item_id,user_id,title,description,price,item_seq_number,activation_date,image,deal_probability,enc_region,enc_city,enc_parent_category_name,enc_category_name,enc_param_1,enc_param_2,enc_param_3,enc_user_type,enc_image_top_1
0,b912c3c6a6ad,e00f8ff2eaf9,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2,2017-03-28,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,0.12789,3,1,0,1,2,0,1,0,1008.0
1,2dac0150717d,39aeb48f0017,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19,2017-03-26,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,0.0,2,3,0,0,2,1,0,0,692.0
2,ba83aefab5dc,91e2f88dd6e3,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9,2017-03-20,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,0.43177,4,1,1,0,1,0,1,0,3032.0
3,02996f1dd2ea,bf5cccea572d,Автокресло,Продам кресло от0-25кг,2200.0,286,2017-03-25,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,0.80323,4,0,0,1,2,0,0,1,796.0
4,7c90be56d2ab,ef50846afc0b,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3,2017-03-16,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,0.20797,2,1,0,1,1,1,0,2,2264.0


## Feature Engineering
### Date Features

In [29]:
# Train data
train['month'] = train['activation_date'].dt.month
train['weekday'] = train['activation_date'].dt.weekday
train['month_day'] = train['activation_date'].dt.day
train['year_day'] = train['activation_date'].dt.dayofyear
# Test data
test['month'] = test['activation_date'].dt.month
test['weekday'] = test['activation_date'].dt.weekday
test['month_day'] = test['activation_date'].dt.day
test['year_day'] = test['activation_date'].dt.dayofyear

In [30]:
vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=MAX_FEATURES)
vectorizer.fit(train['title'])
train_title_tfidf = vectorizer.transform(train['title'])
test_title_tfidf = vectorizer.transform(test['title'])
idf = vectorizer.idf_
vectorizer.vocabulary_
len(vectorizer.stop_words_)

204799

In [31]:
vectorizer.vocabulary_

{'кокон': 1016,
 'стойка': 1745,
 'одежды': 1382,
 'philips': 469,
 'автокресло': 584,
 'ваз': 682,
 '2110': 88,
 '2003': 64,
 'авто': 583,
 'люлька': 1179,
 'водонагреватель': 725,
 '100': 4,
 'литров': 1167,
 'платье': 1471,
 'ботиночки': 660,
 'натур': 1331,
 'квартира': 975,
 '25': 100,
 'м²': 1182,
 'эт': 1987,
 'джинсы': 838,
 'карты': 964,
 'класс': 986,
 'монитор': 1276,
 'acer': 217,
 '18': 41,
 'продаются': 1546,
 'щенки': 1964,
 'овчарки': 1379,
 'женское': 884,
 'новое': 1347,
 'chevrolet': 274,
 'lanos': 398,
 '2008': 69,
 'цифра': 1908,
 'куртка': 1135,
 'весенняя': 703,
 'осенняя': 1403,
 'сниму': 1697,
 'коттедж': 1077,
 '44': 138,
 'шапка': 1929,
 'норковая': 1361,
 'ford': 323,
 'focus': 322,
 '2005': 66,
 'туфли': 1822,
 'ботинки': 659,
 'продаю': 1545,
 'песочник': 1449,
 'crockid': 285,
 'панели': 1430,
 'кулер': 1127,
 'компьютера': 1047,
 'комната': 1036,
 'audi': 236,
 '80': 193,
 'коляска': 1027,
 '2016': 77,
 'года': 773,
 'берцы': 635,
 '70': 180,
 'свадебные

In [32]:
idf

array([8.77912514, 9.45181527, 7.69761642, ..., 9.28045702, 8.35736532,
       9.54650259])

In [33]:
len(vectorizer.get_feature_names())

2000

In [34]:
x_train_csr = csr_matrix(hstack([train[ENCODED_CATEGORY_FEATURES + NUMBER_FEATURES + GENERATED_DATE_FEATURES], train_title_tfidf]))
x_test_csr = csr_matrix(hstack([test[ENCODED_CATEGORY_FEATURES + NUMBER_FEATURES + GENERATED_DATE_FEATURES], test_title_tfidf]))

In [35]:
X_train, X_valid, y_train, y_valid = train_test_split(x_train_csr, train[TARGET_FEATURE], test_size=0.20, random_state=42)

In [36]:

num_round = 5000
params = {'learning_rate': 0.05,
          'max_depth': 7,
          'boosting': 'gbdt',
          'objective': 'regression',
          'metric': ['auc','rmse'],
          'is_training_metric': True,
          'seed': 19,
          'num_leaves': 128,
          'feature_fraction': 0.9,
          'bagging_fraction': 0.8,
          'bagging_freq': 5}
model = lgb.train(params, lgb.Dataset(X_train, label=y_train), num_round, valid_sets=lgb.Dataset(X_valid, label=y_valid), verbose_eval=50, early_stopping_rounds=100)

Training until validation scores don't improve for 100 rounds.
[50]	valid_0's rmse: 0.239674	valid_0's auc: 0.735872
[100]	valid_0's rmse: 0.237102	valid_0's auc: 0.741951
[150]	valid_0's rmse: 0.235776	valid_0's auc: 0.743872
[200]	valid_0's rmse: 0.234762	valid_0's auc: 0.745753
[250]	valid_0's rmse: 0.2342	valid_0's auc: 0.746793
[300]	valid_0's rmse: 0.233695	valid_0's auc: 0.747693
[350]	valid_0's rmse: 0.233233	valid_0's auc: 0.748174
[400]	valid_0's rmse: 0.23286	valid_0's auc: 0.749154
[450]	valid_0's rmse: 0.232522	valid_0's auc: 0.750043
[500]	valid_0's rmse: 0.232187	valid_0's auc: 0.750893
[550]	valid_0's rmse: 0.231966	valid_0's auc: 0.75133
[600]	valid_0's rmse: 0.231698	valid_0's auc: 0.751585
[650]	valid_0's rmse: 0.231511	valid_0's auc: 0.752036
[700]	valid_0's rmse: 0.231281	valid_0's auc: 0.752495
[750]	valid_0's rmse: 0.231119	valid_0's auc: 0.75279
[800]	valid_0's rmse: 0.230939	valid_0's auc: 0.753301
[850]	valid_0's rmse: 0.230761	valid_0's auc: 0.753524
[900]	va

In [37]:
pred = model.predict(x_test_csr)

submission = pd.read_csv("../data/raw/sample_submission.csv")
submission['deal_probability'] = pred
submission['deal_probability'].clip(0.0, 1.0, inplace=True)
submission.to_csv('../data/submissions/submission.csv', index=False)

# Compress (zip) submission file.
submission_zip = zipfile.ZipFile('../data/submissions/submission.csv.zip', 'w')
submission_zip.write('../data/submissions/submission.csv', compress_type=zipfile.ZIP_DEFLATED)
submission_zip.close()

submission

Unnamed: 0,item_id,deal_probability
0,6544e41a8817,0.89921
1,65b9484d670f,0.45429
2,8bab230b2ecd,0.40685
3,8e348601fefc,0.53624
4,8bd2fe400b89,0.53550
5,c63dbd6c657f,0.55200
6,6d1a410df86e,0.36032
7,e8d3e7922b80,0.36406
8,2bc1ab208462,0.36030
9,7e05d77a9181,0.47045
