# Avito Demand Prediction Challenge


In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 500)

In [3]:
data = pd.read_csv("../input/train.csv")
data.head()

In the dataset there are such feature columns:

* `item_id` — Ad id.
* `user_id` — User id.
* `region` — Ad region.
* `city` — Ad city.
* `parent_category_name` — Top level ad category as classified by Avito's ad model.
* `category_name` — Fine grain ad category as classified by Avito's ad model.
* `param_1` — Optional parameter from Avito's ad model.
* `param_2` — Optional parameter from Avito's ad model.
* `param_3` — Optional parameter from Avito's ad model.
* `title` — Ad title.
* `description` — Ad description.
* `price` — Ad price.
* `item_seq_number` — Ad sequential number for user.
* `activation_date`— Date ad was placed.
* `user_type` — User type.
* `image` — Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
* `image_top_1` — Avito's classification code for the image.
* `deal_probability` — The target variable. This is the likelihood that an ad actually sold something. It's not possible to verify every transaction with certainty, so this column's value can be any float from zero to one.


We will remove the following from them:
* `item_id`,` user_id` is useless information for us,
* `city` - seems too small division,
* `title` - we have more meaningful texts,
* param_1, param_2, param_3 are optional and not always present parameters,
* `activation_date`,` item_seq_number` - also looks useless,
* `image`,` image_top_1` - we will concentrate on text and categorical variables, not images.

In [4]:
cols_to_drop = ["item_id", "user_id", "city", "param_1", "param_2", "param_3", "title",
    "activation_date", "item_seq_number", "image", "image_top_1"]
data = data.drop(labels=cols_to_drop, axis=1)

In [5]:
data.head()

We categorize the categorical attributes (we transform it into a numerical form):

In [6]:
from sklearn.preprocessing import LabelEncoder

def label_encoding(data):
    temp  = LabelEncoder()
    temp.fit(data)
    data = temp.transform(data)
    return data

In [7]:
# parent_category = LabelEncoder()
# parent_category.fit(data["parent_category_name"])
# data["parent_category_name"] = parent_category.transform(data["parent_category_name"])
data["parent_category_name"]  = label_encoding(data["parent_category_name"])
data["parent_category_name"].head()

In [8]:
# category = LabelEncoder()
# category.fit(data["category_name"])
# data["category_name"] = category.transform(data["category_name"])
data["category_name"] = label_encoding(data["category_name"])
data["category_name"].head()

In [9]:
# user_type = LabelEncoder()
# user_type.fit(data["user_type"])
# data["user_type"] = user_type.transform(data["user_type"])
data["user_type"] = label_encoding(data["user_type"])
data["user_type"].head()

In [10]:
# region = LabelEncoder()
# region.fit(data["region"])
# data["region"] = region.transform(data["region"])
data["region"] = label_encoding(data["region"])
data["region"].head()

In [11]:
data = data.dropna()

So it looks like now:

In [12]:
data.head()

### 1.2. Preprocessing: texts

Experimental way it was found out that some ads are empty. That about them did not break anything, run `fillna ()`.

In [13]:
data["description"].fillna("", inplace=True)

Texts first bring to the lower case, and then calculate the number of tokens and bring everything to the lemmas. On the way, we'll take away the stop words. This can be done as follows (This process takes a very long time, so preprocessing is limited when uploading data):

```python
from nltk import word_tokenize
from pymystem3 import Mystem
from nltk.corpus import stopwords

mystem = Mystem()


def count_words(text):
    try:
        len_words = len(word_tokenize(text))
    except:
        len_words = 0
    return len_words

def do_lemmas(text):
    try:
        stops = stopwords.words("russian")
        lemmas = [lemma for lemma in mystem.lemmatize(text) if lemma not in stops]
        return lemmas
    except:
        return ""
        
data_new["word_count"] = data_new["description"].apply(count_words)
data_new["lemmas"] = data_new["description"].apply(do_lemmas)
```

We use the casual_tokenize module from the nltk library to tokenize the text.

Cleaning the text from punctuation marks with the built-in Python function .isalpha ()

In [14]:
from nltk import casual_tokenize

In [15]:
def tokenize(text):
    tokens = casual_tokenize(str(text))
    clean_stuff = [word.lower() for word in tokens if word.isalpha()]
    line = " ".join(clean_stuff)
    return line

In [16]:
%%time
data["description"] = data["description"].apply(tokenize)

In [17]:
data["description"].head()

### 1.3. TF-IDF texts + branch features

First, we select the target variable and all the categorical and quantitative features:

In [18]:
cat_num_cols = ["region", "parent_category_name", "category_name", "price", "user_type"]
X_cat_num = data[cat_num_cols].values

In [19]:
y = data["deal_probability"].values

For a stable solution to the problem with texts, using TF-IDF + SVM, you need to use PCA (the method of the main components, which you can learn more about: https://habr.com/post/304214/), or limit the size of the vectors using the TF-IDF method to get rid of data that does not significantly affect the result.

Apply TF-IDF:

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

In [21]:
tfidf = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents="unicode",
    analyzer="word",
    token_pattern=r"\w{1,}",
    stop_words=stopwords.words("russian"),
    max_features=10000
)

In [22]:
%%time
tfidf.fit(data["description"].values)
X_texts = tfidf.transform(data["description"].values)

## 2. Integrating the features

At this stage, it is necessary to connect all the categories we received. signs with the result of the work of TF-IDF.

In [23]:
from scipy.sparse import hstack

In [24]:
X = hstack((X_texts, X_cat_num))

In [25]:
X.shape

We keep it just in case:

In [26]:
import os
from sklearn.externals import joblib

In [27]:
try:
    os.mkdir("./models")
except:
    pass
joblib.dump(X, "./models/X.pkl")
joblib.dump(y, "./models/y.pkl")
joblib.dump(tfidf, "./models/tfidf.pkl")

## 3. Any different algorithms

## 3.0. Preparation

For the beginning ** we will break the sample ** into the training and test:

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

** Create an error function ** RMSE (root of mean squared error) to evaluate by it:

In [30]:
from sklearn.metrics import mean_squared_error, make_scorer
from math import sqrt

In [31]:
def rmse_func(y_calc, y_test):
    rms = sqrt(mean_squared_error(y_actual, y_predicted))
    return rms

rmse = make_scorer(rmse_func, greater_is_better=False)

### 3.1. SVM

The classical state of the art solution for text-based tasks is SVM on the RBF core.

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

In [None]:
%env JOBLIB_TEMP_FOLDER=/tmp

In [None]:
params = {"C": np.arange(1, 100, 2)}
svr = GridSearchCV(
    SVR(),
    param_grid=params,
    scoring=rmse,
    cv=5,
    verbose=1,
    n_jobs=-1
)
svr.fit(X_train, y_train)

In [None]:
print("SVR results:\n\t- best params: {}\n\t- best score: {}".format(svr.best_params_, svr.best_score_))

In [None]:
result = svr.predict(y)
ids = data['item_id']
ids['deal_probability'] = result
ids.to_csv("submit2.csv",index=True,header=True)