**Avito Demand Prediction Challenge:**

Description of the challenge taken from the competition page:

When selling used goods online, a combination of tiny, nuanced details in a product description can make a big difference in drumming up interest. 

And, even with an optimized product listing, demand for a product may simply not exist–frustrating sellers who may have over-invested in marketing.

Avito, Russia’s largest classified advertisements website, is deeply familiar with this problem. Sellers on their platform sometimes feel frustrated with both too little demand (indicating something is wrong with the product or the product listing) or too much demand (indicating a hot item with a good description was underpriced).

In their fourth Kaggle competition, Avito is challenging you to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.

**What are we doing in this kernel?**

The aim of this kernel is to perform EDA of Avito demand prediction challenge competition as well add new features which would be used by the LightGBM library for performing prediction. 

In [None]:
import os
import pandas_profiling as pp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

import datetime
import pandas_profiling as pp

from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams

from sklearn.feature_extraction.text import TfidfVectorizer
stop = set(stopwords.words('russian'))

import lightgbm as lgb
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

In [None]:
df_train = pd.read_csv("../input/train.csv")
df_test  = pd.read_csv("../input/test.csv")
df_sub   = pd.read_csv("../input/sample_submission.csv")
df_periods_train = pd.read_csv("../input/periods_train.csv")

**Data Overview:**

Lets understand the data that we have for this kernel.

In [None]:
df_train.head(5)

In [None]:
df_train.info()

Columns:
- item_id
- user_id
- region
- city
- parent_category_name
- category_name
- param_1 
- param_2
- param_3
- title
- description
- price
- item_seq_number
- activation_date
- user_type
- image
- image_top_1
- deal_probability


In [None]:
df_train.describe(include='all')

As majority of the columns are categorical variables, we have very few numerical variables. 

In [None]:
df_periods_train.head(5)

In [None]:
df_periods_train.info()

In [None]:
df_periods_train.describe(include='all')

**Pandas profiling:**

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- **Essentials**: type, unique values, missing values
- **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range
- **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- **Most frequent values**
- **Histogram**
- **Correlations** highlighting of highly correlated variables, Spearman and Pearson matrixes


In [None]:
pp.ProfileReport(df_train)

Let's try to understand the output of the pandas profiling:

- Types of features: we have different types of features i.e. numerical, categorical, text, date
- Missing values: some columns have missing values like some times description text is missing for a particular ad.
- Users: We have over 51% unique users, most of them don't post lot of ads. 
- Categories: There are 9 categories in parent_category_name
- price is skewed and hence will require careful processing. 

**Feature Analysis:**

Lets analyze some of the features in more details. 

**activation_date:**
Lets create new features based on activation_date: date, weekday, day of month.



In [None]:
df_train["activation_date"] = pd.to_datetime(df_train["activation_date"])
df_train["date"]            = df_train["activation_date"].dt.date
df_train["weekday"]         = df_train["activation_date"].dt.weekday
df_train["day"]             = df_train["activation_date"].dt.day

count_by_date_train         = df_train.groupby("date")["deal_probability"].count()
mean_by_date_train          = df_train.groupby("date")["deal_probability"].mean()

df_test["activation_date"]  = pd.to_datetime(df_test["activation_date"])
df_test["date"]             = df_test["activation_date"].dt.date
df_test["weekday"]          = df_test["activation_date"].dt.weekday
df_test["day"]              = df_test["activation_date"].dt.day

count_by_date_test          = df_test.groupby('date')['item_id'].count()

Lets chart out the Counts and Means for deal probability and try to find relation between number of ads and mean of deal_probability to find any patterns. 

In [None]:
fig, (ax1, ax3) = plt.subplots(figsize=(26, 8), 
                              ncols=2,
                              sharey=True)
count_by_date_train.plot(ax=ax1, 
                        legend=False,
                        label="Ads Count")

ax2 = ax1.twinx()

mean_by_date_train.plot(ax=ax2,
                       color="g",
                       legend=False,
                       label="Mean deal_probability")

ax2.set_ylabel("Mean deal_probabilit", color="g")

count_by_date_test.plot(ax=ax3,
                       color="r",
                       legend=False,
                       label="Ads count test")

plt.grid(False)

ax1.title.set_text("Trends of deal_probability and number of ads")
ax3.title.set_text("Trends of number of ads for test data")
ax1.legend(loc=(0.8, 0.35))
ax2.legend(loc=(0.8, 0.2))
ax3.legend(loc="upper right")

As we can see, we have few weeks of data in training set and one week of data in the test dataset. 

- For most of the data in training dataset, the number of ads is quite high ( 100,000 or more) and mean deal_probability is low around 0.14, but after March 28, the number of ads drop off and so the deal_probability fluctuates.
- In test data set, we have few ads going up to April 18th and then it falls off. 

For training dataset, lets remove the data after March 29 because number of ads fall  to 0.

Lets create a box plot between Ads Count and Deal probability by day of the week to find some patterns. 

In [None]:
fig, ax1 = plt.subplots(figsize=(16, 8))

plt.title("Ads Count and Deal Probability by day of the week")

sns.countplot(x   = "weekday",
             data = df_train,
             ax   = ax1)

ax1.set_ylabel("Ads Count", color="b")

plt.legend(["Projects Count"])

ax2 = ax1.twinx()

sns.pointplot(x   = "weekday",
             y    = "deal_probability",
             data = df_train,
             ci   = 99,
             ax   = ax2,
             color = "black")

ax2.set_ylabel("deal_probability", color="g")
plt.legend(["deal_probability"], loc=(0.875, 0.9))
plt.grid(False)


Learnings from above plots:
- Deal probability increases as the ads count fall and deal probability reaches a maximum on day 6th of the week. 
- Ads count are highest on weekends (assuming day 0 and day 6 are the weekend).
- deal probability fall off during mid-week, potentially due to mid of the week effect 



**Categories:**

Lets spend some time looking at the categories. 

In [None]:
a = df_train.groupby(["parent_category_name", "category_name"]).agg({"deal_probability": ["mean", "count"]}).reset_index().sort_values([("deal_probability", "mean")], ascending=False).reset_index(drop=True)

a

We can see that "Услуги" (services) is the category with the highest deal_probability. Other "good" categories are about animals or electronics/cars. Least successful are various accessories or expensive things.

**City: **

Lets take a look at the city feature in depth. 

In [None]:
city_ads = df_train.groupby("city").agg({"deal_probability": ["mean", "count"]}).reset_index().sort_values([("deal_probability", "mean")], ascending=False).reset_index(drop=True)

print("There are {0} cities in total".format(len(df_train.city.unique())))

print("There are {1} cities with more than {0} ads".format(100, city_ads[city_ads["deal_probability"]["count"] > 100].shape[0]))

print('There are {1} cities with more that {0} ads.'.format(1000, city_ads[city_ads['deal_probability']['count'] > 1000].shape[0]))

print('There are {1} cities with more that {0} ads.'.format(10000, city_ads[city_ads['deal_probability']['count'] > 10000].shape[0]))

It seems that most of the cities have little ads posted and only in 33 of them there are a lot of ads. Let's see the best and the worst cities by mean deal_probability.

In [None]:
city_ads[city_ads["deal_probability"]["count"] > 1000].head()

In [None]:
city_ads[city_ads['deal_probability']['count'] > 1000].tail()

I think that it could be interesting to see what is sold in Лабинск and Миллерово

In [None]:
print("Лабинск")

df_train.loc[df_train.city == "Лабинск"].groupby('category_name').agg({'deal_probability': ['mean', 'count']}).reset_index().sort_values([('deal_probability', 'count')], ascending=False).reset_index(drop=True).head(5)

Most popular categories are "Автомобили" (cars) and "Телефоны" (telephones).



In [None]:
print('Миллерово')
df_train.loc[df_train.city == 'Миллерово'].groupby('category_name').agg({'deal_probability': ['mean', 'count']}).reset_index().sort_values([('deal_probability', 'count')], ascending=False).reset_index(drop=True).head()

Most popular categories are "Автомобили" (cars). And it seems that second-hand wares are least popular.



**deal_probability**



In [None]:
plt.hist(df_train["deal_probability"]);
plt.title("deal_probability");

On the one hand the distribution of the target value is highly skewered towards zero, on the other hand, there is a spike at about 0.8.

**title:**

In [None]:
text = ' '.join(df_train["title"].values)
wordCloud = WordCloud(max_font_size = None,
                      stopwords = stop,
                      background_color = "white",
                      width = 1200,
                      height = 1000).generate(text)

plt.figure(figsize=(12, 8))
plt.imshow(wordCloud)
plt.title('Top words for title')
plt.axis("off")
plt.show()

**description:**

In [None]:
df_train["description"] = df_train["description"].apply(
    lambda x: str(x).replace('/\n', ' ').replace('\xa0', ' ')
)

In [None]:
text = ' '.join(df_train['description'].values)
text = [i for i in ngrams(text.lower().split(), 3)]
print('Common trigrams.')
Counter(text).most_common(40)

We can see that sellers try to tell buyers that their wares are great and also tell about possibilities of delivery. But there are some strange values, let's have a look...



In [None]:
df_train[df_train.description.str.contains('↓')]['description'].head(10).values

Some of the authors are using emoticons for their ads. 

**image:**

In this kernel I won't use the images themselves, but I'll create a feature showing wheather there is an image or not

In [None]:
df_train['has_image'] = 1
df_train.loc[df_train['image'].isnull(),'has_image'] = 0

print('There are {} ads with images. Mean deal_probability is {:.3}.'.format(len(df_train.loc[df_train['has_image'] == 1]), df_train.loc[df_train['has_image'] == 1, 'deal_probability'].mean()))

In [None]:
print('There are {} ads without images. Mean deal_probability is {:.3}.'.format(len(df_train.loc[df_train['has_image'] == 0]), df_train.loc[df_train['has_image'] == 0, 'deal_probability'].mean()))

**item_seq_number:**

In [None]:
plt.scatter(df_train.item_seq_number, df_train.deal_probability, label="item_seq_number vs deal_probability");

plt.xlabel("item_seq_number");
plt.ylabel("deal_probability");

It seems like there are many users who post a lot of ads and number of ads posted isn't really correlated with deal_probability. 

**Params:**

There are three fields with additional information, let's combine it into one. Technically it is possible to treat these features as categorical, but there would be too many of them.

In [None]:
df_train["params"] = df_train["param_1"].fillna('') + ' ' + df_train["param_2"].fillna('') + ' ' + df_train["param_3"].fillna('')

df_train["params"] = df_train["params"].str.strip()

text = ' '.join(df_train["params"].values)
text = [i for i in ngrams(text.lower().split(), 3)]

print("common trigrams")

Counter(text).most_common(40)

Most of params belong to clothes or cars.



**user_type:**

there are three main user_Types. Let's see prices of their wares, where prices are below 100,000. 

In [None]:
sns.set(rc = {'figure.figsize': (15, 8)})

df_train_ = df_train[df_train.price.isnull() == False]
df_train_ = df_train.loc[df_train.price < 100000.0]

sns.boxplot(x = "parent_category_name",
           y = "price",
           hue = "user_type",
           data = df_train_)

plt.title("Price by parent gategory and user type")
plt.xticks(rotation = "vertical")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

We can see that shops usually have higher prices than companies and private sellers usually have the lowest price - maybe because they are usually second-hand.



**price:**

The first question is how to deal with missing values. I have decided to do the following:
- at first fill missing values with median by city and category.
- then missing values which are left are filled with region by region and category
- the remaining missing values are filled with median by category


In [None]:
df_train["price"] = df_train.groupby(["city", "category_name"])["price"].apply(
    lambda x: x.fillna(x.median())
)

df_train["price"] = df_train.groupby(["region", "category_name"])["price"].apply(
    lambda x: x.fillna(x.median())
)

df_train["price"] = df_train.groupby(["category_name"])["price"].apply(
    lambda x: x.fillna(x.median())
)

plt.hist(df_train["price"]);

Let's use boxcox transformation to get rid of skewness



In [None]:
plt.hist(stats.boxcox(df_train["price"] + 1)[0]);

Imagine that you are watching a race and that you are located close to the finish line. When the first and fastest runners complete the race, the differences in times between them will probably be quite small.

Now wait until the last runners arrive and consider their finishing times. For these slowest runners, the differences in completion times will be extremely large. This is due to the fact that for longer racing times a small difference in speed will have a significant impact on completion times, whereas for the fastest runners, small differences in speed will have a small (but decisive) impact on arrival times.

This phenomenon is called “heteroscedasticity” (non-constant variance). In this example, the amount of Variation depends on the average value (small variations for shorter completion times, large variations for longer times).

This distribution of running times data will probably not follow the familiar bell-shaped curve (a.k.a. the normal distribution). The resulting distribution will be asymmetrical with a longer tail on the right side. This is because there's small variability on the left side with a short tail for smaller running times, and larger variability for longer running times on the right side, hence the longer tail.


Why does this matter?

Model bias and spurious interactions: If you are performing a regression or a design of experiments (any statistical modelling), this asymmetrical behavior may lead to a bias in the model. If a factor has a significant effect on the average speed, because the variability is much larger for a larger average running time, many factors will seem to have a stronger effect when the mean is larger. This is not due, however, to a true factor effect but rather to an increased amount of variability that affects all factor effect estimates when the mean gets larger. This will probably generate spurious interactions due to a non-constant variation, resulting in a very complex model with many (spurious and unrealistic) interactions.

If you are performing a standard capability analysis, this analysis is based on the normality assumption. A substantial departure from normality will bias your capability estimates.

http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation

**Feature Engineering:**


In [None]:
# lets transform the test in the same way as train

df_test["params"] = df_test["param_1"].fillna('') + ' ' + df_test["param_2"].fillna('') + ' ' + df_test["param_3"].fillna('')
df_test["params"] = df_test["params"].str.strip()

df_test["description"] = df_test["description"].apply( lambda x: str(x).replace("/\n", ' ').replace("\xa0", " "))

df_test["has_image"] = 1
df_test.loc[df_test["image"].isnull(), "has_image"] = 0

df_test["price"] = df_test.groupby(["city", "category_name"])["price"].apply(lambda x: x.fillna(x.median()))
df_test["price"] = df_test.groupby(["region", "category_name"])["price"].apply(lambda x: x.fillna(x.median()))
df_test["price"] = df_test.groupby(["category_name"])["price"].apply(lambda x: x.fillna(x.median()))

df_train["price"] = stats.boxcox(df_train.price + 1)[0]
df_test["price"]  = stats.boxcox(df_test.price + 1)[0]

**Aggregate features:**

I'll create a number of aggregate features. 

- user_price_mean
- user_ad_count
- region_price_mean
- region_price_median
- region_price_max
- city_price_mean
- city_price_median
- city_price_max
- parent_category_name_price_mean
- parent_category_name_price_median
- parent_category_name_price_max
- category_name_price_mean
- category_name_price_median
- category_name_price_max
- user_type_category_price_mean
- user_type_category_price_median
- user_type_category_price_nax



In [None]:
df_train["user_price_mean"] = df_train.groupby("user_id")["price"].transform("mean")
df_train["user_ad_count"]   = df_train.groupby("user_id")["price"].transform("sum")

df_train["region_price_mean"]   = df_train.groupby("region")["price"].transform("mean")
df_train["region_price_median"] = df_train.groupby("region")["price"].transform("median")
df_train["region_price_max"]    = df_train.groupby("region")["price"].transform("max")

df_train["city_price_mean"]   = df_train.groupby("region")["price"].transform("mean")
df_train["city_price_median"] = df_train.groupby("region")["price"].transform("median")
df_train["city_price_max"]    = df_train.groupby("region")["price"].transform("max")

df_train["parent_category_name_price_mean"]   = df_train.groupby("parent_category_name")["price"].transform("mean")
df_train["parent_category_name_price_median"] = df_train.groupby("parent_category_name")["price"].transform("median")
df_train["parent_category_name_price_max"]    = df_train.groupby("parent_category_name")["price"].transform("max")


In [None]:
df_train["category_name_price_mean"]   = df_train.groupby("category_name")["price"].transform("mean")
df_train["category_name_price_median"] = df_train.groupby("category_name")["price"].transform("median")
df_train["category_name_price_max"]    = df_train.groupby("category_name")["price"].transform("max")

df_train["user_type_category_price_mean"]   = df_train.groupby(["user_type", "parent_category_name"])["price"].transform("mean")
df_train["user_type_category_price_median"] = df_train.groupby(["user_type", "parent_category_name"])["price"].transform("mean")
df_train["user_type_category_price_mean"]   = df_train.groupby(["user_type", "parent_category_name"])["price"].transform("mean")

In [None]:
df_test["user_price_mean"] = df_test.groupby("user_id")["price"].transform("mean")
df_test["user_ad_count"]   = df_test.groupby("user_id")["price"].transform("sum")

df_test["region_price_mean"]   = df_test.groupby("region")["price"].transform("mean")
df_test["region_price_median"] = df_test.groupby("region")["price"].transform("median")
df_test["region_price_max"]    = df_test.groupby("region")["price"].transform("max")

df_test["city_price_mean"]   = df_test.groupby("region")["price"].transform("mean")
df_test["city_price_median"] = df_test.groupby("region")["price"].transform("median")
df_test["city_price_max"]    = df_test.groupby("region")["price"].transform("max")

df_test["parent_category_name_price_mean"]   = df_test.groupby("parent_category_name")["price"].transform("mean")
df_test["parent_category_name_price_median"] = df_test.groupby("parent_category_name")["price"].transform("median")
df_test["parent_category_name_price_max"]    = df_test.groupby("parent_category_name")["price"].transform("max")


In [None]:
df_test["category_name_price_mean"]   = df_test.groupby("category_name")["price"].transform("mean")
df_test["category_name_price_median"] = df_test.groupby("category_name")["price"].transform("median")
df_test["category_name_price_max"]    = df_test.groupby("category_name")["price"].transform("max")

df_test["user_type_category_price_mean"]   = df_test.groupby(["user_type", "parent_category_name"])["price"].transform("mean")
df_test["user_type_category_price_median"] = df_test.groupby(["user_type", "parent_category_name"])["price"].transform("mean")
df_test["user_type_category_price_mean"]   = df_test.groupby(["user_type", "parent_category_name"])["price"].transform("mean")

**Categorical features:**

I'll use the target encoding to deal with categorical features.

In [None]:
def target_encode(trn_series = None,
                 tst_series  = None, 
                 target      = None,
                 min_samples_leaf = 1,
                 smoothing   = 1,
                 noise_level = 0):
    """
    
    https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    
    # Apply average function to all target data
    prior = target.mean()
    
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return ft_trn_series, ft_tst_series
    

In [None]:
df_train['parent_category_name'], df_test['parent_category_name'] = target_encode(df_train['parent_category_name'], df_test['parent_category_name'], df_train['deal_probability'])

df_train['category_name'], df_test['category_name'] = target_encode(df_train['category_name'], df_test['category_name'], df_train['deal_probability'])

df_train['region'], df_test['region'] = target_encode(df_train['region'], df_test['region'], df_train['deal_probability'])
df_train['image_top_1'], df_test['image_top_1'] = target_encode(df_train['image_top_1'], df_test['image_top_1'], df_train['deal_probability'])

df_train['city'], df_test['city'] = target_encode(df_train['city'], df_test['city'], df_train['deal_probability'])

df_train['param_1'], df_test['param_1'] = target_encode(df_train['param_1'], df_test['param_1'], df_train['deal_probability'])
df_train['param_2'], df_test['param_2'] = target_encode(df_train['param_2'], df_test['param_2'], df_train['deal_probability'])
df_train['param_3'], df_test['param_3'] = target_encode(df_train['param_3'], df_test['param_3'], df_train['deal_probability'])

In [None]:
df_train.drop(['date', 'day', 'user_id'], axis=1, inplace=True)
df_test.drop(['date', 'day', 'user_id'], axis=1, inplace=True)

**Text Features:**

We have several features with text data and they need to be processed in different ways. But at first let's create new features based on texts:
- length of text (symbols)
- number of words
- counts of punctuation
- counts of strange symbols ( emoticons)

In [None]:
df_train["len_title"] = df_train["title"].apply(lambda x: len(x))
df_train["words_title"] = df_train["title"].apply(lambda x: len(x.split()))

df_train["len_description"] = df_train["description"].apply(lambda x: len(x))
df_train["words_description"] = df_train["description"].apply(lambda x: len(x.split()))

df_train["len_params"] = df_train["params"].apply(lambda x: len(x))
df_train["words_params"] = df_train["params"].apply(lambda x: len(x.split()))

df_train['symbol1_count'] = df_train['description'].str.count('↓')
df_train['symbol2_count'] = df_train['description'].str.count('\*')
df_train['symbol3_count'] = df_train['description'].str.count('✔')
df_train['symbol4_count'] = df_train['description'].str.count('❀')
df_train['symbol5_count'] = df_train['description'].str.count('➚')
df_train['symbol6_count'] = df_train['description'].str.count('ஜ')
df_train['symbol7_count'] = df_train['description'].str.count('.')
df_train['symbol8_count'] = df_train['description'].str.count('!')
df_train['symbol9_count'] = df_train['description'].str.count('\?')
df_train['symbol10_count'] = df_train['description'].str.count('  ')
df_train['symbol11_count'] = df_train['description'].str.count('-')
df_train['symbol12_count'] = df_train['description'].str.count(',')

df_test['len_title']         = df_test['title'].apply(lambda x: len(x))
df_test['words_title']       = df_test['title'].apply(lambda x: len(x.split()))
df_test['len_description']   = df_test['description'].apply(lambda x: len(x))
df_test['words_description'] = df_test['description'].apply(lambda x: len(x.split()))
df_test['len_params']        = df_test['params'].apply(lambda x: len(x))
df_test['words_params']      = df_test['params'].apply(lambda x: len(x.split()))

df_test['symbol1_count'] = df_test['description'].str.count('↓')
df_test['symbol2_count'] = df_test['description'].str.count('\*')
df_test['symbol3_count'] = df_test['description'].str.count('✔')
df_test['symbol4_count'] = df_test['description'].str.count('❀')
df_test['symbol5_count'] = df_test['description'].str.count('➚')
df_test['symbol6_count'] = df_test['description'].str.count('ஜ')
df_test['symbol7_count'] = df_test['description'].str.count('.')
df_test['symbol8_count'] = df_test['description'].str.count('!')
df_test['symbol9_count'] = df_test['description'].str.count('\?')
df_test['symbol10_count'] = df_test['description'].str.count('  ')
df_test['symbol11_count'] = df_test['description'].str.count('-')
df_test['symbol12_count'] = df_test['description'].str.count(',')

Now let's start transforming texts. Titles have little number of unique words, so we can use default values for TfidfVectorizer (only add stopwords). I have to limit max_features due to memory constraints. I won't use descriptions and parameters due to kernel limits. 

In [None]:
vectorizer = TfidfVectorizer(stop_words = stop, max_features = 6000)
vectorizer.fit(df_train["title"])

df_train_title = vectorizer.transform(df_train["title"])
df_test_title  = vectorizer.transform(df_test["title"])


In [None]:
df_train.drop(["title", "params", "description", "user_type", "activation_date"], axis=1, inplace=True)
df_test.drop(["title", "params", "description", "user_type", "activation_date"], axis=1, inplace=True)

In [None]:
pd.set_option('max_columns', 60)
df_train.head()

**Meta-features:**

One of the features is used to build a model A and the prediction of model A is used as a feature in building model B.

One of possible ideas is creating meta-features. It means that we use some features to build a model and use the predictions in another model. I'll use ridge regression to create a new feature based on tokenized title and then I'll combine it with other features.

In [None]:
%%time

X_meta = np.zeros((df_train_title.shape[0], 1))
X_test_meta = []

for fold_i, (train_i, test_i) in enumerate(kf.split(df_train_title)):
    print(fold_i)
    model = Ridge()
    model.fit(df_train_title.tocsr()[train_i], df_train["deal_probability"][train_i])
    X_meta[test_i, :] = model.predict(df_train_title.tocsr()[test_i]).reshape(-1, 1)
    X_test_meta.append(model.predict(df_test_title))
    
X_test_meta = np.stack(X_test_meta)
X_test_meta_mean = np.mean(X_test_meta, axis=0)

In [None]:
X_full = csr_matrix(hstack([df_train.drop(['item_id', 'deal_probability', 'image'], axis=1), X_meta]))
X_test_full = csr_matrix(hstack([df_test.drop(['item_id', 'image'], axis=1), X_test_meta_mean.reshape(-1, 1)]))

X_train, X_valid, y_train, y_valid = train_test_split(X_full, df_train["deal_probability"], test_size=0.20, random_state=42)

**Building a simple model:**

In [None]:
def rmse(predictions, targets):
    return np.sqrt( ( (predictions - targets) ** 2).mean() )

In [None]:
# took parameters from this kernel:  https://www.kaggle.com/the1owl/beep-beep

params = {"learning_rate": 0.08,
          "max_depth": 8,
          "boosting": "gbdt",
          "objective": "regression",
          "metric": ["auc", "rmse"],
          "is_training_metric": True,
          "seed": 19,
          "num_leaves": 63,
          "feature_fraction": 0.9,
          "bagging_fraction": 0.8,
          "bagging_freq": 5
         }

model = lgb.train(params,
                 lgb.Dataset(X_train, label=y_train),
                 2000,
                 lgb.Dataset(X_valid, label=y_valid),
                 verbose_eval=50,
                 early_stopping_rounds=20)

In [None]:
pred = model.predict(X_test_full)

#clipping is necessary.
df_sub['deal_probability'] = np.clip(pred, 0, 1)
df_sub.to_csv('sub.csv', index=False)