The dataset provided to you has data for several websites owned by the same company and they are asking for your help for what should be their approach to set reserve prices and what is the range for reserve prices they should be setting for July. The data is only of the actual revenue generation and not at bid level. The dataset has the following columns:

1. **date**
1. **site_id** : each id denotes a different website
1. **ad_type_id** : each id denotes a different ad_type. These can be display ads , video ads, text ads etc
1. **geo_id** : each id denotes a different country. our maximum traffic is from english speaking countries
1. **device_category_id** : each id denoted a different device_category like desktop , mobile, tablet
1. **advertiser_id** : each id denotes a different bidder in the auction
1. **order_id** : can be ignored
1. **line_item_type_id** : can be ignored
1. **os_id** : each id denotes a different operating system for mobile device category only (android , ios etc) . for all other device categories, osid will correspond to not_mobile
1. **integration_type_id** : it describes how the demand partner is setup within a publisher's ecosystem - can be adserver (running through the publisher adserver) or hardcoded
1. **monetization_channel_id** : it describes the mode through which demand partner integrates with a particular publisher - it can be header bidding (running via prebid.js), dynamic allocation, exchange bidding, direct etc
1. **ad_unit_id** - each id denotes a different ad unit (one page can have more than one ad units)
1. **total_impressions** - measurement column measuring the impressions for the particular set of dimensions
1. **total_revenue** - measurement column measuring the revenue for the particular set of dimensions
1. **viewable_impressions** - Number of impressions on the site that were viewable out of all measurable impressions. A display ad is counted as viewable if at least 50% of its area was displayed on screen for at least one second
1. **measurable_impressions** - Impressions that were measurable by Active View out of the total number of eligible impressions. This value should generally be close to 100%. For example, an impression that is rendering in a cross-domain iframe may not be measurable.
1. **revenue_share_percent** - not every advertiser gives all the revenue to the publisher. They charge a certain share for the services they provide. This captures the fraction of revenue that will actually reach the publishers pocket.

## Import libraries

In [None]:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

## Read and preprocess data

In [None]:
df = pd.read_csv('/kaggle/input/real-time-advertisers-auction/Dataset.csv')

In [None]:
def weird_division(n, d):
    return n / d if d else 0


df['CPM'] = df.apply(
    lambda x: weird_division(
        x['total_revenue'] * 100,
        x['measurable_impressions']
    ) * 1000,
    axis=1,
)
df = df[df['CPM'] >= 0]

df.head(2)

In [None]:
# These values are known only after the auction
drop_columns = [
    'total_impressions',
    'total_revenue',
    'viewable_impressions',
    'measurable_impressions',
    'revenue_share_percent',    
]


df = df.drop(columns=drop_columns)
df.head(2)

In [None]:
# split train / test by the date
split_date = '2019-06-22 00:00:00'
split_mask = df['date'] >= split_date

df_train = df[~split_mask]
df_test  = df[split_mask]

In [None]:
# filter outliers
train_quantile_95 = np.quantile(df_train['CPM'], 0.95)
df_train = df_train[df_train['CPM'] <= train_quantile_95]

test_quantile_95 = np.quantile(df_test['CPM'], 0.95)
df_test = df_test[df_test['CPM'] <= test_quantile_95]

In [None]:
X_train = df_train.drop(columns=['date', 'CPM'])
X_test  = df_test.drop(columns=['date', 'CPM'])

y_train = df_train['CPM']
y_test  = df_test['CPM']

## Train model

All features are categorical, CatBoost is the best model for this case

In [None]:
# loss function is RMSE
model = CatBoostRegressor()
model.fit(X_train, y_train, cat_features=X_train.columns, verbose=100)

preds = model.predict(X_test)
test_error = mean_squared_error(y_test, preds)

print(f'Test MSE: {test_error:.4f}')

### Test MSE: 3209.8715