# Real-time bidding/Online Auctioning System

    Real-time bidding refers to the online auction process wherein buying and selling 
    of online ad impressions are done in real-time and often facilitated through ad exchange.

    Reserve price is the lowest price or starting price of bidding at which a company/publisher
    is willing to sell an ad. During auction closing if the bidders have not met the reserve price, the seller is not obligated to sell.

    Floor CPM is the threshold value set by the publisher that determines the minimal possible cost per impression the publisher agrees to sell the inventory for on the ad exchange or ad network.For instance, if the publisher sets CPM price floor to $1 the campaigns below that amount will not be served on his website.So we will first calculate the CPM based on company/publisher's total revenue and then we will keep CPM (1% of actual revenue) as our target variable on which our model do predictions.

### **Technique**:
Advertisers running CPM ads set their desired price per 1000 ads served and pay each time whenever their ad appears. As a publisher, company will earn revenue each time a CPM ad is served to the webpage and viewed by a user. 

CPM ads compete against cost per click (CPC) ads in the ad auction, and will display whichever ad is expected to earn more revenue for the company.

Assuming the data relates to a Digital Marketing company by whom the ads were published in the user space and earn the revenue.

Performance of auction system is calculated based on CPM (Cost per Mille/Cost per Thousand)
  
        CPM= revenue/impressions * 1000
        
Let's say:
    
    * The total cost for running an ad in website is $15,000.
    
    * The total amount of impressions generated is $2,400,000.

CPM is calculated as: ($15,000 / $2,400,000)x1000 = $0.00625 x 1000 = $6.25

# **Dataset Analysis:**

The given dataset has the following columns:

    1. date- Date of ads published in the communication channel
    2. site_id- Website Id owned by a Marketing company provided for publishing the ads
    3. ad_type_id- : Advertisement Id for categories such as Health, Technology etc.,
    4. geo_id- Geographic Location Id of Country
    5. device_category_id- Category Id for accessed devices such as Tablet, Laptop, Desktop, Smart Phones
    6. advertiser_id- Adevertiser id denotes a bidder in the auction
    7. order_id- Order Id created for bidder's auction
    8. line_item_type_id- Line Item Type Id for ads. 
    9. os_id- OS id denotes a different operating systems such as Windows, Linux, Android, IOS
    10. integration_type_id- how client integrates with the advertiser
    11. monetization_channel_id- channel mode via customer integrates with the advertiser
    12. ad_unit_id- id denotes a different ad unit of web page.
    13. total_impressions- measured impressions (views, shares, likes, abuses) for posted ads. Showing an ad to a user constitutes one impression.
    14. total_revenue- measured revenue for the total impressions
    15. viewable_impressions- No. of impressions that comes as viewable. (If the ad persists for certain time in website it can be considered as viewable)
    16. measurable_impressions- - Impressions that were measurable by Active View out of the total number of eligible impressions. This value should generally be close to 100%.
    17. revenue_share_percent- Company charges a certain share for the services they provide to clients and the revenue will be generated in that way. Commission percent of revenue that will be paid to publisher.

# **Data Science Report**

    1. Load Ascendeas Dataset.
    2. Feature Engineering.
        Induce CPM into dataset
    3. Exploratory Data Analysis/Data Wrangling
    4. Modelling
    5. Building ML Pipeline
    6. Prediction on CPM
    7. Evaluation
    8. Saving the output File

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Visualization Libraries
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **1. Load Dataset**

In [None]:
ascendeas_df= pd.read_csv('../input/real-time-advertisers-auction/Dataset.csv')
ascendeas_df.head()

In [None]:
date_values= ascendeas_df.date.values

In [None]:
## Converting date time values to date for easy interpretation
ascendeas_df['date'] = pd.to_datetime(ascendeas_df['date'])
ascendeas_df.head()

# **2. Feature Engineering**

### **Induce/Ingest CPM into Dataset**

In [None]:
def revenue_per_impressions(r, i):
    return r / i if i else 0

ascendeas_df['CPM'] = ascendeas_df.apply(lambda x: revenue_per_impressions(((x['total_revenue']*100)),x['measurable_impressions'])*1000 , axis=1)

In [None]:
ascendeas_df['CPM'].describe()

In [None]:
sns.distplot(ascendeas_df['CPM'],kde=False)

# **3. Exploratory Data Analysis**

In [None]:
ascendeas_df.info()

### **Handling Null Values**

In [None]:
# Checking for null values
ascendeas_df.isnull().sum()

**Inference**: No Null values in the dataset to handle.

In [None]:
fig = px.line(ascendeas_df, x="date", y="CPM",title='Plotting CPM on datewise')
fig.show()

**Inference**: High paid CPM of 283.62K is observed on June 11,2019

In [None]:
fig = px.line(ascendeas_df, x="date", y="total_revenue",title='Revenue across dates')
fig.show()

In [None]:
fig = px.line(ascendeas_df, x="date", y="measurable_impressions",title='Impressions across dates')
fig.show()

**Inference**: On June 21st total revenue generated was high due to high measurable impressions.

In [None]:
sns.countplot(ascendeas_df["device_category_id"])

**Inference**: Advertisers accessing through device category=2 are high.

In [None]:
sns.countplot(ascendeas_df["site_id"])

**Inference**: Traffic for website id=346 is high where the advertisers are willing to display their banners on this website to the specific user, place their bids.

In [None]:
f, ax = plt.subplots(1,1, figsize=(6,4))
total = float(len(ascendeas_df))
sns.countplot(ascendeas_df["ad_type_id"])
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(100*height/total),
            ha="center") 
plt.show()

**Inference**: 96.85% of Published Ads are of Type 10.

In [None]:
fig,axes=plt.subplots(1,1,figsize=(12,8))
sns.scatterplot(x='total_impressions',y='measurable_impressions',data=ascendeas_df)

### **Finding Correlation (HeatMap)**

In [None]:
corr = ascendeas_df.corr()
plt.figure(figsize=(18,9))
sns.heatmap(data=corr,vmin=0, vmax=1, cmap="RdYlGn",square=True, annot=True)
plt.show()

In [None]:
for col in ascendeas_df.columns:
    print('{}: {} \n'.format(col,ascendeas_df[col].unique()))

**Inference**:

    * We will drop 'measurable_impressions' and 'total_revenue' for dimensionality reduction
    as we have CPM calculated from these features.
    * We will drop 'site_id', since it is highly correlated with 'ad_unit_id'
    * We will drop 'integration_type_id' and 'revenue_share_percent', because these columns consists of only one value.
    

In [None]:
ascendeas_original= ascendeas_df.copy()
ascendeas_original.shape

In [None]:
ascendeas_df= ascendeas_df.drop(['integration_type_id', 'revenue_share_percent','site_id', 'measurable_impressions', 'total_revenue'], axis = 1)
ascendeas_df.info()

In [None]:
corr = ascendeas_df.corr()
plt.figure(figsize=(14,8))
sns.heatmap(data=corr,vmin=0, vmax=1, cmap="RdYlGn",square=True, annot=True)
plt.show()

### **Handling Outliers**

In [None]:
## Checking outliers using Box plot
sns.boxplot(ascendeas_df["CPM"],color="red")

In [None]:
# Remove the extremes/outliers from CPM
# 95% of the data is within 2 standard deviations
ascendeas_df = ascendeas_df[ascendeas_df['CPM'].between(ascendeas_df['CPM'].quantile(.05), ascendeas_df['CPM'].quantile(.95))]
sns.boxplot(ascendeas_df["CPM"],color="green")

In [None]:
sns.distplot(ascendeas_df["CPM"])

In [None]:
ascendeas_df.shape, ascendeas_original.shape

# **4. Modelling**

### **Split Dataset (Train, Test)**

In [None]:
# divide the data into test and train by date
from sklearn.model_selection import train_test_split
X= ascendeas_df.drop(["date","CPM"],axis=1)
y= ascendeas_df.CPM.values

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from statsmodels.api import OLS
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# **5. Building ML Pipeline**

In [None]:
pipelines = []
pipelines.append(('ScaledLinear', Pipeline([('Scaler', StandardScaler()),('LR',LinearRegression())])))
pipelines.append(('ScaledLinear2', Pipeline([('Scaler', StandardScaler()),('Logistic',LogisticRegression())])))
pipelines.append(('ScaledXGB', Pipeline([('Scaler', StandardScaler()),('XGBR', XGBRegressor())])))
pipelines.append(('ScaledCatBoost', Pipeline([('Scaler', StandardScaler()),('CatBoost', CatBoostRegressor())])))

In [None]:
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=10, random_state=21)
    cv_results = cross_val_score(model, X, y, cv=kfold, scoring='neg_root_mean_squared_error')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# Algorithm comparison
fig = plt.figure(figsize=(8,5))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

**Inference**: CatBoost Regressor is performing well with high mean and low standard deviation

# **CatBoost Regressor**

In [None]:
catboost = CatBoostRegressor(n_estimators=3000, depth=10)
catboost.fit(X,y)

# **6. Prediction**

In [None]:
ypreds= catboost.predict(X)

In [None]:
y.shape, ypreds.shape

# **7. Evaluation**

In [None]:
from sklearn.metrics import mean_squared_error
print('MSE for CatBoost Model:',mean_squared_error(y, ypreds))
print('RMSE for CatBoost Model:', np.sqrt(mean_squared_error(y, ypreds)))

In [None]:
catboost_df= pd.DataFrame(ascendeas_df.date.values, columns=['Date'])
catboost_df['Actual_CPM']= y
catboost_df['Predict_CPM']= ypreds
catboost_df.head()

In [None]:
catboost_df.shape

# **8. Saving Output File**

In [None]:
catboost_df.to_csv('/kaggle/working/CatBoost_Auction_Output.csv', index=False)

# **1. What is the potential revenue range our publisher can make in July?**

**Solution 1:** 
    
    Calculate Mean/Average for total_revenue of June month and round off to it's next value,
    which will be the approximate or above revenue for July month.

In [None]:
print('Approximate revenue for july month:', np.round(ascendeas_original["total_revenue"].mean(),2))

    From above mean value ie., 0.06974043163033072, 
    0.07 and above will be the predicted revenue for the month of June.

**Solution 2**:

        1. Put date and total revenue data in excel.
        2. For the datetime 7/1/2019 0:00 use the below formula to calculate total revenue for same.
        
   **=FORECAST(A567293,B2:B567292,A2:A567292)**

    Below are the forecasted revenue values calculated for the month of july.


    7/1/2019 0:00	0.071032874
    7/2/2019 0:00	0.07111786
    7/3/2019 0:00	0.071202896
    7/4/2019 0:00	0.071287981
    7/5/2019 0:00	0.071373115
    7/6/2019 0:00	0.0714583
    7/7/2019 0:00	0.071543533
    7/8/2019 0:00	0.071628817
    7/9/2019 0:00	0.07171415
    7/10/2019 0:00	0.071799533
    7/11/2019 0:00	0.071884965
    7/12/2019 0:00	0.071970447
    7/13/2019 0:00	0.072055979
    7/14/2019 0:00	0.07214156
    7/15/2019 0:00	0.072227191
    7/16/2019 0:00	0.072312871
    7/17/2019 0:00	0.072398601
    7/18/2019 0:00	0.07248438
    7/19/2019 0:00	0.07257021
    7/20/2019 0:00	0.072656088
    7/21/2019 0:00	0.072742017
    7/22/2019 0:00	0.072827994
    7/23/2019 0:00	0.072914022
    7/24/2019 0:00	0.073000099
    7/25/2019 0:00	0.073086226
    7/26/2019 0:00	0.073172402
    7/27/2019 0:00	0.073258628
    7/28/2019 0:00	0.073344903
    7/29/2019 0:00	0.073431228
    7/30/2019 0:00	0.073517602
    7/31/2019 0:00	0.073604026

# **2. What is the reserve prices that he/she can set ?**

    In Micro Economics, A reservation price is a limit on the price of a good or a service.
        * On the demand side, it is the highest price that a buyer is willing to pay.
        * On the supply side, it is the lowest price a seller is willing to accept for a good or service. 
        
    Based on assumptions we can set maximum CPM as ads comes under demand. For this let's look around descriptive statistics to get min and max for CPM.

In [None]:
catboost_df["Actual_CPM"].describe()

In [None]:
catboost_df["Predict_CPM"].describe()

**Conclusion**:

    * Based on Actual data, one can set 526.923077 as their Reserve price.
    * Based on predicted data, one can set 548.027233 as their Reserve price.