**Hypothesis I**:  the reason why user clicks the ad is ONLY based on doc content and the access context, without caring about the other ads in the same display.

**Hypothesis II**:  the reason why user clicks the ad is not only based on the doc content and the access context, but also considering the other ads in the same display.

TODO: validate which hypothesis is established. 

We have checked that display_id in clicks_test.csv is ordered. This means we do not need groupby op clicks_test.csv to collect the ads for that display.
Same situation is in clicks_train.csv.

For each display, we construct an instance for learning, namely,
$$ (X) \rightarrow y $$
Feature set $X$ is the content of the display, such as document content, user access content, and ad content. The first question is: what is the output label $y$? The naive answer is the click or not for a single ad, like the original presentation in clicks_train.csv. This is consistent with Hypothesis I. So the data set for training is organized as:
$$(D, U, A) \rightarrow 1/0$$
where $D, U, A$ are content of document, user and ad respectively. 

The second answer to the first question is more reasonable. The output label is a sequence of click or not (of course only one click is allowed) for all ads in the same display.
Equivalently, the output is a number to identify which ad is among the ads in the same display. 
This is consistent with Hypothesis II. This model leaves a problem that display contains varied number of ads. Since we only care maximum 12 ads in test set, we can pad NULL ads for displays with less than 12 ads.
So the data set for training is organized as:
$$(D,U,A_1,A_2,\ldots,A_{12}) \rightarrow n \qquad \qquad \text{(1)}$$
where $n$ refers to the clicked $A_i$. 

But wait a minute. Let's check what the test set asks for. It asks for the rank of the ads in the same display. That is:
$$(D,U,A_1,A_2,\ldots,A_{12}) \rightarrow \text{rank of } A_1 \ldots, A_{12} \qquad \text{(2)}$$
While in the train set the non-click ads have no further differential information with the respect to its sibling ads.  
Only one clicked ad stands out of the ads in the same display.
In this way, we might guess that the train set does not provide enough information for training to answer the test question.
Unless we can do some approximations.

For the test question, we first ask the trained machine to answer which one is the best guess for user click.
For example, $A_i$ is the answer among the test case.
Then we eliminate $A_i$ from the test case, and ask the trained machine.
Suppose $A_j$ is the answer the the second round for the test case.
Then $A_j$ is the 2nd rank among the displayed ads.
In this way, problem (2) can be solved by solving problem (1).

**So we mapped the Challenge to a 12-class classification problem.**
Let's check how large the train and test set are.

In [1]:
import pandas as pd

dfTrain = pd.read_csv('../input/clicks_train.csv')
print('there are ', len(dfTrain), ' rows in the original train set.')
gpTrain = dfTrain.groupby('display_id')
print('there are ', len(gpTrain), ' displays in the train set.')

dfTest = pd.read_csv('../input/clicks_test.csv')
print('there are ', len(dfTest), ' rows in the original test set.')
gpTest = dfTest.groupby('display_id')
print('there are ', len(gpTest), ' displays in the test set.')
countTestInst = 0
for disp in gpTest:
    countTestInst += len(disp[1]) - 1
print('there are ', countTestInst, ' test instances all together.')

Finally, we have 16M train cases and 25M (derived from 6M) test instances of 12-class classification problem.
It is obvious a big data learning problem, not even considering the dimension of features of the data set.

## Deriving features of $D$ and $U$ from display_id

In [104]:
import numpy as np
# the original platform fields mixed with char datatype, cast to int64
dfEvent = pd.read_csv('../input/events.csv', index_col='display_id')
print(dfEvent.dtypes)
dfEvent.head(5)

The Dtypewarning message suggests some data cleaning jobs we need to do. We put all such cleaning jobs described in later section.
Right now, let's assume that very data is correct.

display_id defines uuid, doc_id and other context of such display, such as timestamp, platform and geo.
geo feature can be further decoded as country, state and DMA features.

Don't forget to make sure display_id is unique so that it can be used as index of the dataframe.

In [105]:
print(len(dfEvent.index.unique()))
print(len(dfEvent))
print('the number of docs defined in events ', len(dfEvent['document_id'].unique()))
print('the number of uuid defined in events ', len(dfEvent['uuid'].unique()))

### Document content features

From doc_id, we can derive a lot of content features for the doc, which should be features $D$.

In [106]:
dfDocMeta = pd.read_csv('../input/documents_meta.csv')
print(dfDocMeta.dtypes)
print('the unique doc_id number ', len(dfDocMeta['document_id'].unique()))
print('the rows of doc_meta ', len(dfDocMeta))
# if the above two are equal, we are safe to make doc_id as index
dfDocMeta.set_index(dfDocMeta.document_id, inplace=True)
dfDocMeta.head(5)

This table means that ~3M doc have meta infor.
So the features and demension of meta is:
$$Meta = [srcid, pubid, pubtime], |Meta|=3$$
Let's see what topics information is provided for docs.

In [107]:
dfDocTopic = pd.read_csv('../input/documents_topics.csv')
print(dfDocTopic.dtypes)
dfDocTopic.head(5)

Let's check how many topics and docs are described in this file.

In [108]:
print('the number of unique topic_id', len(dfDocTopic['topic_id'].unique()))
print('the number of unique doc_id', len(dfDocTopic['document_id'].unique()))

This means 2.5M docs have been cated into 300 topics, and the same doc_id might be assigned several topic_ids and confidence_levels.
Let's check if the number of topic_ids is same for every doc_id.

In [109]:
gpDocTopic = dfDocTopic.groupby('document_id')
print(gpDocTopic['topic_id'].count().describe())

We have problem here.
There are as much as 39 topic_ids has been assigned to the doc_id, and as min as 1 topic_id.
We choose 7 topic_ids as the regular number for each doc_id.
Fill in NULL if the doc_id has less than 7 topic_ids, and truncate the 7 with higher confidence_level for the doc_id with more than 7 topic_ids.
So the topic feature should be:
$$Top =[(topicid,conflevl)_1, \ldots, (topicid,conflevl)_7], |Top|=14$$
Note, there are about 3M docs have meta info, but only 2.5M docs have been assigned topic feature.

Now let's move on entity features of doc.

In [110]:
dfDocEnt = pd.read_csv('../input/documents_entities.csv')
print(dfDocEnt.dtypes)
dfDocEnt.head(5)

Similar analysis to dfDocTopic, we have:

In [111]:
gpDocEnt = dfDocEnt.groupby('document_id')
print(gpDocEnt['entity_id'].count().describe())

So we have 1.3M entities to be assiged as the features of 1.8M docs.
We choose 4 (entity, confidence) pairs as the entity features for doc:
$$Ent =[(entityid,conflevl)_1, \ldots, (entityid,conflevl)_4],|Ent|=8$$

Now let's check cat feature of docs.

In [112]:
dfDocCat = pd.read_csv('../input/documents_categories.csv')
print(dfDocCat.dtypes)
print(dfDocCat.head(5))
gpDocCat = dfDocCat.groupby('document_id')
print(gpDocCat['category_id'].count().describe())

It is easy to have:
$$Cat =[(catid,conflevl)_1, \ldots, (catid,conflevl)_2],|Cat|=4$$
So, the features of doc content can be defined as:
$$D = Meta \cup Ent \cup Top \cup Cat \qquad \qquad \text{(4)}$$
and the demension of $|D|=29$.

### User features

In [113]:
dfPageView = pd.read_csv('../input/page_views_sample.csv')
print(dfPageView.dtypes)
dfPageView.head(5)

Since this is a sample from page_views.csv, we cannot analyse the distribution of fields. 
The (uuid, doc_id) pair defines the user features when accessing the doc.
But how does it relate to the same pair defined in events.csv?

In [114]:
for i in range(5):
    pv = dfPageView.iloc[i]
    print(pv[0])
    print(dfEvent[(dfEvent.uuid==pv[0]) & (dfEvent.document_id==pv[1])])
    #print(timestamp,platform,geo)

We see that not all (uuid,doc_id) pairs are included in events.csv.
Let's do some reverse check.
Since PageView is not a complete set, so we have to iterate all over Events for possible matchings.

In [115]:
countUnmatched = 0
countMatched = 0
count = 0
for i in range(len(dfEvent)):
    ev = dfEvent.iloc[i]
    matchedPV = dfPageView[(dfPageView.uuid==ev[0]) & (dfPageView.document_id==ev[1]) & (dfPageView.timestamp==ev[2]) & (dfPageView.platform==ev[3]) & (dfPageView.geo_location==ev[4])]
    if len(matchedPV) == 0:
        # this should not happen for page_view.csv, the complete set
        countUnmatched += 1
        continue
    elif len(matchedPV) > 1:
        print('multiple matched found in pageview!')
        print(i, ev)
        print(matchedPV.head(10))
        break
    countMatched += 1
    #print(i,ev)
    #print(matchedPV.head(0))
    if countMatched == 5 :
        break
print('we have found matched pv ', countMatched)
print('and unmatched pv in sample_pageview ', countUnmatched, ' in the first ',i+1,' rows of events.csv.')

We can see that page_view only provides one extra feature for user access context, traffic_source. 
But we have to spend a lot of cost to seek that trafic.
Our first try will drop the heavy feature. This means we do not use the page_view.csv file.
If we have not got acceptable rank in the competition, then we add the traffic_source feature.
So, we have user features:
$$
U = [timestamp, platform, cn, state, DMA, (traffic_source)], |U|=5 or 6
\qquad \qquad \text{(5)}
$$

In [116]:
dfPromotedCont = pd.read_csv('../input/promoted_content.csv')
print(dfPromotedCont.dtypes)
print(dfPromotedCont.head(5))

Let's check if ad_id can be used as index.

In [117]:
gpAdid = dfPromotedCont.groupby('ad_id')
print('group number of ad_id ', len(gpAdid))
print('unique number of promoted_content ', len(dfPromotedCont['ad_id'].unique()))

So it is safe to

In [118]:
dfPromotedCont = pd.read_csv('../input/promoted_content.csv', index_col=0)
print(dfPromotedCont.dtypes)
print(dfPromotedCont.head(5))

It is easy to know that for each ad, we have features for (disp_id, doc_id): 
$$A = [campid, advid]$$
So for problem (1), we have 12 ads, and therefor 24 features for ads.

In summary, for problem (1), we have altogether 29+6+24=59 features, and one 12-class output label.

## Data cleaning

1. events.csv: platform column. correct value: int 1, 2, 3. error values: '1','2','3','\\N'. 
Processing: eliminate rows with '\\N';
'1'->1
'2'->2
'3'->3

2. 


