# **1. Question Formulation**

***Requirement doc: Data set contents and context are clearly introduced.
Clearly articulated question that is appropriate to both the data and the algorithm, and takes the limitations of the data and/or algorithm into account.
Evaluation metrics are discussed - How will you know if you have a good model?***

Marketing campaigns cost money. It is of every company's business interest to launch a targeted and effective marketing campaign. An important metric to gauge the performance of each digital advertisement is click through rate (CTR), which is a ratio to show "how often people who see your ad end up clicking it" <sup>(1)</sup>. Therefore, the purpose of this project is to establish a model to accurately predict the CTR (click through rate), and more importantly, a model with scalability to be applied to large volume of data. It will help companies to improvement the effectiveness of their digital marketing campaigns, and to maximize their ROI.


Specifically, the question we want to answer is: **based on a set of features, would the advertisement be clicked?** Our dataset is obtained through Kaggle Display Advertising Challenge launched by CriteoLabs back in 2014. We downloaded `train.csv` (with a size approximately 12 GB) from Criteo <sup>(2)</sup>. In this `csv` file, each row corresponds to a display ad served by Criteo, with 
- target variable as the first column:
  - `1` for clicked, and 
  - `0` for non-clicked, and 
- the rest being feature variables:
  - `I1` to `I13`: A total of 13 columns of integer features.
  - `C1` to `C26`: A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.


In this report, 
* We first split full data in `train.csv` into our training (80%) and testing data (20%); within the training dataset, we randomly sample 1,000 records to be our toy dataset;
* conduct EDA on our toy dataset; 
* select a model based off the takeaways from our EDA; 
* explain the mathematical theory behind Logistic Regression model; 
* to clearly articulate the mathematical mechanisms behind Logistic Regression model, we build a homegrown Logistic Regression model, create a mini toy dataset, and apply the homegrown model to our mini toy dataset;
* apply multiple techniques of feature engineering on our toy dataset (1,000 records);  
* establish a Spark ML Logistic Regression model and apply to the toy dataset and eventually the full dataset;
* evaluate our model performance by calculating its accuracy and log loss;
* conclude our report with discussions of our challenges, and application of course concepts. 

Certain limitations are noted: 
- Since the `train.csv` provided by Criteo only corresponds to its traffic over a period of 7 days, it is possible that the data may not be representative enough (i.e. holiday season clickthrough activities could be different than non-holiday season). In such case, our model may not generalize well. 
- The semantic of feature variables are not made public by Criteo. Furthermore, the values of all categorical variables are encrypted. Feature engineering is conducted solely based upon mathematical characteristics of each variable, without any industry knowledge involved. If feature names as well as values were to be known, our feature engineering could have been done in a more effective and sensible manner. 

Our notebook is created and coompleted using Google Colaboratory. `train.csv` is loaded in our shared Google Drive folder. Therefore, we need to set up Google Drive authentication to read in datasets. 

We converted `train.csv` to `train.txt`.

We splited full dataset (train.txt) to train and test (80% - 20%). Furthermore, we randomly sampled 1,000 records from training dataset, and created it as toy dataset. 

We further created 3 sets of toy datasets (each with 10,000 records). These datasets are created for EDA pruposes, in order to verify if certain trends are common in all random samples. 

The datatype of each column is `string`. We would need to do data type conversion before featuring engineering and modeling. 

From the histograms of all numeric variables (1 to 13), we can see that 
- nearly all variables present high degree of skewness. As a result, we need to run normalization to correct the skewness. 
- column 10 is a categorical variable rather than numeric. In feature engineering steps, we will treat it as categorical variable. 
- the magnitude of different columns are quite spread out, with certain columns under 5 or 50, and certain columns in the range of hundreds, thousands, and even 1,000,000. As such, we need to apply log transformation before normalization. 

From the scatterplots of all numeric variables (1 to 13), we can see that 
- The values of Column 8 is evidently spread out. 
- On the contrary, Column 12 and 13 have a higher degree of concentration.
- Column 1 to 7, 9 and 11 are in the middle ground.   
- For Column 10, most data points have value of 0, 1 and 2. A few data points have value of 3. An extreme outlier has a value of 5.  

Parquet is a serverless file format which is highly optimized. For larger dataset, we should consider using this file format for scalability. After saving our dataset into parquet file, we read in the data from parquet file. In addition, we conducted further EDA to verify whether:
- skewness are common to all toy datasets;
- wide range of values in numeric fields are present in all toy datasets;
- the counts of records with null value in each column for all toy datasets. We can see that for certain columns, majority of records are null. In this case, we should consider remove these columns from our modeling. 

## **2.5 Heatmaps to see similarity and correlations**

**From the null count and heatmap visualization, we see certain similarity in some columns. The following columns tend to be null in the same rows:**
* Col 1, 10 
* Col 3, 4, 6, 13
* Col 7, 9, 11
* Col 16, 17, 25 and 29
* Col 32, 33, 38, 39 
* Col 12, 35: for these two columns, half of the rows are empty. Which bring up a question on whether they would be useful for prediction. We may need to consider removing these 2 features from our model. 

Next, we want to inspect the correlations between variables further by examing their correlation matrix and correlation heatmaps. From the correlation heatmaps, we can see that:
- Column 7 and 11 have strong positive correlation, 
- Column 10 and 1 have strong positive correlation, 
- Column 4 and 13 have strong positive correlation. 

Therefore, we would need to consider dropping one of the correlated columns during our feature engineering step. 

For categorical fields (Columns 14-39), we further check the counts of unique categories for each feature. The following columns showing high counts of unique categories (more than 50% uniqueness):
- 16, 17, 20, 23, 24, 25, 26, 28, 29

Assuming a variable with 100% uniqueness, which means every row has its own category for this variable. It would not help in our modeling by distinguishing our data points to different groups. As such, we assume that a variable with high uniqueness percentage may contribute less information compared to those with lower uniqueness. As the dataset gets larger, feature dimension reduction becomes more important in order for our modeling to scale. Thus, in our feature engineering step, we would like to consider removing categorical variables with more than 50% uniqueness. 

## **2.7 Takeaways from EDA**  
Our EDA was conducted on 4 toy datasets, one with 1,000 records, the other three each with 10,000 records. Even though our EDA has not been run on full dataset, we could already see some challenges in modeling based off the EDA results:
- The numeric data are highly skewed. Log transformation should be applied.
- The value scale among numeric variables present significant difference. Normalization/standardization should be considered. 
- There are large amount of null values in both numeric and categorical fields. We need to come up with an approach in handling these null values. Should they be removed, replaced with zeros, or replaced with mean? Should we handle numeric and categorical fields differently? 
- Both the semantics and the values of categorical variables are unknown. To build them into our model, we need to perform certain encoding in order to convert hashed strings to numbers. One hot encoding would be applied to these categorical variables. The biggest challenge we anticipate in doing so would be: 
  - when applying to full dataset, because there could be large number of categories for each variable, one hot encoding would further expand our features to a even more substantial amount. 
  - with a considerable size of features, it could lead to memory issue at our modeling stage. 
- Feature reduction becomes a key in resolving this potential memory issue. Approaches we could take in next stage would be:
  - Removing features with high percentage (more than 50%) of null values;
  - Dropping similar columns, or correlated columns;
  - Removing features with high percentage (more than 50%) of uniqueness;
  - Running PCA on our engineered dataset to further reduce feature dimensions. 


# **5. Application of Course Concepts**

- **Scalability and meomory issue**
  - As the data scales up to full dataset, we ran into out-of-memory issue multiple times at the stage of one hot encoding, as well as training our logistic regression model. We did not run into such issue when running on our toy dataset, which triggered us to think on reducing our feature dimensions.  
  - There are multiple options in preventing feature dimensions from becoming overwhelming large. Our approach is to remove certain variables that are deemed less determinant from modeling, and run PCA. We successfully run through the full dataset, with prediction accuracy of 0.746, and log loss of 0.535.
  - Other plausible approachs which we do not have time to implement in this project would be 
    - to control the maximum number of categories each variable could have, 
    - utilize clustering algorithms to group columns together
    - run Random Forest to rank feature importance, and select top features for our modeling. 

- **One hot encoding**
  - The categorical variables are hashed into 32 bits. We do not know the exact meaning of each category. In this case, we decided to use one hot encoding instead of integer encoding. 
  - One hot encoding is implemented by using `OneHotEncoderModel` from `pyspark.ml.feature`.  
  - However, as we run `OneHotEncoderModel` on our full dataset, we ran into OOM issue. This is anticipated as our full dataset has approximately 45 million records. The categorical variables could contain variables such as unique IDs, which would be expanded to substantial amount of columns by one hot encoder if without proper control. 

- **Functional programming**:
- **Normalization**:
  - As demonstrated in Section 2, the scale of the numeric columns are at different magnitude, ranging from 5 to even 1,000,000. As such, we applied normalization to all numeric fields. 
  - In addition, we also applied standardization on one hot encoded categorical variables by using `StandardScaler` from `pyspark.ml.feature`.  