### Required Libraries
Install the following libraries. Ensure that all libraries below can be properly imported without error. If any library is missing, search for installation guide and documentation if necessary.

In [1]:
import os
import math
import scipy
import timeit
import numpy as np
import pandas as pd
from scipy import stats
import psycopg2
from sqlalchemy import create_engine

### Load Feature Codes
Download the list of 1805 features, codes, categories and save as a csv file. The feature code table explains what product category a feature is concerned with, what it means, and how it is encoded. Note that it does not contain the actual values of features.
* Time: ~ 1 m 30 s
* Storage: ~ 85 KB
* File created: data/feature_code.csv

In [2]:
%run -i 'load_feature_codes.py'

Feature codes file exists: data/feature_code.csv. No need to download.


### Load Features
Download the entire sb_marketing.sl_lookalike_features_final table onto a local computer. The result is saved to 106 csv files, one for each categories (105 categories, with 1 across all categories). This enables the lookalike model to be run on a single CPU machine, and is the only method currently available.
* Time: approx. 3~5 hours
* Storage: ~ 12.5 GB
* Files created: data/001_Accessory - Bag.csv, data/002_Accessory - Other Accessory.csv...

In [3]:
%run -i 'load_features.py'

Features downloaded: 2 m 58.492 s                                                                      


### Load Population Sample Features
The lookalike model compares every feature between the source audience and the population to determine the feature's importance score. However, it is costly and unnecesssary to compare to the entire population (11 million). Instead, randomly select a sample to represent the population.

Choose the popuation sample size to be comparable to the size of source audience, which is expected to fall within the range of 5~50k for optimal performance. In the code below, sample size is chosen as 10k. You may modify the sample size in the load_population_sample.py file.

In [4]:
%run -i 'load_population_sample.py'

Population sample member_srl file saved: data/population_sample_srls.csv
Downloading features from Redshift. Progress 100.00 %. Category: all                                                              
Population sample feature file saved: data/population_sample_features.csv
