# Machine Learning Engineer Nanodegree
## Capstone Project - InstaCart Market Basket Analysis

### Utility functions for pre-processing and feature engineering
This notebook duplicates the functions I created for pre-processing and building new features for the project. 
I have included them as a separate notebook for easier viewing and better way to document the code

The project uses pandas v0.18 and relies on saving dataframes in hdfs format (pandas native) for efficiency. This option requires the `tables` (pytables) module to be installed on the host machine. I needed to install this in my Anaconda2 environment using `conda`.

The first block here is boilerplate code, used by other notebooks I created for downstream use. Importing gc helps with managing the memory footprint during the pre-processing and analysis stages.

### Augmenting orders and products

- `get_sample` : is a utility function to enable us to take a small randomized sample of a large dataframe during development
- `preprocess_orders` : handles missing values and outliers in the `orders` table. The `days_since_prior_order` uses a value of _30_ for true values greater than or equal to 30. We modified this to a random number between 30 and 50 and added a categorical variable to indicate that those values are assigned during preprocessing. _Additionally_, this routine also calculates the cumulative number of days between a user's orders.
- `add_product_groups` : adds `department_id` and `aisle_id` columns to the order history (`priors`)
- `get_hist_chunk` : another function to assist development and code testing. This takes a random sample (size determined by `frac`) of users and then selects their order history. This enables us to go through the pre-processing logic in small chunks to verify that the logic is working correctly.

There is one feature that required special handling. We are given the days between orders, but a product may not be on each order placed by a user. In order to calculate the average number of days between when a user orders a specific product, we took these steps:
 1. sort the data by `user_id`, `product_id`, and `order_number`
 2. take the difference of the cumulative sum of days since the last order (calculated in preprocess_orders and stored as `csum_ds`) of each row and its previous one. For the first row, we use the days since the previous order
 
 While there is a way to calculate this in `pandas` using a lambda function, the computation took a very long time. In the interest of efficiency, I wrote a function (`diff1`) to achieve this using `numpy` matrices. TI was able to achive a 20-fold speedup using this method (estimate based on small samples).

### Building user-product features

Building features for user-product combinations using extensive use of Dataframe `aggregate` function. Descriptions of these features are provided in the Preprocessing notebook

### Building user's top-_N_ aisles feature

The function below analyses users' puchase patterns across orders to identify which groups of items they tend to purchase most frequently. In this round, we are using aisles as a convenient group for products. The function counts the number of products per aisle, sorts them in descending order, then picks the aisle-ids for the top-N (N is an input) aisles. 

I rely on some matrix manipulation to convert the top-N list to a data frame with N-columns. 

### Benchmark Models and prediction functions