# Assignment 7: Automated Machine Learning (Part 2)
## Objective:

As we learned from the class, the high demand for machine learning has produced a large amount of data scientists who have developed expertise in tools and algorithms. The features in the data will directly influence the results. However, it is tedious and unscalable to manually design and select features without domain knowledge. Thus, using some AutoML techniques will significantly help data scientists save labour and time. 
After completing this assignment, you should be able to answer the following questions:

1. Why do we need AutoML?
2. How does auto feature generation work?
3. How to use featuretools library to automatically generate features?
4. How to get useful features in a large feature space?

Imagine you are a data scientist in an online retailer company, for example, Amazon. Your task is to provide some recommendations to customers based on their historical purchase records.

In this assignment, we predict whether the customer will buy **Banana** in the next 4 weeks. It is a classification problem. To simplify the problem, we have already generated some features and provide the accuracy of the model (Random Forest model). The task for you is to generate **10** useful features and beat our model performance (AUC = 0.61, see below). 

For example, <br>
`MODE(orders.MODE(order_products.product_name)) = Bag of Organic Bananas` means whether the most frequent purchase of the customer is Bag of Organic Bananas. 

```
1: Feature: MODE(orders.MODE(order_products.product_name)) = Bag of Organic Bananas
2: Feature: MODE(order_products.aisle_id) is unknown
3: Feature: SUM(orders.NUM_UNIQUE(order_products.product_name))
4: Feature: MODE(orders.MODE(order_products.product_name)) = Boneless Skinless Chicken Breasts
5: Feature: MODE(order_products.product_name) = Boneless Skinless Chicken Breasts
6: Feature: STD(orders.NUM_UNIQUE(order_products.aisle_id))
7: Feature: MODE(order_products.aisle_id) = 83
8: Feature: MEDIAN(orders.MINUTE(order_time))
9: Feature: MODE(orders.DAY(order_time)) = 23
10: Feature: MODE(orders.MODE(order_products.department)) = produce

AUC 0.61
```


## Preliminary
If you never use featuretools before, you need to learn some basic knowledge of this topic. 
I found that these are some good resources: 
* [featuretools documentation](https://docs.featuretools.com/en/stable/)
* [Tutorial: Automated Feature Engineering in Python](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219)

The data can be downloaded from [A7-2-data.zip](A7-2-data.zip). 

## 0. Preparation
Import relevant libraries and load the dataset: <br>
users: <br>
* user_id: customer identifier
* label:  1 if the customer will buy banana in next 4 weeks, 0 otherwise

orders: <br>
* order_id: order identifier
* user_id: customer identifier
* order_time: date of the order was placed on 

order_products: <br>
* order_id: order identifier
* order_product_id: foreign key
* reordered:  1 if this product has been ordered by this user in the past, 0 otherwise
* product_name: name of the product
* aisle_id: aisle identifier
* department: the name of the department
* order_time: date of the order was placed on

In [1]:
import pandas as pd

import featuretools as ft
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import os
ft.__version__

'0.13.2'

In [2]:
orders = pd.read_csv("orders.csv")
order_products = pd.read_csv("order_products.csv")
users = pd.read_csv("users.csv")

print(users["label"].value_counts())
orders.shape

False    628
True     139
Name: label, dtype: int64


(5997, 4)

## Task 1. Feature Generation
In this task, you need to use featuretools to generate candidate features by using the above three tables.

### 1.1 Representing Data with EntitySet

Define entities and their relationships (see [https://docs.featuretools.com/en/stable/generated/featuretools.EntitySet.html](https://docs.featuretools.com/en/stable/generated/featuretools.EntitySet.html))

In [3]:
# Get the relationship between entities
def load_entityset(orders, order_products, users):
    # --- Write your code below ---
    # return the EntitySet object
    
    entities = {
    "users" : (users, "user_id"),
    "orders" : (orders, "order_id",'order_time'),
    "order_products": (order_products,"order_product_id","order_time")
    }
    relationships = [("users", "user_id", "orders", "user_id"),
                     ("orders", "order_id", "order_products", "order_id")]
    return ft.EntitySet('users', entities, relationships)

### 1.2 Deep Feature Synthesis

In [4]:
# Automatically generate features

try:
    users = users.drop('Unnamed: 0', 1)
    orders = orders.drop('Unnamed: 0', 1)
    order_products = order_products.drop('Unnamed: 0', 1)
except:
    print("First column (index_column) is already removed! ")
    
es = load_entityset(orders, order_products, users)

# use ft.dfs to perform feature engineering
# --- Write your code below ---
feature_matrix, feature_defs = ft.dfs(entityset=es,entities=es,
                                      target_entity="users")

In [5]:
# output what features you generate
feature_matrix

Unnamed: 0_level_0,label,COUNT(orders),SUM(order_products.aisle_id),SUM(order_products.reordered),STD(order_products.aisle_id),STD(order_products.reordered),MAX(order_products.aisle_id),MAX(order_products.reordered),SKEW(order_products.aisle_id),SKEW(order_products.reordered),...,NUM_UNIQUE(orders.YEAR(order_time)),NUM_UNIQUE(orders.MONTH(order_time)),MODE(orders.MODE(order_products.department)),MODE(orders.WEEKDAY(order_time)),MODE(orders.MODE(order_products.product_name)),MODE(orders.DAY(order_time)),MODE(orders.YEAR(order_time)),MODE(orders.MONTH(order_time)),NUM_UNIQUE(order_products.orders.user_id),MODE(order_products.orders.user_id)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,4,1271,11,38.070486,0.511766,121,1,0.339280,-0.102843,...,1,3,snacks,3,Aged White Cheddar Popcorn,1,2015,1,1,1
2,True,7,5419,29,39.596448,0.476918,123,1,0.141339,0.682090,...,1,3,produce,6,Baked Organic Sea Salt Crunchy Pea Snack,1,2015,1,1,2
3,False,5,2831,14,40.594305,0.480091,123,1,-0.029453,0.694312,...,1,3,produce,6,100% Recycled Paper Towels,1,2015,1,1,3
7,False,4,4781,36,35.379342,0.503413,123,1,0.172950,0.027978,...,1,3,beverages,1,85% Lean Ground Beef,1,2015,3,1,7
10,False,4,7677,20,37.683787,0.382021,123,1,-0.142492,1.729523,...,1,2,produce,5,85% Lean Ground Beef,1,2015,2,1,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,4,2372,18,44.964947,0.495434,128,1,0.220281,0.422463,...,1,2,snacks,1,All Natural Maple Almond Butter,1,2015,1,1,996
997,False,4,2121,7,40.314601,0.415149,129,1,0.290248,1.476346,...,1,3,produce,3,85% Lean Ground Beef,1,2015,1,1,997
998,False,7,3898,41,37.087147,0.469102,116,1,-0.068441,-0.808589,...,1,3,dairy eggs,1,Bag of Organic Bananas,6,2015,1,1,998
999,True,12,22181,182,40.019134,0.489306,131,1,-0.263885,-0.438918,...,1,3,produce,2,100% Raw Coconut Water,1,2015,1,1,999


## Task 2. Feature Selection
In this task, you are going to select 10 features that are useful and train the *Random Forest* model. The goal is to beat the accuracy performance as we have shown before. Note that you have to use the Random Forest and the hyperparameters we provide in Section 2.2. In other words, your job is to achieve a higher AUC than 0.61 through feature generation/selection rather than through hyperparameter tuning or model selectoin. 

### 2.1 Select top features

In [6]:
# --- Write your code below ---
# Select top-10 features and return X, y (X.shape = (767, 10)
import random
from sklearn.preprocessing import LabelEncoder
import math
from collections import OrderedDict

cp_feature_matrix = feature_matrix.copy()

COL_TARGET = 0
COL_FEATURE = 1
# Get feature of training data(X) and target of training data(y)
X_temp = cp_feature_matrix.iloc[:,COL_FEATURE:]
y = cp_feature_matrix.iloc[:,COL_TARGET]

# Change String label column to Numeric categorical column 
col_categorical = X_temp.select_dtypes(include=['object'])
name_col_categorical = list(col_categorical)
index_col_categorical = [X_temp.columns.get_loc(c) for c in name_col_categorical if c in col_categorical]
labelencoder = LabelEncoder()
for col_categorical in name_col_categorical:
    X_temp[col_categorical] = labelencoder.fit_transform(X_temp[col_categorical])

# Create list of index of all features    
num_feature = X_temp.shape[1]
list_index_feature = list(range(COL_FEATURE,num_feature))

# Turn String "True and False" to 1 and 0
y = y.to_frame()
y = labelencoder.fit_transform(y['label'])

dict_features = {}

# Select the number of random trials we want to run in order to find top 10 best features
NUM_TRIAL = 100
for _ in range(NUM_TRIAL):
    rand10_feature = random.sample(list_index_feature,10)
    X = X_temp.iloc[:,rand10_feature].values
    clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
    scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                             scoring="roc_auc", verbose=True)
    # Ignore the ROC with NaN
    if math.isnan(scores.mean()) == False:
        print("AUC %.2f" % (scores.mean()))
        dict_features[scores.mean()] = rand10_feature

        sorted_dict_features = OrderedDict(sorted(dict_features.items(), key=lambda t: t[0],reverse=True))
print(sorted_dict_features)

# list of top10 feature indexes
best10_feature = list(sorted_dict_features.values())[0]
X = X_temp.iloc[:,best10_feature].values

# [17, 26, 102, 105, 96, 30, 41, 78, 75, 13] best score: 0.74
#[26, 106, 29, 55, 12, 17, 53, 109, 90, 52] best score: 0.73
#[68, 47, 106, 7, 105, 90, 60, 53, 97, 73] best score: 0.72

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    8.7s finished


AUC 0.68


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.4s finished


AUC 0.71


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

AUC 0.68


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

AUC 0.68


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

AUC 0.72


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.8s finished


AUC 0.66


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.8s finished


AUC 0.66


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.6s finished


AUC 0.69


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

AUC 0.73


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.2s finished


AUC 0.66


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.3s finished


AUC 0.70


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.9s finished


AUC 0.67


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.8s finished


AUC 0.61


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.7s finished


AUC 0.67


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

AUC 0.70


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.8s finished


AUC 0.76


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.9s finished


AUC 0.68


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


OrderedDict([(0.7564656058192744, [16, 11, 43, 35, 12, 22, 37, 98, 81, 105]), (0.7262566245878482, [105, 56, 16, 96, 30, 43, 13, 39, 17, 65]), (0.7219724668383319, [94, 62, 43, 5, 30, 16, 105, 64, 42, 53]), (0.7086134027075651, [28, 55, 32, 101, 26, 89, 13, 3, 16, 83]), (0.7009392401475756, [1, 32, 68, 16, 10, 110, 86, 56, 60, 17]), (0.6998566131922187, [104, 26, 75, 85, 63, 58, 68, 2, 72, 16]), (0.6857687925838761, [69, 65, 5, 29, 88, 42, 31, 34, 16, 36]), (0.6821558809703315, [30, 19, 5, 20, 74, 86, 89, 51, 58, 62]), (0.6817551738933014, [64, 75, 41, 19, 1, 106, 21, 45, 89, 85]), (0.6807116201767841, [63, 11, 56, 23, 1, 26, 94, 14, 2, 105]), (0.676815900935868, [62, 57, 1, 4, 3, 29, 64, 76, 33, 105]), (0.6732873991808215, [104, 63, 1, 79, 6, 43, 15, 84, 17, 102]), (0.6702999825693238, [3, 62, 25, 68, 76, 71, 84, 60, 24, 90]), (0.6649764619127783, [51, 74, 66, 35, 17, 108, 8, 27, 37, 109]), (0.6589498372399195, [33, 2, 85, 30, 20, 45, 76, 86, 70, 106]), (0.65762609295644, [7, 22, 81, 

### 2.2 Get accuracy and list features

In [7]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                             scoring="roc_auc", verbose=True)
print("AUC %.2f" % (scores.mean()))

# Print top-10 features

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.9s finished


AUC 0.75


## Task3. Writing Questions

1. Please list three advantages and disadvantages of featuretools. 
2. For those disadvantages you listed above, do you have any idea to improve it? 

--- Write your answer here---

Advantages:
1. Powerful tools: Using concepts of Entitysets, Entities, and Relationships, Featuretools can perform Deep Feature Synthesis to create new features. In addition, for complex datasets where there several entities with parent/child relationships, Featuretools helps save us lots of time.

2. Open-source: it is free to use with constant contribution from Data Scientist community.

3. Time Handling: Deep Feature Synthesis in Featuretools can automatically calculate the features for each training example at the specific time associated with the example by using cutoff times. This leads to fewer label leakage problems.

Disadvantages,and How to improve:
1. In-memory: According to Featuretools doc, Featuretools is intended to be run on datasets that can fit in memory on one machine. 
  
  Improvement: Featuretools has document for you on how to run on computationally expensive task such as Reduce number of unique cutoff times, Adjust chuck size, Partition and Distribute Data, etc.


2. Prices: If you want to work on very big datasets by using Apache Spark and Dask, you might need to pay for Featuretools Enterprise APIs. Or else, you need to build the whole system by yourself.   
  
  Improvement: Featuretools allow you test their native Enterprise APIs before paying money. 


3. Supervised Learning: If you are doing supervised machine learning, you must specify and supply your own labels.
  
  Improvement: You can use external tools or you can use 'Compose' tools offered by Featuretools to supply labels.

## Submission
Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.