In [None]:
from IPython.display import HTML
from IPython.display import Image
Image(url= "https://www.autocar.co.nz/_News/_2018Bin/Mercedes-badge_www.jpg")



In [None]:
from IPython.core.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# <p style="text-align: center;"> Table of Contents </p>
- ## 1. Introduction
   - ### 1.1 [Abstract](#abstract)
   - ### 1.2 [Importing Libraries](#importing_libraries)
   - ### 1.3 [Dataset Summary](#dataset_summary)
   - ### 1.4 [Dataset Cleaning](#dataset_cleaning)
   - ### 1.5 [Exploratory Data Analysis (EDA)](#eda)
        - ### 1.5.1[Heatmap](#Correlation_table) 
        - ### 1.5.2[Stripplot X0](#Stripplot)
        - ### 1.5.3[Stripplot X1](#Stripplot_X1)
        - ### 1.5.4[Boxplot X2](#Boxplot_X2)
        - ### 1.5.5[Violin X3](#Violin_X3)
        - ### 1.5.6[Violin X4](#Violin_X4)
        - ### 1.5.7[Boxplot X5](#Boxplot_X5)
        - ### 1.5.8[Boxplot X6](#Boxplot_X6)
        - ### 1.5.9[Boxplot X8](#Boxplot_X8)
        - ### 1.5.10[Cardinality](#Cardinality)
- ## 2. [Modelling Of Data](#MOD)   
   - ### 2.1 [Categorical Conversion](#catconversion)
   - ### 2.2 [XGBoost Model 1](#xgb1)
   - ### 2.3 [XGBoost Model 2](#xgb2)
   - ### 2.4 [XGBoost Model 3](#xgb3)
        - ### 2.4.1[Principal Component Analysis](#pca) 
        - ### 2.4.2[Individual Component Analysis](#ica)
        - ### 2.4.3[Truncated SVD](#svd)
- ## 3. [H2O Modelling](#H2O) 
   - ### 3.1 [H2O Model 4](#xgb4)
- ## 4. [Stacked Algorithm Modeling from Kaggle Kernels](#STA) 

- ## 5. [Conclusion](#Conclusion)
- ## 6. [Contribution](#Contribution)
- ## 7. [Citation](#Citation)
- ## 8. [License](#License)

# <p style="text-align: center;">1. Introduction</p>

#   1.1 Abstract  <a id='abstract'></a>

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

Most part of the code was available in kernels present for the respective competitions, and from that an idea what has to be done becomes clear. So the given data has no null values and and requires regression objective. XGBoost has been implemented on the dataset by modelling data in different ways, then PCA, ICA and truncated SVD has been applied and data is again modelled. The unique thing that i did with my project is I used H2O to calculate the metrics for the given dataset , for that the raw dataset with useful features is used. Then finally stacking algorithm is used.





#   1.2 Importing Libraries  <a id='importing_libraries'></a>

In [None]:
import h2o
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

color = sns.color_palette()
import warnings; warnings.simplefilter('ignore')

# 1.3 Dataset Summary <a id='dataset_summary'></a>

### Statistcal analysis of given dataset

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
train_df.describe()

#### The following table shows the first five rows of the given dataset, thereby giving us insight about what sort of dataset it is. And what are the attributes included in the dataset.

In [None]:
train_df.head()

#### Target Variable:

"y" is the variable we need to predict.It represents the time (in seconds) that the car took to pass testing for each variable.

#### Independent Variables:

The dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

#   1.4 Dataset Cleaning  <a id='dataset_cleaning'></a>

### Information about each column and about null values for each column

In [None]:
train_df.info()

### Types of columns in the given dataset

In [None]:
dtype_df = train_df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()

### Distribution of target variable 

The following plot gives us the maximum and minimum values for 'y' and mean and standard deviation for target variable. Also by the following plot we can see that the following distribution is almost normal distribution , so if in case there are some null values present in the given dataset , we will fill it using median as a measure of central tendency. It is always advisable to remove outliers. Since outliers has a huge effect on mean , though it does not effect mode and median very much. So we use median.

In [None]:
y_train = train_df['y'].values
plt.figure(figsize=(15, 5))
plt.hist(y_train, bins=20)
plt.xlabel('Target value in seconds')
plt.ylabel('Occurences')
plt.title('Distribution of the target value')

print('min: {} max: {} mean: {} std: {}'.format(min(y_train), max(y_train), y_train.mean(), y_train.std()))
print('Count of values above 180: {}'.format(np.sum(y_train > 200)))

So we have a pretty standard distribution here, which is centred around almost exactly 100. Nothing special to note here, except there is a single outlier at 265 seconds where every other value is below 180.


In [None]:
#Finding missing values in the data set 
total = train_df.isnull().sum()[train_df.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(train_df)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

As we can see that there is no null values in the dataset. So we can proceed with EDA

In [None]:
from IPython.display import HTML
from IPython.display import Image
Image(url= "https://media.giphy.com/media/jNdw5Qmy5MOpq/giphy.gif")



In [None]:
cols = [c for c in train_df.columns if 'X' in c]
cols_x = [c for c in test_df.columns if 'X' in c]

counts = [[], [], []]
counts_ = [[], [], []]
for c in cols:
    typ = train_df[c].dtype
    uniq = len(np.unique(train_df[c]))
    if uniq == 1: counts[0].append(c)
    elif uniq == 2 and typ == np.int64: counts[1].append(c)
    else: counts[2].append(c)

print('Constant features: {} Binary features: {} Categorical features: {}\n'.format(*[len(c) for c in counts]))

print('Constant features:', counts[0])
print('Categorical features:', counts[2])

print("Features for test")
for c in cols_x:
    typ = test_df[c].dtype
    uniq = len(np.unique(test_df[c]))
    if uniq == 1: counts_[0].append(c)
    elif uniq == 2 and typ == np.int64: counts_[1].append(c)
    else: counts_[2].append(c)

print('Constant features: {} Binary features: {} Categorical features: {}\n'.format(*[len(c) for c in counts]))

print('Constant features:', counts[0])
print('Categorical features:', counts[2])

So all the integer columns are binary with some columns have only one unique value 0. Possibly we could exclude those columns in our modeling activity.


#    <p style="text-align: center;">1.5 Exploratory Data Analysis (EDA)  <a id='eda'></a>

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

## 1.5.1 Correlation Table  <a id='heatmap'></a>

Correlation is any statistical association, though in common usage it most often refers to how close two variables are to having a linear relationship with each other. The correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot. if r>0 higher the correlation and if r<0 correlation is inversely related.



In [None]:
#correlation
train_df.corr()


### Let us explore the categorical columns present in the dataset.

## 1.5.2 Stripplot for Categorical Variable X0  <a id='Stripplot'></a>
Draw a scatterplot where one variable is categorical.

A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

In [None]:
var_name = "X0"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.3 Stripplot for Categorical Variable X1  <a id='Stripplot_X1'></a>
Draw a scatterplot where one variable is categorical.

A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

In [None]:
var_name = "X1"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.4 Boxplot for Categorical Variable X2  <a id='Boxplot_X2'></a>

It is often used in explanatory data analysis in order to show the shape of the distribution, its central value, and its variability. The following figure gives us the boxplot for the first three factors that are highly positively and negatively correlated.

In [None]:
var_name = "X2"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.5 Violinplot for Categorical Variable X3  <a id='Violin_X3'></a>

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.


In [None]:
var_name = "X3"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.6 Violinplot for Categorical Variable X4  <a id='Violin_X4'></a>

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.


In [None]:
var_name = "X4"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.7 Boxplot for Categorical Variable X5  <a id='Boxplot_X5'></a>

It is often used in explanatory data analysis in order to show the shape of the distribution, its central value, and its variability. The following figure gives us the boxplot for the first three factors that are highly positively and negatively correlated.

In [None]:
var_name = "X5"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.8 Boxplot for Categorical Variable X6  <a id='Boxplot_X6'></a>

It is often used in explanatory data analysis in order to show the shape of the distribution, its central value, and its variability. The following figure gives us the boxplot for the first three factors that are highly positively and negatively correlated.

In [None]:
var_name = "X6"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.9 Boxplot for Categorical Variable X8  <a id='Boxplot_X8'></a>

It is often used in explanatory data analysis in order to show the shape of the distribution, its central value, and its variability. The following figure gives us the boxplot for the first three factors that are highly positively and negatively correlated.

In [None]:
var_name = "X8"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()

## 1.5.10 Cardinality of Variables <a id='Cardinality'></a>

Cardinality is a small Python library to determine and check the size of any iterable (lists, iterators, generators, and so on). It is intended as a useful companion to the built-in itertools module.

What about the cardinality of our features?

In [None]:
for c in counts[2]:
    value_counts = train_df[c].value_counts()
    fig, ax = plt.subplots(figsize=(10, 5))
    plt.title('Categorical feature {} - Cardinality {}'.format(c, len(np.unique(train_df[c]))))
    plt.xlabel('Feature value')
    plt.ylabel('Occurences')
    plt.bar(range(len(value_counts)), value_counts.values)
    ax.set_xticks(range(len(value_counts)))
    ax.set_xticklabels(value_counts.index, rotation='vertical')
    plt.show()

#    <p style="text-align: center;">2 Modeling Of Data  <a id='MOD'></a>


## 2.1 Categorical Conversion  <a id='catconversion'></a>
#### Our analysis requires at least one independent variable which needs to be a multi-class categorical variable and a binary categorical variable and its conversion to numeric data.

We will be using cat coding and one hot encoding for the same, cat coding converts categorical data into numeric for use , basically it provides numbers for ordinal data. Cat coding creates a mapping of our sortable categories, e. g. old < renovated < new → 0, 1, 2

One hot encoding (binary values from categorical data)
A one hot encoding is a representation of categorical variables as binary vectors.  This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

And in our dataset state is a multiclass variable and type is a binary categorical variable. So we are changing them eventually. Further in our analysis we will be using cat coding only to convert our ordinal data.

We have mostly used cat coding as it changes the target column itself while in one hot encoding based on number of types the column holds, new columns be created according to the type with binary values i.e 1 and 0 . Thereby making data set much more complex and increasing redundancy. Although both methods be used for categorical to numeric conversion , we have preferred cat coding over one hot encoding

In [None]:
usable_columns = list(set(test_df.columns) - set(['ID', 'y']))
usable_columns = list(set(train_df.columns) - set(['ID', 'y']))

y_train = train_df['y'].values
id_test = test_df['ID'].values

##dataset for normal modelling and not H2O
x_train = train_df[usable_columns]  
x_test = test_df[usable_columns]
x_train_not_dropped = train_df[usable_columns]  
x_test_not_dropped = test_df[usable_columns]

In [None]:
x_test.head()

In [None]:
unique_columns=['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
x_test = x_test.drop(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347'],axis=1)

for column in usable_columns:
    cardinality = len(np.unique(test_df[column]))
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        test_df[column] = test_df[column].apply(mapper)
        x_test[column]= x_test[column].apply(mapper)
        x_test_not_dropped[column]= x_test_not_dropped[column].apply(mapper)

        


In [None]:
x_train = x_train.drop(unique_columns,axis=1)

for column in usable_columns:
    cardinality = len(np.unique(train_df[column]))
      
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        train_df[column] = train_df[column].apply(mapper)
        x_train[column]= x_train[column].apply(mapper)
        x_train_not_dropped[column]= x_train_not_dropped[column].apply(mapper)



In [None]:
finaltrainset = train_df[usable_columns].values
finaltestset = test_df[usable_columns].values

## 2.2 XGBoost on Normal Dataset Without Modelling  <a id='xgb1'></a>

XGBoost stands for eXtreme Gradient Boosting.XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. It is both a linear model solver and a tree learning algorithm. It does a parallel computation on a single machihne , and supports various objective functions including regression, classification and ranking. 
> Two reasons to use XGBoost are also the two goals of the project: 

> 1. Execution Speed. 

> 2. Model Performance.

The XGBoost library implements the gradient boosting decision tree algorithm.

This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.

Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

This approach supports both regression and classification predictive modeling problems.

Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and task parameters.

>General parameters relate to which booster we are using to do boosting, commonly tree or linear model

>Booster parameters depend on which booster you have chosen

>Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

>Command line parameters relate to behavior of CLI version of XGBoost.

In [None]:
mean_Y = np.mean(y_train)
# prepare dict of params for xgboost to run with
xgb_params = {
    'n_trees': 500,   #general parameter used as booster
    'eta': 0.005,     #Booster Parameter used as learning_rate, step shrinkage reduces overfitting and eta shrinks the feature weights to make the boosting process more conservative.
    'max_depth': 4,   #Booster Parameter Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit
    'subsample': 0.95, #Booster Parameter Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. 
    'objective': 'reg:linear', #Learning Parameter Specify the learning task and the corresponding learning objective
    'eval_metric': 'rmse', #Learning Parameter Evaluation metrics for validation data, a default metric will be assigned according to objective , user can add multiple metrics
    'base_score': mean_Y, # Learning Parameter The initial prediction score of all instances, global bias base prediction = mean(target)
    'silent': 1    #silent [default=0] [Deprecated] Deprecated. Verbosity of printing messages.
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(x_train_not_dropped, label=y_train)
dtest = xgb.DMatrix(x_test_not_dropped)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=1000, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50, 
                   show_stdv=False
                  )

num_boost_rounds = 500

# train model
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)

# check f2-score (to get higher score - increase num_boost_round in previous cell)
from sklearn.metrics import r2_score

# now fixed, correct calculation
print("Accuracy Score:%.2f" %r2_score(dtrain.get_label(), model.predict(dtrain)))

# make predictions and save results
pred_y = model.predict(dtest)



In [None]:
print("Accuracy Score:%.2f" %r2_score(dtrain.get_label(), model.predict(dtrain)))


### First Submission With No change in dataset

In [None]:
output = pd.DataFrame({'id': id_test, 'y': pred_y})
output.to_csv('submission_manali_1.csv', index=False)

## 2.3 XGBoost on Normal Dataset With Less Feature  <a id='xgb2'></a>

In [None]:
mean_Y = np.mean(y_train)
# prepare dict of params for xgboost to run with
xgb_params = {
    'n_trees': 500,   #general parameter used as booster
    'eta': 0.005,     #Booster Parameter used as learning_rate, step shrinkage reduces overfitting and eta shrinks the feature weights to make the boosting process more conservative.
    'max_depth': 4,   #Booster Parameter Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit
    'subsample': 0.95, #Booster Parameter Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. 
    'objective': 'reg:linear', #Learning Parameter Specify the learning task and the corresponding learning objective
    'eval_metric': 'rmse', #Learning Parameter Evaluation metrics for validation data, a default metric will be assigned according to objective , user can add multiple metrics
    'base_score': mean_Y, # Learning Parameter The initial prediction score of all instances, global bias base prediction = mean(target)
    'silent': 1    #silent [default=0] [Deprecated] Deprecated. Verbosity of printing messages.
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=1000, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50, 
                   show_stdv=False
                  )

num_boost_rounds = 500

# train model
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)

# check f2-score (to get higher score - increase num_boost_round in previous cell)
from sklearn.metrics import r2_score

# now fixed, correct calculation
print("Accuracy Score:%.2f" %r2_score(dtrain.get_label(), model.predict(dtrain)))

# make predictions and save results
#pred_y = model.predict(dtest)



In [None]:
print("Accuracy Score:%.2f" %r2_score(dtrain.get_label(), model.predict(dtrain)))


In [None]:
#make predictions and save results
pred_y = model.predict(dtest)



In [None]:
output = pd.DataFrame({'id': id_test, 'y': pred_y})
output.to_csv('submission_manali_2.csv', index=False)

## 2.4 XGBoost on Dataset With PCA, ICA and Truncated SVD  <a id='xgb3'></a>

In [None]:
from sklearn.decomposition import PCA, FastICA
from sklearn.decomposition import TruncatedSVD

In [None]:
from sklearn import datasets
from sklearn.preprocessing import StandardScaler 


### 2.4.1 Principal Component Analysis <a id='pca'></a>

Reducing the dimension of the feature space is called “dimensionality reduction.” There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes:

>Feature Elimination : we reduce the feature space by eliminating features.

>Feature Extraction : In feature extraction, we create ten “new” independent variables, where each “new” independent variable is a combination of each of the ten “old” independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.

Principal component analysis is a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables! As an added benefit, each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model require our independent variables to be independent of one another.

In [None]:
pca = PCA().fit(train_df)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance'); 

#### The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension. we probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

### 2.4.2 Individual Component Analysis <a id='ica'></a>

FastICA: a fast algorithm for Independent Component Analysis.
In PCA, you are finding basis vectors that 'best explain' the variance of your data. The first (ie highest ranked) basis vector is going to be one that best fits all the variance from your data. The second one also has this criterion, but must be orthogonal to the first, and so on and so forth. (Turns out those basis vectors for PCA are nothing but the eigenvectors of your data's covariance matrix).

In ICA, we are again finding basis vectors, but this time, we want basis vectors that give a result, such that this resulting vector is one of the independent components of the original data.

### 2.4.3 Singular Vector Decomposition <a id='svd'></a>

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.

When we form SVD over our training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called truncated SVD.



In [None]:
from sklearn.decomposition import PCA, FastICA
from sklearn.decomposition import TruncatedSVD

In [None]:
#svd
tsvd = TruncatedSVD(n_components=10, random_state=42)
tsvd_results_train = tsvd.fit_transform(train_df.drop(["y"], axis=1))
tsvd_results_test = tsvd.transform(test_df)
 
# PCA
pca = PCA(n_components=10, random_state=42)
pca2_results_train = pca.fit_transform(train_df.drop(["y"], axis=1))
pca2_results_test = pca.transform(test_df)

# ICA
ica = FastICA(n_components=10, random_state=42)
ica2_results_train = ica.fit_transform(train_df.drop(["y"], axis=1))
ica2_results_test = ica.transform(test_df)


In [None]:
# Append decomposition components to datasets
n_comp=10
for i in range(1, 11):
    x_train['pca_' + str(i)] = pca2_results_train[:,i-1]
    x_test['pca_' + str(i)] = pca2_results_test[:,i-1]
    
    x_train['ica_' + str(i)] = ica2_results_train[:,i-1]
    x_test['ica_' + str(i)] = ica2_results_test[:,i-1]
    
    x_train['tsvd_' + str(i)] = tsvd_results_train[:,i-1]
    x_test['tsvd_' + str(i)] = tsvd_results_test[:, i-1]




In [None]:
### Regressor
import xgboost as xgb

# prepare dict of params for xgboost to run with
xgb_params = {
    
    'n_trees': 500, 
    'eta': 0.005,
    'max_depth': 4,
    'subsample': 0.95,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': mean_Y, # base prediction = mean(target)
    'silent': 1
}



# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(x_train,label=y_train)
dtest = xgb.DMatrix(x_test)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=500, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=10, 
                   show_stdv=False
                 )

                

num_boost_rounds = len(cv_result)
print(num_boost_rounds)
                   


In [None]:
num_boost_rounds = 500
# train model

model = xgb.train(dict(xgb_params, silent=1), dtrain, num_boost_round=num_boost_rounds)
from sklearn.metrics import r2_score
print("Accuracy Score:%.2f" %r2_score(model.predict(dtrain), dtrain.get_label()))

In [None]:
# make predictions and save results
y_pred = model.predict(dtest)



In [None]:
output = pd.DataFrame({'id': id_test, 'y': y_pred})
output.to_csv('submission_manali_3.csv', index=False)

## 3 Modelling with H2O <a id='H2O'></a>

H2O is a Java-based software for data modeling and general computing. The goal of H2O is to allow simple horizontal scaling to a given problem in order to produce a solution faster. 
    
After one prepares the dataset for modeling by defining significant data and removing insignificant data, H2O is used to create a model representing the results of the data analysis. 
H2O gives a list of algorithms that give the best result for the given dataset.
    


### 2.5.1 H2O Model 1 <a id='xgb4'></a>

Since above we saw that the scores we were getting was not something that was great , infact they were quite close by and were not standing out , so let's opt for H2O and see if it can give us the result we want.

In [None]:
print(h2o.__version__)

#### Starting H2O

In [None]:
h2o.init(strict_version_check=False) # start h2o

In [None]:
h2o.connect()

#### Creating H2O frames for test and train data

In [None]:
h2o_frame=h2o.H2OFrame(train_df)
h2o_test=h2o.H2OFrame(test_df)

#### Table overview:1-  The following table shows the statiscal analysis and data for the given dataset i.e train_df, thereby giving us insight about what sort of dataset it is. And what are the attributes included in the dataset.

In [None]:
h2o_frame.describe()

#### Table overview:1-  The following table shows the statiscal analysis and data for the given dataset i.e test_df, thereby giving us insight about what sort of dataset it is. And what are the attributes included in the dataset.

In [None]:
h2o_test.describe()

#### Describing target and test for the training data frame 
>Training Features :- 'ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100', 'X101', 'X102', 'X103', 'X104', 'X105', 'X106', 'X107', 'X108', 'X109', 'X110', 'X111', 'X112', 'X113', 'X114', 'X115', 'X116', 'X117', 'X118', 'X119', 'X120', 'X122', 'X123', 'X124', 'X125', 'X126', 'X127', 'X128', 'X129', 'X130', 'X131', 'X132', 'X133', 'X134', 'X135', 'X136', 'X137', 'X138', 'X139', 'X140', 'X141', 'X142', 'X143', 'X144', 'X145', 'X146', 'X147', 'X148', 'X150', 'X151', 'X152', 'X153', 'X154', 'X155', 'X156', 'X157', 'X158', 'X159', 'X160', 'X161', 'X162', 'X163', 'X164', 'X165', 'X166', 'X167', 'X168', 'X169', 'X170', 'X171', 'X172', 'X173', 'X174', 'X175', 'X176', 'X177', 'X178', 'X179', 'X180', 'X181', 'X182', 'X183', 'X184', 'X185', 'X186', 'X187', 'X189', 'X190', 'X191', 'X192', 'X194', 'X195', 'X196', 'X197', 'X198', 'X199', 'X200', 'X201', 'X202', 'X203', 'X204', 'X205', 'X206', 'X207', 'X208', 'X209', 'X210', 'X211', 'X212', 'X213', 'X214', 'X215', 'X216', 'X217', 'X218', 'X219', 'X220', 'X221', 'X222', 'X223', 'X224', 'X225', 'X226', 'X227', 'X228', 'X229', 'X230', 'X231', 'X232', 'X233', 'X234', 'X235', 'X236', 'X237', 'X238', 'X239', 'X240', 'X241', 'X242', 'X243', 'X244', 'X245', 'X246', 'X247', 'X248', 'X249', 'X250', 'X251', 'X252', 'X253', 'X254', 'X255', 'X256', 'X257', 'X258', 'X259', 'X260', 'X261', 'X262', 'X263', 'X264', 'X265', 'X266', 'X267', 'X268', 'X269', 'X270', 'X271', 'X272', 'X273', 'X274', 'X275', 'X276', 'X277', 'X278', 'X279', 'X280', 'X281', 'X282', 'X283', 'X284', 'X285', 'X286', 'X287', 'X288', 'X289', 'X290', 'X291', 'X292', 'X293', 'X294', 'X295', 'X296', 'X297', 'X298', 'X299', 'X300', 'X301', 'X302', 'X304', 'X305', 'X306', 'X307', 'X308', 'X309', 'X310', 'X311', 'X312', 'X313', 'X314', 'X315', 'X316', 'X317', 'X318', 'X319', 'X320', 'X321', 'X322', 'X323', 'X324', 'X325', 'X326', 'X327', 'X328', 'X329', 'X330', 'X331', 'X332', 'X333', 'X334', 'X335', 'X336', 'X337', 'X338', 'X339', 'X340', 'X341', 'X342', 'X343', 'X344', 'X345', 'X346', 'X347', 'X348', 'X349', 'X350', 'X351', 'X352', 'X353', 'X354', 'X355', 'X356', 'X357', 'X358', 'X359', 'X360', 'X361', 'X362', 'X363', 'X364', 'X365', 'X366', 'X367', 'X368', 'X369', 'X370', 'X371', 'X372', 'X373', 'X374', 'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384', 'X385'

>Target Feature:- y

In [None]:
y = h2o_frame.columns[1] #target variable
X = [name for name in h2o_frame.columns if name != y] #train features

Cross-validation rather than taking a test training split reduces the variance of the estimates of goodness of fit statistics. In rare cases one should take a test training split but this should be left to the expert users.

AutoML helps in automatic training and tuning of many models within a user-specified time limit.

The current version of AutoML function can train and cross-validate a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.

When we say AutoML, it should cater to the aspects of data preparation, Model generation, and Ensembles and also provide few parameters as possible so that users can perform tasks with much less confusion. H2o AutoML does perform this task with ease and the minimal parameter passed by the user.

In [None]:
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_runtime_secs=1200,project_name ="automl_test" ,balance_classes= False) # init automl, run for 300 seconds
aml.train(x=X,  
           y=y,
           training_frame=h2o_frame)

### Leaderboard
Next, we will view the AutoML Leaderboard. Since we did not specify a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses cross-validation metrics to rank the models.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of binary classification, the default ranking metric is Area Under the ROC Curve (AUC). In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

The leader model is stored at aml.leader and the leaderboard is stored at aml.leaderboard.

#### Table Overview: -Leaderboard with best metrics

In [None]:
lb=aml.leaderboard
lb

In [None]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()

#### Getting Models for Stack Ensemble:- The learning algorithms which worked together to give us the best model

In [None]:
 #Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels_AutoML_20190403_204847" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(aml.leader.metalearner()['name'])

In [None]:
metalearner.coef_norm()

In [None]:
m_id=''
for model in aml_leaderboard_df['model_id']:
    if 'StackedEnsemble' not in model:
      print (model)
      if m_id=='':
            m_id=model
print ("model_id ", m_id)

In [None]:
non_stacked= h2o.get_model(m_id)
print (non_stacked.algo)

#### Making a Prediction

In [None]:
pred = aml.predict(h2o_test)


#### Making a dataframe for the generated predictions

In [None]:
x=pred.as_data_frame(use_pandas=True)

In [None]:
x['id'] = test_df['ID'].values

In [None]:
x=x[['id','predict']]

In [None]:
x = x.rename(columns={'id': 'id', 'predict': 'y'})

In [None]:
x.to_csv('submission_manali_4.csv')

In [None]:
x.head()

In [None]:
h2o.shutdown()

## 4 Stacking Algorithm - From a Kaggle Notebook <a id='STA'></a>

In [None]:
import numpy as np
from sklearn.base import BaseEstimator,TransformerMixin, ClassifierMixin
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import ElasticNetCV, LassoLarsCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline, make_union
from sklearn.utils import check_array
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA, FastICA
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import r2_score
import warnings; warnings.simplefilter('ignore')

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

train_ids = train.ID




Stacking function used for stacking model that are generated oby xgboost and then again the same model is being transformed and giving us a better model


In [None]:
class StackingEstimator(BaseEstimator, TransformerMixin):
    
    def __init__(self, estimator):
        self.estimator = estimator

    def fit(self, X, y=None, **fit_params):
        self.estimator.fit(X, y, **fit_params)
        return self
    def transform(self, X):
        X = check_array(X)
        X_transformed = np.copy(X)
        # add class probabilities as a synthetic feature
        if issubclass(self.estimator.__class__, ClassifierMixin) and hasattr(self.estimator, 'predict_proba'):
            X_transformed = np.hstack((self.estimator.predict_proba(X), X))

        # add class prodiction as a synthetic feature
        X_transformed = np.hstack((np.reshape(self.estimator.predict(X), (-1, 1)), X_transformed))

        return X_transformed


In [None]:
usable_columns = list(set(train.columns) - set(['y']))


In [None]:

cat_columns = train.select_dtypes(['object']).columns
cat_columns_test= test.select_dtypes(['object']).columns
train[cat_columns] = train[cat_columns].astype('category')
test[cat_columns_test] = test[cat_columns_test].astype('category')
test[cat_columns_test] = test[cat_columns_test].apply(lambda x: x.cat.codes)
train[cat_columns] = train[cat_columns].apply(lambda x: x.cat.codes)



In [None]:
y_train = train['y'].values
y_mean = np.mean(y_train)
id_test = test['ID'].values
#finaltrainset and finaltestset are data to be used only the stacked model (does not contain PCA, SVD... arrays) 



finaltrainset = train[usable_columns].values
finaltestset = test[usable_columns].values


'''Train the xgb model then predict the test data'''

sub = pd.DataFrame()
sub['ID'] = id_test
sub['y'] = 0
for fold in range(1,4):
    np.random.seed(fold)
    xgb_params = {
        'n_trees': 520, 
        'eta': 0.0045,
        'max_depth': 4,
        'subsample': 0.93,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'base_score': y_mean, # base prediction = mean(target)
        'silent': True,
        'colsample_bytree': 0.7,
        'seed': fold,
    }
    # NOTE: Make sure that the class is labeled 'class' in the data file
    
    dtrain = xgb.DMatrix(train.drop('y', axis=1), y_train)
    dtest = xgb.DMatrix(test)
    
    num_boost_rounds = 1250
    # train model
    model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)
    y_pred = model.predict(dtest)
    
    '''Train the stacked models then predict the test data'''
    
    stacked_pipeline = make_pipeline(
        StackingEstimator(estimator=LassoLarsCV(normalize=True)),
        StackingEstimator(estimator=GradientBoostingRegressor(learning_rate=0.001, loss="huber", max_depth=3, max_features=0.55, min_samples_leaf=18, min_samples_split=14, subsample=0.7)),
        LassoLarsCV()
    
    )
    
    stacked_pipeline.fit(finaltrainset, y_train)
    results = stacked_pipeline.predict(finaltestset)
    
    '''R2 Score on the entire Train data when averaging'''
    
    print('R2 score on train data:')
    print(r2_score(y_train,stacked_pipeline.predict(finaltrainset)*0.2855 + model.predict(dtrain)*0.7145))
    
    '''Average the preditionon test data  of both models then save it on a csv file'''

    sub['y'] += y_pred*0.75 + results*0.25
sub['y'] /= 3

leaks = {
    1:71.34112,
    12:109.30903,
    23:115.21953,
    28:92.00675,
    42:87.73572,
    43:129.79876,
    45:99.55671,
    57:116.02167,
    3977:132.08556,
    88:90.33211,
    89:130.55165,
    93:105.79792,
    94:103.04672,
    1001:111.65212,
    104:92.37968,
    72:110.54742,
    78:125.28849,
    105:108.5069,
    110:83.31692,
    1004:91.472,
    1008:106.71967,
    1009:108.21841,
    973:106.76189,
    8002:95.84858,
    8007:87.44019,
    1644:99.14157,
    337:101.23135,
    253:115.93724,
    8416:96.84773,
    259:93.33662,
    262:75.35182,
    1652:89.77625
    }
sub['y'] = sub.apply(lambda r: leaks[int(r['ID'])] if int(r['ID']) in leaks else r['y'], axis=1)
sub.to_csv('stacked-models.csv', index=False)

## <p style="text-align: center;">9. Conclusion</p> <a id='Contribution'></a>

In [None]:
import sys
from astropy.table import Table, Column

t = Table(names=('Model Name', 'Model_Score(r2squared)', 'Algorithm Used'), dtype=('S40', 'S10', 'S30'))
t.add_row(('XGBoost_Model_1', '0.54975', 'XGBoost with full data converted'))
t.add_row(('XGBoost_Model_2', '0.54998', 'XGBoost with some cleaned data'))
t.add_row(('XGBoost_Model_3', '0.55199', 'XGBoost with PCA, SVD and ICA'))
t.add_row(('H2O_Model_1', '0.55642', 'H2O AutoML'))
t.add_row(('Stacked_Model ', '0.58118', 'Stackong and pipeline and XGBoost'))

t.meta['comments'] = ['Conclusion For Models']
t.write(sys.stdout, format='ascii')
print(t)

## <p style="text-align: center;">9. Contribution</p> <a id='Contribution'></a>

I contributed around 55% in terms of coding for the given assignment.

## <p style="text-align: center;">10. Citation</p> <a id='Citation'></a>

1. https://xgboost.readthedocs.io/en/latest/parameter.html
2. https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
3. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
4. https://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html
5. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html
6. https://stats.stackexchange.com/questions/35319/what-is-the-relationship-between-independent-component-analysis-and-factor-analy
7. https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-mercedes
8. https://www.kaggle.com/adityakumarsinha/stacked-then-averaged-models-private-lb-0-554
9. https://www.kaggle.com/bytestorm/stacked-then-averaged-models-0-5697
10. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
11. https://www.quora.com/Why-do-we-need-SVD
12. https://www.quora.com/What-is-an-intuitive-explanation-of-the-relation-between-PCA-and-SVD
13. https://github.com/nikbearbrown/CSYE_7245/blob/master/H2O/H2O_automl_model.ipynb
14. http://docs.astropy.org/en/stable/table/construct_table.html
15. https://www.kaggle.com/adityakumarsinha/stacked-then-averaged-models-private-lb-0-554

## <p style="text-align: center;">11. License</p> <a id='License'></a>

Copyright (c) 2019 Manali Sharma

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

