# Machine Learning Project1

In this project, you will learn to use the concepts we have seen in the lectures and practiced in the labs on a real-world dataset, start to ﬁnish. You will do exploratory data analysis to understand your dataset and your features, do feature processing and engineering to clean your dataset and extract more meaningful information, implement and use machine learning methods on real data, analyze your model and generate predictions using those methods and report your ﬁndings.

## Load and Clean Data

For raw data, firstly, we need to check what conponent it has and how can we deal with it. So we import panda library to have a quick view of train dataset.

In [1]:
import pandas as pd
import numpy as np

from proj1_helpers import *
from implementations import *

%matplotlib inline
import matplotlib.pyplot as plt
%load_ext autoreload

In [2]:
tt = pd.read_csv("train.csv")
tt.head()

Unnamed: 0,Id,Prediction,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
0,100000,s,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,...,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497
1,100001,b,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,...,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226
2,100002,b,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,...,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251
3,100003,b,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,...,0.06,86.062,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
4,100004,b,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,...,-0.871,53.131,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0


In [3]:
tt.shape

(250000, 32)

In [14]:
df = tt[tt.columns.difference(['Id', 'Prediction'])]
correlation_matrix = df.corr().abs() # Correlation matrix

# Gives the set of features and how correlated they are with each other
# So maybe we can remove the columns and see how they work out - need to decide which ones to remove
correlation_matrix.where(np.triu(correlation_matrix, 1) >= 0.9).stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,DER_deltaeta_jet_jet,DER_lep_eta_centrality,0.999998
1,DER_deltaeta_jet_jet,DER_mass_jet_jet,0.946045
2,DER_deltaeta_jet_jet,DER_prodeta_jet_jet,0.999981
3,DER_deltaeta_jet_jet,PRI_jet_subleading_eta,0.999995
4,DER_deltaeta_jet_jet,PRI_jet_subleading_phi,0.999996
5,DER_deltaeta_jet_jet,PRI_jet_subleading_pt,0.999346
6,DER_lep_eta_centrality,DER_mass_jet_jet,0.945584
7,DER_lep_eta_centrality,DER_prodeta_jet_jet,0.99999
8,DER_lep_eta_centrality,PRI_jet_subleading_eta,0.999997
9,DER_lep_eta_centrality,PRI_jet_subleading_phi,0.999998


In [None]:
import missingno as msno
msno.matrix(tt.replace(-999,np.nan))

In [None]:
msno.matrix(tt.replace(0,np.nan))

In [None]:
msno.heatmap(tt.replace(-999,np.nan))

Through observing the dataset, we find out that there are some positive columns has a large number of -999 which should be NaN actually. Specially, in one column called 'PRI_jet_all_pt', 0 value present quite frequently. All these situation above could be trated as the missing values and we should give a reaonable solution.

Here are some plans: 
1. Remove columns containing missing value. But we will lose tons of potential data.
2. Replace them with 0 or -1. But it is a quite dangerous method because the model could possiblely take it as real value
3. Replace the missing values with mean, median or mode. For numerical values, go with mean, and if there are some outliers try median (since it is much less sensitive to them).

The third plan is a standard and often very good approach. So we are going to replace -999 by the mean of other data in same column. And the percentage of 0 value is not high, thus, we think these are real data point.

In [None]:
#load train dataset
y, x, ids = load_csv_data("train.csv")

In [None]:
#replace mising value by mean
for column in range(x.shape[1]):
    x[:, column] = replace_nan(x[:, column], -999)
    
# Standardizing it by features
x, _, _ = standardize(x) 
pd.DataFrame(x).head()

## Prepare Data and Basic Training

We use the cell below to control all the parameters we need, so that all processes could be easy to adjust.

In [None]:
#build polynomial by degree
degree = 1
#split train dataset to 2 parts for test and train
ratio_split = 0.8
#L2 penalty parameter for ridge_regression()
lambda_ = 0.1
#GD
max_iters_GD = 100
gamma_GD = 0.05
#SGD
max_iters_SGD = 100
gamma_SGD = 0.005

__Build Polinomial and Split Data to train and test sets__

In [None]:
x = build_log(x)
x = build_combination(x, 2)
x_ = build_poly(x, degree)
x_train, x_test, y_train, y_test = split_data(x_, y, ratio_split)

In [None]:
print('The size of x_train: {}\nThe size of x_test: {}'.format( x_train.shape, x_test.shape))

__Test the functions in implements.py__

In [None]:
# least_squares()
w, loss = least_squares(y_train, x_train)
print(loss, compute_loss(y_test, x_test, w))

In [None]:
# ridge_regression()
w, loss = ridge_regression(y_train, x_train, lambda_)
print(loss, compute_loss(y_test, x_test, w))

In [None]:
# GD()
w_initial = np.zeros(x_train.shape[1])
w, loss = least_squares_GD(y_train, x_train, w_initial, max_iters_GD, gamma_GD)
print(loss, compute_loss(y_test,x_test,w))

In [None]:
# SGD()
w_initial = np.zeros(x_train.shape[1])

# loss_mae is the argument to get the mean absolute error cost function running
w, loss = least_squares_SGD(y_train, x_train, w_initial, max_iters_SGD, gamma_SGD)#, loss_function='rmse')
# print('Training loss: {}'.format(loss))
# print('Testing loss: {}'.format(compute_loss(y_test, x_test, w, loss_function='rmse')))

In [None]:
def logistic_regression(y, tx, max_iters, gamma):
    initial_w = np.zeros(tx.shape[1])
    divide_by_constant = 1 / y.shape[0]
    
    for n_iter in range(max_iters):
        h = sigmoid(np.dot(initial_w, tx.T))
        gradient = divide_by_constant * np.dot(tx.T, (h - y))
        initial_w -= gamma * gradient
        
        loss = calculate_loss_logistic(h, y)

        print(
            'Loss calculated at: {} , training step: {}'.format(
                 loss, n_iter
            )
        )
    return initial_w, loss
        
def calculate_loss_logistic(h, y):
    """
    Given the actual label y and calculated hypothesis h returns the loss
    accumulated over all data points.
    """
    return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

max_iters_logistic = 100
lr = 0.001
w, loss = logistic_regression(y_train, x_train, max_iters_logistic, lr)

In [None]:
training_predict_labels = calculate_predicted_labels(x_train, w)
testing_predict_labels = calculate_predicted_labels(x_test, w)

In [None]:
print_accuracy(training_predict_labels, x_train, y_train)
print_accuracy(testing_predict_labels, x_test, y_test, train=False)