# PytzMLS2018: Lab 1: Machine learning Fundamentals

<center>**Anthony Faustine (sambaiga@gmail.com)**</center>


## Import all necessary libries and modules we will use

In [1]:
import sys
sys.path.append('../src')
import numpy as np
import seaborn as sns
from ploting import *
beatify(fig_width=6)
%matplotlib inline

np.random.seed(777)

## Part 1. Regression Problem

**Objective**: In this task we will implement the machine learning algorithm to predict the best house price for a sample house.The model will provide buyers with a rough estimate of what the houses are actually worth. Specifically, we will use data related to housing prices at Boston from [kaggle dataset](https://www.kaggle.com/c/boston-housing/data).

The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. For the purposes of this project, the following preprocessing steps have been made to the dataset:

This data frame contains the following columns:

* **crim**: per capita crime rate by town.

* **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

* **indus**: proportion of non-retail business acres per town.

* **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

*  **nox**: nitrogen oxides concentration (parts per 10 million).

* **rm**: average number of rooms per dwelling.

* **age**: proportion of owner-occupied units built prior to 1940.

* **dis**: weighted mean of distances to five Boston employment centres.

* **rad**: index of accessibility to radial highways.

* **tax**: full-value property-tax rate per 10,000USD.

* **ptratio**: pupil-teacher ratio by town.

* **black**: $1000(B_k - 0.63)^2$ where $B_k$ is the proportion of blacks by town.

* **lstat**: lower status of the population (percent).

* **medv**: median value of owner-occupied homes in 1000usd.


### 1.1 Load dataset

We will use panda to load the dataset. 

In [None]:
data=  pd.read_csv("../data/house/train.csv")
data.head()

** Questions**

1. What is the target variable ?
2. What is the total numbers of sample in this dataset?

### 1.2 Data visualization and feature engeeniering¶

In this  section  you will make a cursory investigation about the Boston housing data and provide your observations. Familiarizing yourself with the data through an explorative process is a fundamental practice to help you better understand and justify your results. 




In [None]:
data.drop(['ID'], axis=1, inplace=True)

#### Missing Values Imputation

## Task: 
1. Check if the data contain missing values
2. How can you address missing values in your dataset?

#### Calculate Descriptive Statistics
Calculate the following descriptive statistics: the minimum, maximum, mean, median, and standard deviation of **medv**.
Store each calculation in their respective variable.

These statistics will be extremely important later on to analyze various prediction results from the constructed model.





In [None]:
# TODO: Minimum price of the data
minimum_price = data.medv.min()

# TODO: Maximum price of the data
maximum_price = data.medv.max()

# TODO: Mean price of the data
mean_price = data.medv.mean()

# TODO: Median price of the data
median_price = data.medv.median()

# TODO: Standard deviation of prices of the data
std_price = data.medv.std()

# Show the calculated statistics
print ("Statistics for Boston housing dataset:\n")
print ("Minimum price: {:.2f}".format(minimum_price))
print ("Maximum price: {:.2f}".format(maximum_price))
print ("Mean price: {:.2f}".format(mean_price))
print ("Median price {:.2f}".format(median_price))
print ("Standard deviation of prices: {:.2f}".format(std_price))

Alternatively we could obtain the descriptive of our dataset using the *describe* pandas command

In [None]:
data.describe()

For our specific medv column

In [None]:
data['medv'].describe()

#### Correlation between variables

In [None]:
beatify(fig_width=8)
sns.set(font_scale=1)  
sns.heatmap(data.corr(),annot=True, fmt=".2f")

In [None]:
corr_list = data.corr()['medv'].sort_values(axis=0,ascending=False).iloc[1:]
corr_list

**Question**
1. Which features are strongly correlated with **medv** variable ?.
2. Using your intuition, for each of the 13 features above, do you think that an increase in the value of that feature would lead to an increase in the value of **'medv'** or a decrease in the value of **'medv'**? Justify your answer for each.
3. Which features are strongly correlated to each other?



#### Task
1. Plot the scatter plot between target variable and features varibles.
2. Is there any linear relationship between the target varible and features variables?
3. Which features have strong linear relationship with target varibale. How does this results compare with the previous correlation results?

## 1.2.2 Automatic Feature Selection

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having too many irrelevant features in your data can decrease the accuracy of the models. 

Three benefits of performing feature selection before modeling your data are:

* Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
* Improves Accuracy: Less misleading data means modeling accuracy improves.
* Reduces Training Time: Less data means that algorithms train faster.

You can use the following approach

- Univariate statistics: Check statisticall significance relation between feature and target
- Model-base selection:
- Iterative selection:


**Univariate statistics** 
- Check statisticall significance relation between feature and target.
- select the one with high confidence

Advantage: Very fast to compute, doesnt require building models

Disadvantage: Independent of the model

** Model-based Feature Selection**
- Use a supervised machine learning model to judge the importance of each feature.

Advantages: Consider all features at once.

**Iterative Feature Selection**
A series of models are built with varying number of features. Implemented in Sklearn as [Recursive feature elimination (RFE)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE)

We will use  model - based feature Selection using `RandomForestRegressor` from `sklearn.ensemble`. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy as discussed in [this blog post](http://blog.datadive.net/selecting-good-features-part-iii-random-forests/)

### 1.2.1 Mean decrease impurity
Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

In [None]:
# Prepare Feature and Target
data.columns

In [None]:
feature = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat']

X = data[feature]
y = data.medv

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Define a model
rforest = RandomForestRegressor()
# Fit the model
rforest.fit(X,y)

In [None]:
# Plot the important features
beatify(fig_width=8)
imp_feat_rf = pd.Series(rforest.feature_importances_, index=X.columns).sort_values(ascending=False)
imp_feat_rf.plot(kind='bar', title='Feature Importance with Random Forest', color='C0')
plt.ylabel('Feature Importance values')
plt.subplots_adjust(bottom=0.25)

** Question**
1. From the figure above what are the best features that strongly explain the target variable?
2. How does this result compare with correlation results from the previous  section?

### 1.2.2 Mean decrease score

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict

def feature_select(model,names, metric, X, Y):
    
    rs = ShuffleSplit(n_splits=len(X), test_size=.1, random_state=22)
    scores = defaultdict(list)
    
    for train_idx, test_idx in rs.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        Y_train, Y_test = Y[train_idx], Y[test_idx]
        
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        acc= metric(Y_pred, Y_test)
        
        for i in range(X.shape[1]):
            X_t = X_test.copy()
            np.random.shuffle(X_t[:, i])
            shuff_acc = metric(model.predict(X_t), Y_test)
            
            scores[names[i]].append((acc-shuff_acc)/acc)
    print ("Features sorted by their score:")
    print (sorted([(round(np.mean(score), 3), feat) for
              feat, score in scores.items()], reverse=True))
    
    results = sorted([(np.around(np.mean(score), decimals=3), feat) for
              feat, score in scores.items()], reverse=True)
    
    plt.bar(range(len(results)), [val[0] for val in results], align='center')
    plt.xticks(range(len(results)), [val[1] for val in results])
    plt.xticks(rotation=90)
    plt.title("Feature Importance")
    plt.ylabel("Score ($\%$)")
    
    return results

In [None]:
results = feature_select(rforest, feature, r2_score, X.as_matrix(), y.as_matrix())

** Question**
1. From the figure above what are the best features that strongly explain the target variable?
2. How does this result compare with  the previous  approach?

### 1.2.3 Model - based feature Selection : Using **SelectFromModel**

In [None]:
from sklearn.feature_selection import SelectFromModel

select = SelectFromModel(RandomForestRegressor() , threshold="median")
select.fit(X,y)
X_features = select.transform(X)
print('Original features', X.shape)
print('Selected features', X_features.shape)
for feature_list_index in select.get_support(indices=True):
    print(feature[feature_list_index])

** Question**
1. How does this result compare with above and  correlation results from the previous  sections?
2. Which features will you use to develop a predictive model?

### 1.3  Develop predictive models

In this section you will develop the tools and techniques necessary for a model to make a prediction. However it is important to define  accurate evaluations of each model's performance by quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement.

In this part will use the coefficient of determination, **R2**, to quantify  model's performance. The coefficient of determination is  useful statistic in regression analysis and  describes how "good" that model is at making predictions.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features.

Hint: The R2 score is the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words:

- R2 score of 0 means that the dependent variable cannot be predicted from the independent variable.
- R2 score of 1 means the dependent variable can be predicted from the independent variable.
- R2 score between 0 and 1 indicates the extent to which the dependent variable is predictable.
- R2 score of 0.40 means that 40 percent of the variance in Y is predictable from X.

A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.

In [None]:
from sklearn.metrics import r2_score
def metric(y_true, y_pred):
    
    score = r2_score(y_true, y_pred)

    return score

### 1.3.1 Shuffle and Split Data

In this stage you will take  dataset and split the data into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset. You will use the following features 'rm', 'lstat', 'ptratio'.

To achieve this:
- Use `train_test_split from sklearn.model_selection` to shuffle and split the features and prices data into training and testing sets.
- Split the data into 80% training and 20% testing.
- Set the random_state for train_test_split to a value of your choice. This ensures results are consistent.
- Assign the train and testing splits to X_train, X_test, y_train, and y_test.

In [None]:
features = ['crim', 'nox', 'rm', 'age', 'dis', 'black', 'lstat']
X = data[features]
y = data.medv

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)

In [None]:
print("Size of training: Features: {0},  Target: {1}".format(X_train.shape, y_train.shape))
print("Size of test: Features: {0},  Target: {1}".format(X_test.shape, y_test.shape))

**Question**:
    1. What is the benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm?

### 1.3.1 Model selection

Scikit-learn offer several regression models we can use for this problem. We will compare the following six regression model and find the best model.

- Linear Regression
- Random Forest regressor
- Decision tree regressor
- BayesianRidge

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, BayesianRidge 
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [None]:
def regression_fit(X_train, y_train, X_test, y_test):
    
    reg = LinearRegression()
    rf = RandomForestRegressor()
    dt = DecisionTreeRegressor()
    ls = Lasso()
    rdg = Ridge()
    brdg = BayesianRidge()
    
    models = {"LR":reg, "RF":rf, "DT":dt, "LASSO":ls, "RG":rdg, "BayesRG": brdg}
    results = {}
    
    for (key,model) in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        results[key]=metric(y_test, y_pred)
        
    return results   

In [None]:
results = regression_fit(X_train, y_train, X_test, y_test)

In [None]:
results

** Question**
1. Which models show better performance ?
2. Repeat the above experiments using all available features and compare your results.

## 1.4 Cross validation

Evaluating model perfomance using train/test split approach is very fast and  ideal for large datasets. It is recommended to use use this approach when the algorithm you are investigating is slow to train. However in most cases train/test split provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy. Testing accuracy can change a lot depending on a which observation happen to be in the testing set

Cross validation is a statistical method for evaluating how well a given algorithm will generalize when trained on a specific data set. It is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. In cross validation we split the data repetedely and train a multiple models.


**Advantages of cross-validation**:
- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data
   - This is because every observation is used for both training and testing



**Types of cross-validation**

- K-fold cross validation
- Startified K-fold cross validation
- Leave-one-out cross validation


In [None]:
from sklearn.model_selection import KFold, cross_val_score

def regression_cross_validation(X, y, score='r2'):
    
    reg = LinearRegression()
    rf = RandomForestRegressor()
    dt = DecisionTreeRegressor()
    ls = Lasso()
    rdg = Ridge()
    brdg = BayesianRidge()
    
    models = {"LR":reg, "RF":rf, "DT":dt, "LASSO":ls, "RG":rdg, "BayesRG": brdg}
    results = {}
    
    kfold=KFold(n_splits=5, random_state=40, shuffle=True)
    
    for (key,model) in models.items():
        cv_results = cross_val_score(model, X, y, cv=kfold, scoring=score)
        results[key]=cv_results.mean()
        
    return results   

In [None]:
results = regression_cross_validation(X_train,y_train)

In [None]:
results

** Question**
1. Which models show better performance ?
2. How does your results compare with the previsous results?

## Part 2 Classification: Diabetes Prediction with Pima data

The objective:To diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the [PIMA dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

In [None]:
data_info = pd.DataFrame({
    "Attributes": ["Pregnancies", "Glucose", "BloodPressure", 
              "SkinThickness", "Insulin", "BMI", 
              "DiabetesPedigreeFunction", "Age", "Outcome"],
    "Explanation": ["Number of times pregnant", 
                 "Plasma glucose concentration a 2 hours in an oral glucose tolerance test", 
                 "Diastolic blood pressure (mm Hg)", 
                "Triceps skin fold thickness (mm)", 
                 "2-Hour serum insulin (mu U/ml)",
                 "Body mass index (weight in kg/(height in m)^2)", 
              "Diabetes pedigree function"
                 ,"years",
                   "Diabetes or not"]})


### 2. 1 Load Explore the Data

In [None]:
data = pd.read_csv('../data/pima/diabetes.csv')
data.head()

In [None]:
data_info

**Task**: Check and handle missing data if available

In [None]:
## your code 


### Correlation between variables

In [None]:
beatify(fig_width=6)
sns.set(font_scale=1) 
sns.heatmap(data.corr(),annot=True, fmt=".2f")

We notice high positive correlations between Age and Pregnancies, which is logical. Also between BMI and Skin thickness, Glucose and Insulin as well as Glucose and Outcome.



In [None]:
corr_list = data.corr()['Outcome'].sort_values(axis=0,ascending=False).iloc[1:]
corr_list

### Distribution of varible

In [None]:
beatify(fig_width=6)
data.hist();
plt.tight_layout()

**Questions**
1. Briefly explain the distribution of each variables

#### Outcome variable

In [None]:
plt.hist(data["Outcome"])
print("Number of negative outcomes",np.count_nonzero(data["Outcome"]))
print("Number of positive outcomes",len(data["Outcome"])-np.count_nonzero(data["Outcome"]))

There is nearly twice as many negative outcomes as there is positive in the dataset.



#### Pregnancies

In [None]:
beatify(fig_width=2)
sns.distplot(data["Pregnancies"]);
data["Pregnancies"].describe()

Distribution is positively skewed. Looks exponential.

### Glucose

In [None]:
sns.distplot(data["Glucose"]);
data["Glucose"].describe()

Again positively skewed with outliers to the left, resembles the normal distribution. Those zeros might be missing values, a Glucose concentration of 0 is unrealistic. Lets mark them with NaN instead.

In [None]:
data["Glucose"] = data["Glucose"].replace(0, np.nan)
data["Glucose"].isnull().sum() # Number of missing values.

### Insulin

In [None]:
sns.distplot(data["Insulin"]);
data["Insulin"].describe()

Again we have invalid zero values, replace with NaN

In [None]:
data["Insulin"] = data["Insulin"].replace(0, np.nan)
data["Insulin"].isnull().sum() # Number of missing values.

### BMI


In [None]:
sns.distplot(data["BMI"]);
data["BMI"].describe()

In [None]:
data["BMI"] = data["BMI"].replace(0, np.nan)
data["BMI"].isnull().sum() # Number of missing values.

### DiabetesPedigreeFunction

In [None]:
sns.distplot(data["DiabetesPedigreeFunction"]);
data["DiabetesPedigreeFunction"].describe()

Distribution is positively skewed, and subject to outliers.

### Age

In [None]:
sns.distplot(data["Age"]);
data["Age"].describe()

Distribution is positively skewed. Looks again exponential-ish

## 2.2 Divide the data

Now lets divide the data in to a seperate test set and a train set which we will train our models on. Note that we do this before the attribute selection so that any bias is not introduced. The test set should represent future data.

In [None]:
# Let check for missing data
data.isnull().sum()

The insulin column has many missing values we can drop this column and fill the missing column of BMI and Glucose with median value.

In [None]:
data.drop(['Insulin'], axis=1, inplace=True)

In [None]:
##fill missing value 
data['Glucose'].fillna(data['Glucose'].median(), inplace=True)
data['BMI'].fillna(data['BMI'].median(), inplace=True)

In [None]:
data.isnull().sum()

In [None]:
from sklearn.model_selection import train_test_split

targets = data["Outcome"]
features = data.drop(["Outcome"], axis = 1)
cols = data.columns
x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.1, random_state = 22)

In [None]:
x_test

#### Save the data 

In [None]:
trainingdata = pd.DataFrame(np.hstack((x_train, y_train[:,np.newaxis])), columns = data.columns)
testdata = pd.DataFrame(np.hstack((x_test, y_test[:,np.newaxis])), columns = data.columns)
testdata.to_csv('../data/pima/test_set.csv', index=False)

## 2.3 Feature Selection

Lets have a look at the importance of the attributes features for predicting the outcome. We might be able to remove attributes with a lot of missing values and hence increase our sample-size. We will use two different methods for the selection, one unsupervised attribute extraction that changes the attributes and one supervised attribute selection that don't.

In [None]:
## separate features and targets
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'DiabetesPedigreeFunction', 'Age']
train_y = trainingdata["Outcome"]
train_x = trainingdata[features]

#### Apply Z-Standardization to features
We use `StandardScaler` from `sklearn.preprocessing` for details refer to this [link](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler). 
Important method:

- `fit()`:	Compute the mean and std to be used for later scaling.
- `fit_transform()` :	Fit to data, then transform it.
- `get_params()`	:Get parameters for this estimator.
- `inverse_transform()`:	Scale back the data to the original representation

In [None]:
from sklearn.preprocessing import StandardScaler
z_scaler = StandardScaler()
z_scaler.fit(train_x)

In [None]:
train_x_scaled = z_scaler.transform(train_x)

In [None]:
train_x_scaled

In [None]:
#### Alternatively you could implement this yourself as follows:
train_x_ = train_x.apply(lambda x: (x - np.mean(x))/np.std(x)) # Z Standardization
train_x_.head()

We can use the mean decrease score feasture selection as explained in section 1.2.2 to select best features for this problem. Since this is classification problem, we will use accuracy as the metric score and logistic regression as a classification model. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()
results=feature_select(lr, features, accuracy_score, train_x_scaled, train_y.as_matrix())

## 2.4 Model selection

How do you choose the best model for your problem?. When you work on a machine learning project, you often end up with multiple good models to choose from. Since our dataset is small the best way to approach it is probably by using cross validation. Lets establish a baseline for the problem by using cross validation and a few classification models. If we use k-fold cross validation since our dataset is quite small a splitting of our data into to few folds could introduce a substantial bias. On the other hand if we chose k to large we will have a lot of variance. With the small data-set in the back of our head we chose k to be moderately large, 18.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score, accuracy_score

In [None]:
def classifiers(X, y, score='f1'):
    
    knn     =  KNeighborsClassifier()
    svm     = SVC()
    gnb     =  GaussianNB()
    log     =  LogisticRegression()
    dTree   =  tree.DecisionTreeClassifier()
    rForest =  RandomForestClassifier()
    
    
    models = {"LG":log, "RF":rForest, "DT":dTree, "NB":gnb, "SVM":svm, "KNN":knn}
    results = {}
    skfold = StratifiedKFold(n_splits=18)

    
    for (key,model) in models.items():
        cv_results = cross_val_score(model, X, y, cv=skfold, scoring=score)
        results[key]=cv_results.mean()
        msg = "{0}: {1}" .format(key, cv_results.mean())
        print(msg)
        
    return results   


In [None]:
results =classifiers(train_x_scaled, train_y.as_matrix(), score='accuracy')

** Questions**:
    1. What are the three best models for this problem?

## 2.4.2 Hyper-parameter Selection

Hyper-parameters are parameters that are not directly learnt within model. In scikit-learn they are passed as arguments to the constructor of the model classes. Specifically, to find the names and current values for all parameters for a given model, use: `model.get_params()`.

It is possible to automatically find good values for the hyper-parameters by using tools such as grid search and cross validation.In machine learning, these tasks are commonly done at the same time in data pipelines. Cross validation is the process of training learners using one set of data and testing it using a different set. Parameter tuning is the process to selecting the values for a model’s parameters that maximize the accuracy of the model.






** Question**
1. List all hyper parameter for the best three models from previous section.

In [None]:
log     =  LogisticRegression()

In [None]:
log.get_params()

### Find optimal parameter for [logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
from sklearn.model_selection import GridSearchCV

def get_best_parameter(model, parameters, X, y):
    
    skfold = StratifiedKFold(n_splits=18)
    clf = GridSearchCV(model, parameters, cv=skfold)
    clf.fit(X, y)
   
    print("Training best score: %.2f" %clf.best_score_)
    print("Best parameter: {}" .format(clf.best_params_))

In [None]:
parameters_logreg = {'C':  [0.001, 0.01, 0.1, 1, 10],
                           'penalty':['l1', 'l2'], 
                           }
log = LogisticRegression()
get_best_parameter(log, parameters_logreg, train_x_scaled, train_y.as_matrix())

### Find optimal parameter for [support vector machine](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
kernels = [ 'linear', 'poly', 'rbf', 'sigmoid']
parameters_svm = {'C': Cs, 'gamma' : gammas, 'kernel': kernels}
svm     = SVC()
get_best_parameter(svm, parameters_svm, train_x_scaled, train_y.as_matrix())

## Task 
1. What are the optimal hyper-parameters for the two selected models.

## References
- [Udacity project](https://github.com/baninaveen/Predicting_Boston_House_Pricing-/blob/master/boston_housing_price.ipynb)
- [PIMA](https://github.com/Tranhd/Diabetes-Classification/blob/master/Exploratory%20Analysis.ipynb)