<div>
<img src="https://www.merck.com/wp-content/uploads/sites/5/2022/03/Merck.png" width="125" align="right"/>
</div>

# --- Insert Data Science Project Name ---

**Author**: Merck data Science ROI  
**Purpose**: this notebook template can serve supervised machine learning projects that involve common tasks such as data exploration, cleaning, transformation and preparation, and data modelling  
**Created On**: 01 September 2022  
**Last Modified**: 02 September 2022

### Table of Contents
- Initialization and Imports
- Data Loading
- Data Exploration
- Data Cleaning
- Feature Engineering
- Data Transformation and Preparation
- Model Exploration and Performance Analysis
- Final Model Building

We highly encourage modularizing the code in separate py file if any of the above steps need in depth implementation 

### Initilization and Imports 

For importing libraries necessary for the project, and for basic preprocessing functions 
Add any global variables or paths in the initialization cell.
Import commonly used Data Science libraries here, so make sure they're available for your Python set-up.

In [26]:
# Initialization code cell


In [2]:
# Import libraries necessary for projects
import numpy as np 
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Import visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Pretty display for notebooks
%matplotlib inline

# Make division futuristic for Python 2
from __future__ import division

## Data Loading

For loading data files into appropriate variables.

In [1]:
#Loading the data file (ex: csv) using Pandas
# data = pd.read_csv('') #insert path to file

#Next steps:
# Loading the test data?
# Loading the feaure vectors (X) and the prediction vector (Y) into different variables

## Data Exploration

Section for **exploratory analysis** on the available data. 

The exploration techniques vary for numerical, categorical, or time-series variables. Currently, 

Here we typically:

- Examine sample records in the dataset
- investigate datatypes of variables
- calculate and investigate descriptive statistics (ex: central tendencies, variability etc.)
- investigate distribution of feature vectors (ex: to check for skewness and outliers)
- investigate distribution of prediction vector
- check out the relationship (ex: correlation) between different features
- check out the relationship between feature vectors and prediction vector

Common steps to check the health of the data:

- Check for missing data
- Check the skewness of the data, outlier detection
- Use the data quality report or leverage a checklist (if available) to ensure data quality and analytics readiness 

### Look at Example Records

In [28]:
# data.head(5) #Display out the first 5 records

# Additional:
#     Look at last few records using data.tail()

### Data-types, completeness Information

Using the Pandas "info" function, in addition to the data-type information for the dataset, we can look at counts of available records/missing records too.

In [29]:
# data.info()

### Descriptive Statistics

In [30]:
# data.describe()

# Additonal: 
#     We can also make a guess at the skewness of the data at this stage by looking at the difference between
#     the means and medians of numerical features

### Visualizaton: Distribution of features

* this section is intented to be a flexible space and can be expanded as the author wishes* 

Visualization techniques differ depending on the type of the feature vector (i.e. numerical: continuous or discrete, categorical: ordinal etc). Techniques will also depend on the type of data being dealt with, and the insight that we want to extract from it. 

Common visualization techniques include:
- Bar Plots: Visualize the frequency distribution of categorical features.
- Histograms: Visualize the frequency distribution of numerical features.
- Box Plots: Visualize a numerical feature, while providing more information like the median, lower/upper quantiles etc..
- Scatter Plots: Visualize the relationship (usually the correlation) between two features. Can include a goodness of fit line, to serve as a regression plot.

Below are example code snippets to render these using seaborn.

In [31]:
#Example: drawing a seaborn barplot
#sns.barplot(x="",y="",hue="",data="")

#Can also use pandas/matplotlib for histograms (numerical features) or barplots ()

In [32]:
# Example: drawing a seaborn regplot
# sns.regplot(data[feature1],data[feature2])

In [33]:
#Example: drawing a pandas scatter_matrix
# pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

### Investigating correlations between features

### Visualizing prediction vector

### Investigating missing values

### Outlier Detection

The presence of outliers can often skew results which take into consideration these data points. 

One approach to detect outliers is to use Tukey's Method for identfying them: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

One such pipeline for detecting outliers is below:

In [47]:
# def find_outliers(data):

#     #Checking for outliers that occur for more than one feature
#     outliers  = []

#     # For each feature find the data points with extreme high or low values
#     for feature in [list of features to investigate]:

#         # TODO: Calculate Q1 (25th percentile of the data) for the given feature
#         Q1 = np.percentile(data[feature],25)

#         # TODO: Calculate Q3 (75th percentile of the data) for the given feature
#         Q3 = np.percentile(data[feature],75)

#         # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
#         step = (Q3-Q1) * 1.5

#         # Display the outliers
#         out = data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))]
#         print "Number of outliers for the feature '{}': {}".format(feature, len(out))
#         outliers = outliers + list(out.index.values)


#     #Creating list of more outliers which are the same for multiple features.
#     outliers = list(set([x for x in outliers if outliers.count(x) > 1])) 
    
#     return outliers
    
# print "Data points considered outliers for more than one feature: {}".format(find_outliers(data))

## Data Cleaning

### Imputing missing values

### Cleaning outliers or error values

In [48]:
# Remove the outliers, if any were specified 
# good_data = data.drop(data.index[outliers]).reset_index(drop = True)
# print "The good dataset now has {} observations after removing outliers.".format(len(good_data))

## Feature Engineering

Section to extract more features from those currently available.

In [34]:
# code 

## Data Transformation and Preparation

### Transforming Skewed Continous Features 

It is common practice to apply a logarthmic transformation to highly skewed continuous feature distributions. A typical flow for this is in a commented code block below.

In [35]:
# skewered = [list of skewed continuous features]
# raw_features[skewed] = data[skewed].apply(lambda x: np.log(x+1))

### Normalizing Numerical Features 

A common practice is to apply an appropriate scaling technique on numerical features. Applying scaling doesn't change the shape of each feature's distribution; but ensures that each feature is treated equally when applying supervised learners. Here is a MinMaxScaler module of sklearn implementation below:

In [36]:
# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# numerical = [list of skewed numerical features]
# raw_features[numerical] = scaler.fit_transform(data[numerical])

In [37]:
# Checking examples after transformation
# raw_features.head()

### One Hot Encoding Categorical Features

In [38]:
# Using Pandas get_dummies function
# features = pd.get_dummies(raw_features)

In [39]:
#Encoding categorical prediction vector to numerical ?

Create pipeline functions for data preprocessing, rather than separate script blocks

### Shuffle and Split Data

In [41]:
# from sklearn.cross_validation import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(features, prediction_vector, test_size = 0.2, random_state = 0)

# Show the results of the split
# print "Training set has {} samples.".format(X_train.shape[0])
# print "Testing set has {} samples.".format(X_test.shape[0])

## Model Exploration

### Naive Predictor Performance

To set a baseline for the performance of the predictor. 

Common techniques:
- For categorical prediction vector, choose the most common class
- For numerical prediction vector, choose a measure of central tendency

Then calculate the evalation metric (accuracy, f-score etc)

In [25]:
#Code to implement the above

### Choosing scoring metrics

In [None]:
# from sklearn.metrics import accuracy_score, fbeta_score

### Creating a Training and Prediction Pipeling

In [42]:
#Importing models from sklearn, or tensorflow/keras components

Modify below module as per requirement

In [None]:
# def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
#     '''
#     inputs:
#        - learner: the learning algorithm to be trained and predicted on
#        - sample_size: the size of samples (number) to be drawn from training set
#        - X_train: features training set
#        - y_train: income training set
#        - X_test: features testing set
#        - y_test: income testing set
#     '''
    
#     results = {}
    
#     # TODO: Fit the learner to the training data using slicing with 'sample_size'
#     start = time() # Get start time
#     learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
#     end = time() # Get end time
    
#     # TODO: Calculate the training time
#     results['train_time'] = end - start
        
#     # TODO: Get the predictions on the test set,
#     #       then get predictions on the first 300 training samples
#     start = time() # Get start time
#     predictions_test = learner.predict(X_test)
#     predictions_train = learner.predict(X_train[:300])
#     end = time() # Get end time
    
#     # TODO: Calculate the total prediction time
#     results['pred_time'] = end - start
            
#     # TODO: Compute accuracy on the first 300 training samples
#     results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
        
#     # TODO: Compute accuracy on test set
#     results['acc_test'] = accuracy_score(y_test,predictions_test)
    
#     # TODO: Compute F-score on the the first 300 training samples
#     results['f_train'] = fbeta_score(y_train[:300],predictions_train,0.5)
        
#     # TODO: Compute F-score on the test set
#     results['f_test'] = fbeta_score(y_test,predictions_test,0.5)
       
#     # Success
#     print "{} trained on {} samples.".format(learner.__class__.__name__, sample_size)
        
#     # Return the results
#     return results

### Model Evaluation

In [44]:
# Change the list of classifiers and code below as seen fit. we probably also don't need to see the effects of
# different sample sizes

In [None]:
# # TODO: Import the three supervised learning models from sklearn
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.svm import SVC
# from sklearn.ensemble import AdaBoostClassifier

# # TODO: Initialize the three models, the random states are set to 101 so we know how to reproduce the model later
# clf_A = DecisionTreeClassifier(random_state=101)
# clf_B = SVC(random_state = 101)
# clf_C = AdaBoostClassifier(random_state = 101)

# # TODO: Calculate the number of samples for 1%, 10%, and 100% of the training data
# samples_1 = int(round(len(X_train) / 100))
# samples_10 = int(round(len(X_train) / 10))
# samples_100 = len(X_train)

# # Collect results on the learners in a dictionary
# results = {}
# for clf in [clf_A, clf_B, clf_C]:
#     clf_name = clf.__class__.__name__
#     results[clf_name] = {}
#     for i, samples in enumerate([samples_1, samples_10, samples_100]):
#         results[clf_name][i] = \
#         train_predict(clf, samples, X_train, y_train, X_test, y_test)

Printing out the results

In [None]:
# #Printing out the values
# for i in results.items():
#     print i[0]
#     display(pd.DataFrame(i[1]).rename(columns={0:'1%', 1:'10%', 2:'100%'}))

## Final Model Building

Using grid search (GridSearchCV) with different parameter/value combinations, we can tune our model for even better results.

Example with Adaboost below

In [None]:
# # TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
# from sklearn.grid_search import GridSearchCV
# from sklearn.metrics import make_scorer

# # TODO: Initialize the classifier
# clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())

# # TODO: Create the parameters list you wish to tune
# parameters = {'n_estimators':[50, 120], 
#               'learning_rate':[0.1, 0.5, 1.],
#               'base_estimator__min_samples_split' : np.arange(2, 8, 2),
#               'base_estimator__max_depth' : np.arange(1, 4, 1)
#              }

# # TODO: Make an fbeta_score scoring object
# scorer = make_scorer(fbeta_score,beta=0.5)

# # TODO: Perform grid search on the classifier using 'scorer' as the scoring method
# grid_obj = GridSearchCV(clf, parameters,scorer)

# # TODO: Fit the grid search object to the training data and find the optimal parameters
# grid_fit = grid_obj.fit(X_train,y_train)

# # Get the estimator
# best_clf = grid_fit.best_estimator_

# # Make predictions using the unoptimized and model
# predictions = (clf.fit(X_train, y_train)).predict(X_test)
# best_predictions = best_clf.predict(X_test)

# # Report the before-and-afterscores
# print "Unoptimized model\n------"
# print "Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions))
# print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5))
# print "\nOptimized Model\n------"
# print "Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions))
# print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5))
# print best_clf

Next steps can include feature importance extraction, predictions on the test set.. etc

## Predictions on Test Set