=============================================================================================================================

## Welcome to this PyData Mumbai talk on "Predictive Analytics Pipeline: a walkthrough"

### Presenter: Sukanya Mandal 
#### https://www.linkedin.com/in/sukanyamandal/

=============================================================================================================================

### The entire predictive analytics or machine learning pipeline involves the following steps:

##### STEP 1: Gathering the business requirement : This is the first and the most important step in the entire predictive analytics lifecycle. This step requires one to understand the business requirement and the desired outcome which demand doamin understanding.

##### STEP 2: Forming a hypothesis : Based on the above understanding a hypothesis is proposed and worked upon. This step also involves mapping the functional viewpoint to the outcome based on the available data. 

##### STEP 3: Planning phase : Once the hypothesis is complete and agreed upon, the next step is to plan the project lifecycle and different phases. Once the entire project phase is decided, it is essential to plan the requirements for each of these phases. 

##### STEP 4: Data collection : To staisfy the business requirement, appropriate data needs to be collected both in terms of quality and quantity.

##### STEP 5: Understanding the data description : Once the data is collected, understanding the data and its description is important to carry on further analysis. It can be either numerical, textual or categorical. Based on the data description appropriate action can be taken on how to load the data into the system.

##### STEP 6: Data ingestion : Once we have an understanding about the data, it is fed into the system using appropriate methods.

##### STEP 7: Data Wrangling : After loading the data into the system, it is essential to perform further transformations on it. This step would essentially involve - 

###### a. Filtering Data - This step would involve tasks such as removing/handling incorrect or missing data, handling outliers, and so on. Cleaning also involves standardizing attribute column names to make them more readable, intuitive, and conforming to certain standards for everyone to understand.
###### b. Typecasting - This step involves converting data into appropriate data types.
###### c. Transformation - This step would involve transforming existing columns or deriving new attributes based on the requirements.
###### d. Imputing missing value - Presence of missing values in the dataset can cause lot of problems for the algorithms and can cause calculation issues with the final outcomes, hence it is important to deal with them.
###### e. Handling duplicates - Another common issue with the data is the presence of duplicates. These kind of data does not add much value to the original dataset and sometimes can even cause a lot of problem. Hence, it is necessary to handle them.
###### f. Normalizing values -  Attribute normalization is the process of standardizing the range of values of attributes. Machine learning algorithms in many cases utilize distance metrics, attributes or features of different scales/ranges which might adversely affect the calculations or bias the outcomes. Normalization is also called feature scaling.

##### STEP 8: Feature Engineering : This forms a very crucial step before getting to the model. Feature engineering can make or break the entire outcome. The selection of appropriate features based on the requirements is essential. Again, this step would require one to have domain knowledge to make appropriate decision.

##### STEP 9: Model Building - This step requires one to choose the correct model based on the available data to produce the required outcome. This would also involve the types of problem that machine learning can solve and the type of models available.

##### STEP 10: Model Evaluation - Once we have decided on the representation of the problem and possible set of models, we need some judging criterion or criteria that will help us choose one model over the others, or the best model from a set of candidate models. The idea is to define a metric for evaluation or a scoring function/loss function that will help enable this.

##### STEP 11: Model Tuning - This step involves tuning the existing models based on certain features to improve the performance. This requires understanding of the of the underlying math and logic of the algorithm in focus.

##### STEP 12: Model Interpretation - This step is about the question "can we explain and interpret Machine Learning models in an easy to understand way?" - so that even someone with no technical knowledge can understand what is happening inside the model. This becomes crucial because it is essential for the clients to understand that they can trust the model and the model is capable of giving the desired outcome.

##### STEP 13: Model deployment - The final step is to deploy the model in production, so that we can finally use it.

### Walking through an example: 

#### In this example, we have considered the red wine quality from UC Irvine Machine Learning Repository. There is also a publication based on this dataset. Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 

##### STEP 1: The business requirement: Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. The current requirement is to predict the wine quality based on the data available from certain physiochemical tests. 


##### STEP 2: Forming hypothesis: Properties of red vinho verde wine samples from the north of Portugal. The goal is to model wine quality based on physicochemical tests. We will consider the following attributes to derive our results -- fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality. Understanding the attributes - 

Fixed acidity: Acids are one of the fundamental properties of wine and contribute greatly to the taste of the wine. Reducing acids significantly might lead to wines tasting flat. Fixed acids include tartaric, malic, citric, and succinic acids, which are found in grapes (except succinic). This variable is usually expressed in g(tartaricacid)/(dm)^3 

Volatile acidity: These acids are to be distilled out from the wine before completing the production process. It is primarily constituted of acetic acid, though other acids like lactic, formic, and butyric acids might also be present. Excess of volatile acids are undesirable and lead to unpleasant flavor. In the United States, the legal limits of volatile acidity are 1.2 g/L for red table wine and 1.1 g/L for white table wine. The volatile acidity is expressed in g(aceticacid)/(dm)^3 

Citric acid: This is one of the fixed acids that gives a wine its freshness. Usually most of it is consumed during the fermentation process and sometimes it is added separately to give the wine more freshness. It’s usually expressed in g/(dm)^3

Residual sugar: This typically refers to the natural sugar from grapes that remains after the fermentation process stops, or is stopped. It’s usually expressed in g/(dm)^3

Chlorides: This is usually a major contributor to saltiness in wine. It’s usually expressed in g(sodiumchloride)/(dm)^3

Free sulfur dioxide: This is the part of the sulfur dioxide that, when added to a wine, is said to be free after the remaining part binds. Winemakers will always try to get the highest proportion of free sulfur to bind. They are also known as sulfites and too much is undesirable and gives a pungent odor. This variable is expressed in mg/(dm)^3

Total sulfur dioxide: This is the sum total of the bound and the free sulfur dioxide (SO2). Here, it’s expressed in mg/(dm)^3 . This is mainly added to kill harmful bacteria and preserve quality and freshness. There are usually legal limits for sulfur levels in wines and excess of it can even kill good yeast and produce an undesirable odor.

Density: This can be represented as a comparison of the weight of a specific volume of wine to an equivalent volume of water. It is generally used as a measure of the conversion of sugar to alcohol. Here, it’s expressed in g/(cm)^3

pH: Also known as the potential of hydrogen, this is a numeric scale to specify the acidity or basicity the wine. Fixed acidity contributes the most toward the pH of wines. You might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.

Sulphates: These are mineral salts containing sulfur. Sulphates are to wine as gluten is to food. They are a regular part of the winemaking around the world and are considered essential. They are connected to the fermentation process and affect the wine aroma and flavor. Here, they are expressed in g(potassiumsulphate)/(dm)^3

Alcohol: Wine is an alcoholic beverage. Alcohol is formed as a result of yeast converting sugar during the fermentation process. The percentage of alcohol can vary from wine to wine. Hence it is not a surprise for this attribute to be a part of this dataset. It’s usually measured in % vol or alcohol by volume (ABV)

Quality: Wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual quality score is the median of at least three evaluations made by the same wine experts.

##### STEP 3: Planning Phase: Based on the above hypothesis, the project planning has been done and below are the following requirements: 

In [56]:
import numpy as np
import pandas as pd

In [57]:
from sklearn.model_selection import train_test_split # Split arrays or matrices into random train and test subsets. 
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn import preprocessing # To scale and standardized datasets. 
# http://scikit-learn.org/stable/modules/preprocessing.html

from sklearn.ensemble import RandomForestRegressor # a meta estimator that fits a number of classifying decision trees 
# on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

from sklearn.pipeline import make_pipeline # Construct a Pipeline from the given estimators.
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

from sklearn.model_selection import GridSearchCV # search over specified parameter values for an estimator.Important members 
# are fit, predict. GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, 
# “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

from sklearn.metrics import mean_squared_error, r2_score # R^2 (coefficient of determination) regression score function. 
# Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
# Mean squared error regression loss 
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

##### STEP 4, 5, 6 and 7: 

In [58]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

In [60]:
data.shape

(1599, 12)

This dataset has 1599 datapoints across 12 attributes. Let's have a look at the data with their attributes below. 

In [61]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [62]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 149.9 KB


This gives us an idea about the datatype and constraints of each of the 12 attributes present in the dataset and also the count of datapoints for each attribute in the dataset. 

In [63]:
data.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


##### The describe function gives us the statistical understanding of the complete dataset. 

Count return series with number of non-NA/null observations over requested axis. This gives us the number of datapoints present in each attribute. We can see that all the attributes has 1599 datapoints (which matches the count while we looked for the shape of the dataset). This means that there are no missing data for any attribute, hence we do not need to handle any missing data issue atleast for this dataset. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count

Mean returns the mean of the values for the requested axis. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean 

Std returns sample standard deviation over requested axis. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.std.html#pandas.DataFrame.std

Min - this method returns the minimum of the values in the object. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html#pandas.DataFrame.min

Max - this method returns the maximum of the values in the object. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html#pandas.DataFrame.max

25% , 50% and 75% - For numeric data, the result index would include lower percentile and upper percentile. By default, the lower percentile is 25% and the upper percentile is 75%. The 50 percentile is the same as the median. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

##### STEP 8: Feature Engineering:

Every machine learning projects has - features and labels. Features are the part of a dataset which are used to predict the label. And labels on the other hand are mapped to features. After the model has been trained, we give features to it, so that it can predict the labels. Since we have to predict the wine quality, the attribute 'quality' becomes our label and the rest of the attributes becomes features. So in the next step we separate features and labels into two different dataframes.

In [64]:
y = data.quality
X = data.drop('quality', axis=1) 
# Syntax:
# df = df.drop('column_name', axis=1) --> where 1 is the axis number (0 for rows and 1 for columns.)
# df.drop('column_name', axis=1, inplace=True)   --> To delete the column without having to reassign df
# df.drop(df.columns[[0, 1, 3]], axis=1) --> To drop by column number instead of by column label, e.g. the 1st, 2nd and 4th columns

##### STEP 9: Model Building

We split the dataset into train and test keeping a ratio of 80:20. We keep 80% of the dataset for training set and the remaining 20% for the testing set.

In [65]:
# Splitting data into training and testing datsets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Standardizing our dataset: Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. Standardization is a common requirement for machine learning tasks.

In [66]:
# pipeline with processing and model
# this step mainly becomes a part of feature scaling and data transformation
# below code forms a modeling pipeline that first transforms the data using StandardScaler() and then fits a model 
# using a random forest regressor.
pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100))

##### STEP 10 and 11: Model Evaluation and Tuning

###### Model Tuning

There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot. Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model. Let's take the example of random forest hyperparameters. Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are model parameters. However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of hyperparameters that the user must set.

In [67]:
#Declaring hyperparameter to tune
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.Practically, that "method" is simply a set of hyperparameters in this context.

How does a Cross-validation work:
    a. Split your data into k equal parts, or "folds" (typically k=10).
    b. Train your model on k-1 folds (e.g. the first 9 folds).
    c. Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).
    d. Perform steps (b) and (c) k times, each time holding out a different fold.
    e. Aggregate the performance across all k folds. This is your performance metric.
    
Cross-validation helps evaluate different hyperparameters and estimate their effectiveness. This helps save the test dataset and use it only when we are ready to select a model.

Cross-validation pipeline uses data pre-processing steps inside the cross-validation loop. This is one of the best practice and prevents training and testing datasets overlap. 

Cross-validation pipeline works like - 
    a. Split your data into k equal parts, or "folds" (typically k=10).
    b. Preprocess k-1 training folds.
    c. Train your model on the same k-1 folds.
    d. Preprocess the hold-out fold using the same transformations from step (b).
    e. Evaluate your model on the same hold-out fold.
    f. Perform steps (b) - (e) k times, each time holding out a different fold.
    g. Aggregate the performance across all k folds. This is your performance metric.

In [68]:
#Tuning the model using a cross-validation pipeline
classifier = GridSearchCV(pipeline, hyperparameters, cv=10)

#Fit the training set and tune the model
classifier.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decr...mators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters

Searching for the best set of parameters using cross-validation, below - 

In [69]:
print (classifier.best_params_)

{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'log2'}


After tuning, refitting the training dataset. GridSearchCV will automatically refit the model with the best set of hyperparameters using the entire training set. 

In [70]:
#Refit the entire training set
print(classifier.refit) # No additional code needed if clf.refit == True (default is True)

True


##### Model Evaluation

In [71]:
# Evaluate the model pipeline on test data
y_prediction = classifier.predict(X_test)

In [72]:
#Evaluating the model performance based on the metrics imported earlier
print (r2_score(y_test, y_prediction))

0.453693531828


In [73]:
print (mean_squared_error(y_test, y_prediction))

0.3325640625


###### Further actions:

Ways to improve a model - 
a. Try other regression model families (e.g. regularized regression, boosted trees, etc.).
b. Collect more data if it's cheap to do so.
c. Engineer smarter features after spending more time on exploratory analysis.
d. Speak to a domain expert to get more context (...this is a good excuse to go wine tasting!).


## Thank you!