## Supervised Learning
## Project: Finding Donors for *CharityML*

## Getting Started 

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income). The datset was donated by Ron Kohavi and Barry Becker, after being published in the article _"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_. You can find the article by Ron Kohavi [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf). The data we investigate here consists of small changes to the original dataset, such as removing the `'fnlwgt'` feature and records with missing or ill-formatted entries.


## Exploring the Data


In [None]:
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

import visuals as vs

%matplotlib inline

# Load the Census dataset
data = pd.read_csv("census.csv")

display(data.head(n=5))

In [None]:
#  Total number of records
n_records = len(data)

#  Number of records where individual's income is more than $50,000
n_greater_50k = len(data[data['income']=='>50K'])

#  Number of records where individual's income is at most $50,000
n_at_most_50k = len(data[data['income']=='<=50K'])

#  Percentage of individuals whose income is more than $50,000
greater_percent = n_greater_50k/(n_greater_50k+n_at_most_50k)*100
print("Total number of records: {}".format(n_records))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {}%".format(greater_percent))

** Featureset Exploration **

* **age**: continuous. 
* **workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
* **education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
* **education-num**: continuous. 
* **marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
* **occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
* **relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
* **race**: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other. 
* **sex**: Female, Male. 
* **capital-gain**: continuous. 
* **capital-loss**: continuous. 
* **hours-per-week**: continuous. 
* **native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

----
## Preparing the Data
Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured — this is typically known as **preprocessing**. Fortunately, for this dataset, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted. This preprocessing can help tremendously with the outcome and predictive power of nearly all learning algorithms.

### Transforming Skewed Continuous Features
A dataset may sometimes contain at least one feature whose values tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number.  Algorithms can be sensitive to such distributions of values and can underperform if the range is not properly normalized. With the census dataset two features fit this description: '`capital-gain'` and `'capital-loss'`. 


In [None]:
# Split the data into features and target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# Visualize skewed continuous features of original data
vs.distribution(data)

For highly-skewed feature distributions such as `'capital-gain'` and `'capital-loss'`, it is common practice to apply a <a href="https://en.wikipedia.org/wiki/Data_transformation_(statistics)">logarithmic transformation</a> on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of `0` is undefined, so we must translate the values by a small amount above `0` to apply the the logarithm successfully.



In [None]:
# Log-transform the skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))

# Visualize the new log distributions
vs.distribution(features_log_transformed, transformed = True)

### Normalizing Numerical Features
In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as `'capital-gain'` or `'capital-loss'` above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.


In [None]:
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

display(features_log_minmax_transform.head(n = 5))

### Implementation: Data Preprocessing



In [None]:
 # One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()
import pandas
features_final = pandas.get_dummies(features_log_minmax_transform)

 # Encode the 'income_raw' data to numerical values
income = income_raw.apply(lambda x#1 if x=='>50K' else 0)

encoded = list(features_final.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))



### Shuffle and Split Data
Now all _categorical variables_ have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.

Run the code cell below to perform this split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)


print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

----
## Evaluating Model Performance
In this section, we will investigate four different algorithms, and determine which is best at modeling the data. 

### Question 1 - Naive Predictor Performace
* If we chose a model that always predicted an individual made more than $50,000, what would  that model's accuracy and F-score be on this dataset? 

** Please note ** that the the purpose of generating a naive predictor is simply to show what a base model without any intelligence would look like. In the real world, ideally your base model would be either the results of a previous model or could be based on a research paper upon which you are looking to improve. When there is no benchmark model set, getting a result better than random choice is a place you could start from.



In [None]:
TP = np.sum(income) # Counting the ones as this is the naive case. Note that 'income' is the 'income_raw' data 
#encoded to numerical values done in the data preprocessing step.
FP = income.count() - TP # Specific to the naive case

TN = 0 # No predicted negatives in the naive case
FN = 0 # No predicted negatives in the naive case

# Calculate accuracy, precision and recall
accuracy = TP/(income.count())
recall = TP/(TP+FN)
precision = TP/(TP+FP)

# Calculate F-score using the formula above for beta = 0.5 and correct values for precision and recall.
fscore = (1 + 0.5**2) * ((precision * recall) / ((0.5**2 * precision) + recall))

print("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))

###  Supervised Learning Models
**The following are some of the supervised learning models that are currently available in** [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) **that you may choose from:**
- Gaussian Naive Bayes (GaussianNB)
- Decision Trees
- Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
- K-Nearest Neighbors (KNeighbors)
- Stochastic Gradient Descent Classifier (SGDC)
- Support Vector Machines (SVM)
- Logistic Regression

 **Three models are-:
    
    1.Decision Trees
        * Real-World Application - Used in Business Management for extracting useful information from databases for better 
          customer services.
          Refrence: http://what-when-how.com/artificial-intelligence/decision-tree-applications-for-data-modelling-artificial-intelligence/
        * Strengths
            1. Non-linear relationship between the data parameters do not affect the performance of the tree.
            2. Simple to understand and interpret.
            3. Require less effort for data preparation.
        * Weaknesses
            1. These are unstable as a small change in data can lead to a large change in the tree and thus overfitting can
            occur.
            2. Creating large or complex decision trees with a lot of branches is very time comsuming.
        * Candidacy
            Since decision trees are easy to interpret, this model would be a good choice.This model is also suitable as
            it works for non-linear data parameters which are likely present in the given data.
    
    2.Support Vector Machines(SVC)
        * Real-World Application - Image Classification.
          Refrence : https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
        * Strengths 
            1. Kernel tricks transform low-dimensional input data to a higher dimensional data and thus create optimal
            boundaries.
            2. It works really well with clear margin of separation
            3.It is effective in cases where number of dimensions is greater than the number of samples.
        * Weaknesses
            1. It doesn’t perform well, when we have large data set because the required training time is higher
            2. It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
            3. SVM doesn’t directly provide probability estimates.
        * Candidacy
            SVC is chosen as it works well with high dimensional data.The dimensionality of our dataset incresed after 
            one hot encoding thus SVM would be a good choice.
    
    3.Ensemble Methods(AdaBoost)
        * Real-World Application - Face Detection
          Refrence : https://www.sciencedaily.com/releases/2009/11/091110090858.htm
        * Strengths
            1. Simple classifiers combined together to form a complex model with high accuracy.
            2. By using weak learners, the generalizability of the model improves.
            3. Somewhat resistant to overfitting.
        * Weaknesses
            1. Time and computation expensive in case of a very large dataset.
            2. Prone to outliers.
        * Candidacy
            Its a simple ensemble method which is quite effective and has good average accuracy in comparison to other
            traditional supervised learning algorithims.

### Implementation - Creating a Training and Predicting Pipeline


In [None]:
from sklearn.metrics import fbeta_score,accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    # Fit the learner to the training data using slicing with 'sample_size' using .fit(training_features[:], training_labels[:])
    start = time() # Get start time
    learner = learner.fit(X_train[:], y_train[:])
    end = time() # Get end time
    
    #   Calculate the training time
    results['train_time'] = end - start
        
    #   Get the predictions on the test set(X_test),
    #       then get predictions on the first 300 training samples(X_train) using .predict()
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    
    #   Calculate the total prediction time
    results['pred_time'] = end - start
            
    #   Compute accuracy on the first 300 training samples which is y_train[:300]
    results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
        
    #   Compute accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test,predictions_test)
    
    #   Compute F-score on the the first 300 training samples using fbeta_score()
    results['f_train'] = fbeta_score(y_train[:300],predictions_train,beta = 0.5)
        
    #   Compute F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test,predictions_test,beta=0.5)
       
    # Success
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        

    return results

### Implementation: Initial Model Evaluation


In [None]:
#Import the three supervised learning models from sklearn
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
#  Initialize the three models
clf_A = DecisionTreeClassifier()
clf_B = SVC()
clf_C = AdaBoostClassifier()

#  Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_100 = len(y_train)
samples_10 = samples_100/10
samples_1 = samples_10/10

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)

vs.evaluate(results, accuracy, fscore)

----
## Improving Results


**Based on the above graphs AdaBoostClassifier seems as the most appropriate choice.
    * AdaBoostClassifier has the highest F-score and accuracy when trained on the entire training set
    * Although it takes more time than DecisionTreeClassifier,but its F-score is high.It takes significantly less
    time than SVC which is the next best classifier in terms of F-score and accuracy
    * AdaBoostClassifier should scale well if the size of dataset increases as the time taken to train on entire training set
    is quite less.


 * Adaboost Classifier combines various weak classifiers to form a final classifier which has high accuracy.
    * A weak classifier works better than a naive classifier but still doesn't do a great job at classifying.
    * During each round of the training proces it prioritizes the wrong classification instances for the the next round.
    * It finds the best weak classifier in each round of the training .
    * After certain rounds (maybe predetermined) or till the classifications error can't be improved, it stops and a final 
    classifier is created by the weighted vote of each weak learner.
    * The weight of each weak learner depends upon its accuracy in prediction.

### Implementation: Model Tuning


In [None]:

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import  make_scorer
clf = AdaBoostClassifier()

parameters = {'n_estimators': [50,200],'learning_rate': [0.5,0.1,0.01]}

scorer = make_scorer(fbeta_score,beta = 0.5)

grid_obj = GridSearchCV(clf,parameters,scorer)

grid_fit = grid_obj.fit(X_train,y_train)

best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))

### Implementation - Extracting Feature Importance


In [None]:

model = AdaBoostClassifier().fit(X_train,y_train)

importances = model.feature_importances_

vs.feature_plot(importances, X_train, y_train)

### Feature Selection
How does a model perform if we only use a subset of all the available features in the data? With less features required to train, the expectation is that training and prediction time is much lower — at the cost of performance metrics. From the visualization above, we see that the top five most important features contribute more than half of the importance of **all** features present in the data. This hints that we can attempt to *reduce the feature space* and simplify the information required for the model to learn. The code cell below will use the same optimized model you found earlier, and train it on the same training set *with only the top five important features*. 

In [None]:
from sklearn.base import clone

# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]

# Train on the "best" model found from grid search earlier
clf = (clone(best_clf)).fit(X_train_reduced, y_train)

reduced_predictions = clf.predict(X_test_reduced)

print("Final Model trained on full data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("\nFinal Model trained on reduced data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta = 0.5)))

###  Effects of Feature Selection



 The final F-score decreased by nearly 0.06 and accuracy by 0.03 when trained on only five features.
             Even though the accuracy and F-score decreased I would consider using the reduced data if training time was
             factor and the number of data points are very large.But my final choice would still depend upon the relative
             decrease in the F-scores and accuracy as AdaBoost is a relatively fast model.