<a href="https://colab.research.google.com/github/swapnilxsonawane/Oren/blob/master/Welcome_To_Colaboratory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>Welcome to Colaboratory!</h1>


Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.

In [0]:
#@title Introducing Colaboratory { display-mode: "form" }
#@markdown This 3-minute video gives an overview of the key features of Colaboratory:
from IPython.display import YouTubeVideo
YouTubeVideo('inN8seMm7UI', width=600, height=400)

## Getting Started

The document you are reading is a  [Jupyter notebook](https://jupyter.org/), hosted in Colaboratory. It is not a static page, but an interactive environment that lets you write and execute code in Python and other languages.

For example, here is a **code cell** with a short Python script that computes a value, stores it in a variable, and prints the result:

In [0]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

86400

To execute the code in the above cell, select it with a click and then either press the play button to the left of the code, or use the keyboard shortcut "Command/Ctrl+Enter".

All cells modify the same global state, so variables that you define by executing a cell can be used in other cells:

In [0]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

604800

For more information about working with Colaboratory notebooks, see [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb).


## More Resources

Learn how to make the most of Python, Jupyter, Colaboratory, and related tools with these resources:

### Working with Notebooks in Colaboratory
- [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb)
- [Guide to Markdown](/notebooks/markdown_guide.ipynb)
- [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
- [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)
- <img src="/img/new.png" height="20px" align="left" hspace="4px" alt="New"></img>
 [TensorFlow 2 in Colab](/notebooks/tensorflow_version.ipynb)

### Working with Data
- [Loading data: Drive, Sheets, and Google Cloud Storage](/notebooks/io.ipynb) 
- [Charts: visualizing data](/notebooks/charts.ipynb)
- [Getting started with BigQuery](/notebooks/bigquery.ipynb)

### Machine Learning Crash Course
These are a few of the notebooks from Google's online Machine Learning course. See the [full course website](https://developers.google.com/machine-learning/crash-course/) for more.
- [Intro to Pandas](/notebooks/mlcc/intro_to_pandas.ipynb)
- [Tensorflow concepts](/notebooks/mlcc/tensorflow_programming_concepts.ipynb)
- [First steps with TensorFlow](/notebooks/mlcc/first_steps_with_tensor_flow.ipynb)
- [Intro to neural nets](/notebooks/mlcc/intro_to_neural_nets.ipynb)
- [Intro to sparse data and embeddings](/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb)

### Using Accelerated Hardware
- [TensorFlow with GPUs](/notebooks/gpu.ipynb)
- [TensorFlow with TPUs](/notebooks/tpu.ipynb)

## Machine Learning Examples: Seedbank

To see end-to-end examples of the interactive machine learning analyses that Colaboratory makes possible, check out the [Seedbank](https://research.google.com/seedbank/) project.

A few featured examples:

- [Neural Style Transfer](https://research.google.com/seedbank/seed/neural_style_transfer_with_tfkeras): Use deep learning to transfer style between images.
- [EZ NSynth](https://research.google.com/seedbank/seed/ez_nsynth): Synthesize audio with WaveNet auto-encoders.
- [Fashion MNIST with Keras and TPUs](https://research.google.com/seedbank/seed/fashion_mnist_with_keras_and_tpus): Classify fashion-related images with deep learning.
- [DeepDream](https://research.google.com/seedbank/seed/deepdream): Produce DeepDream images from your own photos.
- [Convolutional VAE](https://research.google.com/seedbank/seed/convolutional_vae): Create a generative model of handwritten digits.

In [1]:
# -*- coding: utf-8 -*-
"""USAID_support_functions.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1kHiQFmSf-bxsCb03yKufMAOJhsEblhm-

Copyright 2019, Pavel Eftimovski, All rights reserved.
"""

## Section: Data Preprocessing ##

import pandas as pd
import os 
import sys

#SCMS Data:

# generate blocks of columns from same data type, count rows with null values
def sort_by_type(dataset):
    
    new_column_names = ['id', 'p_code',
                        'pq', 'po_so',
                        'asn_dn',
                        'country',
                        'managed_by',
                        'flf_via',
                        'vendor_terms',
                        'ship_mode',
                        'pq_client_date',
                        'po_vendor_date',
                        'sch_del_date',
                        'del_date',
                        'rec_del_date',
                        'product_grp',
                        'sub_class',
                        'vendor',
                        'itm_desc',
                        'mol_test',
                        'brand',
                        'dosage',
                        'dos_form',
                        'unit_msr',
                        'ln_itm_qty',
                        'ln_itm_val',
                        'pack_price',
                        'unit_price',
                        'manu_site',
                        'first_line',
                        'weight',
                        'freight_cost',
                        'line_item']
    
    dataset = pd.read_excel(dataset)
    newcol_dict = dict(zip(dataset.columns, new_column_names))
    dataset.rename(columns=(newcol_dict), inplace =True)
    blocks = dataset.as_blocks()
    
    for key in blocks.keys():
        print("Type: {} , Count: {} \nColumn name and null counts: \n{}\n".format(
            key,len(blocks[key].columns),blocks[key].isnull().sum()))
        
    return blocks


# replace null/nan values
def replace_nan(data, column, fillvalue):
    print("-Before: \n", data.isnull().sum()[data.isnull().sum()>0])
    # use 'TestKit/Ancillary' as dosage value
    data[column].fillna(value=fillvalue, inplace=True)
    # re-check null values
    print("\n-After: \n", data.isnull().sum()[data.isnull().sum()>0])
    return data


# describes the columns with null/nan values
def describe_nulls(data, column):
    nulls = data[data[column].isnull()]
    for col in nulls:
        print('\n-------\n',nulls[col].describe())

        
# otuput a summary of columns that correspond to null column
def nan_desc(data, column):
    is_null = data[data[column].isnull()]
    for col in is_null:
        print('\n-------\n',is_null[col].describe())

        
# LPI Data: rank countries by score in missing years
def rank_missing(df_countries, missing_col):
    for row in missing_col:
      year, suffix = row.split(':')[0], row.split(':')[1]
      col_base = suffix.split('_')[0]
      if '_rank' in row:
         df_countries[row]=df_countries[year+':'+col_base+'_score'].rank(ascending=0)

            
# Locate manufacturing site with Google Maps API:
import urllib
import requests

main_p = '.../USAID_Data/' 
ctry_cont = pd.read_csv(main_p+"country code_to continent.csv") # country location in continents
ctry_cont_dict = {x:y for x,y in zip(ctry_cont['Country Code'],ctry_cont['Continent'])}
    
def locateSite(manu_site, ctry_cont_dict = ctry_cont_dict):
    
    google_api = "https://maps.googleapis.com/maps/api/geocode/json?"
    API_KEY = '...' 
    url = google_api + urllib.parse.urlencode({'address': manu_site}) # set the location's url on the map    
    
    json = requests.get(url, params={"key": API_KEY}).json()# get data in .json format    
    
    f_address = json['results'][0]['formatted_address'] # extract formatted adress
    items_bool = [x['types'] == ['country', 'political'] for x in json['results'][0]['address_components']]
    i = items_bool.index(True)
    ctry_code = json['results'][0]['address_components'][i]['short_name'] # get short name of country
    ctry = json['results'][0]['address_components'][i]['long_name'] # get long name of country
    cnt= ctry_cont_dict[ctry_code] #get continent
    return f_address, ctry, cnt


# compare contries from two data frames
def compare_countries(df1,df2,ctry1,ctry2):
    
    fsi_ser, lpi_ser = pd.Series(df1[ctry1].unique(), name='fsi'),pd.Series(df2[ctry2].unique(), name='lpi')
    print("df1 shape:",len(fsi_ser),"df2 shape:",len(lpi_ser))
    
    comparison_df = pd.merge(pd.DataFrame(fsi_ser),pd.DataFrame(lpi_ser), left_on='fsi', right_on='lpi'
             , suffixes=('_fsi', '_lsi'), how='outer')
    
    # output comparison
    df1_not_df2 = list(comparison_df[comparison_df.fsi.isnull()].lpi)
    df2_not_df1 = list(comparison_df[comparison_df.lpi.isnull()].fsi)
    print("In df1, not in df2 length: {} \nIn df2, not in df1 length: {}".format(len(df1_not_df2), len(df2_not_df1)))
    print("In df1, not in df2: {} \nIn df2, not in df1: {}".format(df1_not_df2, df2_not_df1))
    return comparison_df


# split df to features
def split_to_features(t_data, base_exam, feat_names):
    features_dict = {}
    temp = 0
    for i in range(len(feat_names)):
        if 'item_desc' in feat_names[i]: 
            features_dict[feat_names[i]] = t_data.iloc[:, :len(base_exam[i].columns)]
        else:
            features_dict[feat_names[i]] = t_data.iloc[:, 
            len(base_exam[i-1].columns)+temp:len(base_exam[i-1].columns)+len(base_exam[i].columns)+temp]
            temp += len(base_exam[i-1].columns)
    return features_dict

## Section: Data Exploration ##

# import model data
def import_sel_data(files_path, feat_names):
    
    file_names = [feat_names[i]+".csv" for i in range(len(feat_names))]
    file_dict = {f: None for f in feat_names}

    for f in range(len(file_names)):
        file_dict[feat_names[f]] = pd.read_csv(files_path + file_names[f])
    
    for file in file_dict.values():
        try:
            file.drop('Unnamed: 0', axis=1, inplace=True) # drop unnamed column
        except:
            pass
    
    return file_dict


# one-hot-encoder
from sklearn.pipeline import TransformerMixin
from sklearn.preprocessing import LabelBinarizer

class OneHotEncoder(TransformerMixin):
    
    def __init__(self, cat=True):
        self.cat = cat
    
    def fit(self, X, y=None):
        self._validate = True
        return self
    
    def transform(self, X):
        self._validate = True
        return pd.get_dummies(X, drop_first=True)

    
# labeler
class LabelBiner(TransformerMixin):
    
    def __init__(self, lab=True):
        self.lab = lab
    
    def fit(self, y, X=None):
        self._validate = True
        return self
    
    def transform(self, y):
        self._validate = True
        enc = LabelBinarizer()
        return enc.fit_transform(y)

    
# output PCA results
import numpy as np
import matplotlib.pyplot as plt

def describe_pca(df, pca):

    # components
    dims = ['Component {}'.format(i) for i in range(1,len(pca.components_)+1)]
    comps = pd.DataFrame(np.round(pca.components_, 4), columns = df.keys())
    comps.index = dims
    
    # variance
    r = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    var = pd.DataFrame(np.round(r, 4), columns = ['Variance'])
    var.index = dims
    
    # visualize
    fig, ((ax0,ax1),(ax2,ax3)) = plt.subplots(
        nrows=2, ncols=2, figsize=(14,10))
    axes = [ax0,ax1,ax2,ax3]
    
    # plot
    for dim in dims:
        i = dims.index(dim)
        idx = abs(comps.loc[dim,:])>0.1 # select features with > 10% contribution
        comps.loc[dim,:][idx].sort_values(ascending=False).plot(
                                                    kind="barh",
                                                    ax=axes[i],
                                                    title=dim)
        axes[i].set_xlabel("Feature Contribution")

    pd.concat([var, comps], axis = 1)

    
# fill nulls
def fix_nulls(df):
    df_nulls = df.isnull().sum()[df.isnull().sum()>0].index.tolist()
    for c in df_nulls:
        try:
            cmean = df[c].mean()
            df[c].fillna(cmean, inplace=True)
        except:
            pass

## Section: ML Model Building ##

# categorical data pipeline
from sklearn.pipeline import make_pipeline

def cat_pipeline(s_objects, s_ts_objects):
    s_objects['train'] = 1
    s_ts_objects['train'] = 0
    temp_objects = pd.concat([s_objects, s_ts_objects])

    one_hot_pipeline = make_pipeline(OneHotEncoder())
    temp_objects = one_hot_pipeline.fit_transform(temp_objects)

    s_objects = temp_objects.loc[temp_objects['train'] == 1]
    s_ts_objects = temp_objects.loc[temp_objects['train'] == 0]
    s_objects.drop(['train'], axis=1, inplace=True)
    s_ts_objects.drop(['train'], axis=1, inplace=True)
    
    return s_objects, s_ts_objects


# numerical data pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler

def transform_pipepline(t_objects, s_num_log, si, sc, sv, sm, sb, sml, trending, del_del):

    log_transform = FunctionTransformer(np.log1p)
    log_stand_pipeline = make_pipeline(log_transform, StandardScaler()) 
    stand_pipeline = make_pipeline(StandardScaler()) 
    lab_pipeline = make_pipeline(LabelBiner())

    ti, tc, tv, tm, tb, tml = [pd.DataFrame(log_stand_pipeline.fit_transform(d)
                                            , index=d.index
                                            , columns=d.columns) for d in [si, sc, sv, sm, sb, sml]]
    
    t_num_log = pd.DataFrame(stand_pipeline.fit_transform(s_num_log)
                                             , index=s_num_log.index
                                             , columns=s_num_log.columns)

    tt = pd.DataFrame(stand_pipeline.fit_transform(trending)
                                             , index=trending.index
                                             , columns=trending.columns)
    
    
    delayed_del = lab_pipeline.fit_transform(del_del.delayed_del.map({True:1, False:0}))
    delayed_del = pd.DataFrame(delayed_del)
    delayed_del.rename(columns = {0: 'delayed_del'}, inplace = True)

    t_data = pd.concat([t_objects, t_num_log
                        , ti, tc, tv, tm, tb
                        , tml, tt], axis=1)
    
    print("X shape: ", t_data.shape, "\ny shape:", delayed_del.shape)
    
    return t_data, delayed_del
    
    
# classifier: split to train and test datasets
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

def split_set(X, y, ts, use_smote=False):
  
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=ts,
                                                        random_state=103)
    smote = SMOTE(ratio = 1.0, random_state=103)    
    
    if use_smote:         
        X_train, y_train = smote.fit_sample(X_train, y_train)
    
    X_train = pd.DataFrame(X_train, columns=X.columns)
    y_train = pd.DataFrame(y_train, columns=y.columns)
    
    print("Shape of X_train: {} X_val: {} y_train: {} y_val: {}".format(X_train.shape,
                                                                        X_test.shape,
                                                                        y_train.shape,
                                                                        y_test.shape))  
    return X_train, X_test, y_train, y_test


# regression: split dataset
def split_set_reg(X, true_pred, datetime, delayed_del):

    # split for training
    X_train_total = X.loc[delayed_del[delayed_del==1].dropna().index.tolist(),:]
    X_train = X_train_total.drop(true_pred.index.tolist(), axis=0)

    y_train_total = datetime.loc[delayed_del[delayed_del==1].dropna().index.tolist(),['delay_t']]
    y_ttemp = y_train_total.drop(true_pred.index.tolist(), axis=0)
    y_train = y_ttemp.delay_t.dt.days

    # split for testing
    X_test = X.loc[true_pred.index.tolist(),:]
    y_test = datetime.loc[true_pred.index.tolist(),['delay_t']]['delay_t'].dt.days
    
    print("Shape of X_train: {} X_val: {} y_train: {} y_val: {}".format(X_train.shape,
                                                                        X_test.shape,
                                                                        y_train.shape,
                                                                        y_test.shape))
    return X_train, X_test, y_train, y_test


# select classifier
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.metrics import f1_score, r2_score, mean_squared_error

def select_classifier(X_train, X_test, y_train, y_test, clf):

    ml = Pipeline([('estimator', clf)])

    y_train = preprocessing.LabelEncoder().fit_transform(y_train.values.ravel())
    y_test = preprocessing.LabelEncoder().fit_transform(y_test.values.ravel())
    
    ml.fit(X_train, y_train) # fit
    pred = ml.predict(X_test) # predict
    
    return (f1_score(y_test, pred)) # return f1 score


# collect true positivis from clf
def collect_tr_pred(sel_clf, X_train, X_test, y_train, y_test):

    sel_clf.fit(X_train, y_train) # fit
    y_pred = sel_clf.predict(X_test) # predict
    
    pred_df = y_test.copy() 
    pred_df['prediction']= y_pred

    true_pred_df = pred_df[(pred_df.delayed_del==1) & (pred_df.delayed_del==pred_df.prediction)]
    print("True positives:\n", true_pred_df.sum())
    return true_pred_df


# select regressor
def select_regressor(X_train, X_test, y_train, y_test, reg):

    ml = Pipeline([('estimator', reg)]) 
    
    y_train = y_train.values.ravel()
    y_test = y_test.values.ravel()
    
    ml.fit(X_train, y_train) # fit
    pred = ml.predict(X_test) # predict
    
    r_score = (r2_score(y_test, pred), np.sqrt(mean_squared_error(y_test, pred)))
    return r_score # return r2, rmse


# automate classification and regression
from sklearn.metrics import classification_report, confusion_matrix

def classify_regress(X, y, datetime, clf, reg, test_size=0.2, use_smote=False):
    
    # Classification
    X_train_c, X_test_c, y_train_c, y_test_c = split_set(X, y, test_size, use_smote=True)
    clf_fit = clf.fit(X_train_c, y_train_c)
    
    # feature importance
    try:
        clf_importance = visualize_importance(clf_fit, X_train_c, 30)
    except AttributeError:
        clf_importance = pd.DataFrame([[]])
        print("\nCan't visualize feature importance for:", clf)
        
    # classification report, conf matrix  
    clf_report, conf_mtrx=[rm(y_test_c, clf_fit.predict(X_test_c)) for rm in [classification_report,
                                                                                  confusion_matrix]]
    
    # true predicted 
    predicted = y_test_c.copy()
    predicted['prediction']= clf.predict(X_test_c)
    true_pred = predicted[(predicted.delayed_del==1) & (predicted.delayed_del==predicted.prediction)]
    
    # Regression
    X_train_r, X_test_r, y_train_r, y_test_r = split_set_reg(X, true_pred,
                                                             datetime, y) 
    reg.fit(X_train_r,y_train_r)
    
    # reg scores
    r2 = r2_score(y_test_r, reg.predict(X_test_r))
    rmse = np.sqrt(mean_squared_error(y_test_r, reg.predict(X_test_r)))
    
    return predicted, true_pred, clf_importance, clf_report, conf_mtrx, r2, rmse  


# plot feature importance
def visualize_importance(model, df, xlim):
    
    features = df.columns
    major_features = pd.DataFrame(model.feature_importances_, features)
    major_features.reset_index(inplace=True)
    
    major_features.columns=['feature', 'importance']
    major_features.set_index('feature').sort_values(
                                        'importance', ascending=False)[:30].plot(
                                                                        kind="barh",
                                                                        xlim=(0, xlim),
                                                                        figsize=(10,7),
                                                                        color='tab:blue')
    
    
# classification report
from yellowbrick.classifier import ClassificationReport

def class_report(X_train, X_test, y_train, y_test, est):    
    ml = Pipeline([('estimator', est)])
    report = ClassificationReport(ml, classes=['on-time', 'delayed']
                                                    , cmap='YlGnBu')
    report.fit(X_train, y_train)
    report.score(X_test, y_test)
    return report.poof()


# cross-validation score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
import seaborn as sns

def cross_val_report(clf, X, y, report_name):
    
    y = y.values.ravel()
    scores = cross_validate(clf, X, y, cv=5, scoring=('precision', 'recall', 'f1'))
    
    validation_scores = {'precision': {}, 'recall': {}, 'f1': {}}
    
    for keys, values in scores.items():
        
            if 'precision' in keys:
                if 'train' in keys:
                    validation_scores['precision']['on-time'] = values.mean()
                elif 'test' in keys:
                    validation_scores['precision']['delayed'] = values.mean()
                    
            elif 'recall' in keys:
                if 'train' in keys:
                    validation_scores['recall']['on-time'] = values.mean()
                elif 'test' in keys:
                    validation_scores['recall']['delayed'] = values.mean()
                    
            elif 'f1' in keys:
                if 'train' in keys:
                    validation_scores['f1']['on-time'] = values.mean()
                elif 'test' in keys:
                    validation_scores['f1']['delayed'] = values.mean()        

    a, b = plt.subplots(1, 1, figsize=(6,5))
    ax = plt.axes()
    ax.set_title(report_name)
    return sns.heatmap(pd.DataFrame.from_dict(validation_scores),
                                                       ax = ax,
                                                       annot=True,
                                                       fmt='.3f',
                                                       linewidths=.5,
                                                       cmap="YlGnBu")

FileNotFoundError: ignored