***********************************************************
# Final Model Pipeline

By: Aditya Mengani, Ognjen Sosa, Sanjay Elangovan, Song Park, Sophia Skowronski

### This pipeline:
- Runs through a set of baseline and graph features using different iterative set up options
- Performs an end to end pipeline implementation of classifers using RandomizedSearch 
- Displays calculated accuracies and tabulates the final scores
***********************************************************
### Model Context
##### The models are run with the following combinations
- Graph + Baseline
- Baseline only
- Graph + Baseline reduced
- Baseline reduced only
- Graph only
***********************************************************
### Classifiers
- Logistic Regression
- k_Nearest Neighbours
- Bernoulli Naive Bayes
- Decision Trees
- Support Vector Machine(SVM)
- Random Forest Classifier
- XGBoost
***********************************************************
### Degree of Freedom
- 4 and 5
***********************************************************
### Bootstrapping techniques
- RandomizedSearchCV 
- Iterate and calculate average of scores for each setupType and degrees of freedom for each classifie
***********************************************************
### Component Analysis
- PCA 
- Country and Industry Feature sets
***********************************************************
### Model relationships and set up types

**Graph: `p1_tag` ~		`k-core` + `min shortest path` + `shortest paths` + `degrees (in/out)` + `pagerank`**

**Baseline: `p1_tag` ~		`age` + `industry` + `employee count` + `country` + `rank` + `total funding`**

**Baseline-R:	`p1_tag` ~	`age` + `industry` + `employee count` + `country`**

**Graph + Baseline-Reduced: `p1_tag` ~ `Graph Features` + `Baseline Reduced Features`**

**Graph + Baseline: `p1_tag` ~ `Graph Features` + `Baseline  Features`**
***********************************************************
### Input and Output File type and structure
- Baseline (.csv) approx. 1 million observations
    - Baseline File path: files/output/
- Graph generated features (.csv)
    - File path DF 4: files/output/Model_DF_D4/
    - File path DF 5: files/output/Model_DF_D4/
    - Folders 'B', 'G', 'GB', 'GBR', 'BR'
- Output generated file: Generates 1 per `iteration`
    - results_baseline_`iteration`.json
    - File path files/output/

**********************************************************

### MODULE: BASELINE ONLY 

Generates the Baseline only features by reading the datasets for BL features

In [None]:
##################################################################
#####Steps##########
## 1.Input dataframe (original baseline feature set ~ 1 million records)
## 2.Extracted dataframe (UUIDS from files/output/Model_DF_D4 
## or Model_DF_D5)
## 3.Merge the two dataframes to get common list 
##################################################################

def Baseline_Only(df,n_degrees, setup, iteration):
    df = df.copy()
    print("Original DF shape",df.shape)
    
    # Have industry mapper for 'ind_1'...'ind_46' columns
    industries = ['Software', 'Information Technology', 'Internet Services', 'Data and Analytics',
                  'Sales and Marketing', 'Media and Entertainment', 'Commerce and Shopping', 
                  'Financial Services', 'Apps', 'Mobile', 'Science and Engineering', 'Hardware',
                  'Health Care', 'Education', 'Artificial Intelligence', 'Professional Services', 
                  'Design', 'Community and Lifestyle', 'Real Estate', 'Advertising',
                  'Transportation', 'Consumer Electronics', 'Lending and Investments',
                  'Sports', 'Travel and Tourism', 'Food and Beverage',
                  'Content and Publishing', 'Consumer Goods', 'Privacy and Security',
                  'Video', 'Payments', 'Sustainability', 'Events', 'Manufacturing',
                  'Clothing and Apparel', 'Administrative Services', 'Music and Audio',
                  'Messaging and Telecommunications', 'Energy', 'Platforms', 'Gaming',
                  'Government and Military', 'Biotechnology', 'Navigation and Mapping',
                  'Agriculture and Farming', 'Natural Resources']
    industry_map = {industry:'ind_'+str(idx+1) for idx,industry in enumerate(industries)}
    
    # reduce memory
    df_simple = reduce_mem_usage(df)

    print('\nDataframe shape:', df_simple.shape)
    del industries, industry_map
        
    # Extract baseline UUIDS part of Graph Network
    list_Set_Up = ['BL_Only','G_Only','G+BL','G+BL_Red','BL_Red_Only']
    folders = ['B', 'G', 'GB', 'GBR', 'BR']
    save_map = dict(zip(list_Set_Up,folders))

    # read the files based on the input setuptype/iteration/degrees
    if n_degrees == 4:
        df_bl = pd.read_csv('files/output/Model_DF_D4/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D2 shape",df.shape)
    # read the files based on the input setuptype/iteration/degrees   
    elif n_degrees == 5:
        df_bl = pd.read_csv('files/output/Model_DF_D5/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D4 shape",df.shape)

    # merge input dataframe with thre read baseline dataframe      
    df_simple = pd.merge(df_bl.copy(),df_simple.copy(),how='inner',on='uuid') 
    
    return df_simple

### MODULE: BASELINE REDUCED 

Eliminates FEATURES: RANK and total_funding_usd (as there are not much improvements with these feature sets as part of preliminary eda)

In [None]:
##################################################################
#####Steps##########
## 1.Input dataframe (original baseline feature set ~ 1 million records)
## 2.Extracted dataframe (UUIDS from files/output/Model_DF_D4 
## or Model_DF_D5)
## 3. Eliminate the features not required (Rank and total_funding_usd) 
## 4.Merge the two dataframes to get common list 
##################################################################

def Baseline_Reduced(df,n_degrees, setup, iteration):
    df = df.copy()
    print("Original DF shape",df.shape)
    
    # Have industry mapper for 'ind_1'...'ind_46' columns
    industries = ['Software', 'Information Technology', 'Internet Services', 'Data and Analytics',
                  'Sales and Marketing', 'Media and Entertainment', 'Commerce and Shopping', 
                  'Financial Services', 'Apps', 'Mobile', 'Science and Engineering', 'Hardware',
                  'Health Care', 'Education', 'Artificial Intelligence', 'Professional Services', 
                  'Design', 'Community and Lifestyle', 'Real Estate', 'Advertising',
                  'Transportation', 'Consumer Electronics', 'Lending and Investments',
                  'Sports', 'Travel and Tourism', 'Food and Beverage',
                  'Content and Publishing', 'Consumer Goods', 'Privacy and Security',
                  'Video', 'Payments', 'Sustainability', 'Events', 'Manufacturing',
                  'Clothing and Apparel', 'Administrative Services', 'Music and Audio',
                  'Messaging and Telecommunications', 'Energy', 'Platforms', 'Gaming',
                  'Government and Military', 'Biotechnology', 'Navigation and Mapping',
                  'Agriculture and Farming', 'Natural Resources']
    industry_map = {industry:'ind_'+str(idx+1) for idx,industry in enumerate(industries)}
    

    # Reduced baseline doesnt have these two columns
    df_simple = df.drop(['rank','total_funding_usd'], axis=1)
    df_simple = reduce_mem_usage(df_simple)
    print('\nDataframe shape:', df_simple.shape)
    
    # Extract baseline UUIDS part of Graph Network
    list_Set_Up = ['BL_Only','G_Only','G+BL','G+BL_Red','BL_Red_Only']
    folders = ['B', 'G', 'GB', 'GBR', 'BR']
    save_map = dict(zip(list_Set_Up,folders))

    # read the files based on the input setuptype/iteration/degrees
    if n_degrees == 4:
        df_bl = pd.read_csv('files/output/Model_DF_D4/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D2 shape",df.shape)
    # read the files based on the input setuptype/iteration/degrees
    elif n_degrees == 5:
        df_bl = pd.read_csv('files/output/Model_DF_D5/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D4 shape",df.shape)
    
    # merge input dataframe with thre read baseline dataframe 
    df_simple = pd.merge(df_bl.copy(),df_simple.copy(),how='inner',on='uuid')   
    
    del industries, industry_map
    return df_simple

### MODULE : GRAPH ONLY

Generates the Graph only features from extracting the data from graph features to dataframe

In [None]:
##################################################################
#####Steps##########
## 1. Input the baseline dataframe
## 2.Extract graph dataframe (From files/output/Model_DF_D4 
## or Model_DF_D5 related to the graph generated features) 
## 3.Impute graph features fetching infinite values
## 4.Merge the two dataframes to get common list 
##################################################################
def Graph_Only_SS(df,n_degrees, setup, iteration):
    df = df.copy()

    #select uuid and p1_tag from baseline dataframe
    df = df[['uuid','p1_tag']]
    print("Original DF shape",df.shape)

    # read the files based on the input df setuptype/iteration/degress
    list_Set_Up = ['BL_Only','G_Only','G+BL','G+BL_Red','BL_Red_Only']
    folders = ['B', 'G', 'GB', 'GBR', 'BR']
    save_map = dict(zip(list_Set_Up,folders))
    
    # read the files based on the input setuptype/iteration/degrees
    if n_degrees == 4:
        df_gr = pd.read_csv('files/output/Model_DF_D4/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D2 shape",df.shape)
    # read the files based on the input setuptype/iteration/degrees
    elif n_degrees == 5:
        df_gr = pd.read_csv('files/output/Model_DF_D5/{}/{}.csv'.format(save_map[setup], iteration),sep=',')
        print("Original Model_DF_D4 shape",df.shape)
    
    # merge input dataframe with thre read baseline dataframe 
    df_gr = pd.merge(df_gr.copy(),df.copy(),how='inner',on='uuid')
    print("Original DF_GR shape after merge",df_gr.shape)

    # reduce memory
    df_gr = reduce_mem_usage(df_gr) 

    # impute infinite spath values with 1000
    df_gr['w_spath_top_3_0'][df_gr['w_spath_top_3_0']==1e30] = 1000
    df_gr['w_spath_top_3_1'][df_gr['w_spath_top_3_1']==1e30] = 1000
    df_gr['w_spath_top_3_3'][df_gr['w_spath_top_3_3']==1e30] = 1000
    df_gr['w_spath_top_3_4'][df_gr['w_spath_top_3_4']==1e30] = 1000
    df_gr['w_spath_top_min_3'][df_gr['w_spath_top_min_3']==1e30] = 1000
    
    # impute any na values
    df_gr = df_gr.fillna(0)
       
    del df
    return df_gr

### MODULE: GENERATE TRAIN TEST SPLIT

Generates train test split for the input dataframe

In [None]:
##################################################################
#####Steps##########
## 1.Input the dataframe to be split
## 2.Randomly sample the data to pick equal count of non-P1 companies
## 3.Split the data into train/test 80:20
##################################################################
## Select equal sample of non-Pledge 1% organizations
def gen_Train_Test_Split(df_simple):

    # get all the p1 companies
    df_p1 = df_simple[df_simple['p1_tag']==1]
    print(df_p1.shape)

    # sample randomly all the non-p1 companies for equal sample size of p1
    df_notp1 = df_simple[df_simple['p1_tag']==0].sample(n=df_p1.shape[0], replace=True)

    # concat p1 and non-p1 companies
    df_model = pd.concat([df_p1, df_notp1]).reset_index(drop=True)

    # reduce memory
    df_model = reduce_mem_usage(df_model)

    # Create variable for each feature type: categorical and numerical
    numeric_features = df_model.select_dtypes(include=['uint8','int8', 'int16', 'int32', 'int64', 'float16', 'float32','float64']).drop(['p1_tag'], axis=1).columns
    categorical_features = df_model.select_dtypes(include=['object']).columns
    
    # Select all labels except the output
    X = df_model.drop('p1_tag', axis=1)

    # select precdictor label
    y = df_model['p1_tag']
    y = preprocessing.LabelEncoder().fit_transform(y)

    # create a train/test split for 80/20 in the sample data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=99)
    print('Training data shape:', X_train.shape)
    print('Train label shape:', y_train.shape)
    print('Test data shape:',  X_test.shape)
    print('Test label shape:', y_test.shape)

    # reset indexes for train and test
    X_train= X_train.reset_index(drop=True)
    X_test= X_test.reset_index(drop=True)
    return X_train,X_test,X,y,y_train,y_test,numeric_features,categorical_features

### MODULE : PERFORM PCA COUNTRY

Perform PCA on country code feature

In [None]:
##################################################################
#####Steps##########
## 1.Input the train and test datasets
## 2.Perform PCA analysis on country attributes
## 3.Plot a graph for Fraction of total variance vs. number of principal components
## 4.Run PCA, transform the data into reduced components
##################################################################

# Perform PCA of country dataset
def PCA_Country(X_train,X_test):

    # Perform PCA of country dataset
    # read all the attributes that belongs to the country features
    country_train = X_train.filter(regex='^country',axis=1).fillna(0)
    country_test = X_test.filter(regex='^country',axis=1).fillna(0)
#     # For each value of k, use PCA to project the data feature sets to k principle components
#     matrix = [['k', 'total variance']] # For display
#     k_values = list(range(1,113)) # To loop through, there are 112 country codes
#     # For each value of k, use PCA to project the data feature sets to k principle components
#     for k in k_values:
#         pca = PCA(n_components=k, whiten=True,random_state=random.seed(1234))
#         pca.fit(country_train)
#         matrix.append([k, round(pca.explained_variance_ratio_.sum(),4)])
#     # Print results
#     print('Fraction of the total variance in the training data explained by the first k principal components:\n')
#     s = [[str(e) for e in row] for row in matrix]
#     lens = [max(map(len, col)) for col in zip(*s)]
#     fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
#     table = [fmt.format(*row) for row in s]
#     print('\n'.join(table))
#     print()
#     # Plots
#     _, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,5))
#     # Plotting lineplot of fraction of total variance vs. number of principal components
#     # For all possible numbers of principal components
#     ax.plot(np.cumsum(PCA().fit(country_train).explained_variance_ratio_))
#     # Labels
#     ax.set_title('Fraction of total variance vs. number of principal components')
#     ax.set_xlabel('k = number of components')
#     ax.set_ylabel('Cumulative explained variance')
#     # Display
#     plt.show()
    
    # create PCA features for train and test set
    #print("country train",list(country_train.columns))

    # The above commented plot identifies PCA n_components as the best fit of variance
    n_components = 15

    # Run pca and create new features based on the n_components = 15
    pca = PCA(n_components=n_components,whiten=True,random_state=random.seed(1234))  
    
    # create a train pca dataset
    pca_train = pca.fit_transform(country_train)
    # create a test pca dataset
    pca_test = pca.transform(country_test)
    
    # create dataframes from numpy pca features
    df_cty_train = pd.DataFrame(pca_train,columns=['cntry_pca_'+ str(x) for x in range(n_components)])
    df_cty_test = pd.DataFrame(pca_test,columns=['cntry_pca_'+ str(x) for x in range(n_components)])
    
    # drop country prefix columns
    X_train = X_train.drop(list(X_train.filter(regex='^country_',axis=1).columns), axis=1)
    X_test = X_test.drop(list(X_test.filter(regex='^country_',axis=1).columns), axis=1)
    
    # concat with train dataset
    X_train = pd.concat([X_train, df_cty_train],axis = 1)
    X_test = pd.concat([X_test, df_cty_test],axis = 1)
    
    # delete dataframes
    del df_cty_train,df_cty_test,country_train,country_test
    return X_train,X_test


### MODULE : PERFORM PCA INDUSTRY 

Perform PCA on Industry feature

In [None]:
##################################################################
#####Steps##########
## 1.Input the train and test datasets
## 2.Perform PCA analysis on industry attributes
## 3.Plot a graph for Fraction of total variance vs. number of principal components
## 4.Run PCA, transform the data into reduced components
##################################################################

# Perform PCA of country dataset
def PCA_Industry(X_train,X_test):
    
    # Perform PCA of industry dataset
    # Perform PCA of country dataset
    # read all the attributes that belongs to the country features
    industry_train = X_train.filter(regex='^ind_',axis=1).fillna(0)
    industry_test = X_test.filter(regex='^ind_',axis=1).fillna(0)
#     # For each value of k, use PCA to project the data feature sets to k principle components
#     matrix = [['k', 'total variance']] # For display
#     k_values = list(range(1,47)) # To loop through, there are 46 industries
#     # For each value of k, use PCA to project the data feature sets to k principle components
#     for k in k_values:
#         pca = PCA(n_components=k, whiten=True,random_state=random.seed(1234))
#         pca.fit(industry_train)
#         matrix.append([k, round(pca.explained_variance_ratio_.sum(),4)])
#     # Print results
#     print('Fraction of the total variance in the training data explained by the first k principal components:\n')
#     s = [[str(e) for e in row] for row in matrix]
#     lens = [max(map(len, col)) for col in zip(*s)]
#     fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
#     table = [fmt.format(*row) for row in s]
#     print('\n'.join(table))
#     print()
#     # Plots
#     _, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,5))
#     # Plotting lineplot of fraction of total variance vs. number of principal components
#     # For all possible numbers of principal components
#     ax.plot(np.cumsum(PCA().fit(industry_train).explained_variance_ratio_))
#     # Labels
#     ax.set_title('Fraction of total variance vs. number of principal components')
#     ax.set_xlabel('k = number of components')
#     ax.set_ylabel('Cumulative explained variance')
#     # Display
#     plt.show()
    
    # create PCA features for train and test set

    # The above commented plot identifies PCA n_components as the best fit of variance
    n_components=10
    
    # Run pca and create new features based on the n_components = 15
    pca = PCA(n_components=n_components, whiten=True, random_state=random.seed(1234)) 
    
    # create a train pca dataset
    pca_train = pca.fit_transform(industry_train)
    
    # create a test pca dataset
    pca_test = pca.transform(industry_test)
    
    # create dataframes from numpy pca features
    df_ind_train = pd.DataFrame(pca_train,columns=['ind_pca'+ str(x) for x in range(n_components)])
    df_ind_test = pd.DataFrame(pca_test,columns=['ind_pca'+ str(x) for x in range(n_components)])
    
    # drop country prefix columns
    X_train = X_train.drop(list(X_train.filter(regex='^ind_',axis=1).columns), axis=1)
    X_test = X_test.drop(list(X_test.filter(regex='^ind_',axis=1).columns), axis=1)
    
    # concat with train dataset
    X_train = pd.concat([X_train, df_ind_train],axis = 1)
    X_test = pd.concat([X_test, df_ind_test],axis = 1)
    
    # delete dataframes
    del df_ind_train,df_ind_test,industry_train,industry_test

    return X_train,X_test


### MODULE: VIZUALIZE COUNTRY & INDUSTRY PCA

Visualize Country & Industry PCA spread

In [None]:
##################################################################
#####Steps##########
## 1.Input the data and labels
## 2.Plot a graph for PCA distribution of Industry and Country
##################################################################

# create graphs for PCA analysis for country and industry features
def Visualize_Country_Ind_PCA(X,y):
    print("None")
#     Country_df = X.filter(regex='^country',axis=1).fillna(0)
#     pca_new_Country = PCA(n_components=10,random_state=random.seed(1234))  
#     Country_df_PCA = pca_new_Country.fit_transform(Country_df)

#     Industry_df = X.filter(regex='^ind_',axis=1).fillna(0)
#     pca_new_Industry_df = PCA(n_components=30,random_state=random.seed(1234))  
#     Industry_df_PCA = pca_new_Industry_df.fit_transform(Industry_df)

#     # The PCA model
#     fig, axes = plt.subplots(1,2,figsize=(15,15))
#     colors = ['r','g']
#     fig.suptitle('PCA Analysis for Country and Industry', fontsize=30)
#     targets = [1,0]
#     for target, color in zip(targets,colors):
#       indexes = np.where(y == target)
#       axes[0].scatter(Country_df_PCA[indexes][:,0], Country_df_PCA[indexes][:,1],color=color)
#       axes[0].set_xlabel('PC1')
#       axes[0].set_ylabel('PC2')
#       axes[0].set_title('PCA-Country')
#       axes[1].scatter(Industry_df_PCA[indexes][:,0], Industry_df_PCA[indexes][:,1], color=color)
#       axes[1].set_xlabel('PC1')
#       axes[1].set_ylabel('PC2')
#       axes[1].set_title('PCA-Industry')
#     plt.axis('tight')

#     out_labels = ['p1','non-p1']
#     plt.legend(out_labels,prop={'size':10},loc='upper right',title='Legend of plot')

#     plt.show()

### MODULE: RUN CLASSIFIER

 Uncomment the classifier that you need to run and comment the ones that you are not running

In [1]:
##################################################################
#####Steps##########
## 1.Input the feature sets, train, test data and 
##    labels,categorical and numerical features
## 2.Define all the models to be evaluated
## 3.Create a pipeline
## 4. In pipeline perform:
            # - Encoding
            # - Scaling
            # - Simple Imputing
            # - GridSearch/RandomizedSearch
            # - Fit train data/Predict test data
            # - Evaluate accuracy score
            # - Append the scores to results
##################################################################

def Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,Type):
    # create a dict of results and append degrees and set up type
    results = OrderedDict()
    results['n_deg'] = n_deg
    results['Model_Type'] = Type
    #results['Column_Name'] = col_graph

    # create classifier list
    classifier_list = []

    # define classification models
    LRR = LogisticRegression(max_iter=10000, tol=0.1)
    KNN = KNeighborsClassifier(n_neighbors=5)
#     BNB = BernoulliNB()
#     GNB = GaussianNB()
#     SVM = svm.SVC()
#     DCT = DecisionTreeClassifier()
#     XGB = xgb.XGBRegressor() #tree_method='gpu_hist', gpu_id=0
#     RMF = RandomForestClassifier()

    # add classifier parmeters to classifier list
    classifier_list.append(('LRR', LRR, {'classifier__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000],\
                                        'classifier__random_state': [random.seed(1234)]}))
    classifier_list.append(('KNN', KNN, {}))
#     classifier_list.append(('BNB', BNB, {'classifier__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]}))
#     classifier_list.append(('GNB', GNB, {'classifier__var_smoothing': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]}))
#     classifier_list.append(('DCT', DCT, {'classifier__max_depth':np.arange(1, 21),
#                                         'classifier__min_samples_leaf':[1, 5, 10, 20, 50, 100],
#                                         'classifier__random_state' : [random.seed(1234)]}))
#     classifier_list.append(('XGB', XGB, {'classifier__random_state' : [random.seed(1234)]}))
#     classifier_list.append(('RMF', RMF, {'classifier__random_state' : [random.seed(1234)]}))
#     classifier_list.append(('SVM', SVM, {'classifier__random_state' : [random.seed(1234)]}))

    # define encoder lsit
    encoder_list = [ce.one_hot.OneHotEncoder]

    # define scaling list
    scaler_list = [StandardScaler()]

    # for each lablel/classifier/parmeter in classifier list
    # iterate over gridsearch and run the classifers/encoders/scalers
    # store the result in the dictionary
    for label, classifier, params in classifier_list:
        results[label] = {}
        #for each encoder
        for encoder in encoder_list:
            # for each feature scaler
            for feature_scaler in scaler_list:
                # for each results label 
                results[label][f'{encoder.__name__} with {feature_scaler}'] = {}
                print('{} with {} and {}'.format(label,encoder.__name__,feature_scaler))
                # define numerical transformer using standard scaler
                numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])
                # define categorical imputer steps
                categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                                                          ('woe', encoder())])
                # define preprocessor for column transformations
                preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                               ('cat', categorical_transformer, categorical_features)])
                # run pipleine for preprocessor/scaler/classifier
                pipe = Pipeline(steps=[#('preprocessor', preprocessor),
                                       ('scaler', feature_scaler),
                                       ('classifier', classifier)])
                # if params are available
                if params != {}:
                    # perform randomizedsearchCV
                    search = RandomizedSearchCV(pipe, params, n_jobs=-1)
                    # fit the randomizedSearchCV object with train features and labels
                    search.fit(X_train, y_train)
                    # display the best parameter
                    print('Best parameter (CV score={:.3f}): {}'.format(search.best_score_, search.best_params_))
                    # create the model from the best fit values of train features and label
                    model = search.fit(X_train, y_train)
                    # predict the labels for test features
                    y_pred = model.predict(X_test)

                    # if the classifier is xgboost round the values to convert to binary
                    # standard process to convert it for logistic regression
                    if label == 'XGB':
                        y_pred = [round(value) for value in y_pred]

                    # calculate the score and display best score and populate results
                    score = f1_score(y_test, y_pred,average='weighted')
                    print('Best score: {:.4f}\n'.format(score))
                    results[label][f'{encoder.__name__} with {feature_scaler}']['score'] = score
                    try:
                        results[label][f'{encoder.__name__} with {feature_scaler}']['best_params'] = search.best_params_
                    except:
                        print('Something went wrong w/ GridSearch or pipeline fitting.')
                else:
                    # if the does not have any parameters 
                    try:
                        # fit the train labels and features to the pipeline
                        model = pipe.fit(X_train, y_train)
                        # predict the model with the test features
                        y_pred = model.predict(X_test)
                        # if the classifier is xgboost round the values to convert to binary
                        # standard process to convert it for logistic regression
                        if label == 'XGB':
                            y_pred = [round(value) for value in y_pred]
                        
                        # calculate the score and display best score and populate results
                        score = f1_score(y_test, y_pred,average='weighted')
                        print('Score: {:.4f}\n'.format(score))
                        results[label][f'{encoder.__name__} with {feature_scaler}']['score'] = score
                    except:
                        print('Something went wrong with pipeline fitting')
    return results    

### MODULE: WRITE OUTPUT

Generate output json file for the model results

In [None]:
##################################################################
#####Steps##########
## 1.Input the result list generated
## 2.Run this through json encoder to convert the file to json
## 3.Save the results into a json file
##################################################################
def Write_Output(out_list,iteration):
    # encode to encode int/float and array types and write the output json
    class NpEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, np.integer):
                return int(obj)
            elif isinstance(obj, np.floating):
                return float(obj)
            elif isinstance(obj, np.ndarray):
                return obj.tolist()
            else:
                return super(NpEncoder, self).default(obj)

    # File is saved under Files directory. /content would be the baseline folder
    # You can click on folder icon on left side of the directory structure to
    # see the created file
    
    with open(f'files/output/results_baseline_ITER_{iteration}.json', 'w') as fp:
        json.dump(out_list, fp, sort_keys=False, indent=4, cls=NpEncoder)

### MODULE : REDUCE MEMORY USAGE

Module to reduce memory usage for dataframe

In [None]:
##################################################################
#####Steps##########
## 1.Input dataframe from original module
## 2.Identify min and max ranges of sizes for each column
## 3.Apply typecasting to reduce the memory usage 
##################################################################
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100*(start_mem-end_mem)/start_mem))
    return df

### MODULE: CALCULATE AVERAGE SCORE

Calculate the average score of all the generated files for a particular model type/model set up combination

In [None]:
##################################################################
#####Steps##########
## 1.Input the generated results file
## 2.Iterate it over each model/setup/degree type
## 3.Caclulate the average and display it
##################################################################

# Open the generated results file
def calculate_avg(iterations):
    # Enter the model names #'KNN':0,'BNB':0,'GNB':0,'DCT':0,'XGB':0,'RMF':0,'SVM':0
    test_accuracies = {'LRR':0}
    data_cnt = {'LRR':0}
    
    for i in range(iterations):
         with open(f'files/output/results_baseline_ITER_{i}.json') as g:
                #json.dump(out_list, fp, sort_keys=False, indent=4, cls=NpEncoder)
                data = json.load(g)
                for i in list(test_accuracies.keys()):
                    for j in data:
                        test_accuracies[i] = test_accuracies[i] + (j['result'][0][i]['OneHotEncoder with StandardScaler()']['score'])
                        data_cnt[i] = data_cnt[i] + len(data)
    
    for i in test_accuracies:
        test_accuracies[i] = round(test_accuracies[i]/data_cnt[i],2)

  

    print("\nAveraged accuracies: ",test_accuracies)


### MODULE: GENERATE TABLE

This module generates a dataframe for model/set up/degree combination and displays it as a table

In [None]:
##################################################################
#####Steps##########
## 1.Input the generated results files and read them to a dataframe
## 2.Iterate it over each model/setup/degree type
## 3.Caclulate the average and display it as a table
##################################################################

#List of json files

def generate_table():
    file_list = ['results_baseline_ITER_0.json','results_baseline_ITER_1.json','results_baseline_ITER_2.json','results_baseline_ITER_3.json',
                 'results_baseline_ITER_4.json','results_baseline_ITER_5.json','results_baseline_ITER_6.json','results_baseline_ITER_7.json',
                 'results_baseline_ITER_8.json','results_baseline_ITER_9.json']

    #Function for flattening nested json
    def flatten_json(nested_json, exclude=['']):
        out = {}
        def flatten(x, name='', exclude=exclude):
            if type(x) is dict:
                for a in x:
                    if a not in exclude: flatten(x[a], name + a + '_')
            elif type(x) is list:
                i = 0
                for a in x:
                    flatten(a, name + str(i) + '_')
                    i += 1
            else:
                out[name[:-1]] = x
        flatten(nested_json)
        return out
    
    #Loop through list of json files, convert to df and append
    df = pd.DataFrame([])
    for file in file_list:
        with open(f'files/output/{file}') as train_file:
            dict_train = json.load(train_file)
        df = df.append(pd.DataFrame([flatten_json(x) for x in dict_train[0]['result']]))

    df = df.groupby(['n_deg','Model_Type']).mean().round(3).reset_index()
    display(HTML(df.to_html()))

### MODULE: MAIN 
- Run 10 iterations,for each set up and each degree type
- Capture the results in json
- Calculate the average across all scores

In [None]:
##################################################################
#####Steps##########
## 1.Define setup Types/ Degree of freedom/ Iterations
## 2.Iterate over each combination
## 3.For each combination of degree and set up, call the 
## modules defined earlier
## 4.Save results and generate a table of accuracies 
##################################################################

## import modules
'''Data analysis'''
import numpy as np
import pandas as pd
import csv
import warnings
import json
import os
import time
import math
import random
from IPython.display import display, HTML
#import itertoolss
import statistics
from collections import OrderedDict 
from datetime import datetime
warnings.filterwarnings('ignore')
'''Plotting'''
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
'''Stat'''
import statsmodels.api as sm
from scipy.stats import chi2_contingency
'''ML'''
import prince
import category_encoders as ce
from sklearn import metrics, svm, preprocessing, utils
from sklearn.metrics import mean_squared_error, r2_score, f1_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV,RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model  import LogisticRegression
from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn import metrics, svm
from sklearn.tree import DecisionTreeClassifier,export_graphviz
import xgboost as xgb
from sklearn.preprocessing import MinMaxScaler,StandardScaler,\
MaxAbsScaler,RobustScaler,QuantileTransformer,PowerTransformer
from libsvm.svmutil import *

# define set up types
list_Set_Up = ['BL_Only','G_Only','G+BL','G+BL_Red','BL_Red_Only']
# define degrees
degrees = [4,5]

# Defining main function 
def main():
    final_out = []
    df = pd.read_csv('files/output/baseline.csv',sep=';')
    
    # set the total iterations needed to 10
    total_iterations = 10
    
    # for each iteration
    for iteration in range(total_iterations):
        out_dict = {}
        out_dict['iteration'] = iteration
        out_list = []
        
        # for each degree type
        for n_deg in degrees:
            # for each set up type
            for setup_Type in list_Set_Up:
                print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:SETUP:{setup_Type} BEGIN...")
                    
                #********* BASE LINE ONLY **********************************************************
                if setup_Type == 'BL_Only':
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:BASELINE_ONLY SET UP:START...")
                    # call the Baseline_Only module
                    df_bo = Baseline_Only(df,n_deg, setup_Type, iteration)
                    # drop the uuid
                    df_bo = df_bo.drop(['uuid'],axis=1)
                    df_simple = df_bo

                    # pass the df to generate train/test split
                    X_train,X_test,X,y,y_train,y_test,numeric_features,\
                    categorical_features = gen_Train_Test_Split(df_simple)

                    # perform PCA and return PCA applied feature sets
                    X_train,X_test = PCA_Industry(X_train,X_test)
                    X_train,X_test = PCA_Country(X_train,X_test)
                    #Visualize_Country_Ind_PCA(X,y)
                    
                    # Display the final train/test features shape
                    print("Final train dataset shape",X_train.shape)
                    print("\nFinal test dataset shape",X_test.shape)                             
                    
                    # Run the classifer pipeline and get the results
                    results = Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,setup_Type)
                    
                    # append results to the output list
                    out_list.append(results)
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:BASELINE_ONLY SET UP:END")
                
                #********* GRAPH ONLY ********************************************************
                elif setup_Type == 'G_Only':
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:GRAPH_ONLY SET UP:START...")
                    
                    # call the Graph_Only_SS module
                    df_gr = Graph_Only_SS(df,n_deg, setup_Type, iteration)
                    # drop the uuid
                    df_gr = df_gr.drop(['uuid'],axis=1)
                    df_simple = df_gr
                    
                    # pass the df to generate train/test split
                    X_train,X_test,X,y,y_train,y_test,numeric_features,\
                    categorical_features = gen_Train_Test_Split(df_simple)
                    
                    # Display the final train/test features shape
                    print("Final train dataset shape",X_train.shape)
                    print("\nFinal test dataset shape",X_test.shape)
                    
                    #print('\nTest Dataframe Columns:\n\n{}'.format(X_test.columns.to_list()))
                    # Run the classifer pipeline and get the results
                    results = Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,setup_Type)
                    
                    # append results to the output list
                    out_list.append(results)
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:GRAPH_ONLY SET UP:END")
                
                #********* GRAPH + BASELINE ONLY ***********************************************
                elif setup_Type == 'G+BL':   
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:GRAPH+BASELINE:START...") 
                    
                    # call the Graph_Only_SS module
                    df_gr = Graph_Only_SS(df,n_deg, setup_Type, iteration)
                    print("Graph shape after merge",df_gr.shape)

                    # call the Baseline_Only module
                    df_bo = Baseline_Only(df,n_deg,'BL_Only', iteration)

                    # merge the graphOnly feaures with baseline only features
                    df_simple = pd.merge(df_gr.copy(),df_bo.copy(), how = 'inner',on='uuid')

                    # drop unwanted columns
                    df_simple = df_simple.drop(['uuid','p1_tag_y'],axis=1)
                    print("Merged shape after baseline and graph",df_simple.shape)
                    # rename columns after merge
                    df_simple = df_simple.rename(columns={"p1_tag_x": "p1_tag"})
                    
                    # pass the df to generate train/test split
                    X_train,X_test,X,y,y_train,y_test,numeric_features,\
                    categorical_features = gen_Train_Test_Split(df_simple)
                    
                    # Display the final train/test features shape
                    print("Before pca dataset shape",X_train.shape)
                    print("\nBefore pca dataset shape",X_test.shape)
                    
                    # perform PCA and return PCA applied feature sets
                    X_train,X_test = PCA_Industry(X_train,X_test)
                    X_train,X_test = PCA_Country(X_train,X_test)
                    #Visualize_Country_Ind_PCA(X)
                    
                    # Display the final train/test features shape
                    #print("Train set columns list",X_train.columns)
                    print("Final train dataset shape",X_train.shape)
                    print("\nFinal test dataset shape",X_test.shape)
                    print('\nTrain Dataframe Columns:\n\n{}'.format(X_train.columns.to_list()))
                    #print('\nTest Dataframe Columns:\n\n{}'.format(X_test.columns.to_list()))
                    
                    #check for nan and infinite columns if any
                    nan_values = X_train.isna()
                    nan_columns = nan_values.any()
                    columns_with_nan = X_train.columns[nan_columns].tolist()
                    if columns_with_nan != []:
                        print("columns_with_nan ",columns_with_nan)
                    print("Infinite columns train",(X_train.columns.to_series()[np.isinf(X_train).any()]))
                    print("Infinite columns test",(X_test.columns.to_series()[np.isinf(X_test).any()]))
                    
                    # Run the classifer pipeline and get the results
                    results = Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,setup_Type)
                    
                    # Append results to output list
                    out_list.append(results)
                    print(f"\nITERATION:{iteration}DEGREE:{n_deg}:GRAPH+BASELINE SET UP:END")
                
                #********* GRAPH + BASELINE REDUCED ONLY *********************************************
                elif setup_Type == 'G+BL_Red':
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:GRAPH+BASELINE_REDUCED:START...") 
                    
                    # call the Graph_Only_SS module
                    df_gr = Graph_Only_SS(df,n_deg, setup_Type, iteration)
                    print("Graph shape after merge",df_gr.shape)
                    
                    # call the Baseline_Reduced module
                    df_bo = Baseline_Reduced(df,n_deg, 'BL_Red_Only', iteration)
                    
                    # merge the graphOnly feaures with baseline only features
                    df_simple = pd.merge(df_gr.copy(),df_bo.copy(), how = 'inner',on='uuid')
                    
                    # drop unwanted columns
                    df_simple = df_simple.drop(['uuid','p1_tag_y'],axis=1)
                    print("Merged shape after baseline and graph",df_simple.shape)
                    #print(list(df_simple.columns))
                    
                    # rename columns after merge
                    df_simple = df_simple.rename(columns={"p1_tag_x": "p1_tag"})
                    
                    # pass the df to generate train/test split
                    X_train,X_test,X,y,y_train,y_test,numeric_features,\
                    categorical_features = gen_Train_Test_Split(df_simple)
                    print("Before pca dataset shape",X_train.shape)
                    print("\nBefore pca dataset shape",X_test.shape)
                    
                    # perform PCA and return PCA applied feature sets
                    X_train,X_test = PCA_Industry(X_train,X_test)
                    X_train,X_test = PCA_Country(X_train,X_test)
                    #Visualize_Country_Ind_PCA(X)
                    print("Final train dataset shape",X_train.shape)
                    print("\nFinal test dataset shape",X_test.shape)
                    print('\nTrain Dataframe Columns:\n\n{}'.format(X_train.columns.to_list()))
                    #print('\nTest Dataframe Columns:\n\n{}'.format(X_test.columns.to_list()))
                    
                    #check for nan and infinite columns if any
                    nan_values = X_train.isna()
                    nan_columns = nan_values.any()
                    columns_with_nan = X_train.columns[nan_columns].tolist()
                    if columns_with_nan != []:
                        print("columns_with_nan ",columns_with_nan)
                    print("Infinite columns train",(X_train.columns.to_series()[np.isinf(X_train).any()]))
                    print("Infinite columns test",(X_test.columns.to_series()[np.isinf(X_test).any()]))
                    
                    # Run the classifer pipeline and get the results
                    results = Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,setup_Type)
                    
                    # Append results to output list
                    out_list.append(results)
                    print(f"\nDEGREE:{n_deg}:GRAPH+BASELINE_REDUCED SET UP:END")               
                
                #********* BASELINE REDUCED ONLY *****************************************
                elif setup_Type == 'BL_Red_Only':
                    
                    print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:BASELINE_REDUCED_ONLY SET UP:START...")
                    
                    # call the Baseline_Reduced module
                    df_bo = Baseline_Reduced(df,n_deg, setup_Type, iteration)
                    
                    # drop unwanted columns
                    df_bo = df_bo.drop(['uuid'],axis=1)
                    df_simple = df_bo
                    
                    # pass the df to generate train/test split
                    X_train,X_test,X,y,y_train,y_test,numeric_features,\
                    categorical_features = gen_Train_Test_Split(df_simple)
                    
                    # perform PCA and return PCA applied feature sets
                    X_train,X_test = PCA_Industry(X_train,X_test)
                    X_train,X_test = PCA_Country(X_train,X_test)
                    #Visualize_Country_Ind_PCA(X,y)
                    
                    print("Final train dataset shape",X_train.shape)
                    print("\nFinal test dataset shape",X_test.shape)                             
                    
                    # Run the classifer pipeline and get the results
                    results = Run_Classifier(X_train,X_test,y_train,y_test,numeric_features,categorical_features,n_deg,setup_Type)
                    
                    # Append results to output list
                    out_list.append(results)
                    print(f"\nDEGREE:{n_deg}:BASELINE_REDUCED_ONLY SET UP:END")
        
        # append out_list values in an iteration to out_dict 
        out_dict['result'] = out_list
        
        # append out_dict to final_out
        final_out.append(out_dict)
        
        # call Write_Output generate a json structure and store to a file
        # for each iteration created
        Write_Output(final_out,iteration)
        print(f"\nITERATION:{iteration}:DEGREE:{n_deg}:END")
    #calculate_avg(total_iterations)
    
    #call the module generate_table to generate a output result df
    # and display it for all the classifer
    generate_table()
    print("Completed all runs!....")

if __name__ == "__main__":
    # execute only if run as a script
    main()


ITERATION:0:DEGREE:4:SETUP:BL_Only BEGIN...

ITERATION:0:DEGREE:4:BASELINE_ONLY SET UP:START...
Original DF shape (1010412, 264)
Mem. usage decreased to 313.17 Mb (84.6% reduction)

Dataframe shape: (1010412, 264)
Original Model_DF_D2 shape (1010412, 264)
(3868, 263)
Mem. usage decreased to  2.34 Mb (0.0% reduction)
Training data shape: (6188, 262)
Train label shape: (6188,)
Test data shape: (1548, 262)
Test label shape: (1548,)
Final train dataset shape (6188, 29)

Final test dataset shape (1548, 29)
LRR with OneHotEncoder and StandardScaler()
Best parameter (CV score=0.707): {'classifier__random_state': None, 'classifier__C': 1.0}
Best score: 0.7079

KNN with OneHotEncoder and StandardScaler()
Score: 0.6990


ITERATION:0:DEGREE:4:BASELINE_ONLY SET UP:END

ITERATION:0:DEGREE:4:END

ITERATION:1:DEGREE:4:SETUP:BL_Only BEGIN...

ITERATION:1:DEGREE:4:BASELINE_ONLY SET UP:START...
Original DF shape (1010412, 264)
Mem. usage decreased to 313.17 Mb (84.6% reduction)

Dataframe shape: (10104

Unnamed: 0,n_deg,Model_Type,LRR_OneHotEncoder with StandardScaler()_score,LRR_OneHotEncoder with StandardScaler()_best_params_classifier__C,KNN_OneHotEncoder with StandardScaler()_score
0,4,BL_Only,0.709,1.0,0.71
1,4,BL_Red_Only,0.71,1.0,0.702
2,4,G+BL,0.794,1000.0,0.957
3,4,G+BL_Red,0.789,1000.0,0.935
4,4,G_Only,0.684,1000.0,0.675
5,5,BL_Only,0.71,1.0,0.688
6,5,BL_Red_Only,0.718,10.0,0.697
7,5,G+BL,0.813,100.0,0.95
8,5,G+BL_Red,0.803,1000.0,0.935
9,5,G_Only,0.69,1000.0,0.688


Completed all runs!....
