 ### PREDICTION OF THE SUCCESS OF START-UPS USING UNBIASED CLASSIFICATION MACHINE LEARNING MODELS 
 
**Problem Statement**
This notebook contains the final part of the research work to build a supervised learning classification machine learning model that facilitates the prediction of the success of a company based on situational informtion only available at the point of seeking financing by the company.. 

**Data**
Historical data about companies was sourced from Crunchbase. Crunchbase tracks and collects information about millions of companies and related personnel. Past research works has been carried out using Crunchbase data but features such as company descriptions and people descriptions and other textual features were not used for the model buiding. To further enrich the dataset information from social media was scrapped and analysed for sentiments, however due to Twitter restrictions and unrealistic resources required to scrape for all companies, the Twitter scrapping code though available is not utilised for the final dataset. Natural Language Processing of the organisation description and people description were carried out as will as features engineering of additional features from the raw dataset.

**Models**

The following  ML models will be experimented on with the final dataset.
* Logistic Regression
* RandomForest Classifiers
* SVM Classifers
* Naives Bayes
* LightGBM

 ### Machine Learning Model
 
**Scikit Learn ML Library will be utlised and the following steps carried**
 
  * upload the preprocessed train, validation and test datasets
  * train models and evaluate with validation set
  * Carry out hyperparameter tuning as required.
  * Evaluate and select the best model using the test data.
  

In [None]:
# !pip install lightgbm
# !pip install xgboost

In [None]:
# Import Data Analysis Libraries
import pandas as pd
import numpy as np
import os
import dill
import warnings

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)


# Import Machine Learning Classifiers models Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb
from xgboost import XGBClassifier


# Import evaluations modules
#from sklearn.metrics import plot_roc_curve
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay


# Import other needed machine Learning Libraries
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.pipeline import Pipeline


In [None]:
train_reader = pd.read_csv("D:\\OneDrive\\Documents\\Personal Project Portfolio\\PP00017_Prediction of the Success of Start-Ups\\artifacts\\data_transformation\\train_data.csv", chunksize=3000, low_memory=True)
val_reader = pd.read_csv("D:\\OneDrive\\Documents\\Personal Project Portfolio\\PP00017_Prediction of the Success of Start-Ups\\artifacts\\data_transformation\\validate_data.csv", chunksize=3000, low_memory=True)
test_reader = pd.read_csv("D:\\OneDrive\\Documents\\Personal Project Portfolio\\PP00017_Prediction of the Success of Start-Ups\\artifacts\\data_transformation\\test_data.csv", chunksize=3000, low_memory=True)

# Concatenate all chunks into a single DataFrame
train = pd.concat(train_reader)
val = pd.concat(val_reader)
test = pd.concat(test_reader)

In [None]:
# We will load the training, validation and test dataset

train = pd.read_csv("D:\OneDrive\Documents\Personal Project Portfolio\PP00017_Prediction of the Success of Start-Ups\artifacts\data_transformation\train_data.csv")
val = pd.read_csv("D:\OneDrive\Documents\Personal Project Portfolio\PP00017_Prediction of the Success of Start-Ups\artifacts\data_transformation\validate_data.csv")
test = pd.read_csv("D:\OneDrive\Documents\Personal Project Portfolio\PP00017_Prediction of the Success of Start-Ups\artifacts\data_transformation\test_data.csv")

In [None]:
train

In [None]:
X_train, y_train, X_val, y_val, X_test, y_test = train.iloc[:, :-1],train.iloc[:, -1], val.iloc[:, :-1], val.iloc[:, -1],test.iloc[:, :-1], test.iloc[:, -1]
                        


In [None]:
# Confirm the dimensions of the train and test dataset
print (X_train.shape,y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape) 

In [None]:
def fit_and_score_best_model(models,preprocessor_path, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates models passed to it.
    
    Parameters:
    -----------
    models : Dictionary of machine learning models
    X_train : training data (without labels)
    X_test : test data (without labels)
    y_train : training labels
    y_test : test labels
    
    Returns:
    -----------
    The best performing model out of the model passed to it
    
    """
    
    # Set random seed
    np.random.seed(42)
    
    # Make a dictionary to keep model scores
    model_report = {}
        
    # Lopp through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
       
        # Evaluate the model and append its score to model_report
        model_report[name] = model.score(X_test, y_test)
        
    # To get best model score from dict
    best_model_score = max(sorted(model_report.values()))
      
    # To get best model name from dict
    best_model_name = list(model_report.keys())[
                list(model_report.values()).index(best_model_score)
            ]
    best_model = models[best_model_name]

    if best_model_score < 0.6:
        raise CustomException("No best model found")
           
#     logging.info(f"Best found model on both training and testing dataset")

#     save_object(
#                 file_path=self.model_trainer_config.trained_model_file_path,
#                 obj=best_model
#             )

    return print(f'Best Model: {best_model}, Score: {best_model_score}')

In [None]:
# put models in a dictionary
models = {'Logistic Regression':LogisticRegression(),'SVM': SVC(),'Random Forest': RandomForestClassifier(),
           'XGB': XGBClassifier(), 'LightGBM': lgb.LGBMClassifier() }

best_model = fit_and_score_best_model(models,"D:\\OneDrive\\Documents\\PERSONAL\\PERSONAL DEVELOPMENT\\DATA SCIENCE\\Personal Project Portfolio\\PP00017_Prediction of the Success of Start-Ups Using Unbiased Classification ML Models\\artifacts\\data_transformation\\preprocessor.pkl",
                                      X_train=X_train, X_test=X_val,y_train=y_train, y_test=y_val)

best_model
    


In [None]:
def load_object(file_path):
    try:
        with open(file_path, "rb") as file_obj:
            return dill.load(file_obj)

    except Exception as e:
        raise CustomException(e, sys)
        
        
def save_object(file_path, obj):
    with open(file_path, 'wb') as file_obj:
        dill.dump(obj, file_obj)

In [None]:
# Add helper function with hyperparameter tuning with GridSearch CV
def evaluate_models(X_train, y_train, X_test, y_test, models, param):
    try:
        """
        This method fits and score the models provided while doing a gridsearch cross
        validation using the parameter grid provided
        
        input: X-train - Training data input features
             y_train - Training data label 
             X_test - Test data input features
             y_test - Test data labels
             models - ML model to experiment with
             param :dict - parameter settings to try as values.
             
        Returns: a dictionary of the a key values pair of model and score
        """ 
                
        report = {}

        for i in range(len(list(models))):
            model = list(models.values())[i]
            para=param[list(models.keys())[i]]

            gs = GridSearchCV(model,para,cv=3, verbose=3)
            gs.fit(X_train,y_train)

            model.set_params(**gs.best_params_)
            model.fit(X_train,y_train)

            #model.fit(X_train, y_train)  # Train model

#             y_train_pred = model.predict(X_train)

#             y_test_pred = model.predict(X_test)

            test_model_score = model.score(X_test, y_test)

            report[list(models.keys())[i]] = test_model_score
        return report
    except Exception as e:
        raise e

In [None]:
# update the fit_score_best_model helper function

def fit_and_score_best_model(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates models passed to it.
    
    Parameters:
    -----------
    models : Dictionary of machine learning models
    X_train : training data (without labels)
    X_test : test data (without labels)
    y_train : training labels
    y_test : test labels
    
    Returns:
    -----------
    The best performing model out of the model passed to it
    
    """
    
    # Set random seed
    np.random.seed(42)

    # Create the parameter grid
    params={
                'Logistic Regression': {
                    'penalty':[None, '12', 'l1', 'elasticnet'],
                    'dual':[True,False],
                    'solver':['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
                },
                'SVM':{
                    'C':[0.1, 1, 10, 1000],
                    'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
                    'degree':[0, 1, 2, 3, 4, 5, 6],
                    'gamma':['scale', 'auto'],
                  
                },
              'Random Forest': {
                      'criterion': ['gini', 'entropy'],
                      'max_features': ['sqrt','log2'],
                      'n_estimators': [8, 32, 128, 256],
                      'max_leaf_nodes': [None, 16, 32 ],
                      'bootstrap': [True, False],
                      'n_jobs': [1, -1],
                      'max_samples': [2, 4,6],
                      'min_samples_split': [2, 6, 10, 14, 18],
                      'min_samples_leaf': [1, 5, 9, 13, 17]
              },
                'XGB':{
                    'learning_rate': [0.01, 0.05, 0.1],
                    'max_depth': [6, 8, 10],
                    'gamma': [2, 4, 9, 12],
                    'sampling_method': ['uniform', 'gradient_based'],
                    'grow_policy': ['depthwise', 'lossguide'],
                    'n_estimators': [8,16,32,64,128,256]
                    
                },
                'LightGBM':{
                    'boosting_type':['gbdt', 'rf', 'dart'],
                    'max_depth': [-1, 2, -10],
                    'learning_rate': [0.01, 0.05, 0.1],
                    'n_estimator': [100, 50, 200],
                    'num_leaves': [31, 50, 100]
                },
    }
          
    # Evaluate the model and append its score to model_report
    model_report:dict=evaluate_models(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test,
                                             models=models, param=params)
        
    # To get best model score from dict
    best_model_score = max(sorted(model_report.values()))
      
    # To get best model name from dict
    best_model_name = list(model_report.keys())[
                list(model_report.values()).index(best_model_score)
            ]
    best_model = models[best_model_name]

    if best_model_score < 0.6:
        raise CustomException("No best model found")
           
#     logging.info(f"Best found model on both training and testing dataset")

    save_object(
                file_path="D:\\OneDrive\\Documents\\PERSONAL\\PERSONAL DEVELOPMENT\\DATA SCIENCE\\Personal Project Portfolio\\PP00017_Prediction of the Success of Start-Ups Using Unbiased Classification ML Models\\artifacts\\models\model.pkl",
                obj=best_model
            )

    return print(f'Best Model: {best_model}, Score: {best_model_score}')
                

In [None]:
# put models in a dictionary
models = {'Logistic Regression':LogisticRegression(),'SVM': SVC(),'Random Forest': RandomForestClassifier(),
           'XGB': XGBClassifier(), 'LightGBM': lgb.LGBMClassifier() }

best_model = fit_and_score_best_model(models, X_train=X_train, X_test=X_val,
                                      y_train=y_train, y_test=y_val)

best_model

Now we have a baseline model. So will tue the hyperparameters and have more evaluation metrics for review:
* Hyperparamenter tuning
* Feature Importance
* Confusion Matrix
* Cross-validation
* Precision
* Recall
* F1 Score
* Classification Report
* ROC curve
* Area under the C curve

To also experiment on the dataset composition, we will create versions of the dataset as follows:

1. full-data (numeric , categorical & text)
2. medium_1_data (numeric and categories (education & location & biz category & job details))
3. medium_2_data (numeric and categories (education & biz category))
4. medium_3-data (numeric and categories (biz category only)
5. no gender_data (full data without gender)
6. num_data (numeric only)


In [None]:
# helper function for full data
def fit_and_score_fd(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """

    # Define preprocessors

    # pipeline for text data1
    text1_features = 'short_description_o'
    text1_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for text data2
    text2_features = 'description_o'
    text2_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for text data3
    text3_features = 'description_o'
    text3_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for categorical data

    categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                           'featured_job_title','institution_name','degree_type','subject','is_completed']

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
        transformers=[
            ("tex1", text1_transformer, text1_features),
            ("tex2", text2_transformer, text2_feat
             ures),
            ("tex3", text3_transformer, text3_features),
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ])

    X = dataset.drop('success',axis=1)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'fd_{name}'] = f'{clf.score(X_test, y_test):.3f}%'


#     model_score_df =pd.DataFrame(model_score)

    return model_score #model_score_df



In [None]:
fit_and_score_fd(full_data, models)

In [None]:
medium_1_data.head(1)

In [None]:
# helper function for medium_1_data
def fit_and_score_m1d(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """

    # Define preprocessors


    # pipeline for categorical data

    categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                           'featured_job_title','institution_name','degree_type','subject','is_completed']

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ])

    X = dataset.drop('success',axis=1)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'm1d_{name}'] = f'{clf.score(X_test, y_test)}:.3f%'

    model_score = {k:[v] for k,v in model_score.items()}
    model_score_df =pd.DataFrame(model_score)

    return model_score_df.T



In [None]:
fit_and_score_m1d(medium_1_data, models)

In [None]:
# helper function for medium_1_data
def fit_and_score_m2d(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """

    # Define preprocessors


    # pipeline for categorical data


# pipeline for categorical data

    categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                           'featured_job_title','institution_name','degree_type','subject','is_completed']

    categorical_2_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat2", categorical_2_transformer, categorical_2_features),
    ])


    X = dataset.drop('success',axis=1)
    #X.drop(['geometry','roles','type_o','primary_role'],axis=1,inplace=True)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'm2d_{name}'] = f'{%:.3fclf.score(X_test, y_test)}%'

    model_score = {k:[v] for k,v in model_score.items()}
    model_score_df =pd.DataFrame(model_score)

    return model_score_df.T



In [None]:
fit_and_score_m2d(medium_2_data, models)

In [None]:
# helper function for medium_1_data
def fit_and_score_m3d(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """

    # Define preprocessors


    # pipeline for categorical data



    # pipeline for categorical data

    categorical_3_features = ['status','category_list','category_groups_list']

    categorical_3_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat3", categorical_3_transformer, categorical_3_features),
    ])


    X = dataset.drop('success',axis=1)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'm3d_{name}'] = f'{clf.score(X_test, y_test):.3f}%'

    model_score = {k:[v] for k,v in model_score.items()}
    model_score_df =pd.DataFrame(model_score)

    return model_score_df.T



In [None]:
fit_and_score_m3d(medium_3_data, models)

In [None]:
# helper function for nog data
def fit_and_score_nog(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """

    # Define preprocessors

    # pipeline for text data1
    text1_features = 'short_description_o'
    text1_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for text data2
    text2_features = 'description_o'
    text2_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for text data3
    text3_features = 'description_o'
    text3_transformer = Pipeline(steps=[
        ('vectorizer', TfidfVectorizer(stop_words="english"))
    ])

    # pipeline for categorical data

    no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                           'featured_job_title','institution_name','degree_type','subject','is_completed']

    no_gender_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
        transformers=[
            ("tex1", text1_transformer, text1_features),
            ("tex2", text2_transformer, text2_features),
            ("tex3", text3_transformer, text3_features),
            ("num", numeric_transformer, numeric_features),
            ("nog", no_gender_transformer, no_gender_features),
        ])

    X = dataset.drop('success',axis=1)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'nog_{name}'] = f'{clf.score(X_test, y_test):.3f}%'

    model_score = {k:[v] for k,v in model_score.items()}
    model_score_df =pd.DataFrame(model_score)

    return model_score_df.T




In [None]:
fit_and_score_nog(no_gender_data, models)

In [None]:
# helper function for full data
def fit_and_score_num(dataset, models):
    
    """
    Fits and evaluates models passed to it based on the data structure

    Parameters:
    -----------
    models : Dictionary of machine learning models

    Returns:
    -----------
    A dictionary of accuracy scores of each model

    """


    # pipeline for numeric data
    numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                         'degree_length','employee_count_min','employee_count_max']
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


    preprocessor = ColumnTransformer(
        transformers=[

            ("num", numeric_transformer, numeric_features)

        ])

    X = dataset.drop('success',axis=1)
    y = dataset['success']

    # Split into train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to keeo model scores
    model_score = {}

    # Lopp through models
    for name, model in models.items():

        # Fit the model to the data
        clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
        clf.fit(X_train, y_train)

        model_score[f'num_{name}'] = f'{clf.score(X_test, y_test):.3f}%'

    model_score = {k:[v] for k,v in model_score.items()}
    model_score_df =pd.DataFrame(model_score)

    return model_score_df.T


In [None]:
fit_and_score_num(num_data, models)

## We will carry out hyperparameter tuning and also evaluate for the folllowing as stated earlier:

* Hyperparamenter tuning
* Feature Importance
* Confusion Matrix
* Cross-validation
* Precision
* Recall
* F1 Score
* Classification Report
* ROC curve
* Area under the C curve

In [None]:
# We will use RandomizedSearchCV due to limited time to deploy Gridsearch CV

# Create a hyperparameter grid for Logistic Regression
log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
                'clf__C': np.logspace(-4,4,20),
                'clf__solver':['lbfgs','liblinear'],
               'clf__max_iter': np.arange(100,300,50)}

# Create a hyperparameter grid for SVC
svc_grid = {'clf__C': [0.1, 1, 10, 1000], 
           'clf__degree':[0, 1, 2, 3, 4, 5, 6],
           'clf__kernel':['linear', 'rbf', 'poly']}



# Create a hyperparameter grid for Random Forest
rf_grid = {'clf__n_estimators': np.arange(10,1000,50),
          'clf__max_depth': [None, 3, 5, 10],
          'clf__min_samples_split': np.arange(2, 20, 2),
          'clf__min_samples_leaf': np.arange(1, 20, 2)}

# Create a hyperparameter grid for Naive Bayes
nb_grid ={'clf__alpha':[0.1, 1, 10, 1000],
          'clf__fit_prior': [True,False]}


### Full Data

**Logistic Regression**

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg fd is:{model_score}')
print(f'Best params for Log_Reg fd is:{clf_best}')


In [None]:
# Run for 'm1d' data:

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
    
    
X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg m1d is:{model_score}')
print(f'Best params for Log_Reg is m1d:{clf_best}')


In [None]:
# Run for 'm2d' data:

# Define preprocessors


# pipeline for categorical data


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg m2d is:{model_score}')
print(f'Best params for Log_Reg m2d is:{clf_best}')


In [None]:
# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg m3d is:{model_score}')
print(f'Best params for Log_Reg m3d is:{clf_best}')


In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])

X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg nog is:{model_score}')
print(f'Best params for Log_Reg nog is:{clf_best}')


In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg num is:{model_score}')
print(f'Best params for Log_Reg num is:{clf_best}')


In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg ned is:{model_score}')
print(f'Best params for Log_Reg ned is:{clf_best}')


In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg sed is:{model_score}')
print(f'Best params for Log_Reg sed is:{clf_best}')


In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=log_reg_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for log_reg fed is:{model_score}')
print(f'Best params for Log_Reg fed is:{clf_best}')


**SVM**

In [None]:
# Run for 'fd' data

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC fd is:{model_score}')
print(f'Best params for SVC fd is:{clf_best}')

In [None]:
# Run for 'm1d' data

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
 = medium_1_data['success']

   ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])


X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC m1d is:{model_score}')
print(f'Best params for SVC m1d is:{clf_best}')

In [None]:
# Run for 'm2d' data
# Define preprocessors


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])




X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC m2d is:{model_score}')
print(f'Best params for SVC m2d is:{clf_best}')

In [None]:
# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

#rand_params =[log_reg_grid,svc_grid,rf_grid,nb_grid]


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC m3d is:{model_score}')
print(f'Best params for SVC m3d is:{clf_best}')



In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])


X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC nog is:{model_score}')
print(f'Best params for SVC nog is:{clf_best}')



In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC num is:{model_score}')
print(f'Best params for SVC num is:{clf_best}')



In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC ned is:{model_score}')
print(f'Best params for SVC ned is:{clf_best}')



In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC sed is:{model_score}')
print(f'Best params for SVC sed is:{clf_best}')



In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=svc_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for SVC fed is:{model_score}')
print(f'Best params for SVC fed is:{clf_best}')



**Random Forest**

In [None]:
# Run for full data


# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF fd is:{model_score}')
print(f'Best params for RF fd is:{clf_best}')

In [None]:
# Run for 'm1d' data

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
   ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])


X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF m1d is:{model_score}')
print(f'Best params for RF m1d is:{clf_best}')

In [None]:

# Run for 'm2d' data
# Define preprocessors


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF m2d is:{model_score}')
print(f'Best params for RF m2d is:{clf_best}')

In [None]:

# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])

X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF m3d is:{model_score}')
print(f'Best params for RF m3d is:{clf_best}')

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])


X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF nog is:{model_score}')
print(f'Best params for RF nog is:{clf_best}')

In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF num is:{model_score}')
print(f'Best params for RF num is:{clf_best}')

In [None]:

# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF ned is:{model_score}')
print(f'Best params for RF ned is:{clf_best}')

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF sed is:{model_score}')
print(f'Best params for RF sed is:{clf_best}')

In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=rf_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for RF fed is:{model_score}')
print(f'Best params for RF fed is:{clf_best}')

**Naives Bayes**

In [None]:
# Run for full data

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])



preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB fd is:{model_score}')
print(f'Best params for NB fd is:{clf_best}')

In [None]:

# Run for 'm1d' data

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
   ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])


X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB m1d is:{model_score}')
print(f'Best params for NB m1d is:{clf_best}')


In [None]:
# Run for 'm2d' data
# Define preprocessors


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])



preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB m2d is:{model_score}')
print(f'Best params for NB m2d is:{clf_best}')

In [None]:
# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])

X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB m3d is:{model_score}')
print(f'Best params for NB m3d is:{clf_best}')

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])


X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']






# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB nog is:{model_score}')
print(f'Best params for NB nog is:{clf_best}')

In [None]:

# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB num is:{model_score}')
print(f'Best params for NB num is:{clf_best}')

In [None]:

# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])



preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB ned is:{model_score}')
print(f'Best params for NB ned is:{clf_best}')

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])



preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB sed is:{model_score}')
print(f'Best params for NB sed is:{clf_best}')

In [None]:

# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])



preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']




# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB())])

# log_reg_grid = {'clf__penalty': ['l1', 'l2', 'elasticnet', None],
#                 'clf__C': np.logspace(-4,4,20),
#                 'clf__solver':['lbfgs','liblinear'],
#                'clf__max_iter': np.arange(100,300,50)}

clf_model = RandomizedSearchCV(clf, param_distributions=nb_grid, cv=5, n_iter=20,verbose=True) 

clf_model.fit(X_train, y_train)
clf_best = clf_model.best_estimator_

model_score = clf_model.score(X_test, y_test)
best_params  =  clf_best


print(f'RandomizedSearchCV score for NB fed is:{model_score}')
print(f'Best params for NB fed is:{clf_best}')

### Final Experiments

Using the best parameters from the hyperparameter tuning, we will now model and evaluate each using the evaluation techniques for each model for each of the  9 datasets.

### Logistic Regression

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)

# Calculate the confusion matrix






print(f'Accuracy score for log_reg fd is {clf.score(X_test, y_test)}')



In [None]:
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using seaborn's heatmap
    """
    
    fig, ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds ),
                    annot=True,
                    cbar=False,
                    fmt='.0f')
    plt.xlabel('True label')
    plt.ylabel('Predicted label')
    
plot_conf_mat(y_test, y_preds)

In [None]:
# Calculate the classification report 
print(classification_report(y_test, y_preds))

In [None]:
# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

In [None]:
# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

In [None]:
# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

In [None]:
# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

In [None]:
# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_fd)', legend=False)

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg fd is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_fd)', legend=False)

In [None]:
# Run for 'm1d' data:

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
    
    
X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg m1d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_m1d)', legend=False)

In [None]:

# Run for 'm2d' data:

# Define preprocessors


# pipeline for categorical data


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg m2d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_m2d)', legend=False)

In [None]:

# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg m3d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_m3d)', legend=False)

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])

X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg nog is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_nog)', legend=False)

In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg num is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_num)', legend=False)

In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg ned is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_ned)', legend=False)

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg sed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_sed)', legend=False)

In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

plot_roc_curve(clf, X_test, y_test)



print(f'Accuracy score for log_reg fed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_fed)', legend=False)

## SVM

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC fd is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_fd)', legend=False)

In [None]:
# Run for 'm1d' data:

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
    
    
X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC m1d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_m1d)', legend=False)

In [None]:

# Run for 'm2d' data:

# Define preprocessors


# pipeline for categorical data


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC m2d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_m2d)', legend=False)

In [None]:

# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC m3d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_m3d)', legend=False)

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])

X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC nog is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_nog)', legend=False)

In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=1, degree=0))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC num is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_num)', legend=False)

In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC ned is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_ned)', legend=False)

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC sed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (SVC_sed)', legend=False)

In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", SVC(C=0.1))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for SVC fed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic Reg_fed)', legend=False)

## Random Forest

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=10, min_samples_leaf=3,
                                        min_samples_split=16,
                                        n_estimators=710))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_fd is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_fd)', legend=False)

In [None]:
# Run for 'm1d' data:

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
    
    
X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf",  RandomForestClassifier(max_depth=5, min_samples_split=18,
                                        n_estimators=610))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_m1d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_m1d)', legend=False)

In [None]:

# Run for 'm2d' data:

# Define preprocessors


# pipeline for categorical data


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=5, min_samples_split=18,
                                        n_estimators=610))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_SVC m2d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_m2d)', legend=False)

In [None]:

# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=5, min_samples_split=18,
                                        n_estimators=610))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_m3d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_m3d)', legend=False)

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])

X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=10, min_samples_leaf=3,
                                        min_samples_split=16,
                                        n_estimators=710))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_nog is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_nog)', legend=False)

In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=3, min_samples_leaf=7,
                                        min_samples_split=6,
                                        n_estimators=560))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_num is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_num)', legend=False)

In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=10, min_samples_leaf=3,
                                        min_samples_split=16,
                                        n_estimators=710))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_ned is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_ned)', legend=False)

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf",RandomForestClassifier(max_depth=10, min_samples_split=18,
                                        n_estimators=910))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_sed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (rf_sed)', legend=False)

In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", RandomForestClassifier(max_depth=10, min_samples_leaf=3,
                                        min_samples_split=16,
                                        n_estimators=710))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for rf_fed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (logistic rf_fed)', legend=False)

## Naives Bayes

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_fd is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_fd)', legend=False)

In [None]:
# Run for 'm1d' data:

# Define preprocessors


# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
    
    
X = medium_1_data.drop('success',axis=1)
y = medium_1_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=1000, fit_prior=False))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_m1d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_m1d)', legend=False)

In [None]:

# Run for 'm2d' data:

# Define preprocessors


# pipeline for categorical data


# pipeline for categorical data

categorical_2_features = ['status','category_list','category_groups_list','uuid_p','gender',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_2_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat2", categorical_2_transformer, categorical_2_features),
])


X = medium_2_data.drop('success',axis=1)
y = medium_2_data['success']



# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=1000))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_m2d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_m2d)', legend=False)

In [None]:

# Run for 'm3d' data:

# Define preprocessors


# pipeline for categorical data



# pipeline for categorical data

categorical_3_features = ['status','category_list','category_groups_list']

categorical_3_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat3", categorical_3_transformer, categorical_3_features),
])



X = medium_3_data.drop('success',axis=1)
y = medium_3_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10, fit_prior=False))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_m3d is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_m3d)', legend=False)

In [None]:
# Run for 'nog' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

no_gender_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

no_gender_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("nog", no_gender_transformer, no_gender_features),
    ])

X = no_gender_data.drop('success',axis=1)
y = no_gender_data['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=1000, fit_prior=False))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_nog is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_nog)', legend=False)

In [None]:
# Run for 'num' data:

# Define preprocessors


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[

        ("num", numeric_transformer, numeric_features)

    ])


X = num_data.drop('success',axis=1)
y = num_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10, fit_prior=False))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_num is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_num)', legend=False)

In [None]:
# Run for 'ned' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = no_enrich_df.drop('success',axis=1)
y = no_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_ned is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_ned)', legend=False)

In [None]:
# Run for 'sed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = senti_enrich_df.drop('success',axis=1)
y = senti_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_sed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_sed)', legend=False)

In [None]:
# Run for 'fed' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = follow_enrich_df.drop('success',axis=1)
y = follow_enrich_df['success']

# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", MultinomialNB(alpha=10))])

# Fit model
clf.fit(X_train, y_train)

# Predict 

y_preds = clf.predict(X_test)

# Plot the ROC curve and calculate the AUC metric

RocCurveDisplay.from_estimator(clf, X_test, y_test)



print(f'Accuracy score for NB_fed is {clf.score(X_test, y_test)}')


sns.set(font_scale=1.5)
    
plot_conf_mat(y_test, y_preds)

# Calculate the classification report 
print(classification_report(y_test, y_preds))

# Calculate the cross validated accuracy 
cv_acc = cross_val_score(clf,  X, y, cv=5, scoring='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

# Calculate the cross validated precision
cv_precision = cross_val_score(clf,  X, y, cv=5, scoring='precision')
cv_precision = np.mean(cv_precision)
cv_precision

# Calculate the cross validated recall 
cv_recall = cross_val_score(clf,  X, y, cv=5, scoring='recall')
cv_recall = np.mean(cv_recall)
cv_recall

# Calculate the cross validated F1 score
cv_f1 = cross_val_score(clf,  X, y, cv=5, scoring='f1')
cv_f1 = np.mean(cv_f1)
cv_f1

# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({'Accuracy':cv_acc,
                          'Precision': cv_precision,
                          'Recall': cv_recall,
                          'F1': cv_f1}, index=[0])
cv_metrics.T.plot.bar(title='Cross-validated metrics (NB_fed)', legend=False)

**Feature Importance**

In [None]:
# Run for 'fd' data:

# Define preprocessors

# pipeline for text data1
text1_features = 'short_description_o'
text1_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data2
text2_features = 'description_o'
text2_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for text data3
text3_features = 'description_o'
text3_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data

categorical_features = ['country_code_o', 'status','category_list','category_groups_list','uuid_p','gender','country_code_p',
                       'featured_job_title','institution_name','degree_type','subject','is_completed']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not known')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# pipeline for numeric data
numeric_features = ['following', 'followers','polarity','subjectivity','rank_o','rank_p','num_events_part','per_exp_at_coy_start',
                     'degree_length','employee_count_min','employee_count_max']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())])


preprocessor = ColumnTransformer(
    transformers=[
        ("tex1", text1_transformer, text1_features),
        ("tex2", text2_transformer, text2_features),
        ("tex3", text3_transformer, text3_features),
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])

X = full_data.drop('success',axis=1)
y = full_data['success']


# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Set random seed
np.random.seed(42)

# Make a dictionary to keep model scores
model_score = {}
best_params = {}

# Fit the model to the data
clf = Pipeline(steps=[("preprocessor", preprocessor), ("clf", LogisticRegression(C=0.23357214690901212, max_iter=150,
                                    penalty='l1', solver='liblinear'))])

# Fit model
clf.fit(X_train, y_train)






In [None]:
feature_names = clf.preprocessor.named_steps["vectorizer"].get_feature_names()