# Entrepreneurial Competency in University Students Classification

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict the Entrepreneurial Competency in University Students?

# 2. Data

## Context

The dataset was collected in 2019 by Naman Manchanda and skhiearth. The dataset was collected for research purposes. We worked with the data and published a research paper titled Predicting Entrepreneurial competency in university students using Machine Learning algorithms at IEEE. The research paper can be found here: https://ieeexplore.ieee.org/abstract/document/9058292

## Content

The dataset comprises 16 features collected from university students in India. The target variable consists whether the student is likely to become an entrepreneur or not.

# 3. Evaluation

As this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## Inputs / Features

    EducationSector - Engineering background or not
    IndividualProject - If the student builds personal project
    Age - Age of student
    Gender - Sex of student
    City - If the student stays in a city
    Influenced - If the student is influenced by someone
    Perseverance - Rating of a student based upon perseverance
    DesireToTakeInitiative - Rating of a student based upon desire to take initiative - Competitiveness
    Competitive rating - SelfRelianceSelf reliance rating
    SelfReliance - Self reliance rating
    StrongNeedToAchieve - Strong need to achieve a goal rating
    SelfConfidence - Self confidence rating
    GoodPhysicalHealth - Good physical health rating
    MentalDisorder - If there is any mental disorder
    KeyTraits - Key traits of the student
    ReasonsForLack - Reason for lack of entrepreneurship culture

## Output / Label
    y - Whether the student seems to become a entrepreneur or not

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the dataset

In [None]:
# Local
# df = pd.read_csv('data.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/entrepreneurial-competency-in-university-students/data.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df['ReasonsForLack'].unique()

we will fill the nan values as no reason

In [None]:
df['ReasonsForLack'] = df['ReasonsForLack'].fillna('No Reason')

In [None]:
df.isnull().sum()

In [None]:
df['ReasonsForLack'] = df['ReasonsForLack'].str.replace('Just not interested! (Want to work in the corporate sector, or for the government or pursue research or something else)',
                                 'Just not interested, ',regex=False)

In [None]:
df['ReasonsForLack'].unique()

In [None]:
df['ReasonsForLack'] = df['ReasonsForLack'].str.replace(',,',
                                 ',',regex=False)
df['ReasonsForLack'] = df['ReasonsForLack'].str.replace('interested, ',
                                                       'interested')

In [None]:
df['ReasonsForLack'].unique()

In [None]:
df['ReasonsForLack'].str.split(', ', expand=True)[0].unique()

In [None]:
df['No Reason'] = df['ReasonsForLack'].str.contains('No Reason')
df['Just not interested'] = df['ReasonsForLack'].str.contains('Just not interested')
df['waiting for future relocation'] = df['ReasonsForLack'].str.contains('waiting for future relocation')
df['Financial Risk'] = df['ReasonsForLack'].str.contains('Not able to take a Financial Risk')
df['Academic Pressure'] = df['ReasonsForLack'].str.contains('Academic Pressure')
df['Lack of Knowledge'] = df['ReasonsForLack'].str.contains('Lack of Knowledge')
df['Unwillingness to take risk'] = df['ReasonsForLack'].str.contains('Unwillingness to take risk')
df['Parental Pressure'] = df['ReasonsForLack'].str.contains('Parental Pressure')
df['Mental Block'] = df['ReasonsForLack'].str.contains('Mental Block')

In [None]:
df = df.drop('ReasonsForLack', axis=1)

In [None]:
df

In [None]:
df.info()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of labels')
sns.countplot(data=df, x='y');

Labels are balanced and we will use the accuracy and F1 scores for the evaluation

# 5. Modelling

In [None]:
X = df.drop('y', axis=1)
y = df['y']
X = pd.get_dummies(X, drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Baseline Model Scores

In [None]:
from sklearn.metrics import classification_report,precision_score, recall_score,f1_score

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        model_scores[name] = model.score(X_test,y_test)
        model_recall[name] = recall_score(y_test, y_preds)
        model_f1[name] = f1_score(y_test, y_preds)
        model_precision[name] = precision_score(y_test, y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    model_recall = pd.DataFrame(model_recall, index=['Recall']).transpose()
    model_recall = model_recall.sort_values('Recall')
    model_f1 = pd.DataFrame(model_f1, index=['F1']).transpose()
    model_f1 = model_f1.sort_values('F1')
    model_precision = pd.DataFrame(model_precision, index=['Precision']).transpose()
    model_precision = model_precision.sort_values('Precision')
        
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores

## Baseline Evalution Using Cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def get_baseline_cv_scores(model, X, y, cv=5):
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        
        print(name)
        cv_accuracy = cross_val_score(model,X,y,cv=cv,
                             scoring='accuracy')
        print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
        print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')

        cv_precision = cross_val_score(model,X,y,cv=cv,
                             scoring='precision')
        print(f'Cross Validaion precision Scores: {cv_precision}')
        print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')

        cv_recall = cross_val_score(model,X,y,cv=cv,
                             scoring='recall')
        print(f'Cross Validaion recall Scores: {cv_recall}')
        print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')

        cv_f1 = cross_val_score(model,X,y,cv=cv,
                             scoring='f1')
        print(f'Cross Validaion f1 Scores: {cv_f1}')
        print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}') 
        print('\n')

        model_scores[name] = cv_accuracy.mean()
        model_recall[name] = cv_precision.mean()
        model_f1[name] = cv_recall.mean()
        model_precision[name] = cv_f1.mean()
    
    return model_scores, model_recall, model_f1, model_precision

In [None]:
model_scores, model_recall, model_f1, model_precision = get_baseline_cv_scores(models, X_train, y_train, cv=5)

Since most of the model perform well, we will use the LogisticRegression for the final model, as it's faster and simpler model

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix,plot_roc_curve

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test, y_preds))

## Confusion Matrix

In [None]:
plot_confusion_matrix(model,X_test,y_test)

## ROC curve

In [None]:
plot_roc_curve(model,X_test,y_test)

## Features importance

In [None]:
model.coef_

In [None]:
feat_importances = pd.DataFrame(model.coef_[0], index=X.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Features Importance')
plt.xticks(rotation=90)
sns.barplot(data=feat_importances.sort_values(0).T);

## Evalution using Cross-Validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=cv,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=cv,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=cv,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=cv,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X_train, y_train, cv=10)

In [None]:
cv_merics

Model is performing really well!

with the 10 CV scores:

    Accuracy: 100%
    Precision: 100%
    Recall: 100%
    F1: 100%