# US Visa Approval Classification

The main goal of the project is to check if Visa get approved or not based on the given dataset.

This can be used to Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the certain criteria which influences the decision.

The data consists of 25480 Rows and 12 Columns

# Importing required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

# Step 1: Preliminary Analysis

In [2]:
df = pd.read_csv(r"EasyVisa.csv")

In [3]:
df.head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


## Number of independent and dependent variables:

In [4]:
independent_vars = df.columns[:-1]
dependent_var = df.columns[-1]
print("Number of independent variables:", len(independent_vars))
print("Number of dependent variables:", 1)

Number of independent variables: 11
Number of dependent variables: 1


## Number of records:¶

In [5]:
print("Number of records:", len(df))

Number of records: 25480


In [6]:
df.shape

(25480, 12)

## Categorical Features

In [7]:
categorical_features=[feature for feature in df.columns if df[feature].dtypes=='O']
print(categorical_features)
print('Number of Categorical Features :', len(categorical_features))

['case_id', 'continent', 'education_of_employee', 'has_job_experience', 'requires_job_training', 'region_of_employment', 'unit_of_wage', 'full_time_position', 'case_status']
Number of Categorical Features : 9


## Unique counts of categories in columns

In [8]:
for feature in categorical_features:
    print("The feature is {} and the number of categories are {}".format(feature,len(df[feature].unique())))

The feature is case_id and the number of categories are 25480
The feature is continent and the number of categories are 6
The feature is education_of_employee and the number of categories are 4
The feature is has_job_experience and the number of categories are 2
The feature is requires_job_training and the number of categories are 2
The feature is region_of_employment and the number of categories are 5
The feature is unit_of_wage and the number of categories are 4
The feature is full_time_position and the number of categories are 2
The feature is case_status and the number of categories are 2


## Numerical Features

In [9]:
numerical_features = [feature for feature in df.columns if df[feature].dtype in ['int64', 'float64']]
print(numerical_features)
print('Num of Numerical Features :', len(numerical_features))

['no_of_employees', 'yr_of_estab', 'prevailing_wage']
Num of Numerical Features : 3


In [10]:
numerical_features=[feature for feature in df.columns if df[feature].dtypes!='O']
print("Number of numerical variable",len(numerical_features))
df[numerical_features].head()

Number of numerical variable 3


Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
0,14513,2007,592.2029
1,2412,2002,83425.65
2,44444,2008,122996.86
3,98,1897,83434.03
4,1082,2005,149907.39


## Binary Features

In [11]:
for col in df.columns:
    if df[col].nunique() == 2:
        print(f"{col} might be binary")

has_job_experience might be binary
requires_job_training might be binary
full_time_position might be binary
case_status might be binary


## Discrete Features

In [12]:
discrete_features=[feature for feature in numerical_features if len(df[feature].unique())<=25]
print('We have {} discrete features : {}'.format(len(discrete_features), discrete_features))
print('Number of Discrete Features :',len(discrete_features))

We have 0 discrete features : []
Number of Discrete Features : 0


## Continous Features

In [13]:
continuous_features=[feature for feature in numerical_features if len(df[feature].unique()) > 25]
print('\nWe have {} continuous_features : {}'.format(len(continuous_features), continuous_features))
print('Number of Continuous Features :',len(continuous_features))


We have 3 continuous_features : ['no_of_employees', 'yr_of_estab', 'prevailing_wage']
Number of Continuous Features : 3


## Data types of variables:

In [14]:
print("Data types of variables:")
df.info()

Data types of variables:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


In [15]:
# proportion of count data on categorical columns
for col in categorical_features:
    print(df[col].value_counts(normalize=True) * 100)
    print('---------------------------')

case_id
EZYV01       0.003925
EZYV16995    0.003925
EZYV16993    0.003925
EZYV16992    0.003925
EZYV16991    0.003925
               ...   
EZYV8492     0.003925
EZYV8491     0.003925
EZYV8490     0.003925
EZYV8489     0.003925
EZYV25480    0.003925
Name: proportion, Length: 25480, dtype: float64
---------------------------
continent
Asia             66.173469
Europe           14.646782
North America    12.919937
South America     3.343799
Africa            2.162480
Oceania           0.753532
Name: proportion, dtype: float64
---------------------------
education_of_employee
Bachelor's     40.164835
Master's       37.810047
High School    13.422292
Doctorate       8.602826
Name: proportion, dtype: float64
---------------------------
has_job_experience
Y    58.092622
N    41.907378
Name: proportion, dtype: float64
---------------------------
requires_job_training
N    88.402669
Y    11.597331
Name: proportion, dtype: float64
---------------------------
region_of_employment
Northeast    2

**Insights**
 - `case_id` have unique vlaues for each column which can be dropped as it it of no importance
 - `continent` column is highly biased towards asia. hence we can combine other categories to form a single category.
 - `unit_of_wage` seems to be an important column as most of them are yearly contracts.

## Summary Statistics

In [16]:
df.describe()

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
count,25480.0,25480.0,25480.0
mean,5667.04321,1979.409929,74455.814592
std,22877.928848,42.366929,52815.942327
min,-26.0,1800.0,2.1367
25%,1022.0,1976.0,34015.48
50%,2109.0,1997.0,70308.21
75%,3504.0,2005.0,107735.5125
max,602069.0,2016.0,319210.27


# Step 2: Data Cleaning

## Checking for Null Values

In [17]:
df.isnull().sum()

case_id                  0
continent                0
education_of_employee    0
has_job_experience       0
requires_job_training    0
no_of_employees          0
yr_of_estab              0
region_of_employment     0
prevailing_wage          0
unit_of_wage             0
full_time_position       0
case_status              0
dtype: int64

## Checking for unique values

In [18]:
df.nunique()

case_id                  25480
continent                    6
education_of_employee        4
has_job_experience           2
requires_job_training        2
no_of_employees           7105
yr_of_estab                199
region_of_employment         5
prevailing_wage          25454
unit_of_wage                 4
full_time_position           2
case_status                  2
dtype: int64

## All Features

In [19]:
df.columns

Index(['case_id', 'continent', 'education_of_employee', 'has_job_experience',
       'requires_job_training', 'no_of_employees', 'yr_of_estab',
       'region_of_employment', 'prevailing_wage', 'unit_of_wage',
       'full_time_position', 'case_status'],
      dtype='object')

## Checking for low variance:

In [20]:
# Checking for columns with a single unique value
low_variance_cols = [col for col in df.columns if df[col].nunique() <= 1]
print("Columns with low variance (possible candidates for removal):", low_variance_cols)

Columns with low variance (possible candidates for removal): []


## Checking & handling for duplicates

In [21]:
# Checking for duplicates
if df.duplicated().any():
    print("Duplicates found:", df.duplicated().sum())
    df = df.drop_duplicates()
    print("Duplicates have been removed.")
else:
    print("No duplicates found.")

No duplicates found.


## Removing Case_Id

In [22]:
df.drop('case_id', inplace=True, axis=1)

## Handling Missing Values

In [23]:
# Checking for missing values in all columns
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0]

if not missing_data.empty:
    print("Missing values found in the following columns:")
    print(missing_data)
    # Handling missing values (example using median imputation for numerical columns)
    for column in df.columns:
        if df[column].dtype == np.number:
            df[column].fillna(df[column].median(), inplace=True)
        else:
            df[column].fillna(df[column].mode()[0], inplace=True)  # For categorical data, using mode
    print("Missing values have been handled.")
else:
    print("No missing values found.")

No missing values found.


## Copy of Original Data

In [24]:
df1=df
df1.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


# Step 3: Feature Engineering

## Feature Extraction

In [25]:
# importing date class from datetime module
from datetime import date
  
# creating the date object of today's date
todays_date = date.today()
current_year= todays_date.year
print(current_year)

2024


In [26]:
df1['company_age'] = current_year-df1['yr_of_estab']

In [27]:
df.drop('yr_of_estab', inplace=True, axis=1)

In [28]:
df1.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status,company_age
0,Asia,High School,N,N,14513,West,592.2029,Hour,Y,Denied,17
1,Asia,Master's,Y,N,2412,Northeast,83425.65,Year,Y,Certified,22
2,Asia,Bachelor's,N,Y,44444,West,122996.86,Year,Y,Denied,16
3,Asia,Bachelor's,N,N,98,West,83434.03,Year,Y,Denied,127
4,Africa,Master's,Y,N,1082,South,149907.39,Year,Y,Certified,19


# Step 4: Train & Test Split

In [29]:
X = df1.drop('case_status', axis=1)
y = df1['case_status']

**Manual encoding target column**

In [30]:
# If the target column has Denied it is encoded as 1 others as 0
y= np.where(y=='Denied', 1,0)

## **Feature Encoding and Scaling**

In [31]:
num_features = list(X.select_dtypes(exclude="object").columns)

In [32]:
num_features

['no_of_employees', 'prevailing_wage', 'company_age']

### **Preprocessing using Column Transformer**

In [33]:
# Create Column Transformer with 3 types of transformers
or_columns = ['has_job_experience','requires_job_training','full_time_position','education_of_employee']
oh_columns = ['continent','unit_of_wage','region_of_employment']
transform_columns= ['no_of_employees','company_age']

In [34]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler,OrdinalEncoder, PowerTransformer
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline

In [35]:
numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()
ordinal_encoder = OrdinalEncoder()

In [36]:
transform_pipe = Pipeline(steps=[
    ('transformer', PowerTransformer(method='yeo-johnson'))
])

In [37]:
preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, oh_columns),
        ("Ordinal_Encoder", ordinal_encoder, or_columns),
        ("Transformer", transform_pipe, transform_columns),
        ("StandardScaler", numeric_transformer, num_features)
    ]
)

In [38]:
X = preprocessor.fit_transform(X)

### **Resampling**

In [39]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [40]:
from imblearn.combine import SMOTETomek, SMOTEENN

# Resampling the minority class. The strategy can be changed as required.
smt = SMOTEENN(random_state=42,sampling_strategy='minority' )
# Fit the model to generate the data.
X_res, y_res = smt.fit_resample(X, y)

In [41]:
from sklearn.model_selection import  train_test_split
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X_res,y_res,test_size=0.3,random_state=42)
X_train.shape, X_test.shape

((11933, 24), (5115, 24))

## Print the number of records in the training and testing data

In [42]:
print(f"Training records: {len(X_train)}, Testing records: {len(X_test)}")

Training records: 11933, Testing records: 5115


## Save training data to CSV

In [43]:
train_data = pd.concat([ pd.DataFrame(X_train),  pd.DataFrame(y_train)], axis=1)
train_data.to_csv('visa_training_data.csv', index=False)
print("Training data has been saved to 'visa_training_data.csv'.")

Training data has been saved to 'visa_training_data.csv'.


## Save testing data to CSV

In [44]:
test_data = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test)], axis=1)
test_data.to_csv('visa_testing_data.csv', index=False)
print("Testing data has been saved to 'visa_testing_data.csv'.")

Testing data has been saved to 'visa_testing_data.csv'.


# Step 6: Logistic Regression using SkLearn

In [45]:
from sklearn.linear_model import LogisticRegression

In [46]:
from sklearn.metrics import accuracy_score, classification_report,ConfusionMatrixDisplay, \
                            precision_score, recall_score, f1_score, roc_auc_score,roc_curve 

In [47]:
# Initializing the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

In [48]:
lrmodel=log_reg.fit(X_train, y_train)

In [49]:
# Make predictions
y_train_pred = lrmodel.predict(X_train)
y_test_pred = lrmodel.predict(X_test)

In [50]:
def evaluate_clf(true, predicted):
    acc = accuracy_score(true, predicted) # Calculate Accuracy
    f1 = f1_score(true, predicted) # Calculate F1-score
    precision = precision_score(true, predicted) # Calculate Precision
    recall = recall_score(true, predicted)  # Calculate Recall
    roc_auc = roc_auc_score(true, predicted) #Calculate Roc
    return acc, f1 , precision, recall, roc_auc

In [51]:
# Training set performance
model_train_accuracy, model_train_f1,model_train_precision,\
        model_train_recall,model_train_rocauc_score=evaluate_clf(y_train ,y_train_pred)

In [52]:
print('Model performance for Training set')
print("- Accuracy: {:.4f}".format(model_train_accuracy))
print('- F1 score: {:.4f}'.format(model_train_f1)) 
print('- Precision: {:.4f}'.format(model_train_precision))
print('- Recall: {:.4f}'.format(model_train_recall))
print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))

Model performance for Training set
- Accuracy: 0.7379
- F1 score: 0.7542
- Precision: 0.7659
- Recall: 0.7429
- Roc Auc Score: 0.7374


In [53]:
# Test set performance
model_test_accuracy,model_test_f1,model_test_precision,\
        model_test_recall,model_test_rocauc_score=evaluate_clf(y_test, y_test_pred)

In [54]:
print('Model performance for Test set')
print('- Accuracy: {:.4f}'.format(model_test_accuracy))
print('- F1 score: {:.4f}'.format(model_test_f1))
print('- Precision: {:.4f}'.format(model_test_precision))
print('- Recall: {:.4f}'.format(model_test_recall))
print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))

Model performance for Test set
- Accuracy: 0.7398
- F1 score: 0.7520
- Precision: 0.7691
- Recall: 0.7357
- Roc Auc Score: 0.7401


# Step 7: Logistics Using StatsModel

In [55]:
import statsmodels.api as sm

In [56]:
# Add intercept manually for Statsmodels which does not add it by default
X_train = sm.add_constant(X_train)

In [57]:
# Initialize and fit the logistic regression model using Statsmodels
lrstats = sm.Logit(y_train, X_train)
lrstats_results = lrstats.fit()

Optimization terminated successfully.
         Current function value: 0.504951
         Iterations 8


In [58]:
print(lrstats_results.summary())

                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                11933
Model:                          Logit   Df Residuals:                    11908
Method:                           MLE   Df Model:                           24
Date:                Thu, 09 May 2024   Pseudo R-squ.:                  0.2679
Time:                        03:40:23   Log-Likelihood:                -6025.6
converged:                       True   LL-Null:                       -8230.5
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.2256   1.56e+06   7.85e-07      1.000   -3.06e+06    3.06e+06
x1            -0.1554   1.53e+06  -1.02e-07      1.000   -2.99e+06    2.99e+06
x2             0.6129   1.53e+06   4.01e-07      1.0

In [59]:
# # Make predictions
# X_test= sm.add_constant(X_test)
# y_train_pred = lrstats_results.predict(X_train)
# y_test_pred = lrstats_results.predict(X_test)

In [61]:
# The prediction will be in terms of probabilities for the presence of the target class
X_test= sm.add_constant(X_test)
y_test_pred = lrstats_results.predict(X_test)
predictions = np.where(y_test_pred < 0.5, 0, 1)  # Converting probabilities to binary outcomes

In [62]:
# Evaluating the model
model_accuracy = accuracy_score(y_test, predictions)
print("Logistic Regression using Statsmodel Accuracy:", model_accuracy)
print("Classification Report of Logistic Regression using Statsmodel:\n", classification_report(y_test, predictions))

Logistic Regression using Statsmodel Accuracy: 0.7397849462365591
Classification Report of Logistic Regression using Statsmodel:
               precision    recall  f1-score   support

           0       0.71      0.74      0.73      2372
           1       0.77      0.74      0.75      2743

    accuracy                           0.74      5115
   macro avg       0.74      0.74      0.74      5115
weighted avg       0.74      0.74      0.74      5115



# Step 8: Random Forest Using SkLearn

In [63]:
from sklearn.ensemble import RandomForestClassifier

In [64]:
rfc = RandomForestClassifier()

In [65]:
rfcmodel = rfc.fit(X_train, y_train)

In [66]:
# Make predictions
y_train_pred = rfcmodel.predict(X_train)
y_test_pred = rfcmodel.predict(X_test)

In [67]:
# Training set performance
rfcmodel_train_accuracy, rfcmodel_train_f1,rfcmodel_train_precision,\
        rfcmodel_train_recall,rfcmodel_train_rocauc_score=evaluate_clf(y_train ,y_train_pred)

In [68]:
print('Model performance for Training set')
print("- Accuracy: {:.4f}".format(rfcmodel_train_accuracy))
print('- F1 score: {:.4f}'.format(rfcmodel_train_f1)) 
print('- Precision: {:.4f}'.format(rfcmodel_train_precision))
print('- Recall: {:.4f}'.format(rfcmodel_train_recall))
print('- Roc Auc Score: {:.4f}'.format(rfcmodel_train_rocauc_score))

Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000


In [69]:
# Test set performance
rfcmodel_test_accuracy,rfcmodel_test_f1,rfcmodel_test_precision,\
        rfcmodel_test_recall,rfcmodel_test_rocauc_score=evaluate_clf(y_test, y_test_pred)

In [70]:
print('Model performance for Test set')
print('- Accuracy: {:.4f}'.format(rfcmodel_test_accuracy))
print('- F1 score: {:.4f}'.format(rfcmodel_test_f1))
print('- Precision: {:.4f}'.format(rfcmodel_test_precision))
print('- Recall: {:.4f}'.format(rfcmodel_test_recall))
print('- Roc Auc Score: {:.4f}'.format(rfcmodel_test_rocauc_score))

Model performance for Test set
- Accuracy: 0.9419
- F1 score: 0.9460
- Precision: 0.9434
- Recall: 0.9486
- Roc Auc Score: 0.9414


In [71]:
# Evaluating the model
rfcaccuracy = accuracy_score(y_test, y_test_pred)
print("Random Forests using sklearn Accuracy:", rfcaccuracy)
print("Classification Report of Random Forests using sklearn:\n", classification_report(y_test, y_test_pred))

Random Forests using sklearn Accuracy: 0.9419354838709677
Classification Report of Random Forests using sklearn:
               precision    recall  f1-score   support

           0       0.94      0.93      0.94      2372
           1       0.94      0.95      0.95      2743

    accuracy                           0.94      5115
   macro avg       0.94      0.94      0.94      5115
weighted avg       0.94      0.94      0.94      5115



# Step 9: Random Forest Using XGBoost

In [72]:
from xgboost import XGBRFClassifier

In [73]:
# Initialize the XGBoost Random Forest Classifier
xgbrf = XGBRFClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss')

In [74]:
# Fit the model
model = xgbrf.fit(X_train, y_train)

In [75]:
# Predict on the training and test data
prediction_train = model.predict(X_train)
prediction_test = model.predict(X_test)

In [76]:
# Calculate accuracy on the training and test data
accuracy_train = accuracy_score(y_train, prediction_train)
accuracy_test = accuracy_score(y_test, prediction_test)

In [77]:
# Output the accuracy
print("Accuracy on training data using XGBosst: {:.3f}".format(accuracy_train))
print("Accuracy on test data using XGBoost: {:.3f}".format(accuracy_test))

Accuracy on training data using XGBosst: 0.865
Accuracy on test data using XGBoost: 0.865


In [78]:
# Evaluate Random Forest from XGBoost
print("Random Forest Accuracy using XGBost:", accuracy_test)
print("Random Forest using XGBoost Classification Report:\n", classification_report(y_test, prediction_test))

Random Forest Accuracy using XGBost: 0.8651026392961877
Random Forest using XGBoost Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.90      0.86      2372
           1       0.91      0.83      0.87      2743

    accuracy                           0.87      5115
   macro avg       0.87      0.87      0.86      5115
weighted avg       0.87      0.87      0.87      5115

