# US Visa Prediction Project

## Life Cycle of Machine Learning Project

- Understanding the Problem Statement
- Data Collection 
- Exploratory Data Analysis
- Data Clearning
- Data Pre-Processing
- Model Training
- Choose best model

# About

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permamenet basis. The act also protects US workers against adverse impacts on working place and maintain requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

# 1) Problem statement.
OFLC gives job certification applications for employers seeking to bring foreign workers into the United States and grants certifications.
As In last year the count of employees were huge so OFLC needs Machine learning models to shortlist visa applicants based on their previous data.
In this project we are going to use the data given to build a Classification model:

This model is to check if Visa get approved or not based on the given dataset.
This can be used to Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the certain criteria which influences the decision.

# 2) Data Collection.

The Dataset is part of Office of Foreign Labor Certification (OFLC)
The data consists of 25480 Rows and 12 Columns

https://www.kaggle.com/datasets/moro23/easyvisa-dataset

# 2.1 Import Data and Required Packages

#### Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings 

warnings.filterwarnings("ignore")

%matplotlib inline




### Import the CSV Data as Pandas DataFrame

In [5]:
df = pd.read_csv('EasyVisa.csv')


### Show Top 5 Records

In [6]:
df.head(5)

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


### Shape of the dataset

In [7]:
df.shape

(25480, 12)

### Summary of the data

In [9]:
df.describe()

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
count,25480.0,25480.0,25480.0
mean,5667.04321,1979.409929,74455.814592
std,22877.928848,42.366929,52815.942327
min,-26.0,1800.0,2.1367
25%,1022.0,1976.0,34015.48
50%,2109.0,1997.0,70308.21
75%,3504.0,2005.0,107735.5125
max,602069.0,2016.0,319210.27


In [10]:
#check the information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


# Exploring Data

In [15]:
df['unit_of_wage'].dtype

dtype('O')

In [22]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 3 numerical features : ['no_of_employees', 'yr_of_estab', 'prevailing_wage']

We have 9 categorical features : ['case_id', 'continent', 'education_of_employee', 'has_job_experience', 'requires_job_training', 'region_of_employment', 'unit_of_wage', 'full_time_position', 'case_status']


In [23]:
# proportion of count data on categorical columns 

for col in categorical_features:
    print(df[col].value_counts(normalize=True) * 100 )

    print("****--------------*****")

case_id
EZYV01       0.003925
EZYV16995    0.003925
EZYV16993    0.003925
EZYV16992    0.003925
EZYV16991    0.003925
               ...   
EZYV8492     0.003925
EZYV8491     0.003925
EZYV8490     0.003925
EZYV8489     0.003925
EZYV25480    0.003925
Name: proportion, Length: 25480, dtype: float64
****--------------*****
continent
Asia             66.173469
Europe           14.646782
North America    12.919937
South America     3.343799
Africa            2.162480
Oceania           0.753532
Name: proportion, dtype: float64
****--------------*****
education_of_employee
Bachelor's     40.164835
Master's       37.810047
High School    13.422292
Doctorate       8.602826
Name: proportion, dtype: float64
****--------------*****
has_job_experience
Y    58.092622
N    41.907378
Name: proportion, dtype: float64
****--------------*****
requires_job_training
N    88.402669
Y    11.597331
Name: proportion, dtype: float64
****--------------*****
region_of_employment
Northeast    28.237834
South      

## Insights

- case_id have unique values for each column which can be dropped as it is of no importance

- continent column is highly biased towards asia, hence we can combine other categories to form a single category.

- unit_of_wage seems to be an important column as most of the are yearly contracts.

# Univariate Analysis

- The term univariate analysis refers to the analysis of one variable prefix "uni" means 'one'. The purpose of univariate analysis is to understand the values for a single variable.

Other Types of Analysis are 
 - Bivariate Analysis: The analysis of two variables

 - Multivariate Analysis: The analysis of two or more variables.


## Univariate Analysis on Numerical data