# EasyVisa Project

## Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

## Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:

* Facilitate the process of visa approvals.
* Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status. 


## Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

* case_id: ID of each visa application
* continent: Information of continent the employee
* education_of_employee: Information of education of the employee
* has_job_experience: Does the employee has any job experience? Y= Yes; N = No
* requires_job_training: Does the employee require any job training? Y = Yes; N = No 
* no_of_employees: Number of employees in the employer's company
* yr_of_estab: Year in which the employer's company was established
* region_of_employment: Information of foreign worker's intended region of employment in the US.
* prevailing_wage:  Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment. 
* unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
* full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
* case_status:  Flag indicating if the Visa was certified or denied

## Importing necessary libraries and data

In [None]:
# this is a comprehensive list of dependencies in order to run linear regression and classification.
%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

##!pip install -U scikit-learn --user


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

sns.set()

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model
from sklearn.linear_model import LinearRegression

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# to build linear regression_model using statsmodels
import statsmodels.api as sm
import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)


# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
)

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    plot_confusion_matrix,
    make_scorer,
)

In [None]:
EVP = pd.read_csv("EVP.csv")
# copying data to another varaible to avoid any changes to original data
df = EVP.copy()
data = EVP.copy()

## Data Overview

Observations:
- There are 25480 records in 25 columns 
- No negative values
- No missing values
- All values are numeric
- Data is normalized to ensure same scale of comparison

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:
1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification? 

2. How does the visa status vary across different continents? 
 
3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status? 
 
4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa? 
 
5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?
# 1. case_status vs. education_of_employee

In [None]:
labeled_barplot(data, "education_of_employee", perc=True)

In [None]:
# How does the visa status vary across different continents?
# 2. case_status vs continent

In [None]:
labeled_barplot(data, "continent", perc=True)

In [None]:
# Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?
# 3. case_status vs has_job_experience

In [None]:
labeled_barplot(data, "has_job_experience", perc=True)

In [None]:
# In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?
# 4. case_status vs unit_of_wage

In [None]:
labeled_barplot(data, "unit_of_wage", perc=True)

In [None]:
# The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?
# 5. case_status vs prevailing_wage
# note- this also can be compared to my annual_value column to see if the annual value of the employee correlates to case status.

In [None]:
labeled_barplot(data, "prevailing_wage", perc=True)

In [None]:
data.info()

In [None]:
# sns.pairplot(data)

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
histogram_boxplot(data, "continent")

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

In [None]:
histogram_boxplot(data,'fixed acidity')

## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering 
- Outlier detection and treatment (if needed)
- Preparing data for modeling 
- Any other preprocessing steps (if needed)

In [None]:
df

In [None]:
# Data Formatting:
# Drop unnecessary
# Change to Numeric
# Get Dummies
# Set as Category

In [None]:
df.describe()

In [None]:
df.count()

In [None]:
# Drop Unnecessary
# checking for missing values
df.isnull().sum()
# 0 Missisng values
# Drop
df = df.dropna()
# Drop case_id and year_estab
df.drop("case_id", axis=1, inplace=True)
#
df.drop("yr_of_estab", axis=1, inplace=True)
df.info()

In [None]:
df["continent"].value_counts(dropna=False)
# Dummies

In [None]:
df["education_of_employee"].value_counts(dropna=False)
# Dummies

In [None]:
df["has_job_experience"].value_counts(dropna=False)
# Dummies

In [None]:
df["requires_job_training"].value_counts(dropna=False)
# Dummies

In [None]:
df["no_of_employees"].value_counts(dropna=False)
# good to go

In [None]:
df["region_of_employment"].value_counts(dropna=False)
# Dummies

In [None]:
df["prevailing_wage"].value_counts(dropna=False)
# No further processing

In [None]:
df["unit_of_wage"].value_counts(dropna=False)
# Standardize this value
# Replace with multiplication factor to standardize at annual value.

In [None]:
# Change Unit_of_wage to annual value modifier
# In order to deal with difference between yearly salary and hourly, weekly, and monthly, I will create a new column called annual value.


def to_annual_value(x):
    if x == "Week":
        return 52
    if x == "Month":
        return 12
    if x == "Hour":
        return 2240
    if x == "Year":
        return 1


df["annual_value_modifier"] = df["unit_of_wage"].apply(to_annual_value)
# print (df)

In [None]:
df.info()

In [None]:
df["annual_val"] = df["prevailing_wage"] * df["annual_value_modifier"]
df

In [None]:
# Drop annual_value_modifier and prevailing_wage
df.drop("annual_value_modifier", axis=1, inplace=True)
#
df.drop("prevailing_wage", axis=1, inplace=True)
#
df.drop("unit_of_wage", axis=1, inplace=True)
df.info()

In [None]:
df["full_time_position"].value_counts(dropna=False)

In [None]:
df["case_status"].value_counts(dropna=False)

In [None]:
df.info()

In [None]:
# Get Dummies
df = pd.get_dummies(
    df,
    columns=[
        "continent",
        "education_of_employee",
        "has_job_experience",
        "requires_job_training",
        "region_of_employment",
        "full_time_position",
        "case_status",
    ],
    dtype=float,
)  # this worked
df.head()

In [None]:
df.info()

In [None]:
# Negative number error in number of employees; I didn't have this problem but others complained of negative #s. I assumed a typo; thus I eliminated all possible negatives from this column.
df.no_of_employees.abs()
# print(df)

In [None]:
df.describe()

## EDA

- It is a good idea to explore the data once again after manipulating it.

In [None]:
df.describe()

In [None]:
# Outlier detection and treatment (if needed)
plt.hist(df["weight"], 20)
plt.title("Histogram of Weight")
plt.show()

sns.boxplot(df["weight"])
plt.title("Boxplot of Weight")
plt.show()

## Building bagging and boosting models

##  Will tuning the hyperparameters improve the model performance?

## Model Performance Comparison and Conclusions

## Actionable Insights and Recommendations