# About Pathrise

Data Challenge Assignment 

Pathrise is an online program that provides 1-on-1 mentorship, training, and advice to help job seekers get the best possible jobs in tech. Every two weeks, Pathrise welcomes a new cohort of fellows. If a candidate is interested in joining our program and successfully passes all stages of our admission process, they receive an offer to join Pathrise and become a fellow. The first 2 weeks in the program are called a free trial period and a fellow can withdraw within this free trial period without any penalty. After 2 weeks, a fellow needs to sign an ISA (Income Share Agreement) with us if they want to stay in the program. The entire program lasts up to a year, including 8 weeks of the core curriculum. If a fellow is unable to find a job within a year after joining Pathrise, his/her contract is terminated. However, there might be some exceptions. For instance, if someone was on a break, we may extend their contract for the period of the break. 

On average, for fellows who stay with us after their free trial period, it takes about 4 months to receive a final job offer. However, there is a lot of variation in fellows’ outcomes. Being able to predict how fast every single fellow is going to find a job is crucial for our business.
 

# Analysis Goal

The main goal of this analysis is to derive insights around if a fellow will ultimately be placed at a company and how long until a placement can be expected.

# Data wrangling

I will start by importing the data I collected from Pathrise. I will organizing the data, and making sure it's well defined before exploring the data.

In [67]:
#import data wrangling Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [68]:
#loading the data from Pathrise company. I assigned it to df as in data frame using pandas
df=pd.read_csv("Data_Pathrise.csv")

In [69]:
#I want to see what is in the first 10 rows.
df.head(10)

Unnamed: 0,id,company_status,primary_track,cohort_tag,program_duration_days,placed,employment_status,highest_level_of_education,length_of_job_search,biggest_challenge_in_search,professional_experience,work_authorization_status,number_of_interviews,number_of_applications,gender,race
0,1,Active,SWE,OCT19A,,0,Unemployed,Bachelor's Degree,3-5 months,Hearing back on my applications,3-4 years,Canada Citizen,2.0,900,Male,Non-Hispanic White or Euro-American
1,2,Active,PSO,JAN20A,,0,Unemployed,"Some College, No Degree",3-5 months,Getting past final round interviews,1-2 years,Citizen,6.0,0,Male,Non-Hispanic White or Euro-American
2,3,Closed Lost,Design,AUG19B,0.0,0,Employed Part-Time,Master's Degree,Less than one month,Figuring out which jobs to apply for,Less than one year,Citizen,0.0,0,Male,East Asian or Asian American
3,4,Closed Lost,PSO,AUG19B,0.0,0,Contractor,Bachelor's Degree,Less than one month,Getting past final round interviews,Less than one year,Citizen,5.0,25,Male,Decline to Self Identify
4,5,Placed,SWE,AUG19A,89.0,1,Unemployed,Bachelor's Degree,1-2 months,Hearing back on my applications,1-2 years,F1 Visa/OPT,10.0,100,Male,East Asian or Asian American
5,6,Closed Lost,SWE,AUG19A,0.0,0,Employed Full-Time,Master's Degree,1-2 months,Technical interviewing,3-4 years,Green Card,5.0,100,Male,East Asian or Asian American
6,7,Closed Lost,SWE,AUG19B,0.0,0,Employed Full-Time,Master's Degree,Less than one month,Getting past phone screens,3-4 years,Green Card,0.0,9,Male,"Black, Afro-Caribbean, or African American"
7,8,Withdrawn (Failed),SWE,AUG19A,19.0,0,Employed Part-Time,Bachelor's Degree,Less than one month,Getting past final round interviews,1-2 years,Citizen,4.0,15,Female,Latino or Hispanic American
8,9,Active,SWE,AUG19B,,0,Student,Master's Degree,Less than one month,Technical interviewing,1-2 years,F1 Visa/CPT,1.0,5,Male,East Asian or Asian American
9,10,Withdrawn (Trial),SWE,SEP19A,13.0,0,Employed Full-Time,Master's Degree,Less than one month,Getting past final round interviews,3-4 years,Citizen,0.0,10,Male,"Black, Afro-Caribbean, or African American"


there seems to be some NANs in the dataset. I want more insight, so i will use the describe() function

In [70]:
df.describe()

Unnamed: 0,id,program_duration_days,placed,number_of_interviews,number_of_applications
count,2544.0,1928.0,2544.0,2326.0,2544.0
mean,1272.5,136.098548,0.375786,2.182287,36.500786
std,734.533866,125.860248,0.48442,2.959273,53.654896
min,1.0,0.0,0.0,0.0,0.0
25%,636.75,14.0,0.0,0.0,9.0
50%,1272.5,112.0,0.0,1.0,20.0
75%,1908.25,224.0,1.0,3.0,45.0
max,2544.0,548.0,1.0,20.0,1000.0


In [71]:
#i want to know how many columns and rows i am dealing with. so i will call the .shape function
df.shape

(2544, 16)

In [72]:
#i want to explore the data type of each column
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2544 entries, 0 to 2543
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           2544 non-null   int64  
 1   company_status               2544 non-null   object 
 2   primary_track                2544 non-null   object 
 3   cohort_tag                   2536 non-null   object 
 4   program_duration_days        1928 non-null   float64
 5   placed                       2544 non-null   int64  
 6   employment_status            2315 non-null   object 
 7   highest_level_of_education   2486 non-null   object 
 8   length_of_job_search         2470 non-null   object 
 9   biggest_challenge_in_search  2520 non-null   object 
 10  professional_experience      2322 non-null   object 
 11  work_authorization_status    2260 non-null   object 
 12  number_of_interviews         2326 non-null   float64
 13  number_of_applicat

I see that the "placed" column is int64. I will need this column and program_duration_days in the future to predict which fellow will be placed or not. This is not final as I need to explore the data more to actually figure out what to do in my regression analysis. 

# Exploratory Data Analysis

In [73]:
#I want to filter my column of interest "placed" to have a better look using the transpose function
df.placed.T

0       0
1       0
2       0
3       0
4       1
       ..
2539    0
2540    0
2541    0
2542    0
2543    0
Name: placed, Length: 2544, dtype: int64

From this information, it seems like placed with 1 might mean the fellow was placed while 0 means the fellow wasn't or droped out. I wonder if there are more insight on this and how is the rest of columns. I need to count the number of missing values for all the columns. 

In [74]:
#count the number of missing values
miss_value = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
miss_value.columns=['count', '%']
miss_value.sort_values(by=['%'],ascending=False)

Unnamed: 0,count,%
program_duration_days,616,24.213836
gender,492,19.339623
work_authorization_status,284,11.163522
employment_status,229,9.001572
professional_experience,222,8.726415
number_of_interviews,218,8.569182
length_of_job_search,74,2.908805
highest_level_of_education,58,2.279874
biggest_challenge_in_search,24,0.943396
race,18,0.707547


24.21% of program_duration_day is missing. It would make sense to drop columns with missing values but the nature of the problem Pathrise is experience demands a through insight. Hence I will keep the column and since it is the column with the highest missing values, I will try and work with the rest for now except gender. I do not think gender will have any effect on whether a fellow is placed or not. 

In [75]:
#Dropping the gender column
df.drop(['gender'], axis=1)

Unnamed: 0,id,company_status,primary_track,cohort_tag,program_duration_days,placed,employment_status,highest_level_of_education,length_of_job_search,biggest_challenge_in_search,professional_experience,work_authorization_status,number_of_interviews,number_of_applications,race
0,1,Active,SWE,OCT19A,,0,Unemployed,Bachelor's Degree,3-5 months,Hearing back on my applications,3-4 years,Canada Citizen,2.0,900,Non-Hispanic White or Euro-American
1,2,Active,PSO,JAN20A,,0,Unemployed,"Some College, No Degree",3-5 months,Getting past final round interviews,1-2 years,Citizen,6.0,0,Non-Hispanic White or Euro-American
2,3,Closed Lost,Design,AUG19B,0.0,0,Employed Part-Time,Master's Degree,Less than one month,Figuring out which jobs to apply for,Less than one year,Citizen,0.0,0,East Asian or Asian American
3,4,Closed Lost,PSO,AUG19B,0.0,0,Contractor,Bachelor's Degree,Less than one month,Getting past final round interviews,Less than one year,Citizen,5.0,25,Decline to Self Identify
4,5,Placed,SWE,AUG19A,89.0,1,Unemployed,Bachelor's Degree,1-2 months,Hearing back on my applications,1-2 years,F1 Visa/OPT,10.0,100,East Asian or Asian American
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2539,2540,Withdrawn (Failed),Design,JUN18A,457.0,0,Contractor,Master's Degree,6 months to a year,Technical interviewing,5+ years,Citizen,4.0,15,Non-Hispanic White or Euro-American
2540,2541,Withdrawn (Failed),Data,JAN19B,488.0,0,,Master's Degree,3-5 months,Hearing back on my applications,1-2 years,F1 Visa/OPT,1.0,7,Non-Hispanic White or Euro-American
2541,2542,Active,SWE,SEP18C,,0,Contractor,Bachelor's Degree,Less than one month,Technical interviewing,1-2 years,Citizen,1.0,30,Non-Hispanic White or Euro-American
2542,2543,Active,SWE,MAY18A,,0,,Master's Degree,Less than one month,Technical interviewing,1-2 years,Citizen,2.0,10,Decline to Self Identify


In [76]:
# Exploring Categorical Features of the dataset

I want to make sure that there are no duplicates, that each column is unique. 

In [77]:
df_object = df.select_dtypes('object')

In [78]:
df_object.columns.value_counts()

gender                         1
professional_experience        1
primary_track                  1
employment_status              1
race                           1
highest_level_of_education     1
biggest_challenge_in_search    1
company_status                 1
length_of_job_search           1
cohort_tag                     1
work_authorization_status      1
dtype: int64

There are no duplicates.

import more libraries for regression analysis. 

In [88]:
import numpy as np
import statsmodels.api as sm
from statsmodels.graphics.api import abline_plot # For visualling evaluating predictions.
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split # For splitting the data.
from sklearn import linear_model, preprocessing 
import warnings # For handling error messages.
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale