Project Proposal

Title:


Introduction:

Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
Clearly state the question you will try to answer with your project
Identify and describe the dataset that will be used to answer the question


Preliminary exploratory data analysis:

Demonstrate that the dataset can be read from the web into R 
Clean and wrangle your data into a tidy format
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

Methods:

Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results

Expected outcomes and significance:

What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

In [16]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [4]:
attrition_data <- read_csv("https://raw.githubusercontent.com/wenshanli1231/DSCI-Group-Project/main/Employee-Attrition.csv")
glimpse(attrition_data)

[1mRows: [22m[34m1470[39m [1mColumns: [22m[34m35[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
[32mdbl[39m (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 1,470
Columns: 35
$ Age                      [3m[90m<dbl>[39m[23m 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
$ Attrition                [3m[90m<chr>[39m[23m "Yes", "No", "Yes", "No", "No", "No", "No", "…
$ BusinessTravel           [3m[90m<chr>[39m[23m "Travel_Rarely", "Travel_Frequently", "Travel…
$ DailyRate                [3m[90m<dbl>[39m[23m 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
$ Department               [3m[90m<chr>[39m[23m "Sales", "Research & Development", "Research …
$ DistanceFromHome         [3m[90m<dbl>[39m[23m 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
$ Education                [3m[90m<dbl>[39m[23m 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
$ EducationField           [3m[90m<chr>[39m[23m "Life Sciences", "Life Sciences", "Other", "L…
$ EmployeeCount            [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ EmployeeNumber           [3m[90m<dbl>[39m[23m 1, 2, 4, 5, 7, 8, 10, 11, 12, 13,

In [6]:
attrition_data <- attrition_data |>
    select( Attrition, HourlyRate, YearsAtCompany)

attrition_data

Attrition,HourlyRate,YearsAtCompany
<chr>,<dbl>,<dbl>
Yes,94,6
No,61,10
Yes,92,0
⋮,⋮,⋮
No,87,6
No,63,9
No,82,4


In [8]:
attrition_data <- attrition_data |>
        mutate(Attrition = as_factor(Attrition))
attrition_data

Attrition,HourlyRate,YearsAtCompany
<fct>,<dbl>,<dbl>
Yes,94,6
No,61,10
Yes,92,0
⋮,⋮,⋮
No,87,6
No,63,9
No,82,4


In [9]:
set.seed(10)

In [11]:

attrition_split <- initial_split(attrition_data, prop = 0.75, strata = Attrition)
attrition_train <- training(attrition_split)
attrition_test <- testing(attrition_split) 

glimpse(attrition_train)
glimpse(attrition_test)

Rows: 1,101
Columns: 3
$ Attrition      [3m[90m<fct>[39m[23m No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ HourlyRate     [3m[90m<dbl>[39m[23m 61, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 93, 51, 80,…
$ YearsAtCompany [3m[90m<dbl>[39m[23m 10, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 10, 6, 1, 25, 3, 4…
Rows: 369
Columns: 3
$ Attrition      [3m[90m<fct>[39m[23m Yes, No, No, Yes, No, No, No, No, No, No, No, No, No, N…
$ HourlyRate     [3m[90m<dbl>[39m[23m 50, 83, 82, 48, 98, 79, 30, 51, 50, 43, 59, 33, 55, 30,…
$ YearsAtCompany [3m[90m<dbl>[39m[23m 4, 10, 1, 1, 9, 4, 2, 7, 10, 27, 17, 5, 1, 5, 13, 22, 1…


In [14]:
attrition_recipe <- recipe(Attrition~ HourlyRate + YearsAtCompany, data = attrition_data) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
    
knn_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
       set_engine("kknn") |>
       set_mode("classification")

attrition_fit <- workflow() |>
       add_recipe(attrition_recipe) |>
       add_model(knn_spec) |>
       fit(data = attrition_train)
attrition_fit

ERROR: [1m[33mError[39m in `check_installs()`:[22m
[33m![39m This engine requires some package installs: 'kknn'


In [51]:
attrition_proportions <- attrition_train |>
                      group_by(Attrition) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(attrition_train))

attrition_proportions

attrition_proportions_train <- attrition_train |>
                      group_by(Attrition) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(attrition_train))

attrition_proportions_train

Attrition,n,percent
<fct>,<int>,<dbl>
Yes,177,16.07629
No,924,83.92371


Attrition,n,percent
<fct>,<int>,<dbl>
Yes,177,16.07629
No,924,83.92371
