# Classification of Customer Type 

## Introduction: 
Supermarket sales are often in high competition due to the demand for their resources. This dataset aims to examine data from a supermarket company with three distinctive branches. The data has been collected in a three month time period. The plan of action is to analyze supermarket sales of one particular branches and predict whether a new customer will be a store member or not by the use of multiple variables. 

## Preliminary exploratory data analysis

*To Begin exploratory data analysis, all of the required libraries were added to R*

In [11]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)
if (!require("stringr")) install.packages("stringr")
library('stringr')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

*The data was read into R by the use of GitHub and was retrieved from Kaggle. The data file was added and ran by the following code below*

In [14]:
supermarket <- read_csv('supermarket_sales.csv')
    names(supermarket) <- str_replace_all(names(supermarket), " ", "_")

supermarket_sales <- supermarket%>%
    mutate(Customer_type = as_factor(Customer_type))

supermarket_sales%>%
    slice(n = 1:5)

Parsed with column specification:
cols(
  `Invoice ID` = [31mcol_character()[39m,
  Branch = [31mcol_character()[39m,
  City = [31mcol_character()[39m,
  `Customer type` = [31mcol_character()[39m,
  Gender = [31mcol_character()[39m,
  `Product line` = [31mcol_character()[39m,
  `Unit price` = [32mcol_double()[39m,
  Quantity = [32mcol_double()[39m,
  `Tax 5%` = [32mcol_double()[39m,
  Total = [32mcol_double()[39m,
  Date = [31mcol_character()[39m,
  Time = [34mcol_time(format = "")[39m,
  Payment = [31mcol_character()[39m,
  cogs = [32mcol_double()[39m,
  `gross margin percentage` = [32mcol_double()[39m,
  `gross income` = [32mcol_double()[39m,
  Rating = [32mcol_double()[39m
)



Invoice_ID,Branch,City,Customer_type,Gender,Product_line,Unit_price,Quantity,Tax_5%,Total,Date,Time,Payment,cogs,gross_margin_percentage,gross_income,Rating
<chr>,<chr>,<chr>,<fct>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<time>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08:00,Ewallet,522.83,4.761905,26.1415,9.1
226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29:00,Cash,76.4,4.761905,3.82,9.6
631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23:00,Credit card,324.31,4.761905,16.2155,7.4
123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33:00,Ewallet,465.76,4.761905,23.288,8.4
373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37:00,Ewallet,604.17,4.761905,30.2085,5.3


*Once the dataset was read into R from Kaggle, it was split into a training and testing set. The sets of data did not require any cleaning or wrangling since the data was already in tidy format. The dataset was sliced to keep the visual noise minimal* 

In [15]:
set.seed(1)
supermarket_split <- initial_split(supermarket_sales, prop = 0.70, strata = Customer_type) 
supermarket_train <- training(supermarket_split)
supermarket_test <- testing(supermarket_split)

supermarket_train %>%
    slice(n = 1:5)

Invoice_ID,Branch,City,Customer_type,Gender,Product_line,Unit_price,Quantity,Tax_5%,Total,Date,Time,Payment,cogs,gross_margin_percentage,gross_income,Rating
<chr>,<chr>,<chr>,<fct>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<time>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29:00,Cash,76.4,4.761905,3.82,9.6
631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23:00,Credit card,324.31,4.761905,16.2155,7.4
123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33:00,Ewallet,465.76,4.761905,23.288,8.4
373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37:00,Ewallet,604.17,4.761905,30.2085,5.3
699-14-3026,C,Naypyitaw,Normal,Male,Electronic accessories,85.39,7,29.8865,627.6165,3/25/2019,18:30:00,Ewallet,597.73,4.761905,29.8865,4.1


In [24]:
supermarket_data <- supermarket_train%>%
    filter(Branch == 'A')%>%
    select(Customer_type, Gender, Product_line, Total, Rating)
supermarket_data


Customer_type,Gender,Product_line,Total,Rating
<fct>,<chr>,<chr>,<dbl>,<dbl>
Normal,Male,Home and lifestyle,340.5255,7.4
Member,Male,Health and beauty,489.0480,8.4
Normal,Male,Sports and travel,634.3785,5.3
Member,Female,Electronic accessories,433.6920,5.8
Member,Female,Health and beauty,76.1460,7.2
Normal,Female,Electronic accessories,246.4875,7.1
Normal,Male,Food and beverages,453.4950,8.2
Normal,Female,Health and beauty,749.4900,5.7
Member,Female,Health and beauty,506.6355,4.6
Normal,Male,Sports and travel,457.4430,6.9



## Methods: 
To predict whether a new customer will be a store member or not will be done by various variables provided by the dataset. The variables that will be utilized will be gender, product line (categorization groups of items), and, customer satisfaction rating, total price of invoice (which includes unit price, quantity, and 5% tax). Visualization of the data could be done through the use of ___________. 


## Expected outcomes and significance: 
The prediction is to estimate whether a new customer will sign up for a membership or not. This example of classification could help supermarkets analyze what the threshold is for people to spend on products and get a membership for the store. Customer type prediction could lead to future questions regarding which variable is the most effective one when predicting customer type. 

