# Activity: Explore Customer Spending Preference Dataset
Let's load and explore a [customer spending preferences dataset from Kaggle](https://www.kaggle.com/code/heeraldedhia/kmeans-clustering-for-customer-data?select=Mall_Customers.csv). This dataset was created learning customer segmentation concepts, known as [market basket analysis](https://en.wikipedia.org/wiki/Market_basket). We'll load the data, explore the fields, and do some basic exploration to see what's in the dataset.

## Setup
We set up the computational environment by including the `Include.jl` file and loading any needed resources, e.g., a sample dataset, to cluster.
* __Include__: The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc.
* __Documentation__: For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/) and the [VLDataScienceMachineLearningPackage.jl documentation](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). 

In [3]:
include("Include.jl")

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/Desktop/julia_work/CHEME-150-eCornell-Repository/CHEME-150-eCornell-Repository/courses/CHEME-151/module-1/Project.toml`
  [90m[336ed68f] [39m[92m+ CSV v0.10.15[39m
  [90m[aaaa29a8] [39m[92m+ Clustering v0.15.8[39m
  [90m[5ae59095] [39m[92m+ Colors v0.13.1[39m
  [90m[a93c6f00] [39m[92m+ DataFrames v1.7.0[39m
  [90m[b4f34e82] [39m[92m+ Distances v0.10.12[39m
  [90m[5789e2e9] [39m[92m+ FileIO v1.17.0[39m
  [90m[033835bb] [39m[92m+ JLD2 v0.5.13[39m
  [90m[91a5bcdd] [39m[92m+ Plots v1.40.13[39m
  [90m[08abe8d2] [39m[92m+ PrettyTables v2.4.0[39m
  [90m[10745b16] [39m[92m+ Statistics v1.11.1[39m
  [90m[f3b207a7] [39m[92m+ StatsPlots v0.15.7[39m
  [90m[24b76065] [39m[92m+ VLDataScienceMachineLearningPackage v0.1.0 `https://github.com/varnerlab/VLDat

### Data
We load the dataset [using the `MyKaggleCustomerSpendingDataset() method`](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyKaggleCustomerSpendingDataset) exported by the [VLDataScienceMachineLearningPackage.jl package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). This method returns the raw data [as a `DataFrame` instance](https://github.com/JuliaData/DataFrames.jl).

We'll save the raw data in the `originaldataset::DataFrame` variable:

In [5]:
originaldataset = MyKaggleCustomerSpendingDataset()

Row,id,gender,age,income,spendingscore
Unnamed: 0_level_1,Int64,String7,Int64,Int64,Int64
1,1,Male,19,15,39
2,2,Male,21,15,81
3,3,Female,20,16,6
4,4,Female,23,16,77
5,5,Female,31,17,40
6,6,Female,22,17,76
7,7,Female,35,18,6
8,8,Female,23,18,94
9,9,Male,64,19,3
10,10,Female,30,19,72


__Fields and records__: Let's check the number of records we have and the number and type of the fields in each record in the `originaldataset::DataFrame`. 

_Records_: Each row in the dataset holds a record, so we'll compute the number of records by computing the number of rows [using the `nrow(...)` method exported by `DataFrames.jl`](https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.nrow). We'll call [the `nrow(...)` method](https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.nrow) and pass the output of that command to [the Julia `println(...)` method](https://docs.julialang.org/en/v1/base/io-network/#Base.println) using [the `|>` pipe operator](https://docs.julialang.org/en/v1/manual/functions/#Function-composition-and-piping).

In [41]:
nrow(originaldataset) |> n-> println("Number of records: $(n)")

Number of records: 200


_Fields_: We can compute the number of fields on each record using [the `ncol(...)` method exported by the `DataFrames.jl` package](https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.ncol). However, [the `ncol(...)` method](https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.ncol) only returns the number of fields, not the fieldname or the type of data contained in the field.

In [59]:
ncol(originaldataset) |> n-> println("Number of fields: $(n)")

Number of fields: 5


To do a deeper dive 

In [69]:
names(originaldataset)

5-element Vector{String}:
 "id"
 "gender"
 "age"
 "income"
 "spendingscore"

__Hmmm__. We have [categorical fields](https://en.wikipedia.org/wiki/Categorical_variable). Let's remap the `gender::String7` categorical feature, which is a string, to a number, i.e., let `Male = -1` and `Female = 1.` It's much easier to use numbers than categorical data when looking at (and analyzing) this data later. We'll store the revised dataset in [the `dataset::DataFrame` variable](https://github.com/JuliaData/DataFrames.jl).
* __Note__: the original dataset did not include example shoppers who identified as non-binary. Hence, we transformed the original `gender::String7` field to ${-1,1}$. However, if non-binary shoppers were to enter the dataset, we could map them to a different number, e.g., `0`.

In [10]:
dataset = let
    treated_dataset = copy(originaldataset);
    transform!(treated_dataset, :gender => ByRow( x-> (x=="Male" ? -1 : 1)) => :gender); # maps gender to -1,1
    treated_dataset 
end;