# Students Do: Understanding customers

## Instructions

You are given a dataset that contains historical data from purchases of a online store made by 200 customers, in this activity you will put in action your data preprocessing superpowers, also you'll add some new skills needed to start finding customers clusters.

In [1]:
# Initial imports
import pandas as pd
from path import Path


Load the data into a Pandas DataFrame, name it as `df_shopping` and fetch the top 10 rows.

In [2]:
# Data loading
file_path = Path("../Resources/shopping_data.csv")
df_shopping = pd.read_csv(file_path, encoding="ISO-8859-1")
df_shopping.head(10)


Unnamed: 0,CustomerID,Gender,Age,Annual Income,Spending Score (1-100)
0,1,Male,19,15000,39
1,2,Male,21,15000,81
2,3,Female,20,16000,6
3,4,Female,23,16000,77
4,5,Female,31,17000,40
5,6,Female,22,17000,76
6,7,Female,35,18000,6
7,8,Female,23,18000,94
8,9,Male,64,19000,3
9,10,Female,30,19000,72


List the DataFrame's data types to ensure they're aligned to the type of data stored on each column.

In [3]:
# List dataframe data types
df_shopping.dtypes


CustomerID                 int64
Gender                    object
Age                        int64
Annual Income              int64
Spending Score (1-100)     int64
dtype: object

**Question 1:** Is there any column whose data type need to be changed? If so, make the corresponding adjustments.

**Answer:** All columns have an appropriate data type.

**Question 2:** Is there any unnecessary column that needs to be dropped? If so, make the corresponding adjustments.

**Answer:** We can drop the `CustomerID` column, it's not relevant for clustering since it doesn't denote any relevant characteristic of customers shopping habits.

In [4]:
# Remove the CustomerID Column
df_shopping.drop(columns=["CustomerID"], inplace=True)
df_shopping.head()


Unnamed: 0,Gender,Age,Annual Income,Spending Score (1-100)
0,Male,19,15000,39
1,Male,21,15000,81
2,Female,20,16000,6
3,Female,23,16000,77
4,Female,31,17000,40


Remove all rows with `null` values if any.

In [5]:
# Find null values
for column in df_shopping.columns:
    print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")


Column Gender has 0 null values
Column Age has 0 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 0 null values


Remove duplicate entries if any.

In [6]:
# Find duplicate entries
print(f"Duplicate entries: {df_shopping.duplicated().sum()}")


Duplicate entries: 0


In order to use unsupervised learning algorithms, all the features should be numeric, and also, on similar scales. Perform the following data transformations.

* The `Gender` column contains categorical data, anytime you have categorical variables, you should transform them to a numerical value, in this case, transforming `Male` to `1` and `Female` to `0` is a feasible solution.

In [7]:
# Transform Gender column
def changeGender(gender):
    if gender == "Male":
        return 1
    else:
        return 0


df_shopping["Gender"] = df_shopping["Gender"].apply(changeGender)
df_shopping.head()


Unnamed: 0,Gender,Age,Annual Income,Spending Score (1-100)
0,1,19,15000,39
1,1,21,15000,81
2,0,20,16000,6
3,0,23,16000,77
4,0,31,17000,40


* The `Annual Income` column is on a different scale than the other columns, it is needed to have a similar scale on all the variables in order to use unsupervised learning algorithms, so `Annual Income` should be rescaled. In this case, dividing by `1000` is the simplest approach.

In [8]:
# Transform annual income
df_shopping["Annual Income"] = df_shopping["Annual Income"] / 1000
df_shopping.head()


Unnamed: 0,Gender,Age,Annual Income,Spending Score (1-100)
0,1,19,15.0,39
1,1,21,15.0,81
2,0,20,16.0,6
3,0,23,16.0,77
4,0,31,17.0,40


Save the cleaned DataFrame as a `CSV` file, name it as `shopping_data_cleaned.csv`.

In [9]:
# Saving cleaned data
file_path = Path("../Resources/shopping_data_cleaned.csv")
df_shopping.to_csv(file_path, index=False)
