# Students Do: Understanding customers

## Instructions

You are given a dataset that contains historical data from purchases of a online store made by 200 customers. In this activity you will put in action your data preprocessing superpowers, also you'll add some new skills needed to start finding customers clusters.

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path

Load the data into a Pandas DataFrame, name it as `df_shopping` and fetch the top 10 rows.

In [2]:
# Data loading
file_path = Path("../Resources/shopping_data.csv")
df_shopping = pd.read_csv(file_path)
df_shopping.head(10)

Unnamed: 0,CustomerID,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,1,Yes,52,38000,45
1,2,Yes,40,39000,57
2,3,No,57,46000,59
3,4,Yes,54,41000,51
4,5,No,55,45000,53
5,6,Yes,33,41000,51
6,7,Yes,33,45000,48
7,8,Yes,41,49000,52
8,9,Yes,54,39000,55
9,10,Yes,34,44000,53


List the DataFrame's data types to ensure they're aligned to the type of data stored on each column.

In [3]:
# List dataframe data types
df_shopping.dtypes


CustomerID                 int64
Previous Shopper          object
Age                        int64
Annual Income              int64
Spending Score (1-100)     int64
dtype: object

**Question 1:** Is there any column whose data type need to be changed? If so, make the corresponding adjustments.

**Answer:** All columns have an appropriate data type.

**Question 2:** Is there any unnecessary column that needs to be dropped? If so, make the corresponding adjustments.

**Answer:** We can drop the `CustomerID` column. It's not relevant for clustering since it doesn't denote any relevant characteristic of customers shopping habits.

In [4]:
# Remove the CustomerID Column
df_shopping = df_shopping.drop(columns=["CustomerID"])
df_shopping.head()

Unnamed: 0,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,Yes,52,38000,45
1,Yes,40,39000,57
2,No,57,46000,59
3,Yes,54,41000,51
4,No,55,45000,53


Remove all rows with `null` values if any.

In [5]:
# Find null values
for column in df_shopping.columns:
    print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")


Column Previous Shopper has 0 null values
Column Age has 0 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 0 null values


Remove duplicate entries if any.

In [6]:
# Find duplicate entries
print(f"Duplicate entries: {df_shopping.duplicated().sum()}")


Duplicate entries: 0


In order to use unsupervised learning algorithms, all the features should be numeric, and also, on similar scales. Perform the following data transformations.

* The `Gender` column contains categorical data, anytime you have categorical variables, you should transform them to a numerical value, in this case, transforming `Male` to `1` and `Female` to `0` is a feasible solution.

In [7]:
# Transform Previous Customer column
def changeStatus(status):
    if status == "Yes":
        return 1
    else:
        return 0

# Along with replace() and map(), this is another way to encode the gender column into numbers.
df_shopping["Previous Shopper"] = df_shopping["Previous Shopper"].apply(changeStatus)
df_shopping.head()


Unnamed: 0,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,1,52,38000,45
1,1,40,39000,57
2,0,57,46000,59
3,1,54,41000,51
4,0,55,45000,53


* Here, we will scale the `Age`, `Annual Income` and `Spending Score (1-100)` columns to bring them into the same range as the `Previous Shopper` column.

In [8]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_shopping[['Age', 'Annual Income', 'Spending Score (1-100)']])

In [9]:
# A list of the columns from the original DataFrame
df_shopping.columns

Index(['Previous Shopper', 'Age', 'Annual Income', 'Spending Score (1-100)'], dtype='object')

In [10]:
# Create a DataFrame with the transformed data
new_df_shopping = pd.DataFrame(scaled_data, columns=df_shopping.columns[1:])
new_df_shopping['Previous Shopper'] = df_shopping['Previous Shopper']
new_df_shopping.head()

Unnamed: 0,Age,Annual Income,Spending Score (1-100),Previous Shopper
0,-0.29524,-0.118424,-1.625204,1
1,-0.979855,0.05197,-0.05306,1
2,-0.009984,1.244733,0.208964,0
3,-0.181138,0.39276,-0.839132,1
4,-0.124086,1.074338,-0.577108,0


In [11]:
# Rename the spending score column
new_df_shopping = new_df_shopping.rename(columns={'Spending Score (1-100)': 'Spending Score'})
new_df_shopping.head()

Unnamed: 0,Age,Annual Income,Spending Score,Previous Shopper
0,-0.29524,-0.118424,-1.625204,1
1,-0.979855,0.05197,-0.05306,1
2,-0.009984,1.244733,0.208964,0
3,-0.181138,0.39276,-0.839132,1
4,-0.124086,1.074338,-0.577108,0


Save the cleaned DataFrame as a `CSV` file, name it as `shopping_data_cleaned.csv`.

In [12]:
# Saving cleaned data
file_path = Path("../Resources/shopping_data_cleaned.csv")
new_df_shopping.to_csv(file_path, index=False)
