### Customer Segmentation with Machine Learning.
In the realm of data science, your mission is to unlock the potential hidden within a vast trove of e-commerce sales data using Python and the scikit-learn library. The business case at hand is to better understand the customer base of an e-commerce company, and as the custodian of data, you're entrusted with the task of transforming this information into actionable insights.

Your journey begins with data preparation, where you meticulously clean, format, and structure the raw data. This behind-the-scenes work may often go unnoticed, but it forms the bedrock of your mission's success, ensuring that the data is in a state that can be effectively analyzed.

The true artistry of your work comes to light in the segmentation phase. With Python and scikit-learn, you employ the K-means clustering algorithm to partition the customers into distinct segments. These segments aren't arbitrary divisions but rather the keys to understanding customer behavior, preferences, and needs. Your clustering model paints a vivid picture of the customer landscape, allowing the e-commerce company to tailor its strategies, products, and interactions to cater to each segment's unique characteristics.

But your journey doesn't end there. You recognize the importance of precision, and so you delve into hyperparameter tuning. Like a master craftsman refining their masterpiece, you fine-tune the clustering model to perfection. This step ensures that the segments aren't just loosely defined groups but accurate reflections of customer behavior.

Your dedication to precision results in a model that effectively and accurately segments the customer base. It equips the e-commerce company with the insights needed to make data-driven decisions, enhance customer satisfaction, boost sales, and optimize marketing efforts.

In this data-driven quest, you're the unsung hero, silently transforming raw data into actionable insights. While your work may often go unnoticed by the world, its impact reverberates within the e-commerce company. Your dedication to data and your ability to shape it into meaningful customer segments contribute to the ongoing story of e-commerce success, making every customer's journey towards better shopping experiences that much more extraordinary.

Module 1
Task 1: Unlocking Sales Secrets.
You've just loaded an e-commerce dataset using Python, and your task is clear: to delve into the depths of this data and reveal the hidden sales secrets it holds. With pandas by your side, you're ready to explore trends, customer behaviors, and product insights that will not only drive sales but rewrite the success story of this e-commerce business. Your journey begins, armed with data and determination, to unearth the treasure trove that is "Orders_Analysis.csv."


#### Load the data.
Import Pandas and alias it as 'pd'.
Read the CSV file movies Orders_Analysis.csv into a Pandas DataFrame named 'df'.
To import the 'Orders_Analysis.csv' file, which is located in the root path of your project, you should use the following path: './Orders_Analysis.csv'.
Inspect the data by calling the variable 'df'.

### Module 1¶
#### Task 1: Unlocking Sales Secrets.
You've just loaded an e-commerce dataset using Python, and your task is clear: to delve into the depths of this data and reveal the hidden sales secrets it holds. With pandas by your side, you're ready to explore trends, customer behaviors, and product insights that will not only drive sales but rewrite the success story of this e-commerce business. Your journey begins, armed with data and determination, to unearth the treasure trove that is "Orders_Analysis.csv."

In [1]:
#--- Import Pandas ---
import pandas as pd
#--- Read in dataset ----
df = pd.read_csv("Orders_Analysis.csv")
df1 = df
# ---WRITE YOUR CODE FOR TASK 1 ---
#--- Inspect data ---
df.head()

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
0,DPR,DPR,100,AD-982-708-895-F-6C894FB,52039657,1312378,83290718932496,04/12/2018,2,200.0,-200.0,0.0,0.0,0.0,0.0,0,2
1,RJF,Product P,28 / A / MTM,83-490-E49-8C8-8-3B100BC,56914686,3715657,36253792848113,01/04/2019,2,190.0,-190.0,0.0,0.0,0.0,0.0,0,2
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.8,-156.56,-8.24,0.0,0.0,0.0,-2,2
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1


#### Filtering DataFrame Rows.
Use the DataFrame 'df'.
Apply a filter to select only the rows where the "ordered_item_quantity" is greater than 0.
Store the filtered DataFrame back into the variable 'df'.

#### Task 2: Quantifying Success.
In your relentless pursuit of data-driven excellence, you now embark on a quest with a captivating title in mind. Filtering the dataset to retain only records with a positive ordered item quantity is not just a routine task, but a crucial step in the saga of success. You're on a mission to uncover the golden equation that defines what makes products and orders successful in this e-commerce realm. With each line of code, you inch closer to the pivotal insights that will guide product offerings and sales strategies. Your journey continues, as you sift through the data, separating the ordinary from the extraordinary.

In [2]:
#--- WRITE YOUR CODE FOR TASK 2 ---
df = df.loc[df['ordered_item_quantity'] > 0]

#--- Inspect data ---
df.head()

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
0,DPR,DPR,100,AD-982-708-895-F-6C894FB,52039657,1312378,83290718932496,04/12/2018,2,200.0,-200.0,0.0,0.0,0.0,0.0,0,2
1,RJF,Product P,28 / A / MTM,83-490-E49-8C8-8-3B100BC,56914686,3715657,36253792848113,01/04/2019,2,190.0,-190.0,0.0,0.0,0.0,0.0,0,2
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.8,-156.56,-8.24,0.0,0.0,0.0,-2,2
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1


#### Aggregating and Encoding Data.
- Define a list of columns to group by, including "customer_id" and "product_type" as 'column_list.'
- Create an 'aggregated_dataframe' by using the Pandas .groupby() method on 'df' to group the data by the columns specified in 'column_list.' Then, apply the .count() function on the "ordered_item_quantity" column within each group and reset the index. This DataFrame contains the count of ordered items for each combination of customer and product type.
- Add a new column named "products_ordered" to the 'aggregated_dataframe' by applying the encode_column function to the "ordered_item_quantity" column.
- Create a 'customers_orders' DataFrame by using the Pandas .groupby() method on 'aggregated_dataframe' and grouping by the "customer_id" column. Then, sum the "products_ordered" values within each group and reset the index. This DataFrame summarizes the total number of products ordered by each customer.
- The 'customers_orders' DataFrame contains the customer IDs and the corresponding total number of products ordered by each customer.


#### Task 3: Customer Code: Deciphering Buying Patterns.
With a captivating title in mind, you now delve deeper into the world of data manipulation. Your mission is crystal clear - decode the intricate buying patterns of e-commerce customers. By skillfully encoding the data into binary values, you are on a journey to unveil the secrets hidden within. The aggregation of customer behaviors aligns with your objective, as you aim to discern the customers who have made purchases from those who haven't. With each line of code, you illuminate the path towards a deeper understanding of customer preferences, and in doing so, you're empowering the e-commerce business to tailor its offerings more precisely to its audience. Your journey persists as you decipher the customer code, revealing a roadmap for strategic decisions.

In [3]:
### Task 3: Customer Code: Deciphering Buying Patterns.
#With a captivating title in mind, you now delve deeper into the world of data manipulation. Your mission is crystal clear - decode the intricate buying patterns of e-commerce customers. By skillfully encoding the data into binary values, you are on a journey to unveil the secrets hidden within. The aggregation of customer behaviors aligns with your objective, as you aim to discern the customers who have made purchases from those who haven't. With each line of code, you illuminate the path towards a deeper understanding of customer preferences, and in doing so, you're empowering the e-commerce business to tailor its offerings more precisely to its audience. Your journey persists as you decipher the customer code, revealing a roadmap for strategic decisions.

column_list = ["customer_id", "product_type"]

aggregated_dataframe = df.groupby(column_list)['ordered_item_quantity'].count().reset_index()

def encode_column(column):
    if column > 0:
        return 1
    if column <= 0:
        return 0

aggregated_dataframe['products_ordered'] = aggregated_dataframe['ordered_item_quantity'].apply(encode_column)
customers_orders = aggregated_dataframe.groupby('customer_id')['products_ordered'].sum().reset_index()
customers_orders.head()
customers_orders


Unnamed: 0,customer_id,products_ordered
0,1000661,1
1,1001914,1
2,1002167,3
3,1002387,1
4,1002419,2
...,...,...
24869,97805007741979,2
24870,98854671633650,2
24871,98974226154136,1
24872,99262726332691,2


#### Calculating Average Return Rate by Customer Order.
- Create a 'ordered_sum_by_customer_order' DataFrame by using the Pandas .groupby() method on 'df' and grouping by "customer_id" and "order_id." Then, calculate the sum of "ordered_item_quantity" for each group and reset the index. This DataFrame contains the sum of ordered item quantities for each customer order.
- Create a 'returned_sum_by_customer_order' DataFrame by using the Pandas .groupby() method on 'df' and grouping by "customer_id" and "order_id." Calculate the sum of "returned_item_quantity" for each group and reset the index. This DataFrame contains the sum of returned item quantities for each customer order.
- Merge the 'ordered_sum_by_customer_order' and 'returned_sum_by_customer_order' DataFrames based on common columns, creating 'ordered_returned_sums.' This DataFrame combines the sum of ordered and returned item quantities for each customer order.
- Calculate the "average_return_rate" by using a formula that takes the negation of "returned_item_quantity" and divides it by "ordered_item_quantity" within the 'ordered_returned_sums' DataFrame.
- The 'ordered_returned_sums' DataFrame contains information about the sum of ordered and returned item quantities and the calculated average return rate for each customer order.

#### Task 4: Unveiling Return Rate Insights.¶
Your journey now takes you into the world of order and return dynamics. As you meticulously calculate the sum of ordered and returned items by customer and order, you aim to unravel the balance between what's bought and what's sent back. The introduction of the "average return rate" offers a fresh perspective, providing insights into customer behaviors and product quality. Your quest continues, and with each line of code, you bring clarity to a complex puzzle, guiding the e-commerce business toward strategies that ensure a delicate equilibrium between sales and returns.


In [4]:
ordered_sum_by_customer_order = df.groupby(['customer_id', 'order_id'])['ordered_item_quantity'].sum().reset_index()
ordered_sum_by_customer_order.head()
#--- WRITE YOUR CODE FOR TASK 4 ---
returned_sum_by_customer_order = df.groupby(['customer_id', 'order_id'])['returned_item_quantity'].sum().reset_index()
returned_sum_by_customer_order.head()
#--- Inspect data ---
ordered_returned_sums = pd.merge(ordered_sum_by_customer_order, returned_sum_by_customer_order, on= 'customer_id')

ordered_returned_sums['average_return_rate'] = ~(ordered_returned_sums['returned_item_quantity']) / ordered_returned_sums['ordered_item_quantity']
ordered_returned_sums.drop(['order_id_x', 'order_id_y',], axis=1, inplace= True)
ordered_returned_sums

Unnamed: 0,customer_id,ordered_item_quantity,returned_item_quantity,average_return_rate
0,1000661,3,0,-0.333333
1,1001914,1,0,-1.000000
2,1002167,1,0,-1.000000
3,1002167,1,0,-1.000000
4,1002167,1,0,-1.000000
...,...,...,...,...
109313,99262726332691,2,0,-0.500000
109314,99262726332691,2,0,-0.500000
109315,99262726332691,1,0,-1.000000
109316,99262726332691,1,0,-1.000000


### Module 2
#### Task 1: Charting the Path to Customer Satisfaction.
This data-driven pursuit doesn't just stop at numbers; it's about revealing the stories behind each return. As you merge and reshape the data, you're paving the way for the e-commerce business to craft strategies that enhance customer satisfaction. Your journey persists, and with each line of code, you bring clarity to the complex relationship between returns, customer behaviors, and the pursuit of a more satisfying shopping experience.

#### Analyzing Customer Return Rates.
- Calculate the 'customer_return_rate' by using the Pandas .groupby() method on the 'ordered_returned_sums' DataFrame and grouping by "customer_id." Then, calculate the mean of the "average_return_rate" for each customer and reset the index. This DataFrame contains the average return rate for each customer.
- Create a 'return_rates' DataFrame by applying the .value_counts() method to the "average_return_rate" column of 'customer_return_rate.' Then, reset the index and rename the columns to "average return rate" and "count of unit return rate."
- Rename the columns in the 'return_rates' DataFrame to match the specified column names: "average return rate" and "count of unit return rate."
- Merge the 'customers_orders' DataFrame with the 'customer_return_rate' DataFrame based on the "customer_id" column. This creates the 'customers' DataFrame, which combines information about the total number of products ordered by each customer and their average return rates.
- The 'customers' DataFrame contains data on customer orders, their average return rates, and the count of unit return rates for each average return rate value.


In [5]:
#--- WRITE YOUR CODE FOR TASK 1 ---
customer_return_rate = ordered_returned_sums.groupby('customer_id')['average_return_rate'].mean().reset_index()
return_rates = customer_return_rate['average_return_rate'].value_counts().reset_index()
return_rates = return_rates.rename({'index': 'average return rate', 'average_return_rate': 'count of unit return rate'}, axis = 1)
# #--- Inspect data ---
customers = pd.merge(customers_orders, customer_return_rate, on='customer_id')
customers

Unnamed: 0,customer_id,products_ordered,average_return_rate
0,1000661,1,-0.333333
1,1001914,1,-1.000000
2,1002167,3,-1.000000
3,1002387,1,-1.000000
4,1002419,2,-0.500000
...,...,...,...
24869,97805007741979,2,-0.200000
24870,98854671633650,2,-1.000000
24871,98974226154136,1,-1.000000
24872,99262726332691,2,-0.750000


#### Calculating Customer Total Spending.
- Calculate the 'customer_total_spending' by using the Pandas .groupby() method on the 'df' DataFrame and grouping by "customer_id." Then, calculate the sum of "total_sales" for each customer and reset the index. This DataFrame contains the total spending for each customer.
- Rename the column in the 'customer_total_spending' DataFrame from "total_sales" to "total_spending" to provide a more descriptive column name.
- The 'customer_total_spending' DataFrame contains data on each customer's total spending. The column name has been updated for clarity, reflecting the total spending information.


#### Task 2: The Currency of Customer Loyalty.
Renaming columns for clarity, you're preparing a roadmap for the e-commerce company to comprehend customer spending patterns and to identify the high-value customers. With each line of code, you're crafting a narrative of how loyalty is reflected in the currency customers invest, further empowering the business to tailor strategies that nurture and retain its most valuable assets. Your journey persists, as you count the coins that echo the story of customer loyalty and financial success.

In [6]:
# Task 2: The Currency of Customer Loyalty.
# Renaming columns for clarity, you're preparing a roadmap for the e-commerce company to comprehend customer spending patterns and to identify the high-value customers. With each line of code, you're crafting a narrative of how loyalty is reflected in the currency customers invest, further empowering the business to tailor strategies that nurture and retain its most valuable assets. Your journey persists, as you count the coins that echo the story of customer loyalty and financial success.

#--- WRITE YOUR CODE FOR TASK 2 ---
customer_total_spending = df.groupby('customer_id')['total_sales'].sum().reset_index()
customer_total_spending.rename({'total_sales': 'total_spending'}, axis = 1, inplace = True)
#--- Inspect data ---
customer_total_spending

Unnamed: 0,customer_id,total_spending
0,1000661,260.0
1,1001914,79.2
2,1002167,234.2
3,1002387,89.0
4,1002419,103.0
...,...,...
24869,97805007741979,259.0
24870,98854671633650,242.5
24871,98974226154136,89.0
24872,99262726332691,267.0


#### Merging and Cleaning Customer Data.
- Merge the 'customers' DataFrame with the 'customer_total_spending' DataFrame based on the "customer_id" column. This combines information about customer orders, their average return rates, and total spending for each customer.
- Use the .drop() method to remove the "customer_id" column from the 'customers' DataFrame, as it is no longer needed.
- The 'customers' DataFrame now contains customer information, including total spending, average return rates, and the count of unit return rates for each average return rate value. The "customer_id" column has been removed for a cleaner final dataset.

#### Task 3: Customer Chronicles: Weaving a Tapestry of Insights.
With precision, you've shaped this data tapestry, and now it's a map to understanding customer behaviors, return rates, and the financial footprint of loyalty. As you remove the identifier and streamline the data, you're arming the e-commerce business with a holistic view of its customers. Your journey continues, and with each line of code, you're adding depth and color to the evolving story of customer engagement and business success.

In [7]:

#--- WRITE YOUR CODE FOR TASK 3 ---
customers = pd.merge(customers, customer_total_spending, on='customer_id')
customers.drop('customer_id', axis = 1, inplace= True)
#--- Inspect data ---
customers

Unnamed: 0,products_ordered,average_return_rate,total_spending
0,1,-0.333333,260.0
1,1,-1.000000,79.2
2,3,-1.000000,234.2
3,1,-1.000000,89.0
4,2,-0.500000,103.0
...,...,...,...
24869,2,-0.200000,259.0
24870,2,-1.000000,242.5
24871,1,-1.000000,89.0
24872,2,-0.750000,267.0


#### Transforming and Enhancing Customer Data.
- Define a list of columns to transform, including "products_ordered," "average_return_rate," and "total_spending" in the 'columns' list.
- Iterate over the columns in the 'columns' list.
- For each column, apply a natural logarithm transformation using np.log1p() to the data in the selected column, creating 'transformed_column.'
- Round the values in 'transformed_column' to two decimal places using the .round(2) method, resulting in 'rounded_column.'
- Create new columns in the 'customers' DataFrame by prefixing "log_" to the original column name. These new columns contain the rounded, logarithmically transformed data.
- The 'customers' DataFrame is enhanced with additional columns containing the rounded and logarithmically transformed values for "products_ordered", "average_return_rate", and "total_spending".

#### Task 4: Transformed Insights.
By applying logarithmic transformations and rounding to two decimal places, you're elevating the data into a realm of precision and clarity. The columns "products_ordered," "average_return_rate," and "total_spending" are now elegantly reshaped, revealing a new perspective on customer behavior and loyalty. With each line of code, you're adding a layer of sophistication to the e-commerce data, turning it into a canvas that's ready for more profound analysis and strategic decision-making. Your journey continues, as you unlock the secrets within the data, ready to paint the next chapter in the story of data-driven success.

In [8]:
import numpy as np
#--- WRITE YOUR CODE FOR TASK 4 ---
columns = ['products_ordered', 'average_return_rate', 'total_spending']

for column in columns:
    transformed_column = np.log1p(customers[column])
    rounded_column = transformed_column.round(2)
    
    customers[f"log_{column}"] = rounded_column
    

customers['log_average_return_rate'] = customers['log_average_return_rate'].replace((np.inf, -np.inf), -1)
customers

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending
0,1,-0.333333,260.0,0.69,-0.41,5.56
1,1,-1.000000,79.2,0.69,-1.00,4.38
2,3,-1.000000,234.2,1.39,-1.00,5.46
3,1,-1.000000,89.0,0.69,-1.00,4.50
4,2,-0.500000,103.0,1.10,-0.69,4.64
...,...,...,...,...,...,...
24869,2,-0.200000,259.0,1.10,-0.22,5.56
24870,2,-1.000000,242.5,1.10,-1.00,5.50
24871,1,-1.000000,89.0,0.69,-1.00,4.50
24872,2,-0.750000,267.0,1.10,-1.39,5.59


### Module 3
#### Task 1: Cluster Quest: Unveiling the Essence of Customer Segmentation.

In your enthralling journey, the stage is set for the "Cluster Quest." With the power of scikit-learn's K-means clustering, you're poised to unveil the essence of customer segmentation. The code you've crafted launches a sophisticated algorithm to partition customers into distinct groups based on their log-transformed metrics.

The K-means model's score, carefully rounded, reflects the inertia of the clusters, a critical measure of the model's performance. As you press forward, you're about to discover the clusters that define customer segments, empowering the e-commerce business to tailor its strategies with newfound precision. Your journey continues, as you prepare to reveal the secrets that lie within these clusters, setting the stage for data-driven success in the e-commerce landscape.

#### K-Means Clustering and Scoring.
- Import the K-Means clustering model from the scikit-learn library using from sklearn.cluster import KMeans.
- Initialize the K-Means model with the following parameters:
- init='k-means++': Use the k-means++ initialization method.
- max_iter=500: Set the maximum number of iterations for K-Means to 500.
- random_state=42: Set a random seed for reproducibility.
- n_init=10: Explicitly set the number of times the algorithm will be run with different centroid seeds to 10 in order to suppress a warning.
- Fit the K-Means model to the customer data using kmeans_model.fit(customers.iloc[:, 3:]). It applies clustering to the columns in the 'customers' DataFrame starting from the 4th column.
- Calculate the K-Means score using kmeans_model.inertia_, which represents the sum of squared distances between data points and their assigned clusters. Round the score to two decimal places, resulting in 'kmeans_score.'
- The 'kmeans_score' represents the quality of the clustering, with a lower score indicating more compact and well-separated clusters.
- The code provides the K-Means score for the clustering results on the customer data.


In [9]:
from sklearn.cluster import KMeans
kmeans_model = KMeans(init='k-means++', max_iter=500, random_state=42, n_init=10)
kmeans_Cols = customers.iloc[:, 3:]

kmeans_fit = kmeans_model.fit(kmeans_Cols)
#--- WRITE YOUR CODE FOR TASK 1 ---
kmeans_score = kmeans_model.inertia_
kmeans_score = round(kmeans_score, 2)
#--- Inspect data ---
kmeans_score



2381.08

#### Task 2: Finding the Sweet Spot: The Clusters' Hidden Harmony.
In your data-driven journey, you venture into a quest to determine the optimal number of clusters that will reveal the hidden harmony within the data. Your code expertly explores a range of cluster values, from 1 to 15, using the K-means algorithm.

As you iterate through each cluster value, you meticulously record the inertia, capturing the essence of the clustering quality. With each round, you're inching closer to discovering the perfect balance that defines the clusters. This insight holds the key to shaping tailored strategies that align with the nature of the customer base. Your journey continues, as you prepare to unveil the clusters' hidden harmony, setting the stage for a new level of data-driven success in the e-commerce realm.

#### Determining the Number of Clusters (K) for K-Means.
- Create a new DataFrame named 'dataframe' by selecting columns from the 'customers' DataFrame starting from the 4th column (index 3) to the end. This DataFrame contains the data for clustering.
- Define the number of clusters (K) as 15 using K = 15.
- Create a list named 'cluster_values' containing integers from 1 to K (inclusive) to represent the range of cluster numbers you want to evaluate.
- Initialize an empty list named 'inertia_values' to store the inertia values for different K values.
- Use a for loop to iterate over the values in 'cluster_values' (from 1 to 15).
- For each value of 'c' (the number of clusters), create a K-Means model with the following parameters:
- n_clusters = c: Set the number of clusters to 'c.'
- init='k-means++': Use the k-means++ initialization method.
- max_iter=500: Set the maximum number of iterations to 500.
- random_state=42: Set a random seed for reproducibility.
- n_init=10: Explicitly set the number of times the algorithm will be run with different centroid seeds to 10 to suppress a warning.
- Fit the K-Means model to the 'dataframe' and calculate the inertia, which represents the sum of squared distances between data points and their assigned clusters. Round the inertia value to two decimal places.
- Append the rounded inertia value to the 'inertia_values' list for the current number of clusters (K).
- The 'inertia_values' list contains the inertia values for different numbers of clusters (K), allowing you to analyze and determine the optimal number of clusters for K-Means clustering.
- The code provides a list of inertia values for various numbers of clusters, which can be used to identify the optimal K for clustering the customer data.


In [10]:
# pip install --upgrade threadpoolctl


In [19]:
dataframe = customers.iloc[:, 3:]
k = 15
cluster_values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
inertia_values = []

for c in cluster_values:
    kmodel = KMeans(init='k-means++', max_iter=500, random_state=42, n_init=10, n_clusters=c)
    kfit = kmodel.fit(dataframe)
    kmodel_score = kmodel.inertia_
    kmodel_score = round(kmodel_score, 2)
    inertia_values.append(kmodel_score)

    
#--- WRITE YOUR CODE FOR TASK 2 ---
inertia_values
#--- Inspect data ---

[17363.9,
 7963.07,
 5251.57,
 4215.74,
 3608.57,
 3146.05,
 2741.73,
 2381.08,
 2121.97,
 1936.82,
 1802.84,
 1675.74,
 1560.24,
 1473.27,
 1373.53]

#### Task 3: Cluster Symphony: The Grand Unveiling.¶
With the insights gained from the previous steps, you've chosen the optimal number of clusters, and the updated K-means model is poised to perform its magic.

As the model fitting and prediction unfold, you're on the verge of unveiling the refined customer segments. These segments, meticulously crafted through data alchemy, logarithmic transformations, and precision clustering, represent the heart of customer insights. With each line of code, you're on the cusp of discovering the harmonic clusters that will empower the e-commerce business to tailor its strategies with unparalleled precision. Your journey continues, as you're about to reveal the grand symphony of customer segmentation, setting the stage for a new era of data-driven success in the e-commerce landscape.

#### Applying K-Means Clustering with Optimized K.
- Create an updated K-Means model named 'updated_kmeans_model' with the following parameters:
- n_clusters=4: Set the number of clusters to 4, which represents the optimized K.
- init='k-means++': Use the k-means++ initialization method.
- max_iter=500: Set the maximum number of iterations to 500.
- random_state=42: Set a random seed for reproducibility.
- n_init=10: Explicitly set the number of times the algorithm will be run with different centroid seeds to 10 to suppress a warning.
- Use the .fit_predict() method of 'updated_kmeans_model' to apply K-Means clustering to the customer data. Select the columns from the 'customers' DataFrame starting from the 4th column (index 3).
- This method assigns each customer to one of the four clusters.
- The 'res' variable contains the cluster assignments for each customer based on the optimized K-Means clustering with K=4.
- The code applies K-Means clustering to the customer data and assigns each customer to one of the four clusters, providing the cluster assignments in the 'res' variable.

In [23]:

#--- WRITE YOUR CODE FOR TASK 3 ---
updated_kmeans_model = KMeans(n_clusters = 4, init = 'k-means++', max_iter=500, random_state=42, n_init=10)
res = updated_kmeans_model.fit_predict(dataframe)


#--- Inspect data ---
res

array([0, 1, 0, ..., 1, 3, 1], dtype=int32)

#### Task 4: Cluster Insights Unleashed: The Symphony Resonates.
In your data-driven odyssey, you've now reached the pinnacle with "Cluster Insights Unleashed." Your code takes the refined cluster centers and transforms them back into their original, interpretable values. These centers represent the essence of each customer segment, reflecting product preferences, return rates, and total spending.

As you align the clusters and round the values to two decimal places, you're preparing to reveal the symphony of insights hidden within these customer segments. The clusters are no longer abstract; they're tangible profiles that will guide the e-commerce business towards more precise strategies. With each line of code, you're about to unveil the grand symphony of customer segmentation, setting the stage for a new era of data-driven success in the e-commerce landscape. Your journey has reached its crescendo, and the insights within the clusters are ready to reshape the future of the business.

#### Calculating Cluster Centers and Creating a Final Customer DataFrame.
- Calculate the 'cluster_centers' using updated_kmeans_model.cluster_centers_, which provides the feature values for the center of each cluster.
- Use the labels_ attribute of the updated_kmeans_model to assign cluster labels to each customer and add a new column named "clusters" to the 'customers'
- Apply the inverse transformation to the cluster centers using np.expm1() to obtain 'actual_data' values.
- Concatenate transformed cluster centers (actual_data) with the original cluster centers (cluster_centers) horizontally. Here, the actual_data is concatenated first, followed by cluster_centers.
- Concatenate the 'add_points' values once again with the original cluster centers, incorporating cluster labels [0, 1, 2, 3] using axis = 1. Save the result in the existing array named 'add_points' itself.
- Build a new DataFrame named 'centers_df' from the 'add_points' array with column names: "products_ordered", "average_return_rate", "total_spending", "log_products_ordered", "log_average_return_rate", "log_total_spending" and "clusters".
- Convert the "clusters" column in 'centers_df' to the integer data type using .astype("int").
- Apply the .round(2) method to 'centers_df' to round all values to two decimal places and assign it to 'rounded_centers_df'.
- Copy the 'customers' DataFrame to 'customers_final'.
- Hint: Use the following code for the concatinating: add_points = np.append(add_points, [[0], [1], [2], [3]], axis=1)

In [86]:
cluster_centers = updated_kmeans_model.cluster_centers_
labels = updated_kmeans_model.labels_
#--- WRITE YOUR CODE FOR TASK 4 ---
customers['clusters'] = labels
actual_data = np.expm1(cluster_centers)

cluster_centers_df = pd.DataFrame(cluster_centers, columns = dataframe.columns)
actual_data_df = pd.DataFrame(actual_data, columns = dataframe.columns)

add_points = np.append(actual_data_df, cluster_centers_df, axis=1)
add_points = np.append(add_points, [[0], [1], [2], [3]], axis=1)

col_names = ["products_ordered", "average_return_rate", "total_spending", "log_products_ordered", "log_average_return_rate", "log_total_spending","clusters"]

centers_df = pd.DataFrame(add_points, columns= col_names)
centers_df['clusters'] = centers_df['clusters'].astype('int')
rounded_centers_df = centers_df.round(2)
customers_final = customers.copy()

#--- Inspect data ---

rounded_centers_df

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters
0,1.64,-0.58,174.73,0.97,-0.87,5.17,0
1,1.01,-0.63,78.0,0.7,-0.99,4.37,1
2,3.07,-0.53,402.09,1.4,-0.76,6.0,2
3,2.97,-0.81,363.91,1.38,-1.68,5.9,3


#### Task 5: Convergence of Insights: Merging the Customer Tapestry.
You're at the juncture of bringing together the customer data and the refined cluster centers. As you weave this tapestry, you're not only uniting customer profiles with cluster centers, but you're also assigning the identity of "center" to the cluster points.

The merger of these datasets creates a comprehensive view, fusing customer behaviors with the essence of each cluster. This unified data set is a roadmap for the e-commerce business, offering insights into the unique characteristics of each segment. With each line of code, you're guiding the business towards a new era of personalized strategies and data-driven success. Your journey reaches a pivotal moment, as the convergence of insights promises to reshape the future of the e-commerce landscape.

#### Combining Customer Data with Cluster Centers.
- Add a new column named "is_center" to the 'customers_final' DataFrame and initialize it with zeros for all customer data rows.
- Add a new column with the same name, "is_center," to the 'rounded_centers_df' DataFrame and set the value to 1 for all cluster center rows.
- Use the pd.concat() method to append the contents of 'rounded_centers_df' to the 'customers_final' DataFrame, setting ignore_index=True to reindex the resulting DataFrame.
- The 'customers' DataFrame now contains the combined data from both customer information and cluster centers. The "is_center" column distinguishes between customer data and cluster centers, where 0 indicates customer data, and 1 indicates cluster centers.

In [90]:
customers_final['is_center'] = 0
rounded_centers_df['is_center'] = 1
#--- WRITE YOUR CODE FOR TASK 5 ---
#customers = ...
customers_final = pd.concat([rounded_centers_df, customers_final], ignore_index= True)
#--- Inspect data ---
customers_final

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters,is_center
0,1.64,-0.58,174.73,0.97,-0.87,5.17,0,1
1,1.01,-0.63,78.00,0.70,-0.99,4.37,1,1
2,3.07,-0.53,402.09,1.40,-0.76,6.00,2,1
3,2.97,-0.81,363.91,1.38,-1.68,5.90,3,1
4,1.64,-0.58,174.73,0.97,-0.87,5.17,0,0
...,...,...,...,...,...,...,...,...
24877,2.00,-0.20,259.00,1.10,-0.22,5.56,2,0
24878,2.00,-1.00,242.50,1.10,-1.00,5.50,0,0
24879,1.00,-1.00,89.00,0.69,-1.00,4.50,1,0
24880,2.00,-0.75,267.00,1.10,-1.39,5.59,3,0


#### Task 6: The Tapestry of Segmentation: Magnitude Unveiled.¶
In the ever-evolving realm of data analysis, your journey reaches an intriguing chapter titled "The Tapestry of Segmentation." With your clusters and customer data now harmoniously combined, you're about to unveil the magnitude of each customer group.

As you convert cluster labels to strings and meticulously record the cardinality of each cluster, you're preparing to paint a vivid picture of the customer landscape. The "Customer Group Magnitude" holds the key to understanding the size and significance of each segment.

With each line of code, you're providing the e-commerce business with insights that go beyond the clusters themselves, offering a deeper understanding of customer behavior. Your journey continues, as you set the stage for a new era of data-driven success, where customer segmentation becomes a fundamental part of business strategy.

#### Creating a Summary of Customer Clusters.
- Create a new column named "cluster_name" in the 'customers' DataFrame. Assign cluster labels as strings to this column, converting the cluster labels in the "clusters" column to strings using .astype(str).
- Create a new variable named 'final_result' to summarize the customer clusters. Use the .value_counts() method on the "cluster_name" column of the 'customers' DataFrame to count the number of customers in each cluster.
- Reset the index of the 'final_result' DataFrame using reset_index() method.
- Rename the columns in 'final_result' using the rename() method only with inplace=True as follows:
"cluster_name" to "Customer Groups"
"count" to "Customer Group Magnitude"
- The 'final_result' DataFrame provides a summary of customer clusters, displaying the customer group names and the number of customers in each group.

In [98]:
customers['cluster_name'] = labels.astype(str)

#--- WRITE YOUR CODE FOR TASK 6 ---
final_result = customers['cluster_name'].value_counts().reset_index()
final_result = final_result.rename({'cluster_name': 'Customer Groups', 'cluster_name': 'Customer Group Magnitude'}, axis=1)
#--- Inspect data ---
final_result

Unnamed: 0,index,Customer Group Magnitude
0,1,10977
1,0,8170
2,2,3262
3,3,2465
