# TravelTide - segmentation comparison

- Compare segmentation results of K-means and Fuzzy clustering
     - Fuzzy
     - K-means

 - Conclude with reasoning which method we should use for our project

In [1]:
import pandas as pd

# Load the CSV files into DataFrames
kmeans_df = pd.read_csv('data/output/customer_segmented_kmeans.csv')
fuzzy_df = pd.read_csv('data/output/customer_segmented_fuzzy.csv')

# Extract 'user_id' and 'assigned_segment' columns
kmeans_segments = kmeans_df[['user_id', 'assigned_segment']]
fuzzy_segments = fuzzy_df[['user_id', 'assigned_segment']]

# Merge the two DataFrames based on 'user_id'
merged_segments = pd.merge(kmeans_segments, fuzzy_segments, on='user_id', suffixes=('_kmeans', '_fuzzy'))

# Display mapping between 'assigned_segment' values
segment_mapping = merged_segments.groupby(['assigned_segment_kmeans', 'assigned_segment_fuzzy']).size().reset_index(name='count')
print(segment_mapping)


   assigned_segment_kmeans                    assigned_segment_fuzzy  count
0                Cluster 1               exclusive_discounts_segment    125
1                Cluster 1                  free_checked_bag_segment   1862
2                Cluster 1                   free_hotel_meal_segment    398
3                Cluster 1  one_night_hotel_free_with_flight_segment     28
4                Cluster 2               exclusive_discounts_segment    367
5                Cluster 2                  free_checked_bag_segment   1201
6                Cluster 2                   free_hotel_meal_segment    783
7                Cluster 3               exclusive_discounts_segment    105
8                Cluster 3                  free_checked_bag_segment    474
9                Cluster 3  one_night_hotel_free_with_flight_segment      6
10               Cluster 4               exclusive_discounts_segment     34
11               Cluster 4                  free_checked_bag_segment     10
12          

In [2]:
merged_segments

Unnamed: 0,user_id,assigned_segment_kmeans,assigned_segment_fuzzy
0,23557,Cluster 3,free_checked_bag_segment
1,94883,Cluster 1,free_checked_bag_segment
2,101486,Cluster 1,free_checked_bag_segment
3,101961,Cluster 2,free_checked_bag_segment
4,106907,Cluster 5,free_hotel_meal_segment
...,...,...,...
5993,792549,Cluster 2,free_hotel_meal_segment
5994,796032,Cluster 5,free_hotel_meal_segment
5995,801660,Cluster 2,free_hotel_meal_segment
5996,811077,Cluster 3,free_checked_bag_segment


## Conclusion

- Segmentation results by both methods ( Fuzzy and K-means ) DO NOT match perfectly.
- Distance-based methods like K Means are very powerful but have a significant limitation - the process of finding out what the segments actually mean happens after they have been formed. K Means groups data in a bottom-up manner and requires us to conduct an analysis to discover the meaning of the segments after they’ve been found. K Means and other distance-based methods require us to (mainly) rely on inductive reasoning to discover segment meanings.
- This is not the case in methods like Categorization and Thresholding used in Fuzzy segmentation. In these methods, we define segments in a top-down manner. In the language of formal logic, these methods let us use deductive reasoning.
- Since we can use deductive reasoning, we can use the knowledge we have about the data to define the segments manually. If we cannot explain used metrics and segments to non-technical audiance in simple wrods, we will not use those.
- Hence, we will go with Fuzzy segmentation.
