
Name : Saptarshi Mukherjee

In [58]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

# Load datasets
print("\n. Loading datasets...")
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

print("\nCustomers data sample:")
print(customers.head())
print("\nProducts data sample:")
print(products.head())
print("\nTransactions data sample:")
print(transactions.head())



. Loading datasets...

Customers data sample:
  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15

Products data sample:
  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31

Transactions data sample:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:2

In [59]:
# Process signup date
print("\n. Processing signup dates...")
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
customers['days_since_signup'] = (pd.to_datetime('today') - customers['SignupDate']).dt.days
print("\nCustomers data with processed dates:")
print(customers.head())

# Merge datasets
print("\n. Merging datasets...")
merged_data = transactions.merge(customers, on='CustomerID')
merged_data = merged_data.merge(products, on='ProductID')
print("\nMerged data sample:")
print(merged_data.head())
print("\nMerged data shape:", merged_data.shape)



. Processing signup dates...

Customers data with processed dates:
  CustomerID        CustomerName         Region SignupDate  days_since_signup
0      C0001    Lawrence Carroll  South America 2022-07-10                933
1      C0002      Elizabeth Lutz           Asia 2022-02-13               1080
2      C0003      Michael Rivera  South America 2024-03-07                327
3      C0004  Kathleen Rodriguez  South America 2022-10-09                842
4      C0005         Laura Weber           Asia 2022-08-15                897

. Merging datasets...

Merged data sample:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10        

In [61]:
# Feature engineering
customer_features = merged_data.groupby('CustomerID').agg({
    'TotalValue': 'sum',
    'Quantity': 'mean',
    'Category': lambda x: x.mode()[0],  # Most frequent category
    'Region': 'first'  # Region is the same for a customer
}).reset_index()

In [47]:
# Flatten column names
customer_features.columns = ['CustomerID', 'Total_Value_Sum', 'Total_Value_Mean',
                           'Total_Value_Std', 'Quantity_Sum', 'Quantity_Mean',
                           'Quantity_Std', 'Most_Frequent_Category', 'Region',
                           'days_since_signup']

print("\nCustomer features sample:")
print(customer_features.head())
print("\nFeature statistics:")
print(customer_features.describe())



Customer features sample:
  CustomerID  Total_Value_Sum  Total_Value_Mean  Total_Value_Std  \
0      C0001          3354.52           670.904       456.643861   
1      C0002          1862.74           465.685       219.519169   
2      C0003          2725.38           681.345       559.276543   
3      C0004          5354.88           669.360       325.386829   
4      C0005          2034.24           678.080       310.820746   

   Quantity_Sum  Quantity_Mean  Quantity_Std Most_Frequent_Category  \
0            12       2.400000      0.547723            Electronics   
1            10       2.500000      1.000000             Home Decor   
2            14       3.500000      0.577350             Home Decor   
3            23       2.875000      1.125992                  Books   
4             7       2.333333      0.577350            Electronics   

          Region  days_since_signup  
0  South America                933  
1           Asia               1080  
2  South America       

In [48]:
# Handle null values
customer_features = customer_features.fillna({
    'Total_Value_Std': 0,
    'Quantity_Std': 0
})
# Calculate additional features
customer_features['avg_transaction_value'] = customer_features['Total_Value_Sum'] / customer_features['Quantity_Sum']
customer_features['purchase_frequency'] = customer_features['Quantity_Sum'] / customer_features['days_since_signup']

print(" Additional features added:")
print(customer_features[['CustomerID', 'avg_transaction_value', 'purchase_frequency']].head())

# Separate features for encoding
numerical_cols = ['Total_Value_Sum', 'Total_Value_Mean', 'Total_Value_Std',
                 'Quantity_Sum', 'Quantity_Mean', 'Quantity_Std',
                 'days_since_signup', 'avg_transaction_value', 'purchase_frequency']
categorical_cols = ['Most_Frequent_Category', 'Region']

 Additional features added:
  CustomerID  avg_transaction_value  purchase_frequency
0      C0001             279.543333            0.012862
1      C0002             186.274000            0.009259
2      C0003             194.670000            0.042813
3      C0004             232.820870            0.027316
4      C0005             290.605714            0.007804


In [52]:
# Store customer IDs
customer_ids = customer_features['CustomerID'].values

# One-hot encode categorical features
encoded_categorical = pd.get_dummies(customer_features[categorical_cols], prefix=categorical_cols)
print("\nEncoded categorical features sample:")
print(encoded_categorical.head())



Encoded categorical features sample:
   Most_Frequent_Category_Books  Most_Frequent_Category_Clothing  \
0                         False                            False   
1                         False                            False   
2                         False                            False   
3                          True                            False   
4                         False                            False   

   Most_Frequent_Category_Electronics  Most_Frequent_Category_Home Decor  \
0                                True                              False   
1                               False                               True   
2                               False                               True   
3                               False                              False   
4                                True                              False   

   Region_Asia  Region_Europe  Region_North America  Region_South America  
0        False      

In [53]:
# Normalize numerical features
print("\n. Normalizing numerical features...")
scaler = StandardScaler()
normalized_numerical = pd.DataFrame(
    scaler.fit_transform(customer_features[numerical_cols]),
    columns=numerical_cols
)
print("\nNormalized numerical features sample:")
print(normalized_numerical.head())


. Normalizing numerical features...

Normalized numerical features sample:
   Total_Value_Sum  Total_Value_Mean  Total_Value_Std  Quantity_Sum  \
0        -0.061701         -0.070263         0.079511     -0.122033   
1        -0.877744         -0.934933        -1.090817     -0.448000   
2        -0.405857         -0.026271         0.586054      0.203934   
3         1.032547         -0.076769        -0.568308      1.670787   
4        -0.783929         -0.040028        -0.640199     -0.936951   

   Quantity_Mean  Quantity_Std  days_since_signup  avg_transaction_value  \
0      -0.233464     -1.081215           1.148752               0.104195   
1      -0.054969     -0.008562           1.600431              -1.152382   
2       1.729980     -1.010948          -0.713270              -1.039267   
3       0.614387      0.290248           0.869141              -0.525277   
4      -0.352460     -1.010948           1.038137               0.253234   

   purchase_frequency  
0           -0.5

In [54]:
# Combine features
final_features = pd.concat([normalized_numerical, encoded_categorical], axis=1)
print("\n. Final features shape:", final_features.shape)
print("\nFinal features sample:")
print(final_features.head())

# Calculate similarity matrix
print("\n. Calculating similarity matrix...")
similarity_matrix = cosine_similarity(final_features)
print("\nSimilarity matrix shape:", similarity_matrix.shape)
print("\nSimilarity matrix sample (first 5x5):")
print(similarity_matrix[:5, :5])



. Final features shape: (199, 17)

Final features sample:
   Total_Value_Sum  Total_Value_Mean  Total_Value_Std  Quantity_Sum  \
0        -0.061701         -0.070263         0.079511     -0.122033   
1        -0.877744         -0.934933        -1.090817     -0.448000   
2        -0.405857         -0.026271         0.586054      0.203934   
3         1.032547         -0.076769        -0.568308      1.670787   
4        -0.783929         -0.040028        -0.640199     -0.936951   

   Quantity_Mean  Quantity_Std  days_since_signup  avg_transaction_value  \
0      -0.233464     -1.081215           1.148752               0.104195   
1      -0.054969     -0.008562           1.600431              -1.152382   
2       1.729980     -1.010948          -0.713270              -1.039267   
3       0.614387      0.290248           0.869141              -0.525277   
4      -0.352460     -1.010948           1.038137               0.253234   

   purchase_frequency  Most_Frequent_Category_Books  \
0 

In [55]:
# Get top 3 similar customers for first 20 customers
print("\n. Finding top 3 similar customers...")
similar_customers = {}
for i in range(min(20, len(customer_ids))):
    similarities = list(enumerate(similarity_matrix[i]))
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)[1:4]
    similar_customers[customer_ids[i]] = [
        (customer_ids[j], round(score, 4))
        for j, score in sorted_similarities
    ]



. Finding top 3 similar customers...


In [56]:
# Create and save output file
output_df = pd.DataFrame({
    'CustomerID': list(similar_customers.keys()),
    'SimilarCustomers_WithScores': [str(v) for v in similar_customers.values()]
})
output_df.to_csv('Lookalike.csv', index=False)

print("\n. Final Results:")
print("\nOutput file preview:")
print(output_df.head())



. Final Results:

Output file preview:
  CustomerID                        SimilarCustomers_WithScores
0      C0001  [('C0184', 0.7055), ('C0005', 0.6771), ('C0118...
1      C0002  [('C0060', 0.7666), ('C0086', 0.7579), ('C0025...
2      C0003  [('C0144', 0.8501), ('C0136', 0.7312), ('C0091...
3      C0004  [('C0165', 0.7826), ('C0109', 0.777), ('C0153'...
4      C0005  [('C0130', 0.7769), ('C0150', 0.7328), ('C0131...


In [45]:
# Print detailed recommendations
print("\nDetailed recommendations for first 3 customers:")
for customer_id in list(similar_customers.keys())[:3]:
    print(f"\nCustomer {customer_id}:")
    print("Original customer profile:")
    print(customer_features[customer_features['CustomerID'] == customer_id].iloc[0])
    print("\nSimilar customers:")
    for similar_id, score in similar_customers[customer_id]:
        print(f"\nSimilar customer {similar_id} (Similarity score: {score}):")
        print(customer_features[customer_features['CustomerID'] == similar_id].iloc[0])


Detailed recommendations for first 3 customers:

Customer C0001:
Original customer profile:
CustomerID                        C0001
Total_Value_Sum                 3354.52
Total_Value_Mean                670.904
Total_Value_Std              456.643861
Quantity_Sum                         12
Quantity_Mean                       2.4
Quantity_Std                   0.547723
Most_Frequent_Category      Electronics
Region                    South America
days_since_signup                   933
avg_transaction_value        279.543333
purchase_frequency             0.012862
Name: 0, dtype: object

Similar customers:

Similar customer C0184 (Similarity score: 0.7055):
CustomerID                        C0184
Total_Value_Sum                 3393.18
Total_Value_Mean                 484.74
Total_Value_Std              209.079667
Quantity_Sum                         11
Quantity_Mean                  1.571429
Quantity_Std                   0.786796
Most_Frequent_Category      Electronics
Region      

Data Preparation and Feature Engineering:

Customer Profiles: Days since signup and region are used as features. The 'Region' is one-hot encoded, transforming it into a numerical representation suitable for machine learning algorithms. This helps the model consider regional differences when determining similarity. Days since signup helps quantify customer recency and potentially identify trends based on the customer's lifecycle.

Transaction Data: The code aggregates transaction information for each customer, focusing on purchasing behavior. Key features derived are:

AvgTransactionValue: Average value of each transaction per customer. This indicates typical spending habits.
TotalSpending: Total money spent by the customer. A higher value suggests a more valuable customer.
TransactionCount: Number of transactions. Frequency of purchase could be a valuable indicator of customer engagement.
FavoriteCategory: Most frequently purchased product category. Shows preferred product areas for each customer. This category is then one-hot encoded, just like region.
Data Merging: Customer profiles and transaction aggregates are merged into final_data. This combines all relevant features for each customer into a single dataset.

Handling Missing Data: Missing values in FavoriteCategory (which may arise if some customers bought products from various categories equally) are filled with 0.

2. Data Normalization:

Standardization: Using StandardScaler, all features are standardized (z-score normalization). This transforms the data to have zero mean and unit variance. Standardization is crucial for distance-based methods like cosine similarity. Features with larger scales (like total spending) wouldn't disproportionately affect the similarity calculation compared to features with smaller scales (like the number of transactions).
3. Look-Alike Model (Cosine Similarity):

Cosine Similarity: The core of the look-alike model is cosine similarity. It measures the cosine of the angle between two vectors (representing customers in this case). A cosine similarity of 1 indicates identical vectors (customers with identical behavior), while 0 means no similarity.
Similarity Matrix: A similarity_matrix is built, where each entry represents the similarity between two customers.
Top Look-Alikes: For the first 20 customers, the code identifies the top 3 most similar customers (look-alikes) based on the similarity scores. It excludes the customer itself.
4. Output and Storage:

Look-Alike Data: The results are stored in a DataFrame containing the customer ID and a list of their top 3 look-alike customers and their similarity scores.
CSV File: The look-alike data is saved to Lookalike.csv.
Overall Insights and Potential Improvements:

Feature Importance: The code doesn't analyze the importance of each feature. Techniques like feature importance from tree-based models or permutation importance could help identify which factors are most influential in determining look-alike customers. This would enhance interpretability and allow you to refine the feature set.
Alternative Similarity Measures: Explore other similarity or distance metrics (e.g., Euclidean distance, Manhattan distance, or Jaccard similarity) to see if they yield better results. The choice of distance metric depends on the nature of the data and the problem.
Hyperparameter Tuning: If you use more sophisticated similarity methods, hyperparameter tuning is critical.
Dynamic Threshold: Instead of a fixed top 3, consider a dynamic threshold based on the similarity score. This would give you a more flexible way to identify look-alike customers.
Recency, Frequency, Monetary Value (RFM) Analysis: Consider incorporating RFM analysis, which can offer more detailed insights into customer behavior patterns.
Model Evaluation: No evaluation of the model performance is included in the code. Defining metrics and validation approaches to evaluate the model's effectiveness at identifying truly similar customers would be valuable.
