## Table of content

1. Introduction
2. Goal
3. Import Datset & libraries
4. Overview
5. Data Pre-processing
6. Statistical Techniques
7. Descriptive Statistical Analyses
8.  Hypothesis Formulation and Testing
9.  Jupyter Notebook Analysis
10. Machine Leaning 
11. Splitting
12. Training and Testing
13. Conclusions
14. References
15. GitHub repo link

### Import Datset & libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn_extra.cluster import KMedoids
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [12]:
full_path = 'C:/Users/Riccardo/OneDrive/Desktop/Higher Diploma/MACHINE LEARNING/data.xlsx'
df = pd.read_excel(full_path)
df.shape

(525461, 8)

In [None]:
print("Our Dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))
display(df.describe())
display(df.head())
display(df.dtypes.value_counts())

Our Dataset has 525461 rows and 8 columns


Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID
count,525461.0,525461,525461.0,417534.0
mean,10.337667,2010-06-28 11:37:36.845017856,4.688834,15360.645478
min,-9600.0,2009-12-01 07:45:00,-53594.36,12346.0
25%,1.0,2010-03-21 12:20:00,1.25,13983.0
50%,3.0,2010-07-06 09:51:00,2.1,15311.0
75%,10.0,2010-10-15 12:45:00,4.21,16799.0
max,19152.0,2010-12-09 20:01:00,25111.09,18287.0
std,107.42411,,146.126914,1680.811316


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


object            4
float64           2
int64             1
datetime64[ns]    1
Name: count, dtype: int64

### Overview

In [None]:
# from ydata_profiling import ProfileReport
# slice_df = df.iloc[:, :10]
# report = ProfileReport(df, title='My Data', minimal=True)
# report.to_file("First Data File.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

As per First Data File created, we clearly notice we need to handle three features ["Invoice", "StockCode", "Description"] before proceeding to handle missing values of the dataset. 

In [None]:
print(df[['Invoice', 'StockCode', 'Description']])

       Invoice StockCode                          Description
0       489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS
1       489434    79323P                   PINK CHERRY LIGHTS
2       489434    79323W                  WHITE CHERRY LIGHTS
3       489434     22041         RECORD FRAME 7" SINGLE SIZE 
4       489434     21232       STRAWBERRY CERAMIC TRINKET BOX
...        ...       ...                                  ...
525456  538171     22271                 FELTCRAFT DOLL ROSIE
525457  538171     22750         FELTCRAFT PRINCESS LOLA DOLL
525458  538171     22751       FELTCRAFT PRINCESS OLIVIA DOLL
525459  538171     20970   PINK FLORAL FELTCRAFT SHOULDER BAG
525460  538171     21931               JUMBO STORAGE BAG SUKI

[525461 rows x 3 columns]


It seems like the ydata_profiling library is interpreting the data in the "Invoice", "StockCode" and "Description" columns as unsupported or rejected. It is just a problem of the library to read the values but it is not actually a problem for us to interpret the dataset so we will proceeed to focus on missing values of the dataset and handle them.

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

Invoice             0
StockCode           0
Description      2928
Quantity            0
InvoiceDate         0
Price               0
Customer ID    107927
Country             0
dtype: int64


"Description" missing values are relatively few and randomly distributed in the dataset so we choose to simply remove the rows with missing values with the dropna() function in pandas.

In [None]:
df = df.dropna(subset=['Description'])

As "Customer ID" has a significant amount of missing values (20,5%), we decided not to drop them, neighter adopt imputational methodos such as mean or median, but replace the missing values with with a placeholder "-1". 

In [None]:
df['Customer ID'].fillna(-1, inplace=True)

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

Invoice        0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
Price          0
Customer ID    0
Country        0
dtype: int64


In [None]:
if missing_values.any():
    print("\nDataset contains missing values. Handle them before proceeding.")
else:
    print("\nNo missing values found. Dataset is ready.")


No missing values found. Dataset is ready.


Because there was a feature which had a high percentage of missing values (more than 20%), it was necessary to handle them to prevent bias in our analysis.

## Data preprocessing

In the step of data preparation, exploration, and feature selection, we perform the following tasks:

Data Cleaning: We check for and handle missing values, outliers, or any inconsistencies in the dataset we decide to use for clustering.

Exploratory Data Analysis (EDA): We explore the dataset to understand its structure, distributions, correlations, and potential patterns.

Feature Selection: We decide which features are relevant for clustering.

We now proceed to recognize which are the features that can help us to run clustering models.
We think all the features are useful to do the Clustering methods, but "Description" feature is an useful column giving us informations about the product but we could consider it as something not directly contributive to clustering patters, also because we already have StockCode to recognize the product. Nevertheless, Invoice feauture is another important column but we could consider it as unuseful for clustering. The features we think are the best are: ['Quantity', 'Price', 'Customer ID']. This is because we are using just numerical features, we are considering the best columns that provide us insights into customer behavior based on their purchasing patterns, and because we are not going to overwork without any reason, as we could have overloaded ourselves for no reason, such as encoding categorical features that ultimately wouldn't have been useful.



Given that we've decided to use only three numerical features for clustering, we can proceed with the following steps in this phase:


1. Explore the distributions and statistics of the numerical features (Quantity, Price, Customer ID) to understand their characteristics.


2. Visualize relationships between features using scatter plots or correlation matrices to identify any correlations or patterns.


3. Confirm that the selected features are suitable for clustering and proceed to scale them if needed.


 If everything looks good, you can proceed with scaling the features before applying clustering algorithms. Let me know if you need further clarification on any of these steps!


In [15]:
# Load the dataset
df = pd.read_csv("your_dataset.csv")  # Replace "your_dataset.csv" with the actual filename

# Data preprocessing and feature selection
# For example, selecting relevant features and scaling them
scaler = StandardScaler()
selected_features = ['feature1', 'feature2', ...]  # Replace with the actual feature names
df_selected = df[selected_features]
df_scaled = pd.DataFrame(scaler.fit_transform(df_selected), columns=df_selected.columns)

# K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(df_scaled)
kmeans_labels = kmeans.labels_

# Evaluate K-means using silhouette score
kmeans_silhouette_score = silhouette_score(df_scaled, kmeans_labels)

# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(df_scaled)
dbscan_labels = dbscan.labels_

# Visualizing DBSCAN clusters (2D example)
plt.scatter(df_scaled['feature1'], df_scaled['feature2'], c=dbscan_labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# K-Medoids Clustering
kmedoids = KMedoids(n_clusters=3, random_state=42)
kmedoids.fit(df_scaled)
kmedoids_labels = kmedoids.labels_

# Visualizing K-Medoids clusters (2D example)
plt.scatter(df_scaled['feature1'], df_scaled['feature2'], c=kmedoids_labels, cmap='viridis')
plt.title('K-Medoids Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


FileNotFoundError: [Errno 2] No such file or directory: 'your_dataset.csv'