<a href="https://colab.research.google.com/github/sainirajesh17/Online-Retail-Customer-Segmentation/blob/main/Online_Retail_Customer_Segmentation_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## <b> Data Description </b>

### <b>Attribute Information: </b>

* ### InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### Description: Product (item) name. Nominal.
* ### Quantity: The quantities of each product (item) per transaction. Numeric.
* ### InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* ### UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* ### CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### Country: Country name. Nominal, the name of the country where each customer resides.

# Importing Dataset

In [1]:
#Importing important libraries

import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


import statsmodels.api as sm


import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Importing the dataset

dataset_x = pd.read_excel('/content/drive/MyDrive/DATA_SCIENCE_THINGS/Data_Science_Alma_Better/Unsupervised ML Model/Copy of Online Retail.xlsx')

In [5]:
dataset = dataset_x.sample(100000, random_state = 42)
df_fix=dataset_x.sample(100000, random_state = 42)

In [6]:
dataset.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
209268,555200,71459,HANGING JAM JAR T-LIGHT HOLDER,24,2011-06-01 12:05:00,0.85,17315.0,United Kingdom
207108,554974,21128,GOLD FISHING GNOME,4,2011-05-27 17:14:00,6.95,14031.0,United Kingdom
167085,550972,21086,SET/6 RED SPOTTY PAPER CUPS,4,2011-04-21 17:05:00,0.65,14031.0,United Kingdom
471836,576652,22812,PACK 3 BOXES CHRISTMAS PANETTONE,3,2011-11-16 10:39:00,1.95,17198.0,United Kingdom
115865,546157,22180,RETROSPOT LAMP,2,2011-03-10 08:40:00,9.95,13502.0,United Kingdom


In [7]:
dataset.shape

(100000, 8)

# Data Inspection

In [8]:
# Checking tail fo the dataset
dataset.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
242085,C558313,22629,SPACEBOY LUNCH BOX,-1,2011-06-28 11:31:00,1.95,13870.0,United Kingdom
435441,574074,90186B,CRYSTAL HOOP EARRING FLORAL LEAF,1,2011-11-02 15:33:00,2.9,,United Kingdom
275042,560929,21094,SET/6 RED SPOTTY PAPER PLATES,2,2011-07-22 10:06:00,0.85,16866.0,United Kingdom
430654,573585,22467,GUMBALL COAT RACK,4,2011-10-31 14:41:00,4.96,,United Kingdom
290024,562347,22998,TRAVEL CARD WALLET KEEP CALM,4,2011-08-04 12:22:00,0.42,13263.0,United Kingdom


In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 209268 to 290024
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    100000 non-null  object        
 1   StockCode    100000 non-null  object        
 2   Description  99723 non-null   object        
 3   Quantity     100000 non-null  int64         
 4   InvoiceDate  100000 non-null  datetime64[ns]
 5   UnitPrice    100000 non-null  float64       
 6   CustomerID   74983 non-null   float64       
 7   Country      100000 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 6.9+ MB


In [11]:
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity,100000.0,9.41339,351.904849,-80995.0,1.0,3.0,10.0,74215.0
UnitPrice,100000.0,4.464044,86.567489,-11062.06,1.25,2.08,4.13,13541.33
CustomerID,74983.0,15285.24987,1715.869301,12346.0,13952.0,15142.0,16795.0,18287.0


In [14]:
# Checking null values
dataset['Description'].isnull().sum()

277

# Information of the Dataset

In [17]:
# Getting new features from Invoice_Date column

dataset['InvoiceDate_year'] = dataset['InvoiceDate'].dt.year
dataset['InvoiceDate_month'] = dataset['InvoiceDate'].dt.month
dataset['InvoiceDate_day'] = dataset['InvoiceDate'].dt.day
dataset['InvoiceDate_hour'] = dataset['InvoiceDate'].dt.hour
dataset['InvoiceDate_minute'] = dataset['InvoiceDate'].dt.minute
dataset['InvoiceDate_second'] = dataset['InvoiceDate'].dt.second

In [18]:
print("Columns and data types")
pd.DataFrame(dataset.dtypes).rename(columns = {0:'dtype'})

Columns and data types


Unnamed: 0,dtype
InvoiceNo,object
StockCode,object
Description,object
Quantity,int64
InvoiceDate,datetime64[ns]
UnitPrice,float64
CustomerID,float64
Country,object
InvoiceDate_year,int64
InvoiceDate_month,int64


In [20]:
dataset. shape

(100000, 14)

In [24]:
dataset.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country', 'InvoiceDate_year',
       'InvoiceDate_month', 'InvoiceDate_day', 'InvoiceDate_hour',
       'InvoiceDate_minute', 'InvoiceDate_second'],
      dtype='object')

In [27]:
dataset.describe(include = 'all')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDate_year,InvoiceDate_month,InvoiceDate_day,InvoiceDate_hour,InvoiceDate_minute,InvoiceDate_second
count,100000.0,100000,99723,100000.0,100000,100000.0,74983.0,100000,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
unique,17962.0,3610,3697,,16754,,,38,,,,,,
top,573585.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,2011-10-31 14:41:00,,,United Kingdom,,,,,,
freq,221.0,433,441,,221,,,91262,,,,,,
first,,,,,2010-12-01 08:26:00,,,,,,,,,
last,,,,,2011-12-09 12:50:00,,,,,,,,,
mean,,,,9.41339,,4.464044,15285.24987,,2010.92192,7.544,14.99623,13.08129,29.98301,0.0
std,,,,351.904849,,86.567489,1715.869301,,0.268299,3.507693,8.66688,2.441774,16.985543,0.0
min,,,,-80995.0,,-11062.06,12346.0,,2010.0,1.0,1.0,6.0,0.0,0.0
25%,,,,1.0,,1.25,13952.0,,2011.0,5.0,7.0,11.0,16.0,0.0


In [28]:
# Getting list of all features having numerical data
numerical_columns=list(dataset.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

Index(['Quantity', 'UnitPrice', 'CustomerID', 'InvoiceDate_year',
       'InvoiceDate_month', 'InvoiceDate_day', 'InvoiceDate_hour',
       'InvoiceDate_minute', 'InvoiceDate_second'],
      dtype='object')

In [29]:
# Getting list of all the features  having categorical data
categorical_columns=list(dataset.select_dtypes(['object']).columns)
categorical_features=pd.Index(categorical_columns)
categorical_features

Index(['InvoiceNo', 'StockCode', 'Description', 'Country'], dtype='object')

In [30]:
def unique_name_no(col):
  print(dataset[col].unique())
  print(dataset[col].nunique())


for i in categorical_columns:
  print(i.upper())
  unique_name_no(i)

INVOICENO
[555200 554974 550972 ... 568460 'C558313' 560929]
17962
STOCKCODE
[71459 21128 21086 ... '90042A' '16169P' 72811]
3610
DESCRIPTION
['HANGING JAM JAR T-LIGHT HOLDER' 'GOLD FISHING GNOME'
 'SET/6 RED SPOTTY PAPER CUPS' ... 'CHEST NATURAL WOOD 20 DRAWERS'
 'FRESHWATER PEARL BRACELET GOLD' 'SMALL ZINC/GLASS CANDLEHOLDER']
3697
COUNTRY
['United Kingdom' 'Australia' 'Norway' 'Finland' 'Germany' 'Bahrain'
 'EIRE' 'Spain' 'France' 'Canada' 'RSA' 'Netherlands' 'Italy' 'Austria'
 'Channel Islands' 'Unspecified' 'Sweden' 'Belgium' 'Portugal' 'USA'
 'Cyprus' 'Poland' 'Switzerland' 'Japan' 'Denmark' 'Hong Kong' 'Iceland'
 'Singapore' 'Israel' 'United Arab Emirates' 'Greece' 'Lithuania'
 'European Community' 'Malta' 'Saudi Arabia' 'Lebanon' 'Czech Republic'
 'Brazil']
38


In [31]:
# duplicate
len(dataset[dataset.duplicated()])

200

In [33]:
dataset[dataset.duplicated()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDate_year,InvoiceDate_month,InvoiceDate_day,InvoiceDate_hour,InvoiceDate_minute,InvoiceDate_second
270266,560555,20984,12 PENCILS TALL TUBE POSY,2,2011-07-19 13:01:00,0.29,14178.0,United Kingdom,2011,7,19,13,1,0
510482,579456,22624,IVORY KITCHEN SCALES,1,2011-11-29 14:20:00,8.50,13428.0,United Kingdom,2011,11,29,14,20,0
213179,555524,22697,GREEN REGENCY TEACUP AND SAUCER,1,2011-06-05 11:37:00,2.95,16923.0,United Kingdom,2011,6,5,11,37,0
370967,569205,22624,IVORY KITCHEN SCALES,1,2011-10-02 10:55:00,8.50,16923.0,United Kingdom,2011,10,2,10,55,0
375246,569424,22436,12 COLOURED PARTY BALLOONS,10,2011-10-04 10:49:00,0.65,15529.0,United Kingdom,2011,10,4,10,49,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409368,572058,22907,PACK OF 20 NAPKINS PANTRY DESIGN,1,2011-10-20 12:43:00,0.85,18252.0,United Kingdom,2011,10,20,12,43,0
360382,568226,23232,WRAP VINTAGE LEAF DESIGN,25,2011-09-26 10:39:00,0.42,16719.0,United Kingdom,2011,9,26,10,39,0
504413,578949,22629,SPACEBOY LUNCH BOX,1,2011-11-27 14:30:00,1.95,14954.0,United Kingdom,2011,11,27,14,30,0
49635,540524,84755,COLOUR GLASS T-LIGHT HOLDER HANGING,8,2011-01-09 12:53:00,0.65,16735.0,United Kingdom,2011,1,9,12,53,0


In [35]:
# Dropping Duplicate rows
dataset=dataset.drop_duplicates()
len(dataset[dataset.duplicated()])

0

In [36]:
dataset.shape

(99800, 14)

In [37]:
dataset.isnull().sum()

InvoiceNo                 0
StockCode                 0
Description             277
Quantity                  0
InvoiceDate               0
UnitPrice                 0
CustomerID            25017
Country                   0
InvoiceDate_year          0
InvoiceDate_month         0
InvoiceDate_day           0
InvoiceDate_hour          0
InvoiceDate_minute        0
InvoiceDate_second        0
dtype: int64