About Dataset

Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."


Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.

https://www.kaggle.com/datasets/carrie1/ecommerce-data

### 데이터

제가 다뤄볼 이번 데이터는 1년 동안(2010.12.01~2011.12.09) 4천명 정도의 고객이 구매했던 데이터를 모아둔 E-commerce dataset입니다.

새로운 고객과 기존 고객을 나누는 시도를 하려고 합니다.

### 목차

#### 데이터 준비

#### 데이터 속 변수들 탐색

#### 상품 품목 관찰

#### 고객 분류

#### 고객 명시화

#### 예측

#### 결론 및 회고

#### 필요한 라이브러리 설치

In [None]:
!pip install pandas-profiling



In [None]:
!pip install missingno



In [None]:

!pip install --user -U nltk
"""
ModuleNotFoundError: No module named 'nltk'에러로 인해서 설치 시도
에러 해결
"""

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 10.8 MB/s eta 0:00:01
[?25hCollecting click
  Downloading click-8.1.2-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 11.0 MB/s eta 0:00:01
[?25hCollecting regex>=2021.8.3
  Downloading regex-2022.4.24-cp38-cp38-macosx_11_0_arm64.whl (281 kB)
[K     |████████████████████████████████| 281 kB 16.7 MB/s eta 0:00:01
Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 10.1 MB/s eta 0:00:01
[?25hInstalling collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.2 nltk-3.7 regex-2022.4.24 tqdm-4.64.0


In [None]:

"""ModuleNotFoundError: No module named 'wordcloud'에러 발생으로 인한 조치"""
!pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.8.1.tar.gz (220 kB)
[K     |████████████████████████████████| 220 kB 10.9 MB/s eta 0:00:01
Building wheels for collected packages: wordcloud
  Building wheel for wordcloud (setup.py) ... [?25ldone
[?25h  Created wheel for wordcloud: filename=wordcloud-1.8.1-cp38-cp38-macosx_11_0_arm64.whl size=153430 sha256=152accae0b04ca5c1853d0f6b3566bb1bbfa85160abab6dba406667b9c2ed3e4
  Stored in directory: /Users/krc/Library/Caches/pip/wheels/4d/3f/0d/a2ba9b7895c9f1be89018b3141c3df3d4f9c786c882ccfbc3b
Successfully built wordcloud
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1


In [None]:
"""ModuleNotFoundError: No module named 'plotly'에러 발생으로 인한 조치"""
!pip install plotly

Collecting plotly
  Downloading plotly-5.7.0-py2.py3-none-any.whl (28.8 MB)
[K     |████████████████████████████████| 28.8 MB 11.6 MB/s eta 0:00:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.7.0 tenacity-8.0.1


In [None]:
#필요한 라이브러리 호출

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime, nltk, warnings
import matplotlib.cm as cm
import itertools
import warnings
#current version of seaborn generates a bunch of warning
warnings.filterwarnings("ignore")
sns.set_style('whitegrid')

import missingno as msno
import pandas_profiling

import gc
import datetime
from pathlib import Path


In [None]:
#사이킷런 패키지 호출

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import preprocessing, model_selection, metrics, feature_selection
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn import neighbors, linear_model, svm, tree, ensemble
from sklearn.ensemble import AdaBoostClassifier
from sklearn.decomposition import PCA


In [None]:
#wordcloud: 문서의 키워드, 개념 등을 직관적으로 파악할 수 있도록 핵심 단어를 시각화하는 기법
#IPython:
#plotly 
#호출

from wordcloud import WordCloud, STOPWORDS
from IPython.display import display, HTML
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use('fivethirtyeight')
#mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
%matplotlib inline
color = sns.color_palette()

#### 데이터 준비

데이터를 호출합니다.

데이터 파일이 깨지는 것을 방지하기 위해서 encoding를 시도합니다.

ID는 숫자보단 문자열이기에 문자열로 데이터 타입 유형을 변경할 것입니다.

데이터의 결측치 값과 비율을 확인합니다.

In [None]:
data = pd.read_csv('../Downloads/data.csv',encoding='ISO-8859-1',
                   dtype={'CustomerID': str, 'InvoiceID':str})

print('Dataframe dimensions:', data.shape)

data.head()
#InvoiceDate의 시간대를 보기가 불편함으로 변경을 해야할 것 같습니다.

Dataframe dimensions: (541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom


In [None]:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate']) #시간대 보기 편하게하기
data.head()


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [None]:
#데이터 타입 확인

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  object        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 33.1+ MB


In [None]:
#결측치 확인
data.isnull().sum().sort_values(ascending=False)

CustomerID     135080
Description      1454
InvoiceNo           0
StockCode           0
Quantity            0
InvoiceDate         0
UnitPrice           0
Country             0
dtype: int64

In [None]:
#칼럼별 결측치 수는 파악이 되었으나 비율을 알고 싶습니다.
#추가적으로 dataframe형식으로 결측치 값과 비율을 출력해보겠습니다.

raw_info = pd.DataFrame(data.dtypes).T.rename(index={0:'column type'})
raw_info = raw_info.append(pd.DataFrame(data.isnull().sum()).T.rename(index={0:'null values(nb)'}))
raw_info = raw_info.append(pd.DataFrame(data.isnull().sum()/data.shape[0]*100).T.rename(index={0:'null values (%)'}))

display(raw_info)
display(data[:5])

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
column type,object,object,object,int64,datetime64[ns],float64,object,object
null values(nb),0,0,1454,0,0,0,135080,0
null values (%),0.0,0.0,0.268311,0.0,0.0,0.0,24.926694,0.0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


결측치는 Description 과 CustomerID 두 개가 있습니다.
실제 비율을 보니 Description은 작지만 CustomerID는 약25%정도로 결측치가 많은 데이터이므로 삭제를 진행할 것입니다.