# e-commerce
This data set contains all the transactions ocurring between 01-12-2010 and 09-12-2011 for a UK-based and registered non-store online retail. Many customers of the company are wholesalers.

## Index
- [1. Import libraries and download data](#section1)
- [2. EDA](#section2)
- [3. Recommender Model](#section3)
- [4. Conclusion](#section4)

## 1. Import libraries and download data<a id='section1'></a>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import datetime
from isoweek import Week

import os

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = '/kaggle/input/ecommerce-data/data.csv'
data_online = pd.read_csv(path, sep=",")

## 2. EDA<a id='section2'></a>

We are going to analyse data which is composed by every transacion between 01-12-2010 and 09-12-2011.

### 2.1 Structure
Firstly, we look at what shape this data has and what type of features make up the database.
#### 2.1.1. Shape and dataframe's head

In [None]:
print('- Shape:', data_online.shape)
data_online.head()

#### 2.1.2. Type of features

In [None]:
data_online.info()

- InvoiceNo: An invoice number is a unique number generated by a business issuing an invoice to a client. This number is included on the invoice and it is used for payment tracking purposes. Note that, when the number starts with C, this means cancellation. The type of features is string, categorical.
- StockCode: Product (item) code. It is unique to each product. This feature is string, categorical.
- Description: Product name. This feature is string, categorical.
- Quantity: The quantity of this product per transaction. This feature is numerical discrete.
- InvoiceDate: When the transaction is done. This feature is a date.
- UnitPrice: Price per unit. This feature is numerical continuous.
- CustomerID: unique per each customer. It is numerical.
- Country: The country where the customer lives. This feature is string, categorical. 

#### 2.1.3. Null values

In [None]:
df_null = pd.DataFrame(data_online.isnull().sum())
df_null = df_null.rename(columns={0:'Number of null values'})
df_null['Percentage null values'] = round(data_online.isnull().sum()/data_online.InvoiceNo.count()*100,2)
df_null

Observing the previous table, there are two features which contain null values, Description and CustomerID. The Description feature the percentage is 0.27% and CustomerID is 24.93%. 

### 2.2.Features

#### 2.2.1. InoviceNo

- Type of transaction and frequency

The InvoiceNo feature can be distinguished by sales or cancelled. When the number is preceded by C, this transaction has been cancelled. Therefore, we are going to plot the frequency of transacion between sales and cancelled, in order to see the proportions. 

Seeing the plot, the proportion of transaction is more representive in sales than cancelled, as expected.

In [None]:
data_online['Type of transaction'] = np.where(data_online['InvoiceNo'].str.contains('C'),'Cancelled','Sales')

In [None]:
fig, axes=plt.subplots(nrows=1, figsize=(12,4))
data_online[['InvoiceNo', 'Type of transaction']].drop_duplicates()['Type of transaction'].value_counts().plot.bar(ax=axes)
plt.title('Number of transactions (Cancelled and Sales)')
plt.show()

- Products per transaction

In this section, we count the number of unique products per transaction, and we plot the distribution, distinguishing sales and cancellations.

Looking at the cancelled plot, the poins fall around 0, so the majority of cancellations have small number of unique products per transaction. The sales plot, the number unique products per transaction is bigger than 1. 


In [None]:
transaction = data_online.groupby(['Type of transaction', 'InvoiceNo'])['InvoiceNo'].count().to_frame().reset_index(0)
sales = transaction[transaction['Type of transaction']=='Sales'].InvoiceNo
cancelled = transaction[transaction['Type of transaction']=='Cancelled'].InvoiceNo

In [None]:
fig, axes=plt.subplots(nrows=1, figsize=(12,4))
axes.hist(sales, bins = 100)
axes.hist(cancelled, bins=100)
plt.title('Number of unique products per transaction distribution')

axes.legend(labels=["Sales","Cancelled"])
axes.set_xlim([-500,500])
plt.show()

- Total product per transaction

In this section, we sum the total of products per transaction, in order to create the two distributions (sales and cancelled). When we were analysing the data, we realised that there are some sale's transactions negative, as you can see in this [dataframe](#sectiondata). This data seems that these products have been damaged, lost, and so on, and this is the way how thwy are removed from the stock. Therefore, we are going to remove in order to draw the distributions. 

Looking at the first plot, there are some values between -80000 and 80000, this causes lack of visiulasitation of main shape of both distribution. Therefore, the distributions are zoomed into the points that fall between -4000 and 4000. The cancellacion transactions, the number of products are mainly concentrated in less than 500. Sales transaction, the number of products are also big number less than 500 but it has also a tail quite representative.

The third plot, it is boxplot for both type of transactions, as both they have large values of outliers we apply the logarithm in the data.

In [None]:
transaction = data_online.groupby(['Type of transaction', 'InvoiceNo'])['Quantity'].sum().to_frame().reset_index(0)
#sales = transaction[transaction['Type of transaction']=='Sales'].Quantity.sort_values()
cancelled = transaction[transaction['Type of transaction']=='Cancelled'].Quantity.sort_values()
sales_neg = transaction[(transaction['Type of transaction']=='Sales') & (transaction['Quantity']<0)].Quantity.sort_values()
invice_neg = sales_neg.index

In [None]:
trans_sales_clean = transaction.loc[~transaction.index.isin(invice_neg)]
sales = trans_sales_clean[trans_sales_clean['Type of transaction']=='Sales'].Quantity.sort_values()
trans_sales_clean = trans_sales_clean.assign(log_Quantity=0)
trans_sales_clean['log_Quantity'] = trans_sales_clean['Quantity'].apply(lambda x: np.log(x) if x>0 else np.log(-x))

In [None]:
trans_sales_clean = trans_sales_clean.assign(log_Quantity=0)
trans_sales_clean['log_Quantity'] = trans_sales_clean['Quantity'].apply(lambda x: np.log(x) if x>0 else np.log(-x))

In [None]:
fig, axes=plt.subplots(nrows=3, figsize=(12,10))
plt.subplots_adjust(hspace=0.4)
#sns.kdeplot(sales,ax=axes[0],fill=True)
#sns.kdeplot(cancelled,ax=axes[0],fill=True)
axes[0].hist(sales, bins=200)
axes[0].hist(cancelled, bins=200)
axes[0].set_title('Total products per transaction histogram')
axes[0].legend(labels=["Sales","Cancelled"])

#sns.kdeplot(sales,ax=axes[1])
axes[1].hist(sales, bins=200)
axes[1].hist(cancelled, bins=200)
#sns.histplot(cancelled,ax=axes[1],fill=True, bin=50)
axes[1].set_title('Total products per transaction histogram(zoom)')
#axes[1].legend(labels=["Sales","Cancelled"])
axes[1].set_xlim([-4000,4000])



sns.boxplot(x='Type of transaction', y='log_Quantity', data=trans_sales_clean, ax = axes[2], 
            order = ['Sales','Cancelled'] )
axes[2].set_title('Total products per transaction Boxplot')
plt.show()



[dataframe negative product in the sale sections]<a id='sectiondata'></a>

In [None]:
sales_1000 = transaction[(transaction['Type of transaction']=='Sales') & (transaction['Quantity']<0)].Quantity.sort_values()
invice_neg = sales_1000.index
data_online[data_online['InvoiceNo'].isin(invice_neg)].sort_values('InvoiceNo')

- Price per transaction

The total price is defined as (Quantity * UnitPrice), and is added by transaction and distinguishing cancelled and sales in order to plot the histograms. 

When plotting this we realised that there were negative results in Price, which are shown in that [dataframe](#sectiondata1), it is described as adjust bad debt. There are also some amount of zeros, which are related to fix the stocks caused by damaged, lost, etc products. All this information we decided to remove from original data because it can distort the sales data.

The histograms have more concentration at the first bin close to 0, so this means that we have more sales and cancellations with less amount of money and some less sales with big amount of money.


In [None]:
data_online['TotalPrice'] = data_online['Quantity'] * data_online['UnitPrice']
#remove the damage products we keep only the sales and cancellation
data_online = data_online.loc[~data_online.InvoiceNo.isin(invice_neg)]
data_online_A = data_online.loc[data_online['InvoiceNo'].str.contains('A')]
data_online = data_online.loc[~data_online['InvoiceNo'].str.contains('A')]

In [None]:
transaction = data_online.groupby(['Type of transaction', 'InvoiceNo'])['TotalPrice'].sum().to_frame().reset_index(0)
sales = transaction[transaction['Type of transaction']=='Sales'].TotalPrice
cancelled = transaction[transaction['Type of transaction']=='Cancelled'].TotalPrice

In [None]:
fig, axes=plt.subplots(nrows=2, figsize=(12,10))
axes[0].hist(sales, bins = 200)
axes[0].hist(cancelled, bins = 200)
axes[0].set_title('Total price per transaction distribution')
axes[0].legend(labels=["Sales","Cancelled"])

axes[1].hist(sales, bins = 200)
axes[1].hist(cancelled, bins = 200)
axes[1].set_title('Total price per transaction distribution(zoom)')
axes[1].legend(labels=["Sales","Cancelled"])
axes[1].set_xlim([-20000,20000])
plt.show()


[dataframe negative price total in the sale sections]<a id='sectiondata1'></a>

In [None]:
data_online_A

#### 2.2.2. StockCode
This feature contains the codes which are assigned to the products in the stock. As the code does not give much information, we decide to use the description to understand more the business from the data e-commerce.

- Quantity

These plots reflect the most popular products (sales and cancelled). The thing to highlight is the two most popular items are the same of sales and cancellation with exactly same amount, therefore we can think that these sales were a mistake and then the cancellation occurred.


In [None]:
ind = data_online[((data_online['Type of transaction']=='Sales') & (data_online['Quantity']<=0))].index
data_online = data_online.drop(ind)

In [None]:
stock = data_online.groupby(['Type of transaction','StockCode'])['Quantity'].sum().reset_index(0)
description = data_online.groupby(['Type of transaction','Description'])['Quantity'].sum().reset_index(0)

In [None]:
fig, axes=plt.subplots(nrows=2, figsize=(12,16))
## first plot Sales
description[description['Type of transaction']=='Sales'].Quantity.sort_values(ascending=False)[0:20].plot.barh(ax=axes[0])
axes[0].set_title('The 20 most popular products (sales)')
description[description['Type of transaction']=='Cancelled'].Quantity.sort_values()[0:20].plot.barh(ax=axes[1])
axes[1].set_title('The 20 most popular products (cancelled)')
plt.show()

- Price

We plot the 30 most expensive products, we see the majority of items are Manual, Dotcom Postage and Amazon fee, therefore they do not show the really prodruct that company has.

The plot with the 20 cheapest products, the amount of all items is 0. And we try to figure out what happened and this prodructs in other dates have different amount bigger than 0. Then, we think these items can be gift for some customers and for that reason they have zero price.

In [None]:
fig, axes=plt.subplots(nrows=2, figsize=(12,16))
data_online[data_online['Type of transaction']=='Sales'][['Description', 'UnitPrice']].drop_duplicates().sort_values('UnitPrice', ascending = False).set_index('Description')[0:30].plot.barh(ax=axes[0])
axes[0].set_title('The 30 most expensive products(sales)')
data_online[data_online['Type of transaction']=='Sales'].dropna()[['Description', 'UnitPrice']].drop_duplicates().sort_values('UnitPrice', ascending = True).set_index('Description')[0:30].plot.barh(ax=axes[1])
axes[1].set_title('The 30 cheapest products(sales)')
plt.show()

#### 2.2.3. InvoiceDate

The date when is produced the sale or cancellation.

- Transaction per date

Below, there are some plots that show the number of transactions per different type of periods of time, in order to understand how busy can be the business.

In [None]:
data_online['InvoiceDate'] = pd.to_datetime(data_online.InvoiceDate, format ='%m/%d/%Y %H:%M' )
data_online['date_Invoice'] = data_online['InvoiceDate'].dt.date
data_online['time_InvoiceDate'] = data_online['InvoiceDate'].dt.time
data_online['year_InvoiceDate'] = data_online['InvoiceDate'].dt.year
#define day of the month
data_online['day_InvoiceDate'] = data_online['InvoiceDate'].dt.day
#define days of the week
data_online['weekday_InvoiceDate'] = data_online['InvoiceDate'].dt.weekday
days = {0:'Mon', 1:'Tues', 2:'Wed', 3:'Thurs', 4:'Fri', 5:'Sat', 6:'Sun'}
data_online['weekday_InvoiceDate'] = data_online['weekday_InvoiceDate'].apply(lambda x: days[x])
# hour
data_online['hour_InvoiceDate'] = data_online['InvoiceDate'].dt.hour

In [None]:
fig, axes=plt.subplots(nrows=4, figsize=(16,16))
plt.subplots_adjust(hspace=0.4)
#plot number transaction per date
df_transaction_date = data_online[['Type of transaction', 'InvoiceNo','date_Invoice']].drop_duplicates()
df_transaction_date = df_transaction_date = df_transaction_date.groupby(['Type of transaction', 'date_Invoice'])['date_Invoice'].count().to_frame().reset_index(0)
df_transaction_date[df_transaction_date['Type of transaction']=='Cancelled'].date_Invoice.plot.line(color = 'orange', ax = axes[0])
df_transaction_date[df_transaction_date['Type of transaction']=='Sales'].date_Invoice.plot.line(color = 'blue', ax = axes[0])
axes[0].set_title('Number of transaction per date')
axes[0].legend(labels=["Cancelled","Sales"])

#plot mean transaction per day
df_day_transaction = data_online[['Type of transaction', 'InvoiceNo','date_Invoice','day_InvoiceDate']].drop_duplicates()
df_day_transaction = df_day_transaction.groupby(['Type of transaction', 'date_Invoice','day_InvoiceDate'])['day_InvoiceDate'].count().to_frame().reset_index(0)
df_day_transaction_cancelled = df_day_transaction[df_day_transaction['Type of transaction'] == 'Cancelled'].reset_index([0])
df_day_transaction_cancelled.groupby(df_day_transaction_cancelled.index)['day_InvoiceDate'].mean().plot.line(ax=axes[1], color='orange')
df_day_transaction_sales = df_day_transaction[df_day_transaction['Type of transaction'] == 'Sales'].reset_index([0])
df_day_transaction_sales.groupby(df_day_transaction_sales.index)['day_InvoiceDate'].mean().plot.line(ax=axes[1], color='blue')
axes[1].set_title('Mean of number of transaction per day of the month')
axes[1].legend(labels=["Cancelled","Sales"])

#plot mean transaction per weekday
df_weekday_transaction = data_online[['Type of transaction', 'InvoiceNo','date_Invoice','weekday_InvoiceDate']].drop_duplicates()
df_weekday_transaction = df_weekday_transaction.groupby(['Type of transaction', 'date_Invoice','weekday_InvoiceDate'])['weekday_InvoiceDate'].count().to_frame().reset_index(0)
df_weekday_transaction_cancelled = df_weekday_transaction[df_weekday_transaction['Type of transaction'] == 'Cancelled'].reset_index([0])
df_weekday_transaction_cancelled.groupby(df_weekday_transaction_cancelled.index)['weekday_InvoiceDate'].mean().plot.line(ax=axes[2], color='orange')
df_weekday_transaction_sales = df_weekday_transaction[df_weekday_transaction['Type of transaction'] == 'Sales'].reset_index([0])
df_weekday_transaction_sales.groupby(df_weekday_transaction_sales.index)['weekday_InvoiceDate'].mean().plot.line(ax=axes[2], color='blue')
axes[2].set_title('Mean of number of transaction per weekday')
axes[2].legend(labels=["Cancelled","Sales"])

#plot mean transaction per hour
df_hour_transaction = data_online[['Type of transaction', 'InvoiceNo','date_Invoice','hour_InvoiceDate']].drop_duplicates()
df_hour_transaction = df_hour_transaction.groupby(['Type of transaction', 'date_Invoice','hour_InvoiceDate'])['hour_InvoiceDate'].count().to_frame().reset_index(0)
df_hour_transaction_cancelled = df_hour_transaction[df_hour_transaction['Type of transaction'] == 'Cancelled'].reset_index([0])
df_hour_transaction_cancelled.groupby(df_hour_transaction_cancelled.index)['hour_InvoiceDate'].mean().plot.line(ax=axes[3], color='orange')
df_hour_transaction_sales = df_hour_transaction[df_hour_transaction['Type of transaction'] == 'Sales'].reset_index([0])
df_hour_transaction_sales.groupby(df_hour_transaction_sales.index)['hour_InvoiceDate'].mean().plot.line(ax=axes[3], color='blue')
axes[3].set_title('Mean of number of transaction per hour')
axes[3].legend(labels=["Cancelled","Sales"])
plt.show()


- Income per date

As we saw before there are some cancellations that caused losses to the business. Then, below there is a graph where we show the net income per date. We can observe that there is only one moment it produces negative income. In general, the graph tends to increase the incomes, particularly the last two moths. 

In [None]:
# net income per day 
fig, axes=plt.subplots(nrows=1, figsize=(16,5))
data_online.groupby('date_Invoice')['TotalPrice'].sum().plot.line(ax=axes)
plt.axhline(y=0, color='r', linestyle='-.')
plt.title('Net income per date')
plt.show()

- Product unit price variation

Choosing different items we become aware of the same product may have different price, as you can see in the table below. So for tha reason, we select the 10 most popular products and plot their average unit price per day over date, in order to see the volatility of the price. We calculate the average of price per day because even in the same day the product price can differ. 

In [None]:
# Getting the 10 most popular products and looking at changing the price over time
list_popular_products = description[description['Type of transaction']=='Sales'].Quantity.sort_values(ascending=False)[0:10].index.to_list()
df_popular_products = data_online[data_online['Description'].isin(list_popular_products)]

In [None]:
data_online[(data_online['Description']=='MEDIUM CERAMIC TOP STORAGE JAR') & (data_online['date_Invoice']==datetime.date(2011, 12, 9))]

In [None]:
fig, axes=plt.subplots(nrows=1, figsize=(16,8))
df_popular_products.groupby(['Description', 'date_Invoice'])['UnitPrice'].mean().unstack(level=0).fillna(0).plot.line(ax=axes)
plt.title('The 10 most popular products\' price over time')
plt.show()

#### 2.2.4.CustomerID

Over this year, the company had customer from different countries, although the main country was the UK, as reflected in the map below.

In [None]:
import geopandas as gpd

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


In [None]:
data_online = data_online.replace('RSA','South Africa')
data_online = data_online.replace('EIRE','Ireland')
data_online = data_online.replace('USA','United States of America')
data_online = data_online.replace('Czech Republic','Czechia')
data_online = data_online.replace('Channel Islands', 'United Kingdom')

df_transaction_customer = data_online[['CustomerID','Country']].drop_duplicates()
df_customer_country = df_transaction_customer.groupby('Country')['CustomerID'].count().to_frame().reset_index()

In [None]:
df_customer_geographic = world.merge(df_customer_country, left_on='name', right_on='Country', how='outer')
df_customer_geographic[['Country', 'CustomerID']] = df_customer_geographic[['Country', 'CustomerID']].fillna(0)
df_customer_geographic = df_customer_geographic.dropna()
df_customer_geographic['percenta_customer'] = df_customer_geographic.CustomerID / df_customer_geographic.CustomerID.sum()*100

In [None]:
fig, axes=plt.subplots(nrows=1, figsize=(16,10))
df_customer_geographic.plot(column='CustomerID',cmap='OrRd',scheme = 'User_Defined', legend=True,
                           classification_kwds=dict(bins=[0,20,40,60,80,100]), ax=axes)
plt.title('Customers\' Location')
axes.axis('off')


plt.show()



- the loyalty customer

We draw some plots which show us the customer behaviour. 

Firstly, there is a bar graph that displays the best customers and the number of purchase that they make over the period. 

Secondly, there is a histogram where is shown the numbers of transactions per customer, and looking at it the majority of customers do not make many purchases.

Finally, we plot the trend of new customers into the company, and the graphs is increasing number of customers. 




In [None]:
df_customer_country_trans = data_online[['CustomerID','Country', 'InvoiceNo']].drop_duplicates()

In [None]:
fig, axes=plt.subplots(nrows=3, figsize=(12,15))
plt.subplots_adjust(hspace=0.5)
#plot 1
df_customer_country_trans.groupby('CustomerID')['InvoiceNo'].count().sort_values(ascending = False).to_frame().head(60).plot.bar(ax =axes[0])
axes[0].set_title('The best 60 customer and the number of purchases')

#plot 2
serie_customer_country = df_customer_country_trans.groupby('CustomerID')['InvoiceNo'].count()
sns.histplot(serie_customer_country, ax=axes[1])
axes[1].set_title('Number of transaction per customer distribution')
axes[1].set_xlim([0,300])

#plot 3
df_customer_date = data_online[['CustomerID', 'date_Invoice']].drop_duplicates()
list_idx = df_customer_date[['CustomerID']].drop_duplicates().index.to_list()
customer_news = df_customer_date[df_customer_date.index.isin(list_idx)].groupby('date_Invoice')['CustomerID'].count().to_frame()
customer_news.CustomerID.cumsum().plot.line(ax=axes[2])
axes[2].set_title('Cumulative sum of new customers')
plt.show()

- Behaviour of the 10 best customers

We try to understand of often the best customers do make a transaction.

The first two graphs represent how often the 10 best customer make an order in this ecommerce, we split into two plots(5 customers each) in order to see them better. Even there are some customers that show some patterns in their purchases.

The last plot, we show the sum of transaction per customer over this period and the customers upward trend in the number of transaction.

In [None]:
data_online['weekOfyear_InvoiceDate'] = data_online['InvoiceDate'].dt.isocalendar().week
data_online['beginning_week'] = data_online.apply(lambda x: Week(int(x.year_InvoiceDate), 
                                                        int(x.weekOfyear_InvoiceDate)).monday(), axis=1)



In [None]:
# we separate into two groups the 10 best customers in order to see better in the plot 
list_best_10_customer = df_customer_country_trans.groupby('CustomerID')['InvoiceNo'].count().sort_values(ascending = False).to_frame().head(10).index.to_list()
df_best_5_customer = data_online[data_online['CustomerID'].isin(list_best_10_customer[0:5])]
df_best_2_5_customer = data_online[data_online['CustomerID'].isin(list_best_10_customer[5::])]
df_best_10_customer = data_online[data_online['CustomerID'].isin(list_best_10_customer)]


In [None]:
df_best_5 = df_best_5_customer[df_best_5_customer['Type of transaction']=='Sales'][['InvoiceNo', 'CustomerID', 'beginning_week']].drop_duplicates()
df_best_2_5 = df_best_2_5_customer[df_best_2_5_customer['Type of transaction']=='Sales'][['InvoiceNo', 'CustomerID', 'beginning_week']].drop_duplicates()
df_best_10 = df_best_10_customer[df_best_10_customer['Type of transaction']=='Sales'][['InvoiceNo', 'CustomerID', 'beginning_week']].drop_duplicates()

In [None]:
fig, axes=plt.subplots(nrows=3, figsize=(14,15))
plt.subplots_adjust(hspace=0.3)
df_best_5.groupby(['beginning_week','CustomerID'])['InvoiceNo'].count().unstack().fillna(0).plot.line(ax=axes[0])
axes[0].set_title('Frequency of transaction from the 5 best customer (Grouped by weeks) ')
df_best_2_5.groupby(['beginning_week','CustomerID'])['InvoiceNo'].count().unstack().fillna(0).plot.line(ax=axes[1])
axes[1].set_title('Frequency of transaction from the second of 5 best customer (Grouped by weeks) ')
df_best_10.groupby(['beginning_week','CustomerID'])['InvoiceNo'].count().unstack().fillna(0).cumsum().plot.line(ax=axes[2])
axes[2].set_title('Cumulative sum transaction from the 10 best customers (Grouped by weeks)')
plt.show()


## 3. Recommender Model<a id='section3'></a>
We are going to implement an item-based collaborative filtering. We use KNN, is one the simplest model for recommender system. 

We assume that if a customer buys an item the rating will be 1 and otherwise it will be 0, since our data does not have any rating. For the nearest neighbour search, we will use the cosine similarity. 

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix


In [None]:
df_stock_customer = data_online[data_online['Type of transaction']=='Sales'][['Description','CustomerID']].drop_duplicates()
df_stock_customer['rating'] = 1

In [None]:
df_stock_customer_pivot = df_stock_customer.pivot(index = 'Description', columns = 'CustomerID', values ='rating').fillna(0)
df_stock_customer_matrix = csr_matrix(df_stock_customer_pivot.values)
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(df_stock_customer_matrix)

In [None]:
query_index = np.random.choice(df_stock_customer_pivot.shape[0])
distances, indices = model_knn.kneighbors(df_stock_customer_pivot.iloc[query_index,:].values.reshape(1,-1), n_neighbors = 8)

Below, there is an example of a particular case where if you buy one item, the system will recommend you the following seven elements.

In [None]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recomendations for {0}:\n'.format(df_stock_customer_pivot.index[query_index]))
    else:
        print('{0} : {1}, with distance of {2}:'.format(i, df_stock_customer_pivot.index[indices.flatten()[i]],
                                                       distances.flatten()[i]))

## 4. Conclusion<a id='section4'></a>

In this notebook we have analysed the transactions ocurring between 01-12-2010 and 09-12-2011 for an ecommerce busines UK-based and we have implemented one of the simplest model in order to recommend customers new purchases. 

When we were analysing the dataframe, we discovered of some inconsistencies in the inputs, for example:
- Different prices for the same product.
- When price is 0, the product can be damaged or gift for the customer, with the differences if negative product or not and customerID is identified or not.

Information that it would be useful to have, such as:
- Regarding damaged, lost products, it would be interisting to have the value of these items, in order to study the losses the company. 
- Ratings, when we try to create a recommender system it is much better to have the real rating or opinions from customers.

Finally, we would like to know how new is the company, and how it has evolved, in other words, to have a longer period of transactions.