# Business Cases for Data Science
## Business Case 4 - ManyGiftsUK recommender system
### Group AA
**Members**:
- Emil Ahmadov (m20201004@novaims.unl.pt)
- Doris Macean (m20200609@novaims.unl.pt)
- Doyun Shin (m20200565@novaims.unl.pt)
- Anastasiia Tagiltseva (m20200041@novaims.unl.pt)

<a class="anchor" id="0.1"></a>

# **Table of Contents**

1. [Business Understanding](#1)

2. [Data Understanding](#2)
   - 2.1 [Exploratory Data Analysis](#2.1)
  
3. [Data Preparation](#3)
   - 3.1 [Handling missing values](#3.1)
   - 3.2 [Outliers](#3.2) 
   - 3.3 [Feature engineering](#3.3) 
   - 3.4 [Feature Selection](#3.4) 
   - 3.5 [Encoding](#3.5)
   - 3.6 [Scaling](#3.6)
   
4. [Modeling](#4)
   - 4.1 [](#4.1) 
   - 4.2 [](#4.2)
   - 4.3 [](#4.3)
   
5. [Evaluation](#5)
 
6. [Deployment](#6)

# **1. Business Understanding** <a class="anchor" id="1"></a>

ManyGiftsUK asked us 

1. Explore the data and build models to answer the problems:

    -Recommender system: the website homepage offers a wide range of products the user might be interested on
    
    -Cold start: offer relevant products to new customers
    
2. Implement adequate evaluation strategies and select an appropriate quality measure
3. In the deployment phase, elaborate on the challenges and recommendations in implementing the recommender system

### Project Plan
| Phase | Time | Resources | Risks |
| :--: | :--------: |:--: | :--------: |
| Business Understanding | 2 days | All analysts | Economic and market changes |
| Data Understanding | 2 days | All analysts | Data problems, technological problems |
| Data Preparation | 2 days | Data scientists, DB engineers | Data problems, technological problems |
| Modeling |4 days | Data scientists | Technological problems, inability to build adequate model |
| Evaluation | 2 days | All analysts | Economic change inability to implement results |
| Deployment | 2 days | Data scientists, DB engineers, implementation team | Economic change inability to implement results |

# **2. Data Understanding** <a class="anchor" id="2"></a>

### Metadata
| Name | Meaning | 
| :--: | :--------|
| InvoiceNo | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation|
| StockCode | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product|
| Description | Product (item) name. Nominal|
| Quantity | The quantities of each product (item) per transaction. Numeric|
| InvoiceDate | Invoice Date and time. Numeric, the day and time when each transaction was generated|
| UnitPrice | Unit price. Numeric, Product price per unit in pounds|
| CustomerID | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer|
| Country | Country name. Nominal, the name of the country where each customer resides|

## 2.1 Exploratory Data Analysis <a class="anchor" id="2.1"></a>

In [4]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

In [None]:
data = pd.read_csv('events_example.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'], unit='ms')
data.head()

In [None]:
print('Data size: ', data.shape)
print('Unique visitors: ', data['visitorid'].unique().shape[0])
print('Unique items: ', data['itemid'].unique().shape[0])

In [None]:
counts = data['event'].value_counts()
fig = px.bar(counts, log_y = True, labels = {'index':'Event', 'value':'Count'}, color_discrete_sequence = ['rgba(126, 165, 222, 0.8)'])
fig.update_layout(showlegend = False, title = 'Events distribution')
fig.show()

In [None]:
print(f"There is only {round((counts['transaction'] / counts['view']) * 1000) } transaction per 1000 views")

In [None]:
counts = data['visitorid'].value_counts().value_counts()
temp = [*counts[:10], counts[10:].sum()]
index = list(map(str, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, '>10']))
fig = px.bar(x = index, y = temp,  labels = {'x':'Count of events by visitor', 'y':'Frequency'}, color_discrete_sequence = ['rgba(126, 165, 222, 0.8)'])
fig.show()

In [None]:
counts = data['itemid'].value_counts().value_counts()
temp = [*counts[:10], counts[10:].sum()]
index = list(map(str, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, '>10']))
fig = px.bar(x = index, y = temp, labels = {'x':'Count of events with item', 'y':'Frequency'}, color_discrete_sequence = ['rgba(126, 165, 222, 0.8)'])
fig.show()

In [None]:
fig = px.line(data['timestamp'].value_counts().groupby(pd.Grouper(freq='D')).sum(), labels = {'index':'', 'value':''}, color_discrete_sequence = ['rgba(12, 24, 41, 0.8)'], width = 1200)
fig.update_layout(
    showlegend = False,
    title = 'Event frequencies by day'
)
fig.show()

In [None]:
temp = data['timestamp'].value_counts()
temp = temp.groupby(temp.index.month).sum()
fig = px.bar(temp, color_discrete_sequence = ['rgba(126, 165, 222, 0.8)'], labels = {'index':'', 'value':''})
fig.update_layout(
    showlegend = False,
    title = 'Event frequencies by month'
)
fig.show()

In [None]:
temp = data['timestamp'].value_counts()
temp = temp.groupby(temp.index.day_name()).sum()
fig = px.bar(temp, color_discrete_sequence = ['rgba(126, 165, 222, 0.8)'], labels = {'index':'', 'value':''})
fig.update_layout(
    showlegend = False,
    title = 'Event frequencies by weekday'
)
fig.show()