# This notebook will generate the synthetic data required as an input to our Facebook/Prophet model
- Some noise will be added to the data to reflect real data!
- This notebook output will be saved in the data folder (RAW)

## Two tables will be generated:
- The first table will attach every product to a cluster of products (5 clusters A, B, C, D and E will be used):

| Product_code  |  Associated cluster |
|---|---|
| CLA01  |  A  |
| CLA02  |   E  |
| CLB01  |   A  |

The product code 3 first characters define the client : CLA : Client A , CLB : Client B etc.

- The second table is the actual Sales history table:

| Product_code  | Date  |  Quantity |
|---|---|---|
|  CLB01 | 25/07/2019  | 1,000  |
|  CLB01 | 19/07/2019  |  1,500 |
|   CLA02 | 23/07/2019  |  10,000 |

### Product table generation

#### 1. Import the famous pandas and numpy libraries

In [2]:
import pandas as pd
import numpy as np

In [3]:
#This sets the seed to always generate the same data

from numpy.random import RandomState
random_state=RandomState(9999)

In [23]:
#Input parameters

number_of_clients=5
min_number_of_products_per_client=1
max_number_of_products_per_client=6
number_of_clusters=5

In [5]:
number_of_products_list=random_state.randint(min_number_of_products_per_client,
                                             max_number_of_products_per_client,
                                             number_of_clients)

In [6]:
number_of_products_list

array([2, 5, 2, 4, 1])

In [7]:
def letter_range(start, stop="{", step=1):
    """Yield a range of lowercase letters.""" 
    for ord_ in range(ord(start.upper()), ord(stop.upper()), step):
        yield chr(ord_)

In [8]:
clients_list=['CL'+ l for l in letter_range(chr(97),chr(97+number_of_clients))]

In [9]:
clients_list

['CLA', 'CLB', 'CLC', 'CLD', 'CLE']

In [18]:
product_codes=[(client,client+str(i)) for number_of_products,client in zip(number_of_products_list,clients_list) for i in range(number_of_products) ]

In [19]:
product_codes

[('CLA', 'CLA0'),
 ('CLA', 'CLA1'),
 ('CLB', 'CLB0'),
 ('CLB', 'CLB1'),
 ('CLB', 'CLB2'),
 ('CLB', 'CLB3'),
 ('CLB', 'CLB4'),
 ('CLC', 'CLC0'),
 ('CLC', 'CLC1'),
 ('CLD', 'CLD0'),
 ('CLD', 'CLD1'),
 ('CLD', 'CLD2'),
 ('CLD', 'CLD3'),
 ('CLE', 'CLE0')]

In [35]:
index = pd.MultiIndex.from_tuples(product_codes, names=['first', 'second'])

In [36]:
product_table=pd.Series(random_state.randint(0,number_of_clusters, len(product_codes)), index=index)

In [46]:
product_table.rename('Cluster',inplace=True)

first  second
CLA    CLA0      1
       CLA1      3
CLB    CLB0      0
       CLB1      1
       CLB2      2
       CLB3      4
       CLB4      2
CLC    CLC0      1
       CLC1      2
CLD    CLD0      0
       CLD1      4
       CLD2      2
       CLD3      3
CLE    CLE0      3
Name: Cluster, dtype: int64

In [47]:
product_table

first  second
CLA    CLA0      1
       CLA1      3
CLB    CLB0      0
       CLB1      1
       CLB2      2
       CLB3      4
       CLB4      2
CLC    CLC0      1
       CLC1      2
CLD    CLD0      0
       CLD1      4
       CLD2      2
       CLD3      3
CLE    CLE0      3
Name: Cluster, dtype: int64

#### Now building the Sales table

In [55]:
products_list=product_table.index.levels[1].values

In [56]:
# This sets the frequencies in days for each product which will be used for the dates generator
products_frequencies=random_state.randint(1,365.25//2,len(products_list))

In [57]:
products_frequencies

array([141,  75,  74, 147, 114,  91,  69, 161,  44, 104, 171, 103,  27,
        89])