## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the `raw_sales` table from the database `retail_sales` fon of Ironhack's databases. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [2]:
# import required libraries
import numpy as np
import pandas as pd

In [3]:
raw_sales = pd.read_csv('../retail_sales-raw_sales.csv', sep=';') #import file
#sep=';' means this symbol is what has to be understand as separator, 
#not that this symbol will be the separator in the displayed DATAFRAME
raw_sales2 = raw_sales.copy() #make a security copy
raw_sales.head() #print the 5 first occurrences in the DF

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04 00:00:00,29,1469,1199.0,1.0
1,2015-01-04 00:00:00,28,21364,479.0,1.0
2,2015-01-04 00:00:00,28,21365,999.0,2.0
3,2015-01-04 00:00:00,28,22104,249.0,2.0
4,2015-01-04 00:00:00,28,22091,179.0,1.0


In [4]:
raw_sales.info() #understand the info in the table
#no null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4545 entries, 0 to 4544
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          4545 non-null   object 
 1   shop_id       4545 non-null   int64  
 2   item_id       4545 non-null   int64  
 3   item_price    4545 non-null   float64
 4   item_cnt_day  4545 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 159.8+ KB


In [6]:
raw_sales_shops = raw_sales.sort_values(by='shop_id') #sort values by the shop ID
raw_sales_shops #verify the change

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
2821,2015-01-04 00:00:00,2,19118,699.00,1.0
1304,2015-01-04 00:00:00,2,21367,749.00,1.0
1305,2015-01-04 00:00:00,2,21366,1439.00,1.0
1306,2015-01-04 00:00:00,2,19118,699.00,1.0
1307,2015-01-04 00:00:00,2,21677,49.00,1.0
...,...,...,...,...,...
1067,2015-01-04 00:00:00,59,4728,249.00,1.0
1066,2015-01-04 00:00:00,59,7805,498.00,1.0
4078,2015-01-04 00:00:00,59,18454,199.00,1.0
1129,2015-01-04 00:00:00,59,1905,249.00,1.0


In [7]:
raw_sales_shops.item_cnt_day.value_counts() 
#display all values in column item_cnt_day

 1.0     4161
 2.0      264
 3.0       42
-1.0       30
 4.0       30
 5.0        9
 6.0        6
 10.0       3
Name: item_cnt_day, dtype: int64

In [8]:
raw_sales_shops['item_cnt_day'] = raw_sales_shops.item_cnt_day.astype(int) 
#change item_cnt_day type from float to int for practicity

In [9]:
raw_sales_shops

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
2821,2015-01-04 00:00:00,2,19118,699.00,1
1304,2015-01-04 00:00:00,2,21367,749.00,1
1305,2015-01-04 00:00:00,2,21366,1439.00,1
1306,2015-01-04 00:00:00,2,19118,699.00,1
1307,2015-01-04 00:00:00,2,21677,49.00,1
...,...,...,...,...,...
1067,2015-01-04 00:00:00,59,4728,249.00,1
1066,2015-01-04 00:00:00,59,7805,498.00,1
4078,2015-01-04 00:00:00,59,18454,199.00,1
1129,2015-01-04 00:00:00,59,1905,249.00,1


In [10]:
raw_sales_shops.item_cnt_day.value_counts() 
#looking for error values in item_cnt_day

 1     4161
 2      264
 3       42
-1       30
 4       30
 5        9
 6        6
 10       3
Name: item_cnt_day, dtype: int64

In [15]:
raw_sales_shops['item_cnt_day'] = raw_sales_shops['item_cnt_day'].replace(-1, 1)
#value -1 is an error so we are changing it for 1
raw_sales_shops.item_cnt_day.value_counts() 
#verify the change

1     4191
2      264
3       42
4       30
5        9
6        6
10       3
Name: item_cnt_day, dtype: int64

In [20]:
raw_sales_shops['date']= pd.to_datetime(raw_sales_shops['date']) #change de date format

In [23]:
raw_sales_shops.info() #verify the changes

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4545 entries, 2821 to 4103
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          4545 non-null   datetime64[ns]
 1   shop_id       4545 non-null   int64         
 2   item_id       4545 non-null   int64         
 3   item_price    4545 non-null   float64       
 4   item_cnt_day  4545 non-null   int32         
dtypes: datetime64[ns](1), float64(1), int32(1), int64(2)
memory usage: 195.3 KB


In [38]:
raw_sales_shops['item_total_day'] = raw_sales_shops['item_price'] * raw_sales_shops['item_cnt_day']

#create a new column storing the total income per item

In [39]:
clean_raw_sales = raw_sales_shops.copy() #create a clean table

In [40]:
#verify the result
clean_raw_sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day,item_total_day
2821,2015-01-04,2,19118,699.00,1,699.00
1304,2015-01-04,2,21367,749.00,1,749.00
1305,2015-01-04,2,21366,1439.00,1,1439.00
1306,2015-01-04,2,19118,699.00,1,699.00
1307,2015-01-04,2,21677,49.00,1,49.00
...,...,...,...,...,...,...
1067,2015-01-04,59,4728,249.00,1,249.00
1066,2015-01-04,59,7805,498.00,1,498.00
4078,2015-01-04,59,18454,199.00,1,199.00
1129,2015-01-04,59,1905,249.00,1,249.00


In [53]:
store_sales = raw_sales_shops.groupby('shop_id')['item_total_day'].sum()
store_sales

shop_id
2     103746.00
3      67443.00
4      29361.00
5      33138.00
6     138678.00
7      52371.00
10     22716.00
12    295173.00
14     57450.00
15    125139.00
16    121923.00
18     35787.00
19     61008.00
21    236487.00
22    150717.00
24     56955.00
25    301026.00
26    120462.00
27    172959.00
28    202512.00
29     85737.00
31    304692.00
34     12117.00
35     89709.00
37    220500.00
38     73482.00
39     34686.00
41     36840.00
42    345705.00
44    143427.00
45     82350.00
46     93903.00
47     80142.00
48     32745.00
49     35784.00
50    142053.00
51     10665.00
52     65527.02
53     50505.00
54    125343.00
55    170847.60
56     54906.00
57    226269.00
58    142863.00
59    113109.00
Name: item_total_day, dtype: float64

In [54]:
item_sales = raw_sales_shops.groupby('item_id')['item_total_day'].sum()
item_sales

item_id
30        507.0
31       1089.0
32        447.0
42        897.0
59        747.0
          ...  
22091    1074.0
22092     537.0
22104    1494.0
22140     652.5
22162    7182.0
Name: item_total_day, Length: 985, dtype: float64

In [55]:
clean_raw_sales #clean dataframe
store_sales #each store d
item_sales

           date  shop_id  item_id  item_price  item_cnt_day  item_total_day
2821 2015-01-04        2    19118      699.00             1          699.00
1304 2015-01-04        2    21367      749.00             1          749.00
1305 2015-01-04        2    21366     1439.00             1         1439.00
1306 2015-01-04        2    19118      699.00             1          699.00
1307 2015-01-04        2    21677       49.00             1           49.00
...         ...      ...      ...         ...           ...             ...
1067 2015-01-04       59     4728      249.00             1          249.00
1066 2015-01-04       59     7805      498.00             1          498.00
4078 2015-01-04       59    18454      199.00             1          199.00
1129 2015-01-04       59     1905      249.00             1          249.00
4103 2015-01-04       59    11655     1544.82             1         1544.82

[4545 rows x 6 columns]
shop_id
2     103746.00
3      67443.00
4      29361.00
5      