# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per store.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [1]:
# your code here

from functools import reduce
import numpy as np
import pandas as pd
from sqlalchemy import create_engine

import sys
sys.executable


# Connection Data
driver = 'mysql+pymysql'
ip = '34.65.10.136'
username = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
db = 'retail_sales'
connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'
# Engine & Query
engine = create_engine(connection_string)
query = 'SELECT * FROM raw_sales'
# Database Request
sales = pd.read_sql(query,engine)


In [2]:
sales

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
5,2015-01-04,28,21842,149.0,1.0
6,2015-01-04,28,21881,299.0,1.0
7,2015-01-04,29,6930,2199.0,1.0
8,2015-01-04,29,10515,169.0,1.0
9,2015-01-04,29,8624,149.0,1.0


In [3]:
type(sales)

pandas.core.frame.DataFrame

In [4]:
sales.isna().any()

date            False
shop_id         False
item_id         False
item_price      False
item_cnt_day    False
dtype: bool

In [5]:
sales.dtypes

date            datetime64[ns]
shop_id                  int64
item_id                  int64
item_price             float64
item_cnt_day           float64
dtype: object

In [6]:
sales.shop_id.unique()


array([29, 28, 31, 27, 35, 34, 24, 25, 21, 22, 19, 26, 56, 55, 54, 57, 58,
       48, 49, 50, 47, 53, 52, 51, 44, 45, 41, 46, 37, 38, 39, 59, 42,  6,
        7, 10,  3,  4,  2,  5, 18, 16, 15, 12, 14], dtype=int64)

In [7]:
sales.item_id.unique()


array([ 1469, 21364, 21365, 22104, 22091, 21842, 21881,  6930, 10515,
        8624,  4178,  5643,  5823,  5037,  5013,  5822, 16487, 16360,
       15698, 15812, 15818, 17748, 16923, 16889, 16880, 17027, 15922,
       15645, 20225, 20434, 20069, 19816, 20949, 20916, 18407, 18148,
       18048, 18439, 19116, 18477,  2868,  2853,  2852,  2817,  3141,
        3084,  2047,  2286,  1969,  1744,  2445,  4102,  4334,  4554,
        3553,  3734, 17717, 16205, 16174, 16788, 18147, 12362, 13517,
       13511, 11927, 20610,  1364,  1858,  1563,  1570,  1568, 21362,
       20259, 21367,    32,   486,  1971,  1878,    30, 21605,  3342,
        3340,  3071,  3934,  3777,  2293,  2284,  2578,  2799, 15817,
       16441, 15739, 17322, 14822, 15257, 16113, 16136, 16116, 15856,
       15989, 15857, 20608, 20544, 20521, 18042, 18030, 18026, 18050,
       18046, 18049, 11431,  8079, 11899, 10476, 10164, 14308, 13803,
       14597, 12091, 13591, 12839,  5459,  4393,  4003,  4872,  7308,
        7894,  6129,

In [8]:
sales.item_price.unique()


array([1.19900000e+03, 4.79000000e+02, 9.99000000e+02, 2.49000000e+02,
       1.79000000e+02, 1.49000000e+02, 2.99000000e+02, 2.19900000e+03,
       1.69000000e+02, 1.59000000e+03, 2.99000000e+03, 2.79900000e+03,
       2.59900000e+03, 6.98000000e+02, 1.14900000e+03, 3.99000000e+02,
       7.99000000e+02, 9.90000000e+01, 6.99000000e+02, 3.49000000e+02,
       5.99000000e+02, 1.99000000e+02, 2.14900000e+03, 1.89900000e+03,
       9.80000000e+01, 5.00000000e+00, 1.71900000e+03, 4.49000000e+02,
       6.66000000e+02, 3.79900000e+03, 1.18000000e+03, 1.98000000e+02,
       2.99900000e+03, 3.99900000e+03, 3.39900000e+03, 1.00000000e+03,
       3.59000000e+02, 2.09900000e+03, 2.39000000e+03, 1.49900000e+03,
       1.09900000e+03, 1.79900000e+03, 1.99900000e+03, 7.49000000e+02,
       3.00000000e+02, 4.49900000e+03, 3.49900000e+03, 4.99000000e+02,
       1.25900000e+03, 1.59900000e+03, 6.59000000e+02, 8.99000000e+02,
       7.90000000e+01, 1.19800000e+03, 2.29000000e+03, 1.69900000e+03,
      

In [11]:
sales['item_price'].describe()


count     4545.000000
mean      1031.686121
std       2073.919990
min          3.000000
25%        249.000000
50%        479.000000
75%       1192.000000
max      27990.000000
Name: item_price, dtype: float64

In [12]:
sales.groupby(['item_id']).groups.keys()

dict_keys([30, 31, 32, 42, 59, 74, 109, 259, 464, 482, 486, 492, 493, 494, 686, 687, 787, 799, 803, 806, 837, 839, 970, 971, 1007, 1010, 1045, 1114, 1143, 1201, 1204, 1256, 1260, 1310, 1315, 1364, 1370, 1373, 1384, 1389, 1390, 1467, 1469, 1523, 1534, 1535, 1556, 1563, 1568, 1569, 1570, 1670, 1682, 1744, 1824, 1826, 1855, 1857, 1858, 1866, 1875, 1876, 1877, 1878, 1880, 1883, 1892, 1905, 1914, 1941, 1968, 1969, 1970, 1971, 2047, 2049, 2070, 2140, 2252, 2254, 2269, 2283, 2284, 2286, 2293, 2297, 2308, 2354, 2416, 2429, 2442, 2445, 2517, 2574, 2575, 2578, 2592, 2690, 2734, 2753, 2755, 2760, 2766, 2778, 2799, 2808, 2813, 2817, 2819, 2851, 2852, 2853, 2861, 2868, 2871, 2878, 2881, 2918, 2919, 2934, 2937, 2939, 2946, 2954, 2959, 2969, 2975, 2976, 3007, 3025, 3026, 3027, 3028, 3054, 3071, 3072, 3077, 3084, 3108, 3115, 3141, 3146, 3148, 3156, 3170, 3234, 3236, 3238, 3243, 3251, 3303, 3327, 3329, 3331, 3340, 3341, 3342, 3343, 3423, 3443, 3447, 3460, 3461, 3472, 3473, 3476, 3477, 3484, 3553, 3580,

In [13]:
sales.groupby('item_id')['item_id'].min()
sales.groupby('item_id')['item_id'].max()

item_id
30          30
31          31
32          32
42          42
59          59
74          74
109        109
259        259
464        464
482        482
486        486
492        492
493        493
494        494
686        686
687        687
787        787
799        799
803        803
806        806
837        837
839        839
970        970
971        971
1007      1007
1010      1010
1045      1045
1114      1114
1143      1143
1201      1201
         ...  
21486    21486
21591    21591
21605    21605
21669    21669
21674    21674
21677    21677
21679    21679
21684    21684
21704    21704
21726    21726
21761    21761
21762    21762
21793    21793
21842    21842
21881    21881
21893    21893
21902    21902
21922    21922
21962    21962
21976    21976
22058    22058
22073    22073
22076    22076
22087    22087
22088    22088
22091    22091
22092    22092
22104    22104
22140    22140
22162    22162
Name: item_id, Length: 985, dtype: int64

In [14]:
sales.groupby(['shop_id', 'item_id'])['item_cnt_day'].sum()


shop_id  item_id
2        1970       3.0
         1971       3.0
         2871       3.0
         2881       3.0
         3028       3.0
         5380       3.0
         10489      3.0
         10826      3.0
         11786      3.0
         11927      3.0
         14256      3.0
         14259      3.0
         14368      3.0
         14716      3.0
         14720      3.0
         15047      3.0
         15112      6.0
         15489      3.0
         17717      6.0
         18589      3.0
         19118      3.0
         21363      3.0
         21366      3.0
         21367      3.0
         21677      3.0
3        1469       3.0
         1970       3.0
         3692       3.0
         4493       3.0
         4870       3.0
                   ... 
59       2959       3.0
         3027       3.0
         3243       3.0
         3251       3.0
         3484       3.0
         4728       3.0
         4806       6.0
         5361       3.0
         5643       3.0
         5820       3.0

In [16]:
sales.groupby('item_id', as_index=False).agg({"item_price": "sum"})

Unnamed: 0,item_id,item_price
0,30,507.00
1,31,1089.00
2,32,447.00
3,42,897.00
4,59,747.00
5,74,1497.00
6,109,747.00
7,259,747.00
8,464,897.00
9,482,19800.00


In [None]:
ignore = ['17717','8452','6503','17662','1969','11655','13697','21363','15857','19436','2416','21361','21366','4102','17887','11926','21362','21365','1114','11688','11927','18052','18050','18042','21364','7780','21386','2954','19752','20949']
len(ignore)

In [17]:
# Group the data frame by month and item and extract a number of stats from each group
sales.groupby(
   ['shop_id', 'item_id']
).agg(
    {
         'item_cnt_day':sum,   
         'item_price': "min",  
        'item_price': "max"  
    })

Unnamed: 0_level_0,Unnamed: 1_level_0,item_cnt_day,item_price
shop_id,item_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2,1970,3.0,8999.00
2,1971,3.0,4499.00
2,2871,3.0,999.00
2,2881,3.0,999.00
2,3028,3.0,2599.00
2,5380,3.0,3490.00
2,10489,3.0,399.00
2,10826,3.0,28.00
2,11786,3.0,28.00
2,11927,3.0,539.00


In [None]:
####
sales

In [18]:
sales_sum = sales.groupby(['shop_id', 'item_id'])

In [20]:
sales_sum.head()

Unnamed: 0,date,shop_id,item_id,item_price,item_cnt_day
0,2015-01-04,29,1469,1199.0,1.0
1,2015-01-04,28,21364,479.0,1.0
2,2015-01-04,28,21365,999.0,2.0
3,2015-01-04,28,22104,249.0,2.0
4,2015-01-04,28,22091,179.0,1.0
5,2015-01-04,28,21842,149.0,1.0
6,2015-01-04,28,21881,299.0,1.0
7,2015-01-04,29,6930,2199.0,1.0
8,2015-01-04,29,10515,169.0,1.0
9,2015-01-04,29,8624,149.0,1.0


In [None]:
sales

In [21]:
sales['ExtRev'] = sales['item_price'] * sales['item_cnt_day']

In [22]:
sales_by_item = sales['ExtRev'].groupby([sales['item_id']]).sum()


In [23]:
rev_by_shop = sales['ExtRev'].groupby([sales['shop_id']]).sum()

In [24]:
type(rev_by_shop)

pandas.core.series.Series

In [25]:
#sales.groupby("shop_id")["ExtRev"].sum().to_frame()
rev_by_shop = sales.groupby("shop_id")["ExtRev"].sum().to_frame()

In [26]:
rev_by_shop

Unnamed: 0_level_0,ExtRev
shop_id,Unnamed: 1_level_1
2,103746.0
3,67443.0
4,29361.0
5,33138.0
6,138678.0
7,52371.0
10,22716.0
12,295173.0
14,57450.0
15,125139.0


In [27]:
#sales.groupby("shop_id")["ExtRev"].sum().to_frame()
rev_by_item = sales.groupby("item_id")["ExtRev"].sum().to_frame()
rev_by_item

Unnamed: 0_level_0,ExtRev
item_id,Unnamed: 1_level_1
30,507.00
31,1089.00
32,447.00
42,897.00
59,747.00
74,1497.00
109,747.00
259,747.00
464,897.00
482,39600.00


In [28]:
rev_by_item.to_csv(r'rev_by_item.csv')
rev_by_shop.to_csv(r'rev_by_shop.csv')
sales.to_csv(r'sales_fixed.csv')

In [30]:
# Connection Data
driver = 'mysql+pymysql'
ip = '127.0.0.1'
username = 'root'
password = 'root'
db = 'retail_sales'
connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'
# Engine & Query
engine = create_engine(connection_string)
query = 'SELECT * FROM raw_sales'
# Database Request
#sales = pd.read_sql(query,engine)

In [35]:
sales.to_sql('sales', con=engine)